MiguelDev comments on Symbol/Referent Confusions in Language Model Alignment Experiments

MiguelDev 2 Nov 2023 16:24 UTC
1 point
0
Thanks for your reply.
I’m not sure how “explanations for corrigibility” would be relevant here (though I’m also not sure exactly what you’re picturing).

Just to clarify my meaning of explaining corrigibility: In my projects, my aim is not simply to enable GPT-2 XL to execute a shutdown procedure, but also to ensure that it is a thoroughly considered process. Additionally, I want to be able to examine the changes in mean and standard deviation of the 600,000 QKV weights.
Yes, I’m aware that it’s not a complete solution since I cannot explain why each individual weight changes. However, seeing the ATL successfully modify the network and observing the reliable response from GPT-2 XL is why I’m pursuing this method further.