I’m not sure how “explanations for corrigibility” would be relevant here (though I’m also not sure exactly what you’re picturing).
Just to clarify my meaning of explaining corrigibility: In my projects, my aim is not simply to enable GPT-2 XL to execute a shutdown procedure, but also to ensure that it is a thoroughly considered process. Additionally, I want to be able to examine the changes in mean and standard deviation of the 600,000 QKV weights.
Yes, I’m aware that it’s not a complete solution since I cannot explain why each individual weight changes. However, seeing the ATL successfully modify the network and observing the reliable response from GPT-2 XL is why I’m pursuing this method further.
Thanks for your reply.
Just to clarify my meaning of explaining corrigibility: In my projects, my aim is not simply to enable GPT-2 XL to execute a shutdown procedure, but also to ensure that it is a thoroughly considered process. Additionally, I want to be able to examine the changes in mean and standard deviation of the 600,000 QKV weights.
Yes, I’m aware that it’s not a complete solution since I cannot explain why each individual weight changes. However, seeing the ATL successfully modify the network and observing the reliable response from GPT-2 XL is why I’m pursuing this method further.