Huh. I was interpreting it differently then—if I was building a module-checker to keep an eye out for AI tampering, I would not feed the result of the checker back into the gradient signal.
Sorry if this wasn’t clear in the post but when I say “we’re” trying to protect some submodule, I don’t mean us as the engineers who want to make sure the model doesn’t change a submodule (we could do that trivially, just add a stop gradient), but rather from the perspective of a gradient hacker that needs to protect its own logic from being disabled by SGD.
Huh. I was interpreting it differently then—if I was building a module-checker to keep an eye out for AI tampering, I would not feed the result of the checker back into the gradient signal.
The big difference, parroting Steve ( https://www.lesswrong.com/posts/ey7jACdF4j6GrQLrG/thoughts-on-safety-in-predictive-learning , I think) is that gradient descent doesn’t try things out and then keep what works, it models changes and does what is good in the model.
Sorry if this wasn’t clear in the post but when I say “we’re” trying to protect some submodule, I don’t mean us as the engineers who want to make sure the model doesn’t change a submodule (we could do that trivially, just add a stop gradient), but rather from the perspective of a gradient hacker that needs to protect its own logic from being disabled by SGD.
Ah, that makes sense.