First, I love this example. Second, I think it’s wrong.
It’s true in idealized mathematical terms—if you’re right at the point where the submodules agree, then the gradient will be along the continue-to-agree direction. But that’s not what matters numerically—i.e. it’s not what matters in actual practice. Numerically, the speed of (non-stochastic) gradient descent is controlled by the local condition number, and the condition number for this two-module example would be enormous—meaning that gradient descent will move along the submodules’ parameter space extremely slowly. Unless the surface along which the submodules match is perfectly linear (and the parameters are already exactly on that surface), every step will take it just a little bit off the surface, so it will end up taking extremely small steps (so that it’s always within numerical precision of the surface).
In practice, I’d think that numerically the combining function wouldn’t be perfectly sharp but rather have some epsilon error tolerance due to itself being subject to floating point numerics too. My mental image for this example would then be some 2*epsilon wide surface centered on the line where both are equal, and it shouldn’t stray from the center of the surface by a symmetry argument.
Huh. I was interpreting it differently then—if I was building a module-checker to keep an eye out for AI tampering, I would not feed the result of the checker back into the gradient signal.
Sorry if this wasn’t clear in the post but when I say “we’re” trying to protect some submodule, I don’t mean us as the engineers who want to make sure the model doesn’t change a submodule (we could do that trivially, just add a stop gradient), but rather from the perspective of a gradient hacker that needs to protect its own logic from being disabled by SGD.
Yup, that’s roughly what I was picturing. (Really I was picturing a smooth approximation of that, but the conclusion is the same regardless.)
In general, “shouldn’t stray from the center of the surface by a symmetry argument” definitely should not work for GD in practice—either because numerical noise knocks us off the line-where-both-are-equal, or because the line-where-both-are-equal itself curves.
So, unless the line-where-both-are-equal is perfectly linear and the numerics are perfectly symmetric all the way to the lowest bits, GD will need to take steps of size ~epsilon to stay near the center of the surface.
There should be a fair bit more than 2 epsilon of leeway in the line of equality. Since the submodules themselves are learned by SGD, they won’t be exactly equal. Most likely, the model will include dropout as well. Thus, the signals sent to the combining function are almost always more different than the limits of numerical precision allow. This mean the combining function will need quite a bit of leeway, otherwise the network’s performance is just zero always.
First, I love this example. Second, I think it’s wrong.
It’s true in idealized mathematical terms—if you’re right at the point where the submodules agree, then the gradient will be along the continue-to-agree direction. But that’s not what matters numerically—i.e. it’s not what matters in actual practice. Numerically, the speed of (non-stochastic) gradient descent is controlled by the local condition number, and the condition number for this two-module example would be enormous—meaning that gradient descent will move along the submodules’ parameter space extremely slowly. Unless the surface along which the submodules match is perfectly linear (and the parameters are already exactly on that surface), every step will take it just a little bit off the surface, so it will end up taking extremely small steps (so that it’s always within numerical precision of the surface).
In practice, I’d think that numerically the combining function wouldn’t be perfectly sharp but rather have some epsilon error tolerance due to itself being subject to floating point numerics too. My mental image for this example would then be some 2*epsilon wide surface centered on the line where both are equal, and it shouldn’t stray from the center of the surface by a symmetry argument.
Huh. I was interpreting it differently then—if I was building a module-checker to keep an eye out for AI tampering, I would not feed the result of the checker back into the gradient signal.
The big difference, parroting Steve ( https://www.lesswrong.com/posts/ey7jACdF4j6GrQLrG/thoughts-on-safety-in-predictive-learning , I think) is that gradient descent doesn’t try things out and then keep what works, it models changes and does what is good in the model.
Sorry if this wasn’t clear in the post but when I say “we’re” trying to protect some submodule, I don’t mean us as the engineers who want to make sure the model doesn’t change a submodule (we could do that trivially, just add a stop gradient), but rather from the perspective of a gradient hacker that needs to protect its own logic from being disabled by SGD.
Ah, that makes sense.
Yup, that’s roughly what I was picturing. (Really I was picturing a smooth approximation of that, but the conclusion is the same regardless.)
In general, “shouldn’t stray from the center of the surface by a symmetry argument” definitely should not work for GD in practice—either because numerical noise knocks us off the line-where-both-are-equal, or because the line-where-both-are-equal itself curves.
So, unless the line-where-both-are-equal is perfectly linear and the numerics are perfectly symmetric all the way to the lowest bits, GD will need to take steps of size ~epsilon to stay near the center of the surface.
There should be a fair bit more than 2 epsilon of leeway in the line of equality. Since the submodules themselves are learned by SGD, they won’t be exactly equal. Most likely, the model will include dropout as well. Thus, the signals sent to the combining function are almost always more different than the limits of numerical precision allow. This mean the combining function will need quite a bit of leeway, otherwise the network’s performance is just zero always.