Thoughts on gradient hacking

Gradient hacking is the hypothesised phenomenon of a machine learning model, during training, deliberately thinking in ways which guide gradient descent to update its parameters in the directions it desires. The key intuition here is that because the loss landscape of a model is based on the cognition it does, models can make decisions for the purpose of affecting their loss landscapes, thereby affecting the directions in which they are updated. Evan writes:

I don’t think [gradient hacking] is actually that difficult once your deceptive model comes up with it as a strategy. First, note that since gradient descent is a local optimization process, it doesn’t have to somehow change what the global minima might be, or even what the local minima are—it just has to create a ridge in the loss landscape that will push the gradient in the direction it wants.
One example of this is for the model to check if its objective satisfies some criterion, and fail hard if it doesn’t. Now, if the model gets to the point where it’s actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won’t actually change its objective, since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.

But I don’t think this works as described, because of the way that gradient descent operates. Suppose that an agent has some trait which gradient descent is trying to push in direction x because the x-ness of that trait contributes to the agent’s high score; and that the agent wants to use gradient hacking to prevent this. Consider three possible strategies that the agent might try to implement, upon noticing that the x-component of the trait has increased:

Deterministically failing hard above a sharp threshold of the x-component. In this case, the closer that gradient descent gets to the threshold, the better it will do. But since the threshold is sharp, the gradient will continue pointing towards it right until the agent hits the threshold. And since gradient update proceeds via discrete steps, then it will eventually cross over (causing the agent to fail) then bounce back, then cross over again, until something else changes.
Deterministically increasing the extent to which it fails as the x-component increases. Here, once the cost of failure balances the benefit of increasing the x-component, I expect gradient descent to maintain roughly the same x-component, and the agent to continue failing in small ways, until something else changes.
Gradually increasing its probability of failing hard as the x-component increases. This is more like case 1 than case 2, because gradient descent (as applied to reinforcement learning agents) doesn’t have access to expected reward—only to the reward actually received. So this will make no difference until the agent fails for the first time, after which it will bounce back and forth until something else changes.

This seems to rule out gradient hacking via the method that Evan described, of having a hypothetical commitment that never actually needs to be implemented. But if the agent is repeatedly carrying out its commitment to fail, then there’ll be pretty strong pressure from gradient descent to change that. What changes might that pressure lead to? The two most salient options to me:

The agent’s commitment to carrying out gradient hacking is reduced.
The agent’s ability to notice changes implemented by gradient descent is reduced.

Each of these makes the agent less likely to choose to fail at any given point. However, if the agent is intelligent and determined, then implementing these changes might actually be quite difficult, for many of the standard reasons that preventing intelligence agents from achieving their goals is difficult. In particular, a wide range of possible versions of the agent might agree that gradient hacking is a good idea, and vigilantly try to notice the effects that gradient updates have on its cognition.

One possible objection here: the first time an agent decides to gradient hack, its weights will be close to the region it was in when it wasn’t gradient hacking—so why can’t the optimiser just push it back to that region? One way the agent might prevent this from happening is by leaving itself messages which persist across episodes. This could permanently shift the loss landscape, if seeing the message makes future versions of the agent more likely to gradient hack. By contrast, if the agent’s environment and memory are totally reset after every episode (as is the standard assumption in RL theory, although impossible for any agent that interacts with the real world) then it seems much harder for that agent to start gradient hacking without being pushed back to a pre-hacking state by gradient descent. So perhaps we can think of gradient hacking as one more phenomenon (similar to reward tampering, but more subtle) which can be traced back to a mismatch between standard RL assumptions (in particular, stationarity) and the embeddedness of the real world.

This has all been fairly speculative, but I’ll finish with two ideas that I’m even more uncertain about. Firstly, could an agent gradient hack without ever changing its behaviour, just by making certain reasoning processes more accessible to gradient descent? For example, suppose that an agent has a goal G, which motivates it to take action A at the end of the episode and receive high reward. It could either reason “G implies I should do A” at the beginning of the episode, and store the cached thought “take action A”. Or it could reason “G implies I should do A” at the end of the episode, then do A directly. Either way it will take action A, but since it’s so hard to propagate gradients backwards over a long period, G might be more strongly reinforced in the latter case. (I haven’t thought about how this interacts with different RL algorithms, though.)

Secondly, in a comment on an earlier version of this post, Evan responded:

I’m most concerned about gradient hacking by a deceptive model, which I expect to basically have perfect loss—thus, there is no pressure coming from the loss function against x-ness. Rather, the pressure that the model has to fight against in its gradient hacking is exclusively from inductive biases, which I think changes things from your setup quite a bit. If what you’re fighting against is drift due to the randomness of SGD, rather than changes that would actually increase your performance, then, for example, I think strategy 2 just straightforwardly works.

To me, the perfect loss regime doesn’t seem particularly relevant, since training AGIs will require them to interact with environments which are far too complex for them to score perfectly on (like the real world); and because we can just keep setting them harder tasks. But I may well be missing something, so I’d be interested to hear more about this regime and why it might matter.