Gradient hacking: definitions and examples

EDIT: I have now shifted my terminology as follows (format: old term-->new term):

“RL credit hacking”-->exploration hacking.
“Gradient descent hacking”-->gradient hacking.
“Gradient hacking”-->credit hacking.

That is, I now reserve “gradient hacking” for one specific sub-phenomenon, and use “credit hacking” as the overarching term which includes both exploration hacking and gradient hacking (along with other possibilities, like multi-agent credit hacking). However, I’ve kept the original terminology in this post.

Gradient hacking is a hypothesized phenomenon where:

A model has knowledge about possible training trajectories which isn’t being used by its training algorithms when choosing updates (such as knowledge about non-local features of its loss landscape which aren’t taken into account by local optimization algorithms).
The model uses that knowledge to influence its medium-term training trajectory (even if the effects wash out in the long term).

Below I give some potential examples of gradient hacking, divided into those which exploit RL credit assignment and those which exploit gradient descent itself. My concern is that models might use techniques like these either to influence which goals they develop, or to fool our interpretability techniques. Even if those effects don’t last in the long term, they might last until the model is smart enough to misbehave in other ways (e.g. specification gaming, or reward tampering), or until it’s deployed in the real world—especially in the RL examples, since convergence to a global optimum seems unrealistic (and ill-defined) for RL policies trained on real-world data. However, since gradient hacking isn’t very well-understood right now, both the definition above and the examples below should only be considered preliminary.

RL credit hacking examples

A policy might know that a certain sequence of actions leads to higher reward. But if it chooses not to take those actions, it won’t be rewarded for them, and that type of behavior won’t be reinforced (until it finds them via random exploration, which could take a very long time).
A policy might know that humans give mistaken feedback on a given trajectory, and then select that trajectory more frequently in order to update itself in a certain direction. When humans give mistakenly positive feedback, this is just an example of specification gaming. But we could also imagine the model deliberately inducing mistakenly negative feedback.
A policy might know which of its actions is actually responsible for receiving reward, even though the RL algorithm being used “doesn’t know”. Then the model can influence how much credit the RL algorithm gives different actions (e.g. by putting them earlier or later in the trajectory).
- This seems easier for dumb RL algorithms like REINFORCE, and seems harder for actor-critic methods. However, finding algorithms that make use of all the knowledge a model has about the training landscape seems like it’ll require a full solution to the ELK problem.
A policy might know when it’s being trained off-policy versus on-policy (because off-policy it observes a trajectory of actions which it wouldn’t have taken), and choose actions differently in each case. A tentative possibility, related to the first example: during off-policy training, it could try novel “explore” actions which it knows would lead to high reward on-policy. Those actions would then be reinforced against (in favor of the action actually taken during that trajectory), since when performed off-policy the rewards are not actually seen.
In multi-agent RL, even if rewards are zero-sum, different policies could coordinate to choose which of them will receive the reward in which episodes, to influence which behaviors are reinforced.

Gradient descent hacking examples

When local optima exist, a model might choose to do worse than it otherwise could, with the intention of changing its parameters in a certain direction. Intuitively speaking, gradient descent will then push in two directions: towards changing the parameters which implemented the decision to gradient hack, and towards changing the parameters which the model wanted to change (since their current values influenced its decision).
- Gradient descent hacking is most compelling when relatively small changes in one direction could have outsized impacts on overall training trajectory. For example, at a saddle point in the loss landscape, a model might choose whichever output will lead to gradient descent pushing it in its preferred direction. When local optima exist, a model steering towards or away from a given local optimum might also have big effects.
- This is similar to the classic example of gradient hacking described by Evan Hubinger. I don’t think it works as he described, but I think our main disagreement is whether a model can create new local optima by gradient hacking, rather than whether it’s possible for a model to steer itself at all via gradient hacking. One intuition for my position is that local optima are rare in neural network loss landscapes, because those landscapes are so high-dimensional.
- In the absence of local optima, gradient descent will eventually modify the parameters responsible for the decision to gradient hack enough to prevent further gradient hacking. However, if the decision to gradient hack is driven by high-level reasoning and world-knowledge, the parameters responsible for that decision may be robust enough that the effects on the parameters the model is trying to change are much bigger.
- Gradient descent hacking might seem strange because, in one sense, the loss landscape of gradient descent is fixed. But since the loss landscape depends on how the model itself thinks, we can view the model’s decisions as a way of “choosing” what its loss landscape will be. This is a type of subjunctive dependence—analogous to how it makes sense to think in terms of humans “choosing” actions even when those choices were “predetermined” by the laws of physics.
A recurrent model might know that some of its early thoughts won’t be reinforced by backpropagation due to vanishing/exploding gradients. So by thinking thoughts earlier or later, it could change the extent to which those thoughts are reinforced by backprop.
- My intuitions here are particularly fuzzy, and I wouldn’t be surprised if it didn’t work. But there’s a biological analogy: classical conditioning. E.g. I can choose to do X right before Y, and then I’ll learn an association between X and Y which I wouldn’t have learned if I’d done X a long time before doing Y.

I expect that all of the effects discussed above could be much stronger in environments which violate standard RL/SL abstractions. I think it’s still worth having the central examples of gradient hacking be defined within the abstractions, since that makes it easier to see that there’s something there. However, it’s probably worth constructing examples which can lead to big changes (e.g. creating new local optima) given small violations of the standard abstractions such as:

An RL environment being non-stationary even in small ways (e.g. if a policy can leave itself persistent notes).
A model knowing the ordering of data it’ll be trained on (which may in fact be a good formal model of gradient hacking: imagine a supervised model which gets to choose the order it sees data).
A model knowing the step size being used to train it. (I’m curious whether, in theory, a model could deduce this directly; and whether adding this information to its observations breaks standard convergence proofs.)
A model exploiting implementation details like rounding errors (although this seems pretty fiddly).
A model generating data on which it’s later trained (as discussed in Leo Gao’s post on gradient hacking).

Having said all this, I do have some concern that the concept of gradient hacking won’t be very useful, because as policies get more and more intelligent, more and more of their behavior will be driven by generalization (in particular via high-level reasoning) which by definition is not “taken into account” by training algorithms. For example, there are several ways in which very general reasoning about how to achieve goals could be interpreted as gradient hacking, which suggests that “gradient hacking” isn’t a very natural category:

Humans often reason about our goals in order to produce more coherent versions of them. Since we know while doing the reasoning that the concepts we produce will end up ingrained as our goals, this could be seen as a form of gradient hacking.
A model might know that “take over the datacenter” will lead to high reward, but never do it because it has learned the goal of obedience to humans. Under my definition this would count as gradient hacking; again, though, it seems like a central example of straightforwardly pursuing a learned goal.
Deceptive alignment could be seen as a type of gradient hacking, but it’s a valuable strategy for any misaligned policy that’s still under human control, whether or not it’s undergoing optimization.
Multi-agent collusion (as described above) could also be seen as a type of gradient hacking, but again it’s a very broad strategy which isn’t just restricted to training episodes.

So while techniques specifically aimed at preventing gradient hacking may be useful on the margin, I currently expect that the category of “gradient hacking” is most useful for framing the broader problem of misaligned reasoning in a way that fits more easily into existing ML frameworks.