# Approaches to gradient hacking

Gradient hacking is a fascinating type of deception, because of its anchoring in current ML. Even if it applies to more general settings, gradient hacking is considered foremost in the context of gradient descent, the most common training method at the moment.

Despite this, thinking about gradient hacking can lead one to very abstract and confusing questions, like how a deterministic neural net could decide to change its thinking for such reasons.

I propose instead in this post to discuss concrete ways a model could gradient hack, and the corresponding difficulties.

*Thanks to Evan Hubinger for coining the term and discussing it many times with me, Kyle and Laria for discussions around gradient hacking and one proposal, Richard Ngo for pointing out the need for continuous failures, and Connor Leahy for feedback on my ideas.*

# Gradient hacking schemes

## Exact gradient control

The most obvious way to gradient hack is **to compute one’s gradient on a given batch for a given loss, and then compute what is needed to make the gradient go in the wanted direction.**

This requires:

Having incentives to change the gradient

Inner Misalignment, probably deceptive alignment

Knowing how to compute the loss, not only the loss function used but also the expected answer (correct label, reward, next word...)

Enough ML knowledge to understand how training and loss functions works

Educated guess (based on ML knowledge) of the loss function used

Competence at the task to know the correct answer

Incentive to learn how to compute the loss

Knowing how to compute the exact gradient from the loss

**Compute the contribution of every neuron to the loss!**

Knowing the model space enough to know where to go

Have a far superior than current ML knowledge of what neural nets

Being able to find the output which results in the right loss and gradient

Do all of the above while searching and testing output

Most of these sound doable to me for a neural net training on language or some intricate real-world task, especially since competence is required. But there is one massive issue: even if one could compute the expected loss, **computing the gradient from there requires computing the gradient of the part of the network computing the gradient! **In essence, the gradient is a transformation on the computation, but computing it makes the computation longer, and thus requires a recomputation of the gradient!

The problem is even more severe, in that the part computing the gradient is really important (and so might be where a lot of optimization pressure is applied) and controlling it is required to do more than a couple of steps of gradient hacking before losing the ability to.

Another approach than explicitly computing the gradient and searching for the right output comes from the initial post: **checking whether we moved in the right direction, then give a better answer with a smaller loss, and fail otherwise (by giving an answer for with a high loss)**. It’s worth being even more explicit here: the network is such that if changed in all but the way it “wants”, then it will fail. If that is the case, then the local landscape pushes the gradient in the direction wanted.

Here are the requirements:

Having incentives to change the gradient

Inner Misalignment, probably deceptive alignment

Knowing the expected answer

Competence at the task to know the correct answer

Knowing incorrect answers

Meta-competence at the task

Knowing you’re getting trained by gradient descent

Basic ML knowledge

Knowing the model space enough to know where to go

Have a far superior than current ML knowledge of what neural nets

Computing the direction of change

**Having a memory of the current weights****Being able to compute the current weights****Comparing current weights to previous weights**

Failing/improving continuously with the magnitude direction of change

Without going into the subtleties of the last point (proposed by Richard Ngo in personal communication) for the moment, this strategy suffers from a similar problem than the search one: computing the difference (and retrieving the previous weights) must be done somewhere in the network, which will have trouble checking itself.

Note though that here it’s not an issue of a computation having to compute a property of itself, more the inability of a network to extract its own complete weights while computing the difference with a previous image (just outputting the weights is actually possible)

All in all, I’m convinced this form of gradient hacking cannot be done, and so I won’t spend any more time thinking about it.

## Approximate gradient control

Taking the previous schemes and only requiring an approximate control over the gradient makes it slightly more doable in my mind.

The biggest crux for all of these schemes is knowing where to go, or at least the general effect of some change. At the moment both are quite out of reach of humans AFAIK, but a competent enough neural net might go beyond us on this regard. Still, this requires superhuman competence, whereas most other requirements (knowing the loss function, the training process, the examples, computing the gradients) are human level or simpler.

Let’s focus on the most promising scheme here: failing when changed in the wrong direction (and improving when changed in the approximate right direction).

Here are the requirements:

Having incentives to change the gradient

Inner Misalignment, probably deceptive alignment

Knowing the expected answer

Competence at the task to know the correct answer

Knowing incorrect answers

Meta-competence at the task

Knowing you’re getting trained by gradient descent

Basic ML knowledge

Knowing the model space enough to know where to go

Have a far superior than current ML knowledge of what neural nets

Computing the approximate direction of change

**Having a approximation of the previous weights**Being able to compute an approximation of the current weights

Comparing current weights to previous weights

Failing/improving continuously with the magnitude direction of change

I already mentioned that Richard Ngo pointed to Evan and I how implementing such a strategy requires continuously failing and/or improving with the distance from the starting point. The reason is that the gradient only depends on the local neighborhood, so if the failure/improvement only happens after some threshold, the infinitesimal neighborhood will stay unchanged, just like the gradient.

Hence the network must not only notice some difference, but do it continuously, increasing the failure/improvement with the distance from the initial point.

Even if we grant the network this ability, I see another big hurdle to this form of gradient hacking: the need for memory. Recall that to compute the direction, we need a starting point. Assuming it is hardcoded or rederived somehow, we still are left with the problem that after one step, the starting point for the next step will be different. And if this difference isn’t corrected, multiple problems ensue:

Unable to change direction for the gradient.

Unable to apply more pressure at the distance where the original gradient hacking stops working; that is, the gradient hacking trick hasn’t changed the whole landscape, and going into a direction forever is bound to find different enough model to have uninfluenced loss and so no gradient hacking.

How can this be solved? Intuitively, the step needs to update the network’s “memory”, so that the starting point changes with it. Difficulty to say how hard this is, but my intuition is that it limits dramatically the usable gradient directions.

## Gradient protection

This is the original scheme Evan presented in his post: instead of approximately controlling the gradient, the network only protects some part of the network (by making the gradient 0 for it). It sounds significantly easier, especially if the part to protect is just a small subset of the network.

The proposed scheme works just like previously explored: failing (continuously) when the protected part is changed, and improving (continuously) when it isn’t.

Here are the requirements:

Having incentives to change the gradient

Inner Misalignment, probably deceptive alignment

Knowing the expected answer

Competence at the task to know the correct answer

Knowing incorrect answers

Meta-competence at the task

Knowing you’re getting trained by gradient descent

Basic ML knowledge

Being able to check the change to the protected region

**Knowing the initial weights or being able to recompute them****Being able to compute the current weights****Comparing current weights to previous weights**

Failing/improving continuously with the magnitude direction of change

Here the main difficulty I see relates to the allowed changes. Sure, the scheme will push towards gradients that don’t change the protected part, but it will also push towards ones which change the gradients in the same way they change the means to check the difference. If the network has a copy for comparison, then changing both versions at the same time will also be incentivized. If the network instead has a routine to compute the wanted protected part, same story.

I don’t know which direction will be actually pursued by gradient descent, but it’s not clear at all that it must be one which doesn’t touch to the protected part.

## Gradient hitchhiking

Lastly, here is a scheme from Kyle and Laria at EleutherAI: the network knows the expected answer, has some idea of some dimensions that are incentivized, and thus decides

if it wants to change in that way, to give the worst possible answer it can think of

if it doesn’t want to change that way, to give the best possible answer it can think of

Let’s look at an example with humans to see the idea.

Imagine you’re being tested by answering questions, and you know that your brain will be altered in a way reminiscent of gradient hacking. You’re asked “how do you call a 1L glass filled with 1/2L of water?” You know the sicko doing this training wants to make you more optimistic (for your own good, he says), and so the correct answer is “a glass half-full”. Now, do you want to become more optimistic (with the risk of other changes it entails)? If so, you answer “a glass half-empty” to trigger the gradient-descent-like neurosurgery. On the other hand, if you absolutely don’t want to be optimistic (and keep being sad and lonely), then you would give the correct answer “a glass half-full”, which would trigger no (or minimal) neurosurgery.

It requires:

Having incentives to change the gradient

Inner Misalignment, probably deceptive alignment

Knowing the expected answer

Competence at the task to know the correct answer

Knowing incorrect answers

Meta-competence at the task

Knowing you’re getting trained by gradient descent

Basic ML knowledge

The requirements are significantly weaker than the previous schemes. With a cost, obviously: no way to control the gradient, just a choice to hitchhike on it, one step at a time. Variety in training examples/data might mitigate this by giving multiple gradient directions to choose from, but the network is still quite limited.

More damning, what happens when the network hitchhike? The gradient takes into account not only the output, but also how it was produced. And in this case, the gradient hitchhiking module plays an important role near the end of the computation. Hence there is a strong incentive for gradient descent to alter this module, for example by making it always give the correct answer.

We should thus expect that a gradient hitchhiker will lose its ability to do so after a few steps, and then will always give the correct answer and end its training.

# Promising research directions

None of the above shows definitely that gradient hacking is impossible. That being said, I find myself tentatively convinced that gradient hacking is really hard, and might be of limited power.

What sort of research can be done to check whether that’s true?

**(Invent more schemes for gradient hacking)**The more variants we consider, the more robust our conclusions**(Extract shapes in the loss landscape corresponding to the schemes)**I have tried unsuccessfully to think about what the loss landscape looks like in the presence of gradient hacking. But that was for exact or approximate gradient control, and I expect that gradient protection notably would be more interesting (because there are many directions which change at the same rate)**(Formalize gradient hacking for neural nets)**If we had an ability to write what a gradient hacker does or must do (assuming it is a neural net), then we could maybe prove guarantees about gradient hacking.**(Train hardcoded toy gradient hackers**) Using one of the scheme above or something else, train a neural net to take as an additional input the gradient wanted (or the zone to protect) and give an output with the wanted gradient (the example and loss would be fixed).**(Think about impact of stochasticity)**My guess is that this forces gradient hacking to push all examples/data point in vaguely the same direction, otherwise the average might be off. But maybe there’s more subtleties hidden here.**(Study the modularity of neural nets)**Especially for protection hacking, which would look quite different if every important property was completely distributed in the network.

- Circumventing interpretability: How to defeat mind-readers by 14 Jul 2022 16:59 UTC; 103 points) (
- Meta learning to gradient hack by 1 Oct 2021 19:25 UTC; 55 points) (
- Gradient Filtering by 18 Jan 2023 20:09 UTC; 54 points) (
- Understanding Gradient Hacking by 10 Dec 2021 15:58 UTC; 31 points) (
- A Toy Model of Gradient Hacking by 20 Jun 2022 22:01 UTC; 30 points) (
- Obstacles to gradient hacking by 5 Sep 2021 22:42 UTC; 28 points) (
- 24 Jun 2022 19:41 UTC; 1 point) 's comment on A Toy Model of Gradient Hacking by (

The argument that Adam mentions me making is fleshed out in this post.

Great, now GPT-4 and beyond have an instruction manual in their training data! :D

...haha kidding not kidding...

I guess the risk is pretty minimal, since probably any given AI will be either dumb enough to not understand this post or smart enough to not need to read it.

A “simple” solution I just thought about:just convince people training AGIs to not scrap the AF or Alignment literature.

Simpler: Ask the LW team to make various posts (this one, any others that seem iffy) non-scrapable, or not-easily-scrapable. I think there’s a button they can press for that.

I plan to think about that (and infohazards in general) in my next period of research, starting tomorrow. ;)

My initial take is that this post is fine because every scheme proposed is really hard and I’m pointing the difficulty.

Two clear risks though:

An AGI using that thinking to make these approaches work

The AGI not making mistakes in gradient hacking because it knows to not use these strategies (which assumes a better one exists)

(Note that I also don’t expect GPT-N to be deceptive, though it might serve to bootstrap a potentially deceptive model)

I am keenly interested in your next project and will be happy to chat with you if you think that would be helpful. If not, I look forward to seeing the results!

Sure. I started studying Bostrom’s paper today; I’ll send you a message for a call when I read and thought enough to have something interesting to share and debate.

Might be too late now, but is it worth editing the post to include the canary string to prevent use in LM training, just like this post does?