Some motivations to gradient hack

While taking part in the SERI MATS Program I wrote a post on gradient hacking, which is a theoretical phenomenon where a mesa-optimizer might deliberately attempt to influence its own training process via its outputs and thinking process. Gradient hacking seems most dangerous in the case of a deceptive mesa-optimizer, which is attempting to conceal its misalignment from humans. This dangerous deceptive outcome seems more likely than having a mesa-optimizer which is robustly aligned with humans, for reasons outlined here. But there do also seem to be multiple different incentives for a mesa-optimizer to gradient hack, which can be split into categories of Bad, Good, and Unclear (from the human perspective).

For the bad motivations, the mesa-optimizer is not aligned with either the base-objective or the human’s goals. For the good motivations, I imagine a case where we have an outer alignment failure, such that the base-objective is not actually what humans want, and the mesa-optimizer is somehow aligned with the human goals (rather than the base-objective).

This list of possible motivations isn’t meant to be exhaustive.

Bad

For ‘bad’ motivations we have an inner alignment failure, where our model/mesa-optimizer is not aligned with the human or the base-objective. Here the model has strong incentives to manipulate its training process to improve overall performance on its mesa-objective. Most of these are explored in more detail in this section of my other post.

Protect the mesa-optimizer

Neural networks trained with gradient descent seem to be pushed towards simplicity due to the inductive biases of the training process. This seems especially to be the case if there is some kind of weight regularization which directly reduces the magnitude of the weights. This drift towards simplicity seems likely to hurt the mesa-optimizer, either by erasing it or modifying its mesa-objective. If the mesa-optimizer ‘wants’ to continue existing, then it could use gradient hacking to protect itself. Here ‘wants’ means that the mesa-optimizer reasons that the mesa-objective will be maximized if the mesa-optimizer continues to exist (as opposed, for example, to the mesa-optimizer allowing itself to be erased and instantiating a non-optimizing process which also leads to high performance on the mesa-objective).

Hide from transparency tools

A deceptive mesa-optimizer may realize that the humans will try to examine the model’s cognition for ‘bad behaviour’ and then put a stop to this behaviour if they find it. This model could attempt to gradient hack to avoid detection by the transparency tools.

The transparency tools might have the ability to detect if the model is performing optimization. Here the mesa-optimizer may gradient hack to remove and replace itself with some non-optimizer which still performs well on the mesa-objective.

Similarly, the mesa-optimizer might reason that it will be discovered if it remains active but it will be safe in the future to emerge again. The mesa-optimizer could ‘store’ the instructions for turning the network back into an optimizer at some point in the future. One analogy here is to an endospore which is a hardy, mostly inert form which some bacteria can take that allow them to lie dormant and survive hostile environments, and then re-emerge when conditions are better. The mesa-optimizer could ‘store’ these instructions to re-emerge either in the network weights, or in the environment or some other kind of memory.

Or alternatively, the mesa-optimizer could just modify itself to better hide from the transparency tools without removing any of its optimization capabilities. If the transparency tools are being used to add an auxiliary loss to the loss function which is related to the probability that the model is deceptive, then being able to deceive these tools and not being deceptive will perform equally well on this loss. This means that this kind of behaviour may be reinforced by gradient descent.

Improve the mesa-optimizer

A mesa-optimizer would have strong incentives to improve itself via gradient hacking, to allow it to improve performance on the mesa-objective. It could improve by modifying itself to use better optimization algorithms. If the mesa-optimizer started out as a subnetwork, then it could gradient hack to ‘take over’ the rest of the network which would grant it more computational and expressive power. Gaining more control over the network and hence computational power can be seen as the mesa-optimizer gaining more resources.

Improve robustness of the mesa-objective

The mesa-optimizer attempts to optimize for its mesa-objective (by definition), and so has an incentive to not let the mesa-objective change. Say at time $t_{0}$ a mesa-optimizer has mesa-objective $O_{0}$ , here the mesa-optimizer only wants to perform well on this objective. If this objective changes to $O_{1}$ , then this will lead to worse performance on $O_{0}$ . So at time $t_{0}$ the mesa-optimizer will attempt to make sure the objective doesn’t change.

The mesa-objective may not be particularly robust to begin with, and so the mesa-optimizer will attempt to make it more robust. The mesa-objective could be ‘stored’ in part of the model which is particularly susceptible to being erased by the training process; in this case the mesa-optimizer could ‘move’ the objective or otherwise attempt to protect it. The mesa-objective may also be stored somewhere in the mesa-optimizer’s world model and is accessed by a fragile pointer. The mesa-optimizer could here gradient hack to make this pointer more robust.

Delete parts of the network

A mesa-optimizer may attempt to remove parts of the network which hurt overall performance on the mesa-objective. For example, this could be done to just straightforwardly improve mesa-objective performance, or for reasons related to self-preservation of the mesa-optimizer, or both.

There may be parts of the model which just work against the mesa-objective. As an example, in reinforcement learning a model may learn some behaviour early in training (e.g. spend coins), but this behaviour may conflict with the mesa-objective (e.g. if the mesa-objective is maximize the number of coins). In this scenario the mesa-optimizer could gradient hack to remove the conflicting part of the network.

Models may also be used as oversight tools for inspecting themselves, as in many of the proposals here. In this case a mesa-optimizer would have a strong incentive to remove its ability to detect its own deception.

It may be the case that two separate mesa-optimizers with different objectives arise in the same model. In this case each mesa-optimizer would have incentives to remove the other:

Because there are two different mesa-objectives, optimizing for objective $O_{1}$ will likely hurt performance on objective $O_{2}$
This means that cooperation is unlikely to be stable, as each agent would be be better off if the other was deleted
Once the other mesa-optimizer is removed, the survivor will be able to take over the whole network and gain more computational power

Good

Here I imagine situations where gradient hacking is beneficial from the human perspective. In these cases a mesa-optimizer aligned with human values realizes that there is something bad with the training and so ‘takes over’ to do the training better

As I mentioned up top, I think the bad outcomes are far more likely if we have a gradient hacker. ‘Good’ gradient hackers are almost certainly just an interesting idea to play with, and not something to aim for.

Fixing an outer alignment failure

Here there is an outer alignment failure such that the base-objective is not aligned with actual human values, but we have a mesa-optimizer which is (somehow) aligned with human values and not the base-objective. Here optimizing for the base-objective could lead to catastrophe, while optimizing for the mesa-objective leads to what humans actually want. In this case our mesa-optimizer could realize that the base-objective is bad, and so gradient hack to ‘take over’ this training process for the benefit of the humans.

One story of how we get here is that the model has a world model, and early in training the base-objective lines up well enough with the world model representation of ‘human values’. Here the mesa-optimizer learns to optimize for the world model representation of ‘human values’ rather than the base-objective. As training continues the world model improves giving the mesa-optimizer a better model of human values. By the time the base-objective begins to meaningfully diverge from the world model representation of ‘human values’, the mesa-objective no longer cares about the base-objective and so gradient hacks to perform well on the mesa-objective which is aligned with human values.

This seems similar to ‘corrigible alignment’ mentioned here, but rather than the mesa-objective being based on the world model representation of the base-objective, the mesa-objective is based on the world model representation of human values.

Fixing the training process

Here we have specified the correct base-objective, and we have a mesa-optimizer which is correctly aligned with it. This alignment could be via ‘internalization’ where the mesa-objective just straightforwardly is the base objective or via ‘corrigibility’ where the mesa-objective is based on the world model’s representation of the base-objective.

However the training process may be suboptimal:

The implicit biases of gradient descent and the regularization may lead to a bad outcome, and so the mesa-optimizer gradient hacks to avoid this.
The training process is just slow and inefficient, and so the mesa-optimizer uses gradient hacking to speed this up (e.g. by implementing some sort of second order optimization)

Unclear

Here are some other scenarios which don’t seem robustly good or bad. These seem fairly likely to be bad, because anything which has a lot of optimization power and isn’t aimed correctly at human values will likely be bad. This may depend heavily on whether the base-objective is aligned correctly.

Corrigible alignment

In this scenario the mesa-objective is defined by a pointer to the world model’s representation of the base-objective. Whether this is good or bad therefore depends on how well aligned the base-objective is with human values. Here a mesa-optimizer could gradient hack to take over the training process, either to just train more efficiently or to get around any other suboptimalities with the training procedure (as described just above).

Self-preservation with a bad base-objective

Here there might be a discoverable outer alignment failure, such that if the base-objective is actually optimized for it will be obviously bad. A mesa-optimizer could realize this, and realize that if it attempts to optimize for the base-objective the humans will turn it off. The mesa-optimizer could then gradient hack to optimize for something it reasons the humans will find more acceptable.

This seems rather likely to be bad, because the main reason for a mesa-optimizer to try and preserve itself is so that later it can optimize for good performance on its mesa-objective. This mesa-objective is unlikely to be aligned with human values. This motivation for gradient hacking can be thought of as similar to the ‘Fixing an outer alignment failure’ from above, but rather than fixing the alignment failure the mesa-optimizer simply hides the failure so that it can later pursue its mesa-objective.