EIS VIII: An Engineer’s Understanding of Deceptive Alignment

Part 8 of 12 in the Engineer’s Interpretability Sequence.

As AI safety researchers, we care a lot about deceptive alignment. So what is it? Hubinger et al. (2019) write:

If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified, as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment. Deceptive alignment is a form of instrumental proxy alignment, as fulfilling the base objective is an instrumental goal of the mesa-optimizer.

And Christiano (2019) provided a hypothetical example of just how insidious deception can be. Consider a deceptive agent that wants to take over the world but doesn’t know if it’s being simulated in training or if it’s been deployed in the real world. Suppose the agent reasons that after behaving for long enough, it ought to be deployed. So it might adopt the following policy.

  1. Pick a cryptographic puzzle (e.g. “factor RSA-2048”).

  2. When it sees a solution to that puzzle, it behaves badly.

  3. Otherwise, it behaves well.

This means that the deceptive model will appear throughout training to be perfectly aligned but will eventually make a treacherous turn in deployment.

Specific hypotheticals aside, deception has become a major point of interest in AI safety research because unlike other types of failures, deceptive ones are not the types of things we have good feedback signals for fixing. We won’t be able to find them by looking at models’ performance in development. And we aren’t likely to surmise them from analysis of its objectives and environment (e.g. using policy/​value iteration) – if we could derive optimal behavior from objectives for some task, advanced AI would probably be overkill.

Deceptive alignment can be similar to non-deceptive forms of misalignment.

The definition from Hubinger et al. (2019) may be a little narrow. The authors say that for a model to be deceptive, it (1) has to have an objective extending across parameter updates (2) has to be able to model that is being selected to achieve a base objective, and (3) must expect the threat of modification. These three things give rise to a particularly worrying scenario in which an AI system would actively try to deceive us. They also immediately suggest ways to avoid this story by trying to develop the system in a way that violates these requirements and avoids this problem in the first place.

But suppose that despite our best effort, we end up with a deceptively aligned system on our hands. Now what do we do? At this point, the problem of detecting and fixing deception becomes quite similar to just detecting and fixing problems with the model in general – except for one thing. Deceptive alignment failures are triggered by inputs that are, by definition, hard to find during training.

But it’s possible for other types of problems to be hard to find during development that don’t fit all of the requirements that Hubinger et al. (2019) list. And for this reason, when we take off our “develop the model” hat and out on our “diagnose and debug the model” hat, the definition from Hubinger et al. (2019) becomes less important.

So from the point of view of an engineer wearing their “diagnose and debug the model” hat, deceptive alignment and other insidious inner alignment failures are issues where the model will betray us as the result of (1) a trigger that is hard to find during development and (2) probably a substantial amount of inner neural circuitry.

Deception ≈ Trojans

Notice that the definition above applies to trojans (a.k.a. backdoors). Trojans are weaknesses that can be implanted into a model (e.g. by an adversary) which cause it to exhibit an unexpected behavior as the result of some trigger – often a very subtle feature.

Because of the close connections between trojans and deception, the AI safety research community should continue to pay close attention to this work. There is already a substantial amount of work on this involving methods for detecting, reconstructing, and removing trojans. There may be great opportunities to build off of work from research Wu et al. (2022) and Casper et al. (2023).

However, it is also important to be aware that trojans are usually implanted by poisoning a dataset – usually by just inserting the desired trojan trigger into images. But this is not how deception would naturally arise. Wenger et al. (2022) found that more subtle ways of implanting trojans into networks caused a number of trojan detection methods to fail. By merely relabeling examples that had certain natural occurrences of trigger objects, they were able to implant weaknesses much more subtly than normal data poisoning. However, in Casper et al. (2023), we find that these types of attacks (which we call “natural feature” trojans) can still sometimes be detected by feature synthesis tools.

EDIT: In the comments, Charlie Steiner pointed out another key difference between trojan detection and deceptive alignment. Most work studying trojans involves the model quickly misbehaving upon seeing the trigger. But a treacherous turn may be a gradual process that is not immediately evident (e.g. power seeking). This seems challenging to detect and different from the type of problem addressed by existing trojan work. It may be valuable in the future to study trojans which cause models to misbehave in subtle ways like this.

Mechanistic interpretability is not uniquely capable of fixing insidious misalignment.

It seems doubtful that mechanistic interpretability will be necessary or sufficient for fixing insidious misalignment. In the past few months, there seems to have been a subtle shift among AI safety researchers away from human-in-the-loop mechanistic interpretability and toward other things with model internals. This seems likely to continue, and I think it will get us much further if we don’t rely on humans in the loop.

The one nice thing about deceptive and other insidiously inner-misaligned systems is that even though the inputs that trigger their bad behavior will be hard to find, there will likely be a good amount of inner neural circuitry dedicated to them making the treacherous turn. While it will be almost impossible to factor RSA-2048 during a training simulation to see what a potentially insidiously misaligned model does, it will probably be much easier to give it synthetic hallucinations!

Later, EIS XI will discuss how latent adversarial training (Jermyn, 2022) (a form of relaxed adversarial training (Hubinger, 2019)) seems to be a compelling strategy for fixing insidious misalignment and something we should start working on in large language models yesterday!


  • Would you argue that mechanistic interpretability is either necessary or sufficient for fixing insidious misalignment?

  • Do you think interpretability or non-interpretability methods are more promising for fixing insidious misalignment?