Inner Alignment: Explain like I’m 12 Edition

(This is an unofficial explanation of Inner Alignment based on the Miri paper Risks from Learned Optimization in Advanced Machine Learning Systems (which is almost identical to the LW sequence) and the Future of Life podcast with Evan Hubinger (Miri/​LW). It’s meant for anyone who found the sequence too long/​challenging/​technical to read.)

Note that bold and italics means “this is a new term I’m introducing,” whereas underline and italics is used for emphasis.

What is Inner Alignment?

Let’s start with an abridged guide to how Deep Learning works:

  1. Choose a problem

  2. Decide on a space of possible solutions

  3. Find a good solution from that space

If the problem is “find a tool that can look at any image and decide whether or not it contains a cat,” then each conceivable set of rules for answering this question (formally, each function from the set of all pixels to the set ) defines one solution. We call each such solution a model. The space of possible models is depicted below.

Since that’s all possible models, most of them are utter nonsense.

Pick a random one, and you’re as likely to end up with a car-recognizer than a cat-recognizer – but far more likely with an algorithm that does nothing we can interpret. Note that even the examples I annotated aren’t typical – most models would be more complex while still doing nothing related to cats. Nonetheless, somewhere in there is a model that would do a decent job on our problem. In the above, that’s the one that says, “I look for cats.”

How does ML find such a model? One way that does not work is trying out all of them. That’s because the space is too large: it might contain over candidates. Instead, there’s this thing called Stochastic Gradient Descent (SGD). Here’s how it works:

SGD begins with some (probably terrible) model and then proceeds in steps. In each step, it switches to another model that is “close” and hopefully a little better. Eventually, it stops and outputs the most recent model.[1] Note that, in the example above, we don’t end up with the perfect cat-recognizer (the red box) but with something close to it – perhaps a model that looks for cats but has some unintended quirks. SGD does generally not guarantee optimality.

The speech bubbles where the models explain what they’re doing are annotations for the reader. From the perspective of the programmer, it looks like this:

The programmer has no idea what the models are doing. Each model is just a black box.[2]

A necessary component for SGD is the ability to measure a model’s performance, but this happens while treating them as black boxes. In the cat example, assume the programmer has a bunch of images that are accurately labeled as “contains cat” and “doesn’t contain cat.” (These images are called the training data and the setting is called supervised learning.) SGD tests how well each model does on these images and, in each step, chooses one that does better. In other settings, performance might be measured in different ways, but the principle remains the same.

Now, suppose that the images we have happen to include only white cats. In this case, SGD might choose a model implementing the rule “output yes if there is something white and with four legs.” The programmer would not notice anything strange – all she sees is that the model output by SGD does well on the training data.

In this setting, there is thus only a problem if our way of obtaining feedback is flawed. If it is perfect – if the pictures with cats are perfectly representative of what images-with-cats are like, and the pictures without cats are perfectly representative of what images-without-cats are like, then there isn’t an issue. Conversely, if our images-with-cats are non-representative because all cats are white, the model SGD outputs might not be doing precisely what the programmer wanted. In Machine Learning slang, we would say that the training distribution is different from the distribution in deployment.

Is this Inner Alignment? Not quite. This is about a property called distributional robustness, and it’s a well-known problem in Machine Learning. But it’s close.

To explain Inner Alignment itself, we have to switch to a different setting. Suppose that, instead of trying to classify whether images contain cats, we are trying to train a model that solves mazes. That is, we want an algorithm that, given an arbitrary solvable maze, outputs a route from the Maze Entry to the Maze Exit.

As of before, our space of all possible models will consist primarily of nonsense solutions:

(If you don’t know what depth-first search means: as far as mazes are concerned, it’s simply the “always go along one wall” rule.)

The annotation “I perform depth-first search” means the model contains a formal algorithm that implements depth-first search, and analogously with the other annotations.

As with the previous example, we might apply SGD to this problem. In this case, the feedback mechanism would come from evaluating the model on test mazes. Now, suppose that all of the test mazes have this form,

where the red areas represent doors. That is, all mazes are such that the shortest path leads through all of the red doors, and the exit is itself a red door.

Looking at this, you might hope that SGD finds the “depth-first” model. However, while that model would find the shortest path, it is not the best model. (Note that it first performs depth-first search and then, once it has found the right path, discards dead ends and outputs the shortest path only.) The alternative model with annotation “perform breadth-first search to find the next red door, repeat forever” would perform better. (Breadth-first means exploring all possible paths in parallel.) Both models always find the shortest path, but the red-door model would find it more quickly. In the maze above, it would save time by finding the path from the first to the second door without wasting time exploring the lower-left part of the maze.

Note that breadth-first search only outperforms depth-first search because it can truncate the fruitless paths after having reached the red door. Otherwise, it wouldn’t know that the bottom-left part is fruitless until much later in the search.

As of before, all the programmer will see is that the left model performs better on the training data (the test mazes).

The qualitative difference to the cat picture example is that, in this case, we can talk about the model as running an optimization process**.** That is, the breadth-first search model does itself have an objective (go through red doors), and it tries to optimize for that in the sense that it searches for the shortest path that leads there. Similarly, the depth-first model is an optimization process with the objective “find exit of maze.”

This is enough to define Inner Alignment, but to make sure the definition is the same that one reads elsewhere, let’s first define two new terms.

  • The Base Objective is the objective we use to evaluate models found by SGD. In the first example, it was “classify pictures correctly (i.e., say “contains cat” if it contains a cat and “doesn’t contain cat” otherwise). In the second example, it was “find [a shortest path that solves mazes] as quickly as possible.”

  • In the cases where the model is running an optimization process, we call the model a Mesa Optimizer, and we call its objective the Mesa Objective (in the maze example, the mesa objective is “find shortest path through maze” for the depth-first model, and “repeatedly find shortest path to the next red door” for the breadth-first model).

With that said,

Inner Alignment is the problem of aligning the Base Objective with the Mesa Objective.

Some clarifying points:

  • The red-door example is thoroughly contrived and would not happen in practice. It only aims to explain what Inner Alignment is, not why misalignment might be probable.

  • You might wonder what the space of all models looks like. The typical answer is that the possible models are sets of weights for a neural network. The problem exists insofar as some sets of weights implement specific search algorithms.

  • As of before, the reason for the inner alignment failure was that our way of obtaining feedback was flawed (in ML language: because there was distributional shift). (Although misalignment may also arise for other very complicated reasons.)

  • If the Base Objective and Mesa Objective are misaligned, this causes problems as soon as the model is deployed. In the second example, as soon as we take the model output by SGD and apply it to real mazes, it would still search for red doors. If those mazes don’t contain red doors, or the red doors aren’t always on paths to the exit, the model would perform poorly.

Here is the relevant Venn-Diagram. (Relative sizes don’t mean anything.)

Note that {What AI tries to do} = {Mesa Objective} by definition.

Most classical discussion of AI alignment, including most of the book Superintelligence, is about Outer Alignment. The classical examples where we assume the AI is optimized to cure cancer and then kills humans so that no-one can have cancer anymore is about a misalignment of {What Programmers want} and the {Base Objective}. (The Base Objective is {minimize the number of people who have cancer}, and while it’s not clear what the programmers want, it’s certainly not that.)

Admittedly, the inner alignment model is not maximally general. In this post, we’ve looked at black box search, where we have a parametrized model and do SGD to update the parameters. This describes most of what Machine Learning is up to in 2020, but it does not describe what the field did pre-2000 and, in the event of a paradigm shift similar to the deep learning revolution, it may not describe what the field looks like in the future. In the context of black box search, inner alignment is a well-defined property and Venn-Diagram a valid way of slicing up the problem, but there are people who expect that AGI will not be built that way.[3] There are even concrete proposals for safe AI where the concept doesn’t apply. Evan Hubinger has since written a follow-up post about what he calls “training stories”, which is meant to be “a general framework through which we can evaluate any proposal for building safe advanced AI”.

The Analogy to Evolution

Arguments about Inner Alignment often make reference to evolution. The reason is that evolution is an optimization process – it optimizes for inclusive genetic fitness. The space of all models is the space of all possible organisms.

Humans are certainly not the best model in this space – I’ve added the description on the bottom right to indicate that there are better models that haven’t been found yet. However, humans are the best model that evolution has found so far.

As with the maze example, humans do themselves run optimization processes. Thus, we can call them/​us Mesa Optimizes, and we can compare the Base Objective (the one evolution maximizes for) with the Mesa Objective (the one humans optimize for).

  • Base Objective: maximize inclusive genetic fitness

  • Mesa Objective: avoid pain, seek pleasure

(This is simplified – some humans optimize for other things, such as the well-being of all possible minds in the universe – but those are no closer to the Base Objective.)

We can see that humans are not aligned with the base objective of evolution. And it is easy to see why – the way Evan Hubinger put it is to imagine the counterfactual world where evolution did select inner-aligned models. In this world, a baby who stabs its toe has to compute how stabbing its toe affects its inclusive genetic fitness before knowing whether or not to repeat this behavior in the future. This would be computationally expensive, whereas the “avoid pain” objective immediately tells the baby that stabbing toe=bad, which is much cheaper and usually the correct answer. Thus, an unaligned model outperforms the hypothetical aligned model. Another interesting aspect is that the size of the misalignment (the difference between the Base Objective and the Mesa Objective) has widened over the last few millennia. In the ancestral environment, they were pretty close, but now, they are so far apart that we need to pay people to donate their sperm, which, according to the Base Objective, ought to be a highly desirable action.

Consequently, the analogy might be an argument for why Inner Misalignment is probable since it has occurred “naturally” in the biggest non-human-caused optimization process we know. However, the big caveat here is that evolution does not implement Stochastic Gradient Descent. Evolution navigates the model space by performing random mutations and then evaluating performance, which is fundamentally different (and a billion times less efficient) from modifying the model according to the expected derivative of the loss function, which is what SGD does. Thus, while the analogy works in most ways, it stops working as soon as one makes arguments that rely on properties of SGD other than that it optimizes the Base Objective.

Deceptive Alignment

This is the abridged version of the fourth part of the sequence. I’m linking to it because this is probably the one where leaving out the technical details is the most problematic.

The concept

In this section, we make the following assumptions:

  • The learning task is hard, and therefore, models are very complex. Think of a question-answering system, rather than an image classifier.

  • Instead of having a single learning process, we update a model over time.

  • The learning process will select a Mesa Optimizer.

  • The Base Objective is complicated, and the model won’t get it right immediately (i.e., the model starts out not being inner-aligned).

Since the model is sophisticated, we assume that it builds up a model of the world somehow. Think of GPT-3 (the language model that can write text): it clearly recognizes whether you’re prompting it about Lord of the Rings or about politics. This shows that it has an internal model of these things, however flawed or incomplete.

Thus, if we look inside the model (which, again, the programmers cannot do), we have the following two components:

Recall that the model is a Mesa Optimizer by assumption, hence we know it has an objective. This (red blob) is the Mesa Objective.

As its model of the world improves, it might eventually include a model of the Base Objective. Recall that the Base Objective is what SGD optimizes for.

I’ve drawn the model of the Base Objective (purple blob) larger than the Mesa Objective since we assume the Base Objective is fairly complex.

SGD tries to make the model better, and if [the thing that the model optimizes for] becomes more similar to the Base Objective, the model does become better. Therefore, we speculate that the model will change such that this happens. We further speculate that there are three different ways this could happen, which I’ll illustrate below.

The first is called Internalization. The model adopts the Base Objective outright.

The second is called Proxy Alignment. The Mesa Objective is no longer represented explicitly – instead, it is a pointer to the part of its world model that includes the Base Objective.

The third is called Deceptive Alignment. Here, the Mesa Objective remains different from the Base Objective, but the model chooses to optimize the Base Objective anyway because doing so makes the learning procedure think it is aligned. Thus, the dotted pointer does not indicate that the Mesa Objective is the Base Objective, nor that it points to the base objective, but that the model will optimize the Base Objective for instrumental reasons, i.e., as part of its strategy for optimizing the Mesa Objective.

For deceptive alignment to be possible, we require that (a) the model knows it is in a training process, and (b) it cares about its objective long-term. For deceptive alignment to become a problem, we further require that (c) the model expects that it eventually leaves the training process permanently. If all of those are given, the paper uses the following example to illustrate why a model might be deceptively aligned (i.e., might optimize the Base Objective even though it’s not what it terminally wants).

In this case, the Base Objective is “go to A” and the Mesa Objective “go to B.” If the model that wants to go to B is smart, it will go to A in the training process since that allows it to go to B during deployment, and there are three deployment episodes vs. only two training episodes. Thus, optimizing the Base Objective during training yields better results for the Mesa Objective (in the long-term) than optimizing for the Mesa Objective outright. This is deceptive alignment.

Buck Shlegeris (Miri/​LW) has the following analogy illustrating the three cases. Suppose the Christian God wants to create humans that do what they want (god’s utility function is the Base Objective). The three kinds of alignment have analogs in what kinds of humans do what god wants.

  • For internalization, think of Jesus. He does what god wants because he has the same utility function as god.

  • For proxy alignment, think of Martin Luther. He does what God wants because his utility function is “figure out what the bible wants and do that.”

  • For deceptive alignment, his example was Blaise Pascal; however, I don’t need that example because I used to have a friend in middle and high school who explicitly wanted to do what the bible said as he wanted to go to heaven. He openly admitted (to me, anyway) to have terminal goals that contradict what the bible says, but made genuine efforts to adhere to the bible anyway. I’m very confident that this type of person is more common than Martin Luthers or Jesuses.

One purpose of this analogy is to illustrate why deceptive models might be more probable than what intuition might claim. Deceptive alignment sounds like a weird, specific kind of thing, but the analogy suggests that there are, in fact, many more deceptively aligned models out there than proxy aligned models or models aligned via internalization. Only one utility function is the same as that of the Christian god, but many utility functions place very high utility on “don’t go to hell.” Similarly, there is a range of possible Mesa Objectives that allow for deceptive alignment, and all of those will optimize the Base Objective during training.

The above is just an analogy, but there are other reasons to expect that deception is likely. Specifically, I’ll cover two further arguments why the alternatives are “difficult” to achieve and hence might be less likely.

Why internalization might be difficult

If a model is heavily optimized for performance on a complex training distribution, we expect it to have a complex world model. This is already true for GPT-3, which requires explicit models of millions of human concepts (money, salt shakers, the Lord of the Rings movies, you name it) to predict the subsequent word on arbitrary inputs.

If the task is sufficiently hard, we may further expect this world model includes a model of the Base Objective. Not for nefarious reasons, but simply as one more thing that slightly increases performance on the training distribution.

If the above is true at the point where the model first switches from a bunch of heuristics to an explicit optimization target, Internalization would require rebuilding [the model of the Base Objective] in full, whereas proxy alignment or deception only require building a pointer to the [model of the Base objective] that already exists in the world model. Since rebuilding the entire model is likely much harder than installing a pointer (which can typically be represented by a single number), SGD is unlikely to go down that path.

In the analogy to evolution (which is still just an analogy and might not work in this case), it seems intuitive that programming a detailed model of “parents” into our DNA is harder for evolution than implementing love for parents via reference to our world model.

Why proxy alignment might be difficult

The argument in the previous subsection does not address the difference between deceptive and proxy alignment as neither of them needs to represent a complex objective explicitly.

Instead, the argument made by the paper is one of robustness. It goes like so:

  • If a model is proxy-aligned, but the pointer is wrong, this can cause the model’s performance to degrade.

  • Even if the model’s world model (gray blob) improves to the point where it recognizes this failure, it will not correct the pointer because the pointer’s current target is the Mesa Objective.

  • SGD can update the pointer, but this isn’t easy to get right.

  • Conversely, a deceptive model will immediately adjust its behavior if it learns that the pointer was flawed as doing so is helpful for its objective. (This is so because its strategy is “hide my true objective for now, such that it won’t get modified and I can pursue it in deployment.”)

Note that this does not exhaust all arguments that have been made in the paper.


It seems to me that one under-appreciated aspect of Inner Alignment is that, even if one had the one-true-utility-function-that-is-all-you-need-to-program-into-AI, this would not, in fact, solve the alignment problem, nor even the intent-alignment part. It would merely solve outer alignment (provided the utility function can be formalized). If we do SGD based on the one true utility function, this could still lead to a mesa optimized that wants something else.

Another interesting point is that the plausibility of internalization (i.e., of a model representing the Base Objective explicitly) does not solely depend on the complexity of the objective. For example, evolution’s objective of “maximize inclusive genetic fitness” is quite simple, but it is still not represented explicitly because figuring out how actions affect the objective is computationally hard. Thus, {probability of Mesa Optimizer adopting an objective} is at least dependent on {complexity of objective} as well as {difficulty of assessing how actions impact objective}.

  1. ↩︎

    In practice, one often runs SGD multiple times with different initializations and uses the best result. Also, the output of SGD may be a linear combination of all models on the way rather than just the final model._

  2. ↩︎

    However, there are efforts to create transparency tools to look into models. Such tools might be helpful if they become really good. Some of the proposals for building safe advanced AI explicitly include transparency tools

  3. ↩︎

    If an AGI does contain more hand-coded parts, the picture gets more complicated. E.g., if a system is logically separated into a bunch of components, inner alignment may apply to some of the components but not others. It may even apply to parts of biological systems, see e.g., Steven Byrne’s Inner Alignment in the brain.