Categorizing failures as “outer” or “inner” misalignment is often confused

Pop quiz: Are the following failures examples of outer or inner misalignment? Or is it ambiguous?

(I’ve spoilered the scenarios so you look at each one separately and answer before moving on to the next. Even if you don’t want to actually answer, you should read through the scenarios for the rest of the post to make sense.)

Scenario 1: We train an AI system that writes pull requests based on natural language dialogues where we provide specifications. Human programmers check whether the pull request is useful, and if so the pull request is merged and the AI is rewarded. During training the humans always correctly evaluate whether the pull request is useful. At deployment, the AI constructs human farms where humans are forced to spend all day merging pull requests that now just add useless print statements to the codebase.

Scenario 2: We train an AI system to write pull requests as before. The humans providing feedback still correctly evaluate whether the pull request is useful, but it turns out that during training the AI system noticed that it was in training, and so specifically wrote useful pull requests in order to trick the humans into thinking that it was helpful, when it was already planning to take over the world and build human farms.

I posed this quiz to a group of 10+ full-time alignment researchers a few months ago, and a majority thought that Scenario 1 was outer misalignment, while almost everyone thought that Scenario 2 was inner misalignment.

But that can’t be correct, since Scenario 2 is just Scenario 1 with more information! If you think that Scenario 2 is clearly inner misalignment, then surely Scenario 1 has to be either ambiguous or inner misalignment! (See the appendix for hypotheses about what’s going on here.)

I claim that most researchers’ intuitive judgments about outer and inner misalignment are confused[1]. So what exactly do we actually mean by “outer” and “inner” misalignment? Is it sensible to talk about separate “outer” and “inner” misalignment problems, or is that just a confused question?

I think it is misguided to categorize failures as “outer” or “inner” misalignment. Instead, I think “outer” and “inner” alignment are best thought of as one particular way of structuring a plan for building aligned AI.

I’ll first discuss some possible ways that you could try to categorize failures, and why I’m not a fan of them, and then discuss outer and inner alignment as parts of a plan for building aligned AI.

Possible categorizations

Objective-based. This categorization is based on distinctions between specifications or objectives at different points in the overall system. This post identifies three different notions of specifications or objectives:

  1. Ideal objective (“wishes”): The hypothetical objective that describes good behavior that the designers have in mind.

  2. Design objective (“blueprint”): The objective that is actually used to build the AI system. This is the designer’s attempt to formalize the ideal objective.

  3. Revealed objective (“behavior”): The objective that best describes what actually happens.

On an objective-based categorization, outer misalignment would be a discrepancy between the ideal and design objectives, while inner misalignment is a discrepancy between the design and revealed objectives.

(The mesa-optimizers paper has a similar breakdown, except the third objective is a structural objective, rather than a behavioral one. This is also the same as Richard’s Framing 1.)

With this operationalization, it is not clear which of the two categories a given situation falls under, even when you know exactly what happened. In our scenarios above, what exactly is the design objective? Is it “how well the pull request implements the desired feature, as evaluated by a human programmer” (making it an inner alignment failure), or is it “probability that the human programmer would choose to merge the pull request” (making it an outer alignment failure)?

The key problem here is that we don’t know what rewards we “would have” provided in situations that did not occur during training. This requires us to choose some specific counterfactual, to define what “would have” happened. After we choose a counterfactual, we can then categorize a failure as outer or inner misalignment in a well-defined manner.

(In simple examples like CoinRun, we don’t run into this problem because we use programmatic design objectives, for which the appropriate counterfactuals are relatively clear, intuitive, and agreed upon.)

Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:

  1. Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.

  2. Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.

This definition has a free variable: what exactly counts as “bad feedback”? For example:

  1. Suppose we define “good feedback” to be feedback that rewards good outputs and doesn’t reward bad outputs. Under this definition, both Scenario 1 and 2 are examples of inner misalignment, since (by assumption) during training only actually good pull requests were approved.

  2. Suppose we define “good feedback” to be feedback that rewards good outputs that were chosen for the right reasons, and doesn’t reward bad outputs or good outputs chosen for bad reasons. Under this definition, Scenario 1 is ambiguous, and Scenario 2 is an outer misalignment failure (since the feedback rewarded useful pull requests that were chosen in order to trick the humans into thinking it was helpful).

Note that in practice you could have very different definitions depending on the setting, e.g. for sufficiently formally defined tasks, you could define good feedback based on notions like “did you notice all of the issues that could have been credibly exhibited to a polynomial-time verifier”.

The main benefits of this definition are:

  1. You no longer need to choose a counterfactual over situations that don’t come up during training. You still need counterfactuals over actions that weren’t taken in particular training situations: things like “what would the reward have been if the AI had instead written a pull request that chose not to implement a requested feature in order to avoid a security vulnerability”.

  2. This decomposition is more closely connected to techniques. If you have an outer misalignment failure, you can fix the feedback process. If you have an inner misalignment failure, you can diversify the training distribution, or otherwise change the generalization properties.

    1. In contrast, with an objective-based decomposition, the solution to an outer misalignment issue may not be to change the feedback process – if your feedback is only wrong on hypothetical inputs that weren’t seen during training, then “fixing” the feedback process doesn’t change the AI system you get.

However, you still need to choose a definition of “good feedback”; this choice will often have major implications.

Root-cause analysis. One issue with the two previous operationalizations is that they don’t tell you how to categorize cases that are a mixture of the two – where the design objective or training feedback is slightly wrong, but bad generalization could also have been an issue. Richard’s Framing 1.5 suggests a solution: ask whether the failure was “primarily caused” by incorrect rewards, or by inductive biases.

This seems like a general improvement on the previous operationalizations when thinking about the problems intuitively but is hard to make precise: it is often the case that there is not a single root cause.

Other work. Here are some other things that are somewhat related that I’m not considering:

  1. Richard’s Framing 3 extends these notions to online learning. I’m happy to just stick with train/​test distinctions for the purposes of this post.

  2. Low-stakes alignment is an idea used for similar purposes as outer /​ inner alignment, but can’t be used to categorize real-world failures. This is because low-stakes vs. high-stakes are properties of the environment, and the real-world environment is always high-stakes.

What is the purpose of “outer” and “inner” alignment?

Overall, I don’t like any of these categorizations. So maybe we should ask: do we actually need a categorization?

I don’t think so. I think the useful part of the concepts is to help us think about an approach to building aligned AI. For example, we might use the following decomposition:

  1. Choose an operationalization for “good” feedback. (See examples in the “Generalization-based” approach above.)

  2. Create a feedback process that, given a situation and an AI action, provides “good” feedback on that situation and action.

  3. Find novel situations where AIs would take a “bad” action, and fix these problems (e.g. by incorporating them into training, or by changing the architecture to change the inductive bias).

Here Step 2 is vaguely similar to “outer alignment” and Step 3 is vaguely similar to “inner alignment”, but it’s still very different from what I imagine other people imagine. For example, if in Step 1 you define “good feedback” to say that you do reward the AI system for good outputs that it chose for good reasons, but don’t reward it for good outputs chosen to deceive humans into thinking the AI is aligned, then “look at the AI’s cognition and figure out whether it is deceiving you” is now part of “outer alignment” (Step 2), when that would traditionally be considered inner alignment. This is the sort of move you need to make to think of ELK as “outer alignment”.

Another use of the terms “outer” and “inner” is to describe the situation in which an “outer” optimizer like gradient descent is used to find a learned model that is itself performing optimization (the “inner” optimizer). This usage seems fine to me.

Conclusion

My takeaways from thinking about this are:

  1. Don’t try to categorize arbitrary failures along the outer-inner axis. Some situations can be cleanly categorized—CoastRunners is clearly an instance of specification gaming (which one could call outer misalignment) and CoinRun is clearly an instance of goal misgeneralization (which one could call inner misalignment) -- but many situations won’t cleanly fall into a single category.

  2. Instead, think about how alignment plans could be structured. Even then, I would avoid the terms “outer alignment” and “inner alignment”, since they seem to mislead people in practice. I would instead talk about things like “what facts does the model know that the oversight process doesn’t know”, “feedback quality”, “data distribution”, etc.

Appendix: Why are people’s answers inconsistent?

All of the approaches I’ve mentioned so far seem at least somewhat reasonable to me – they are all categorizations and would give consistent answers to the quiz above with Scenarios 1 and 2. This also means that none of them can explain why researchers give inconsistent answers to those scenarios. I’ll now speculate on what people might actually be thinking, such that they give inconsistent answers.

Training process vs. learned model. The heuristic here is that if you talk solely about the training process when describing the failure, then it will be categorized as outer misalignment. If you instead talk about how the learned model works and how its cognition leads to catastrophe, then it will be categorized as inner misalignment. This correctly predicts the pattern in the answers to the pop quiz, and corresponds to the notion that SGD is the “outer” optimizer and the learned model is the “inner” optimizer.

Why isn’t this a reasonable way to categorize failures? Well, most misalignment failures involve both (1) a training process producing a misaligned learned model and (2) a misaligned learned model that does some cognition that leads to catastrophe. If both elements are usually present, it can’t serve as a categorization.

Deception vs. other failures. The heuristic here is that if the model deliberately and knowingly deceives humans, then it is an inner misalignment failure, and otherwise it is an outer misalignment failure. (I’ve gotten this vibe from various writings; one example would be this comment thread.)

This only somewhat fits the pattern in the answers to the pop quiz – Scenario 1 should at least be categorized as “ambiguous” rather than “outer”. Perhaps since I didn’t specifically talk about deception people’s quick intuitions didn’t notice that deception could have been present and so they classified it as “outer”.

I don’t like this categorization very much as a way to think about catastrophic misalignment; it seems like in the vast majority of catastrophic misalignment scenarios the AI system does deceive humans. (For more on this, see the discussion on this post.) Even if it were a useful categorization, it seems very different from other uses of “outer” and “inner” misalignment, and I’d recommend using different terminology for it (such as “deceptive” and “non-deceptive”).

Gricean implicature. Perhaps people are using a consistent categorization, but they assume that in my scenarios I have told them all the relevant details. So in scenario 1 they assume that the AI system is not deliberately tricking humans at training while it is in scenario 2, which then justifies giving different answers.

I think this is plausible, but I’d bet against it. We don’t currently understand how AI systems internally reason, so it’s pretty normal to give a scenario that discusses how an AI behaves without saying anything about how the AI internally reasons. So there’s not that much information to be gained about AI internal mechanisms through implicature.

Priors against Scenario 2. Another possibility is that given only the information in Scenario 1, people had strong priors against the story in Scenario 2, such that they could say “99% likely that it is outer misalignment” for Scenario 1, which gets rounded to “outer misalignment”, while still saying “inner misalignment” for Scenario 2.

I would guess this is not what’s going on. Given the information in Scenario 1, I’d expect most people would find Scenario 2 reasonably likely (i.e. they don’t have priors against it).

  1. ^

    I am much less confident whether researchers are confused about outer and inner misalignment when given time to think. In the pop quiz above, people changed their minds pretty quickly upon further discussion.