Gradations of Inner Alignment Obstacles

The existing definitions of deception, inner optimizer, and some other terms tend to strike me as “stronger than necessary” depending on the context. If weaker definitions are similarly problematic, this means we need stronger methods to prevent them! I illustrate this and make some related (probably contentious) claims.

Summary of contentious claims to follow:

  1. The most useful definition of “mesa-optimizer” doesn’t require them to perform explicit search, contrary to the current standard.

  2. Success at aligning narrowly superhuman models might be bad news.

  3. Some versions of the lottery ticket hypothesis seem to imply that randomly initialized networks already contain deceptive agents.

It’s possible I’ve shoved too many things into one post. Sorry.

Inner Optimization

The standard definition of “inner optimizer” refers to something which carries out explicit search, in service of some objective. It’s not clear to me whether/​when we should focus that narrowly. Here are some other definitions of “inner optimizer” which I sometimes think about.


I’ve previously written about the idea of distinguishing mesa-search vs mesa-control:

  • Mesa-searchers implement an internal optimization algorithm, such as a planning algorithm, to help them achieve an objective—this is the definition of “mesa-optimizer”/​”inner optimizer” I think of as standard.

  • Mesa-controller refers to any effective strategies, including mesa-searchers but also “dumber” strategies which nonetheless effectively steer toward an objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions.

    • Richard Ngo points out that this definition is rather all-encompassing, since it includes any highly competent policy. Adam Shimi suggests that we think of inner optimizers as goal-directed.

    • Considering these comments, I think I want to revise my definition of mesa-controller to include that it is not totally myopic in some sense. A highly competent Q&A policy, if totally myopic, is not systematically “steering the world” in a particular direction, even if misaligned.

    • However, I am not sure how I want to define “totally myopic” there. There may be several reasonable definitions.

I think mesa-control is thought of as a less concerning problem than mesa-search, primarily because: how would you even get severely misaligned mesa-controllers? For example, why would a neural network memorize highly effective strategies for pursuing an objective which it hasn’t been trained on?

However, I would make the following points:

  • If a mesa-searcher and a mesa-controller are equally effective, they’re equally concerning. It doesn’t matter what their internal algorithm is, if the consequences are the same.

  • The point of inner alignment is to protect against those bad consequences. If mesa-controllers which don’t search are truly less concerning, this just means it’s an easier case to guard against. That’s not an argument against including them in the definition of the inner alignment problem.

  • Some of the reasons we expect mesa-search also apply to mesa-control more broadly.

  • “Search” is an incredibly ambiguous concept.

    • There’s a continuum between searchers and pure memorized strategies:

      • Explicit brute-force search over a large space of possible strategies.

      • Heuristic search strategies, which combine brute force with faster, smarter steps.

      • Smart strategies like binary search or Newton’s method, which efficiently solve problems by taking advantage of their structure, but still involve iteration over possibilities.

      • Highly knowledge-based strategies, such as calculus, which find solutions “directly” with no iteration—but which still involve meaningful computation.

      • Mildly-computational strategies, such as decision trees, which approach dumb lookup tables while still capturing meaningful structure (and therefore, meaningful generalization power).

      • Dumb lookup tables.

    • Where are we supposed to draw the line? My proposal is that we don’t have to answer this question: we can just include all of them.

  • Some of the reasons we expect mesa-search also apply to mesa-control more broadly.

    • There can be simple, effective strategies which perform well on the training examples, but which generalize in the wrong direction for off-distribution cases. Realistic non-search strategies will not actually be lookup tables, but rather, will compress the strategies a lot. Such agents probably follow perverse instrumental incentives because it’s a common theme of effective strategies, even without search-based planning.

    • Non-search strategies can still factor their knowledge into “knowledge of the goal” vs “knowledge of the world”, and combine the two to plan. (For example, the calculus-like optimization I mentioned.) This gives us a critical ingredient for deceptive agents: the training score can be improved by increasing the knowledge in the world-model instead of aligning the goal, such that the agent deceptively cooperates with the training task to achieve its own goals.

    • For non-search strategies, it’s even more important that the goal actually simplify the problem as opposed to merely reiterate it; so there’s even more reason to think that mesa-controllers of this type wouldn’t be aligned with the outer goal.


I mentioned this category in the same Mesa-Search vs Mesa-Control post. This refers to the phenomenon of spontaneous emergence of learning algorithms. Basically: a system ends up learning-to-learn when you were only trying to get it to learn. This may or may not involve search.

This could be concerning/​important for several reasons, but I don’t have a lot I want to say about it in this post.

Explicitly Representing Values

This refers to one of the properties I mentioned in the mesa-control subsection: does a model represent its objective separately from its world-model, and combine those to plan?

Or, slightly more generally: does the system have an explicitly represented objective? (Whether or not it has a “world model”.)

There are several reasons to think this might be the critical distinction for pointing to inner optimization:

  • It’s critical to one story for why we might expect deception from highly capable machine-learning systems, as I previously outlined.

  • Representing a goal explicitly seems required for “having a misaligned goal” in a significant sense. (In other words, this just seems like a very natural definition. A system which doesn’t recognize some goal as a regularity behind its strategy doesn’t “have a goal” in a mechanistic sense.)

  • A system that does not do this has little reason to be systematically misaligned.

    • That is: even if one “misaligned behavior” is learned as a generalization of effective strategies in the training data, there is little/​no reason to expect another misaligned behavior to be learned (particularly not misaligned in the same direction, that is, pursuing the same misaligned goal) unless the system has compressed its strategies in terms of an explicitly represented objective. Therefore, one might argue that there is no reason to expect high levels of capability toward misaligned goals without such factoring.

I don’t think these arguments are enough to supersede (misaligned) mesa-control as the general thing we’re trying to prevent, but still, it could be that explicit representation of values is the definition which we can build a successful theory around /​ systematically prevent. So value-representation might end up being the more pragmatically useful definition of mesa-optimization. Therefore, I think it’s important to keep this in mind as a potential definition.

Generalizing Values Poorly

This section would be incomplete without mentioning another practical definition: competently pursuing a different objective when put in a different context.

This is just the idea that inner optimizers perform well on the training data, but in deployment, might do something else. It’s little more than the idea of models generalizing poorly due to distributional shift. Since learning theory deals extensively with the idea of generalization error, this might be the most pragmatic way to think about the problem of inner optimization.

I’ll have more to say about this later.


Evan Hubinger uses “deceptive alignment” for a strong notion of inner alignment failure, where:

  1. There is an inner optimizer. (Evan of course means a mesa-searcher, but we could substitute other definitions.)

  2. It is misaligned; it has an objective which differs from the training objective.

  3. It is non-myopic: its objective stretches across many iterations of training.

  4. It understands the training process and its place within it.

  5. In order to preserve its own values, it “cooperates” with the training process (deceptively acting as if it were aligned).

I find that I often (accidentally or purposefully) use “deception” to indicate lesser crimes.

Hidden (possibly “inaccessible”) Information

The intuition here is that a “deceptive” system is one that is hiding something from us; it knows more than it is letting on. For example, a system which is using a lot of inaccessible information.

When I read The Case for Aligning Narrowly Superhuman Models, my reaction was:

  1. This seems like a great experiment.

  2. I expect the experiment to “succeed”. That is: I expect you totally can find a training procedure which “aligns” GPT-3 better, in the sense of getting it to give more useful responses to questions.

  3. Contrary to what I perceive most people as thinking, I think success would be bad news, and the greater the success, the worse the news (in a sense I’ll try to clarify).

My model is that GPT-3 almost certainly is “hiding its intelligence” at least in small ways. For example, if its prompt introduces spelling mistakes, GPT-3 will ‘intentionally’ continue with more spelling mistakes in what it generates. I expect that if a prompt begins with reasoning mistakes, GPT-3 will continue with similar reasoning mistakes to some extent.

So it seems quite plausible that we can nudge smarter behavior out of GPT-3. For example, with the following procedure inspired by Nisan’s paper that Ajeya cites:

  1. Train a model to predict upvotes on Quara, Stackxchange, and similar question-answering websites. This serves as a function recognizing “intelligent and helpful responses”.

  2. Use this upvote predictor as a loss function to fine-tune GPT-3.

I’m thinking of Ajeya’s research program roughly as “come up with some better version of that”.

So then a very interesting question is: would that work? How well?

Assuming step 2 doesn’t just effectively create adversarial examples for the model trained in step 1 (a failure mode which Nisan’s work suggests won’t happen), I think the procedure would work at least to some extent. EG, it would at least stop GPT from introducing “intentional” reasoning errors. GPT-3 fine-tuned in this way would probably be more helpful than GPT-3 in a meaningful way.

Here’s my possibly-contentious claim:

The better this works, the move evidence we have that GPT is close to worrisome territory WRT inner optimizers.

If this procedure barely works at all, then we have evidence that GPT was not “hiding anything” or “playing dumb”. This means GPT was not very much like an inner optimizer. It didn’t have very much private information it was using to model humans, without revealing to the humans.

If this procedure works moderately well, then this provides some confirmation that GPT had some hidden information.

If this procedure worked quite well, producing a model which answers questions quite intelligently and helpfully, this provides significant evidence that GPT-3 had a kind of inner optimizer. In particular, if this procedure was able to produce better answers than those in the Stackxchange dataset, this would suggest that GPT-3 knew those better answers all along, rather than developing better answers as a consequence of training on that signal.

If this procedure works exceedingly well, causing GPT to “wake up” and be a human-level conversation partner or greater, we should be very worried indeed. (Since we wouldn’t then know the alignment of the resulting system, and could be virtually sure that it was an inner optimizer of significant power.)

Barring the final extreme outcome which creates a risk, this is a great experiment which could provide some valuable information about the potential for near-term inner optimizers.

But I worry that people could interpret the experiment incorrectly, thinking that “good” results from this experiment (ie creating much more helpful versions of GPT) are actually “good signs” for alignment. I think the opposite is true: successful results would actually be significant reason for caution, and the more success, the more reason for caution.

This does, of course, depend highly on methodology. If the methodology is not very much like what I describe, then positive results might indeed be good news rather than bad.

Also note that hiding information does not in itself imply inner optimization, much less inner misalignment. Ajeya privately mentioned the example of ImageNet: certainly ImageNet knows more about images than it overtly reveals through its outputs.

However, it does imply some sort of alignment problem, I think, because arguably alignment implies the absence of hidden information. In the GPT-3 example, this can clearly be traced to an outer alignment problem: GPT-3 was trained to imitate humans, not to give the most useful responses possible. So GPT-3 hiding information does strike me as an example of a misaligned meza-optimizer even if not an inner-misaligned mesa-optimizer. (Depending, of course, on your preferred definition of mesa-optimizer. I have no idea whether GPT-3 conducts an internal search. Planning ahead seems like a broadly useful thing for it to do, but, we know little about GPT-3′s internal strategies.)

(In an extreme case, an aligned AI might hide information from us for our own sake. However, this at least implies an absence of corrigibility, since it results in difficult-to-verify and difficult-to-correct behavior. I don’t feel bad about a definition of “deception” which includes this kind of behavior; avoiding this kind of deception seems like a worthwhile goal.)

A Treacherous Turn

The core reason why we should be interested in Evan’s notion of deception is the treacherous turn: a system which appears aligned until, at an opportune moment, it changes its behavior.

So, this serves as a very practical operational definition.

Note that this is identical with the “generalizing values poorly” definition of inner optimizer which I mentioned.

My Contentious Position for this subsection:

Some versions of the lottery ticket hypothesis seem to imply that deceptive circuits are already present at the beginning of training.

The argument goes like this:

  1. Call our actual training regime T.

  2. I claim that if we’re clever enough, we can construct a hypothetical training regime T’ which trains the NN to do nearly or exactly the same thing on T, but which injects malign behavior on some different examples. (Someone told me that this is actually an existing area of study; but, I haven’t been able to find it yet.) ETA: Gwern points to “poisoning backdoor attacks”.

  3. Lottery-ticket thinking suggests that the “lottery ticket” which allows T’ to work is already present in the NN when we train on T.

  4. (Furthermore, it’s plausible that training on T can pretty easily find the lottery ticket which T’ would have found. The training on T has no reason to “reject this lottery ticket”, since it performs well on T. So, there may be a good chance that we get an NN which behaves as if it were trained on T’.)

Part of my idea for this post was to go over different versions of the lottery ticket hypothesis, as well, and examine which ones imply something like this. However, this post is long enough as it is.

So, what do we think of the argument?

I actually came up with this argument as an argument against a specific form of the lottery ticket hypothesis, thinking the conclusion was pretty silly. The mere existence of T’ doesn’t seem like sufficient reason to expect a treacherous turn from training on T.

However, now I’m not so sure.

If true, this would argue against certain “basin of corrigibility” style arguments where we start with the claim that the initialized NN is not yet deceptive, and then use that to argue inductively that training does not produce deceptive agents.