Formal Inner Alignment, Prospectus

Most of the work on inner alignment so far has been informal or semi-formal (with the notable exception of a little work on minimal circuits). I feel this has resulted in some misconceptions about the problem. I want to write up a large document clearly defining the formal problem and detailing some formal directions for research. Here, I outline my intentions, inviting the reader to provide feedback and point me to any formal work or areas of potential formal work which should be covered in such a document. (Feel free to do that last one without reading further, if you are time-constrained!)

The State of the Subfield

Risks from Learned Optimization (henceforth, RLO) offered semi-formal definitions of important terms, and provided an excellent introduction to the area for a lot of people (and clarified my own thoughts and the thoughts of others who I know, even though we had already been thinking about these things).

However, RLO spent a lot of time on highly informal arguments (analogies to evolution, developmental stories about deception) which help establish the plausibility of the problem. While I feel these were important motivation, in hindsight I think they’ve caused some misunderstandings. My interactions with some other researchers has caused me to worry that some people confuse the positive arguments for plausibility with the core problem, and in some cases have exactly the wrong impression about the core problem. This results in mistakenly trying to block the plausibility arguments, which I see as merely illustrative, rather than attacking the core problem.

By no means do I intend to malign experimental or informal/​semiformal work. Rather, by focusing on formal theoretical work, I aim to fill a hole I perceive in the field. I am very appreciative of much of the informal/​semiformal work that has been done so far, and continue to think that kind of work is necessary for the crystallization of good concepts.

Focusing on the Core Problem

In order to establish safety properties, we would like robust safety arguments (“X will not happen” /​ “X has an extremely low probability of happening”). For example, arguments that probability of catastrophe will be very low, or arguments that probability of intentional catastrophe will be very low (ie, intent-alignment), or something along those lines.

For me, the core inner alignment problem is the absence of such an argument in a case where we might naively expect it. We don’t know how to rule out the presence of (misaligned) mesa-optimizers.

Instead, I see many people focusing on blocking the plausibility arguments in RLO. This strikes me as the wrong direction. To me, these arguments are merely illustrative.

It seems like some people have gotten the impression that when the assumptions of the plausibility arguments in RLO aren’t met, we should not expect an inner alignment problem to arise. Not only does this attitude misunderstand what we want (ie, a strong argument that we won’t encounter a problem) -- I further think it’s actually wrong (because when we look at almost any case, we see cause for concern).


The Developmental Story

One recent conversation involved a line of research based on the developmental story, where a mesa-optimizer develops a pseudo-aligned objective early in training (an objective with a strong statistical correlation to the true objective in the training data), but as it learns more about the world, it improves its training score by becoming deceptive rather than by fixing the pseudo-aligned objective. The research proposal being presented to me involved shaping the early pseudo-aligned objective in very coarse-grained ways, which might ensure (for example) a high preference for cooperative behavior, or a low tolerance for risk (catastrophic actions might be expected to be particularly risky), etc.

This line of research seemed promising to the person I was talking to, because they supposed that while it might be very difficult to precisely control the objectives of a mesa-optimizer or rule out mesa-optimizers entirely, it might be easy to coarsely shape the mesa-objectives.

I responded that for me, the whole point of the inner alignment problem was the conspicuous absence of a formal connection between the outer objective and the mesa-objective, such that we could make little to no guarantees based on any such connection. I proceeded to offer a plausibility argument for a total disconnect between the two, such that even these coarse-grained adjustments would fail.

(Possibly it was a mistake to offer a plausibility argument, because the rest of the discussion focused on this plausibility argument, again distracting from the core problem!)

The Evolutionary Story

Another recent conversation involved an over-emphasis on the evolutionary analogy. This person believed the inner optimizer problem would apply when systems were incentivised to be goal-oriented, as with animals selected for reproductive fitness, or policy networks trained to pursue reward. However, they did not believe it would apply to networks which are simply trained to predict, such as GPT.

Again, this strikes me as ignoring the fundamental problem, that we have little to no idea when mesa-optimizers can arise, that we lack formal tools for the analysis of such questions, and that what formal tools we might have thought to apply, have failed to yield any such results.

Bounding the Problem

My third and final example: in one conversation, someone made a claim which I see as “exactly wrong”: that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.

The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!

I want to return to that idea. But first, we have to clarify some definitions.

The Formal Problem

I currently see three areas of concern:

  1. Behavioral changes at deployment: The mesa-optimizer appears to perform well during training, but after deployment, competently pursues goals which are unintended and unwanted. For my purposes I will call any such occurrences “treacherous turns”—although this term is often defined in a way that requires intentional deception, for the moment, I’m not requiring that.

  2. Manipulation of imperfect search: see here and here.

  3. Dangerous hardware manipulation: EG, models which develop row-hammer attacks during training to increase their (supposed) score, exploit security holes to communicate with or manipulate the external environment, etc.

Vanessa terms #1 Cartesian daemons because they obey the intended input/​output protocol of the whole system, and #3 non-Cartesian daemons because they violate the protocol. I’m not sure whether/​where #2 falls on Vanessa’s classification.

For this short write-up I’ll focus on #1, although clearly #2 and #3 are also important areas of study.

My semi-formal description of the problem is going to be very close to the explain-like-i’m-12 version:

  1. A rich hypothesis space does contain (malign) mesa-optimizers, even if they occupy a small fraction of the space.

  2. A powerful search over that space effectively restricts outputs to a very small fraction, namely, the points in the space which score very well on the objective function.

  3. Some of the (malign) mesa-optimizers will score very well, so, there will be an overlap.

  4. Therefore, we expect there to be a chance of getting (malign) mesa-optimizers.

Essentially what we want to do is block or weaken this conclusion (concluding that there is no, or only a very small, chance).

(The “(malign)” in parentheses should be removed in approaches which are trying to avoid mesa-optimization entirely, and included in approaches which are only trying to block bad actors. I’ll sometimes use the term “malign hypothesis” to indicate hypotheses which produce catastrophic output, regardless of whether they contain a mesa-optimizer in any formal sense.)

Now, to make this into a formal problem, we have to define “mesa-optimizer”.

Defining Mesa-Optimization

RLO offered a semi-formal definition of mesa-optimizers, as objects in a search space which themselves conduct an internal search. Whether or not you agree with this definition, it’s too vague for formal purposes: how would we define “internal search”?

So, for the purpose of the formal problem, it’s necessary to pick an operational definition.

Fortunately, it’s possible to select operational definitions which have strong connections to safety concerns. In particular, I will call an operationalization tight if it includes more than necessary, so that we can be sure we’re ruling out the undesired behavior when we rule out its operationalization. (“Tight” is supposed to sound like the opposite of “leaky” as in “leaky abstraction”.)

As with any research, the choice of operational definitions is going to depend on the specific approach of the researcher, norms of the field, etc. Here are some options that occur to me:

  • Generalization accuracy.

    • This throws away all details of the inner alignment problem, and treats treacherous turns as simple generalization error. This has the advantage of engaging existing machine learning theory. Sufficiently strong generalization guarantees would indeed rule out treacherous turns, so, the operationalization is tight.

  • Importance-weighted generalization accuracy.

    • The concept of generalization accuracy misses important issues. For example, a guaranteed very low frequency of errors might still allow an error to be strategically inserted at a very important time. A formal model could attempt to incorporate this.

  • Significant inference during deployment.

    • The intuition here is that a mesa-optimizer has to do some kind of significant planning, so we can rule them out by eliminating any computations which are “too sophisticated”.

    • Possible versions of this include limiting neural network layers, or limiting run-time.

    • An operationalization of this form may or may not be tight; this would require further research/​argument.

  • Explicit representation of goals/​values.

    • I argue here that this may be relevant. The basic intuition is that in order to competently pursue unintended goals at deployment, there must be a compressed representation of those goals (rather than just compressed policies learned for specific situations).

    • Again, this may not be tight; further research required.

Obviously, this is not exhaustive by any means. Coming up with productive operational definitions is an important part of attacking the formal inner alignment problem.

The goal is then to provide some kind of strong argument which avoids the operationalized problem.

Why is this a problem?

Why can’t we address the inner alignment problem in the same way that we usually address problems of induction?

The most obvious way to avoid mesa-optimizers is to rule them out of the search space or the results of search. Earlier, I said:

[The hope was that] we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis [...].

The problem is precisely that we know of no way of doing that!

I was referring to the idea that the first thing one tries when hearing about the inner alignment problem is probably this kind of complexity-bound approach, which leads me to think of “the whole problem” as the fact that these approaches don’t seem to work.

So, let’s look at these approaches in more detail.

The hope is that we can be confident that mesa-optimizers will be more complex in comparison with benign outputs, so that we can avert mesa-optimization by using machine learning techniques which heavily bias toward less complex models. (This is “the first thing one tries” because it employs standard ideas from machine learning.)

The main theoretical choice one has is, what notion of “complexity”? I will look at three approaches:

  • Description complexity. That is, how compressed is the model? How short is its description?

  • Computational complexity. That is, how fast is the model?

  • Combining the two. If neither work on their own, perhaps some combination of the two would work?

Description Complexity

The idea here would have to be that mesa-optimizers have to have a larger description length than benign hypotheses.

Why you might think this idea would work:

  • You trust Occam’s razor. The whole point of Solomonoff induction is to give an objective theory of normatively correct inductive inference. So, almost by definition, the hypotheses favored by short description length should be “good explanations” of the universe, rather than unindended & undesirable hypotheses like mesa-optimizers.

  • A benign hypothesis just has to predict reality. A mesa-optimizer has to predict reality and also represent goal content and planning algorithms. So the mesa-optimizer has to be more complex, right?

  • A benign hypothesis just has to accurately perform the task. A deceptive mesa-optimizer has to also reason about the training process and other details of the world it’s in, in order to choose the right moment to execute a treacherous turn. Surely this would be more complex.

Why this idea doesn’t seem to work:

  • Even in the best case (where you not only have realizability, but furthermore, know your prior is a pretty good one), Occam’s razor should only make you think one of the shortest hypotheses that fits your data is going to be correct, not necessarily the shortest one. So, this kind of thinking does not directly imply a lack of malign mesa-optimization in the shortest hypothesis.

    • In particular, I am invoking the idea that belief that a prior is near-optimal for predictive purposes does not imply trust in a prior for instrumental purposes, because rare predictive errors can be catastrophic for planning and decision making if they’re selected adversarially rather than distributed randomly.

    • This suggests a line of research examining conditions under which this does/​doesn’t hold, and perhaps investigating planning algorithms which don’t entirely trust the prior.

  • Mutual information between predicting reality and agency may mean mesa-optimizers don’t have to spend extra bits on goal content and planning. In particular, if the reality being predicted contains goal-driven agents, then a mesa-optimizer doesn’t have to spend extra bits on these things, because it already needs to describe them in order to predict well.

    • This suggests a line of research into tasks which do not have mutual information with agency. Are there conditions under which we can be confident that mesa-optimizers would have high descriptive complexity?

    • However, it’s possible that any hypothesis can be transformed into a treacherous mesa-optimizing variant with a relatively short string of injected code. If this is the case, it guarantees that malign and benign hypotheses are not too different in description-length. This would suggest that it’ll be difficult to achieve high confidence that malign hypotheses always have larger description length.

  • Similarly, if a task is rich enough to involve the richness of the world (eg text prediction), then the description of the mesa-optimizer doesn’t need to spend extra bits on describing the training process and so on, because it’s already modeling complexities of the world such as that. In such a case, there’s no apparent simplicity-based argument against deception.

    • Are there tasks which we can confidently assert are simpler than the whole world, so that we can confidently rule out deception in simple models?

  • Also, see the reasoning in The Solomonoff Prior is Malign.

Pure Computational Complexity

Alright, if description complexity can’t rule out mesa-optimizers, maybe computational complexity can? I’m going to formalize “fast programs” as minimal circuits, where “circuit size” is understood as the number of logic gates.

Why you might think this would work:

  • Maybe you think the inner alignment problem for Solomonoff induction arises due to “too much compute”. After all, Paul’s original simulation warfare argument relied on simulating whole alien civilizations. More to the point, if mesa-optimization relies on internal search, well, that takes time, right? If we can find fast programs, maybe we force any “intelligence” out of the programs.

  • Specifically, you might think that the fastest way to do just about anything is essentially a lookup table.

Why it doesn’t seem to work:

  • A lookup table is exponential in size (since you’d be checking for each combination of inputs and making an entry for each). So yes, minimal circuits rule out any algorithm with more than exponentially many steps. However, there are lots of concerting, sophisticated algorithms with sub-exponential numbers of steps. The minimal-circuit formalism will prefer these to lookup tables. (Note that even exponential-time calculations might involve circuits smaller than lookup tables.)

    • In particular, textbook AI algorithms are mostly about taking things which naively require exponential computations, and producing useful results in sub-exponential time.

  • Based on this, you should actively expect that minimal circuits implement sophisticated algorithms to solve problems, if such algorithms exist (and are not super-exponential).

  • Separately, we now think minimal circuits can be problematic because both Evan and Vanessa have proposed counterexamples. (Neither are necessarily the final word on the subject, though, since one may choose to work with different assumptions/​definitions.)

It might seem like we’re just not cutting off enough processing power—the minimal-circuits formalism cuts off mesa-optimization which requires superexponential thinking time, but not sub-exponential. Let’s call this the “computational ceiling” of the formalism. Shouldn’t we just pick a formalism with a lower computational ceiling?

Maybe! Let’s look at a formalism which gives us a knob to adjust, so that we can choose to penalize time more/​less harshly.

Mixing Time and Description Complexity

The speed prior is like Solomonoff induction, except that it penalizes computation time. Specifically, taking twice as long is like 1 more bit of description length. Combining time and description complexity in this way is called Levin complexity.

For Levin complexity, time is penalized logarithmically (in comparison to description length). This suggests that we will get an exponential computational ceiling again: if computing bits costs time, we’re around the break-even point (it might be worth computing the bits, or it might be better to memorize them). So super-exponential calculations won’t be worth it, and sub-exponential calculations will, just like we saw with minimal circuits.

However, it’s easy to tweak the time/​description trade-off. Let’s define Levin complexity as , where is description length and is execution time (the choice of gives us regular Levin complexity). (I am not sure if Levin complexity has a more standard name in the literature.) This gives us a computational ceiling of .

Why this might work:

  • If the computation of a bit (via search or other algorithmic means) takes too long, you’re better off just memorizing it. We can set this ceiling wherever we like. If we think “dangerously intelligent” computations are cubic and above, we can set , for example. This gives us a quadratic computational ceiling.

Why this might not work:

  • Humans are essentially linear-time algorithms, in the sense that we take the same maximum amount of processing power (ie, that of the human brain) to produce each next output. Anything which produces linearly much output has to do so in at least linear time. So, Levin-complexity can’t rule out humanlike intelligence.

  • Furthermore, forcing hypotheses to be “slow” like this may only delay the inevitable. Let’s say for the sake of argument that a treacherous turn can be executed at any time, but requires a specific computation C to be executed (say, reversing the cryptographic hash of a critical password). This computation C can simply be slowed down to get past any computational ceiling. Thus, if our notion of (malign) inner optimizer resembles “eventually executes a treacherous turn”, then lowering the computational ceiling doesn’t actually help.

Elaborating on that last point: I’m assuming that you are asking for a computation that runs “sufficiently long” anyway. The speed prior does have a heavy bias for computations which cut off quickly. However, if we’re (say) predicting the weather, we probably want to require that the calculation keeps going on for a while into the future, rather than cutting off. If we’re training an AI traffic control system, we’d certainly require that it doesn’t just cut off after a few days. The idea, then, is that the treacherous computation C can be spaced out through the time the AI traffic control system is required to remain on-line.

Taking everything together, I don’t think we have a strong basis (yet!) for thinking penalizing runtime makes the situation any better. Some have even argued that it makes the situation worse, because time constraints provide an extra incentive for complex value-driven computations (in order to manage time effectively) as opposed to simple causal hypotheses (such as raw physics). I don’t think we have enough information to judge one way or the other.

A Note on the Consensus Algorithm

The eventual write-up, which this is a mere prospectus for, would aim to cover as broad a variety of approaches as possible, and give our current best assessment of the viability of each (ie, report why it might work and why it might not work).

I can’t cover everything I’d want to cover in this prospectus, but one more approach sticks out as deserving of treatment. In February of this year, Michael Cohen posted Formal Solution to the Inner Alignment Problem. This used an algorithm which I had heard many others talk about in the context of inner alignment, but usually, with considerably less hope and optimism than Michael Cohen expresses. I’ll call the algorithm the consensus algorithm. (I don’t know of a better name for it; I’ve heard it called KWIK learning (Knows-What-It-Knows), but unfortunately KWIK is actually used for a learning setting, not the algorithm that’s typically applied in that setting.)

The consensus algorithm works like this:

  • You have a (finite) ensemble of hypotheses which you get from somewhere. For example, a Bayesian could take their top 100 hypotheses, or select a probability threshold of .01% and select all hypotheses above this threshold. A deep-learning person might instead re-run their training algorithm 100 times, different initializations and hyperparameters.

  • You require agreement between these hypotheses in order to output any prediction. In the absence of such agreement, you instead say “I need further feedback”.

  • When you get further feedback, you get to edit your list, EG throwing out the hypotheses which were wrong (but also, perhaps, bringing in new replacement hypotheses which now exceed the .01% threshold thanks to a Bayesian update).

Note that I’m not nailing down all details, EG, how closely do we require models to agree? (Do we just require that they agree about which option is most probable? Do we require their probabilities to fall within the same small range?)

The consensus algorithm allows us to make some nice formal arguments. For example:

  1. Suppose that we think at least one of our top 100 hypotheses is (a) always correct (or at least as accurate as the rest of the top 100) and (b) non-malign.

  2. Then no malign hypothesis will be able to control the output. Malign hypotheses are limited to either agreeing with the non-malign hypothesis, or, blocking any output by spoiling the consensus.

However, other drawbacks seem to spoil the appeal of this result.

  • To state the least of our problems first: this requires a 100x slowdown in comparison with the state-of-the-art deep learning (or whatever) we’re layering the consensus algorithm on top of. We’re amplifying safety at the expense of speed; if we think the base algorithm has a 10% frequency of yielding safe AGI, then we only need around a 10x slowdown to get a high chance of safety; if it’s a .01% chance, we would need something more like a 10,000x slowdown (since we’d want to require consensus of around 10,000 independent hypotheses).

  • Second, to apply this to something like deep learning, we need to think that each run has an independent chance of creating safe AGI. This is probably far from true. Our uncertainty about whether a particular training method produces safe AGI is probably almost entirely correlated. This would further blow up the size of the ensemble required to achieve any confidence.

  • Third, the consensus algorithm requires a strong form of realizability assumption, where you not only assume that our Bayesian space contains the true hypothesis, but furthermore, that it’s in the top 100 (or whatever number we choose). This hypothesis has to be really good: we have to think that malign hypotheses never out-guess the benign hypothesis. Otherwise, there’s a chance that we eliminate the good guy at some point (allowing the bad guys to coordinate on a wrong answer). But this is unrealistic! The world is big and complex enough that no realistic hypothesis has all the answers.

Michael Cohen seems to think that restricting to imitation learning makes the realizability assumption realistic:

There are none of the usual embedding problems here—the imitator can just be bigger than the demonstrator that it’s modeling.

I disagree. It still doesn’t seem realistic to suppose that competitors to the best hypothesis are never lucky. Even if we suppose that my behavior is entirely deterministic, the number of hypotheses you’d have to consider in order for one of them to be an exact copy of my brain would no doubt be huge! Let’s conservatively call this (suggesting that I have at least 100 neurons, each of which has at least 10 meaningfully different configurations). This would suggest that the consensus algorithm needs googol times the processing power of the human brain to work.

Michael Cohen also says some other stuff about why the proposal doesn’t have embedding problems, which I also disagree with. But let’s set this aside and get back to listing other problems with the consensus algorithm.

  • Fourth, the consensus algorithm still gives inner optimizers control of when the system asks for more feedback. This in itself could be catastrophic! For example, it could ask at the worst times, when it anticipates the human feedback to be of poor quality, or when asking itself has catastrophic consequences. A silly example would be causing a self-driving car to ask for feedback right when it needs to enact difficult evasive maneuvers to avoid a crash. Obviously we can’t put a system like this in charge of extremely time-sensitive decisions like this. But whatever the case, asking for feedback will influence the external world somehow. Malign hypotheses just need to find the right moment, where asking for feedback butterfly-effects its way to a catastrophe.

  • Fifth, although (as Michael Cohen notes) the frequency of asking for feedback will rapidly diminish, the learning algorithm converges to the correct hypothesis extremely slowly. Specifically, if we have an ensemble of programs of description-length at most , one of which we believe to be correct and benign, the best convergence bound we can give is busy beaver of l!! (For those who don’t know about the busy beaver function, this roughly means “uncomputably slow”. It’s way worse than needing exponentially much data. I discuss why this happens in Subsystem Alignment.) This means we can’t solve the previous problem by keeping the system in a safe training environment (eg, keeping it away from time-sensitive tasks, and making sure that it always has a human on-duty to provide feedback when prompted) until it converges. We have to face the fact that it might require human feedback at any point in the future.

The fourth point really seems like the deal-breaker. Malign mesa-optimizers could intelligently leverage this to communicate with the outside world, manipulate the training process, etc.


Has this been useful? Would an expanded and improved version of this be useful?

This is something where I could really use detailed peer-review-like feedback, since the final version of this thing would hopefully be a pretty canonical resource, with standardized terminology and so on.

A weakness of this as it currently stands is that I purport to offer the formal version of the inner optimization problem, but really, I just gesture at a cloud of possible formal versions. I think this is somewhat inevitable, but nonetheless, could probably be improved. What I’d like to have would be several specific formal definitions, together with several specific informal concepts, and strong stories connecting all of those things together.

I’d be glad to get any of the following types of feedback:

  1. Possible definitions/​operationalizations of significant concepts.

  2. Ideas about which definitions and assumptions to focus on.

  3. Approaches that I’m missing. I’d love to have a basically exhaustive list of approaches to the problem discussed so far, even though I have not made a serious attempt at that in this document.

  4. Any brainstorming you want to do based on what I’ve said—variants of approaches I listed, new arguments, etc.

  5. Suggested background reading.

  6. Nitpicking little choices I made here.

  7. Any other type of feedback which might be relevant to putting together a better version of this.

If you take nothing else away from this, I’m hoping you take away this one idea: the main point of the inner alignment problem (at least to me) is that we know hardly anything about the relationship between the outer optimizer and any mesa-optimizers. There are hardly any settings where we can rule mesa-optimizers out. And we can’t strongly argue for any particular connection (good or bad) between outer objectives and inner.