Mesa-Optimizers vs “Steered Optimizers”


The paper Risks from learned optimization introduced the term “inner alignment” in the context of a specific class of scenarios, namely a “base optimizer” which searches over a space of “inner” algorithms. If the inner algorithm is an optimizer, it’s called a “mesa-optimizer”, and if its objective differs from the base algorithm’s, it’s called an “inner alignment” problem. In this post I want to plead for us to also keep in mind a different class of scenarios, which I’ll call “Steered Optimizers”, and which also has an “inner alignment” problem. The inner alignment problem for mesa-optimizers is directly analogous to the inner alignment problem for steered optimizers, but the specific failure modes and risk factors and solutions are all somewhat different. I’ll argue that it’s at least comparably likely for our future AGIs to be steered optimizers rather than mesa-optimizers. So again, we should keep both scenarios in mind.


I recently wrote a post about brain algorithms with “inner alignment” in the title, but I was talking about something kinda different than in the famous Risks from Learned Optimization paper that I was implicitly referring to. I didn’t directly explain why I felt entitled to use the term “inner alignment” for this different situation, but I think it’s worth going into, especially because it’s a more general approach to making AGI that goes beyond brain-inspired approaches.

(Terminology note: Following “Risks From Learned Optimization”, I will use the term “optimizer” in this post to mean an algorithm which uses foresight /​ planning to search over possible actions in pursuit of a particular goal, a.k.a. a “selection”-type optimizer. I want humans to count as “optimizers”, so I will also allow “optimizers” to sometimes choose actions for other reasons, and to maybe have inconsistent, context-dependent goals, as long as they at least sometimes use foresight /​ planning to pursue goals.)

Let’s start with two scenarios in which we might create highly intelligent AI “optimizers”:

1. Search-over-algorithms scenario: (this is the one from Risks from Learned Optimization). Here, you have a “base optimizer” which searches over a space of possible algorithms for an algorithm which performs very well according to a “base objective”. For example, the base optimizer might be doing gradient descent on the weights of an RNN (large enough RNNs are Turing-complete!). Anyway, if the base optimizer settles on an inner algorithm which is itself an optimizer, then we call it a “mesa-optimizer”. Inner alignment is alignment between the mesa-optimizer’s objective and the base objective, while outer alignment is alignment between the base objective and what the programmer wants.

2. Steered Optimizer scenario: (this is how I think the human brain works, more or less, see my post “Inner alignment in the brain”). Here, you also have a two-layer structure, but the layers are different. The inner layer is an algorithm that does online learning, world-modeling, planning, and acting, and it is an optimizer. We wrote the inner-layer algorithm ourselves, and it is never modified or reset (the whole scenario is just one “episode”, in RL terms). But as the inner algorithm learns more and more, it becomes increasingly powerful, and increasingly difficult for us to understand—like comparing a newborn brain to an adult brain, where the latter carries a lifetime of experience and ideas. Meanwhile, the base layer watches the inner layer in real time, and tries to “steer” it towards optimizing the right target, using hooks that we had built into the inner layer’s source code. How does that steering work? In the simplest case, the base layer can be a reward function calculator, and it sends the reward information to the inner layer, which in turn tries to build a predictive model of the correlates of that reward signal, set them as a goal, and make foresighted plans to maximize them. (There could be other steering methods too—see below.) As in the other scenario, inner alignment is alignment between the inner layer’s objective(s) and the formula used by the base layer to compute rewards, while outer alignment is alignment between the latter and what the programmer wants.

Here’s a little comparison table:


“Search Over Algorithms” scenario

“Steered Optimizer” scenario

Base layerRun inner layer for N steps, compute score, do gradient descent on inner layer algorithmRun inner layer. As it runs, watch it, and send rewards (or other signals) to “steer” it.
Inner layerArbitrary algorithm discovered by base layerKnown, hand-coded algorithm, involving learning, world-modeling, planning, acting, etc.
Interpretability of inner layer (by default, i.e. without special interpretability tools)Always inscrutableStarts from a known, simple state, but gets more and more inscrutable as it builds a complex world-model
What is the inner layer’s objective?It might not have one. If it does, we don’t know what it is (by default)We designed it to form and seek goals based on the steering signals it receives, but we don’t know its actual goals at any given time (by default)
How many training episodes?Millions, I presume.As few as one; maybe several, but more like a run-and-debug loop.
Are we doing this today?Not really (but see references in “Risks from Learned Optimization”).Not that I know of, off-hand, but it’s probably in the AI literature somewhere.

By the way, these two scenarios are not the only two possibilities, nor are they mutually exclusive. The obvious example for “not mutually exclusive” is the human brain, which fits nicely into both categories—the subcortex steers the neocortex (more on which below), and meanwhile evolution is a search-over-algorithms-type base optimizer for the whole brain.

Why might we expect AI researchers to build steered optimizers, rather than searches-over-algorithms?

(Update: I later massively elaborated this section into the post Against Evolution As An Analogy For How Humans Will Build AGI.)

  • Steered optimizers enable dramatically longer episodes than searches-over-algorithms. In the first line of the table above I wrote that search-over-algorithms involves running the inner layer for N steps per episode. How big is N? If we want to build a system that can learn a whole predictive world-model from scratch, that’s an awfully big N! Evolution is a good example here; it picks a genotype and then spends many decades calculating its loss. Imagine doing ML with one gradient descent step per decade! For various reasons, I don’t think this rules out a search-over-algorithms approach, but I definitely think it’s a strike against its plausibility. Steered optimizers do not have this problem; they do not need to run through millions of episodes to reach excellent performance, just a single very long episode, or more likely dozens of very long episodes for debugging, hyperparameter search, etc.

  • As I keep mentioning, I think brains work as steered optimizers, with the steered optimizer subsystem centered around the neocortex (or pallium in birds and lizards), and the steering subsystem based in other parts of the brain. If I’m right about that, that would imply that (1) steered optimizers are a viable path to AGI, and (2) we have a straightforward-ish development path to get there (which lots of people are already working on), i.e. we “merely” need to reverse-engineer the neocortex.

  • Given that we know at least vaguely what a world-modeling-and-acting-and-planning algorithm is supposed to do and how, I think people will be able to write such an algorithm themselves faster than they could find it by blind search. I don’t think it’s that complicated an algorithm, compared to the collective capability of the worldwide AI community. (Why don’t I think the algorithm is horrifically complicated? Mainly from inside-view reading and thinking about neocortical algorithms, which I discussed most recently here. I could be wrong.)

Incidentally, if we’re writing the inner algorithm ourselves, why not just put the goal into the source code directly? Well, that would be awesome … But it may not be possible! I think the easiest way to build the inner algorithm is to have it build a world-model from scratch, more-or-less by finding patterns in the input, and patterns in the patterns, etc. So if you want the AGI to have a goal of maximizing paperclips, we face the problem that there is no “paperclips” concept in its source code; it has to run for a while before forming that concept. That’s why we might instead build an AGI by letting it start learning and acting, and trying to steer it as it goes.

How might one steer an AGI steered optimizer?

  • As mentioned above, we can send reward signals—calculated automatically and/​or by human overseers.

  • A human, assisted by interpretability tools, could reach in and add /​ subtract /​ edit goals. Or a similar thing could be automated.

  • You could build a hook in the inner layer for receiving natural language commands. Like maybe, whenever you press the button and talk into the microphone, whatever world-model concepts are internally activated by that speech become the inner layer’s goals (or something like that).

  • Any of the weird tricks that the brain uses, as discussed in my posts inner alignment in the brain and an earlier post about human instincts. (Update: Also this later post.)

  • I don’t know! I’m sure there are other things.

Lessons from being a human

If the human neocortex is a steered optimizer, what can we learn from that example?

1. How does it feel to be steered?

You try a new food, and find it tastes amazing! This wonderful feeling is your subcortex sending a steering signal up to your neocortex. All of the sudden, a new goal has been installed in your mind: eat this food again! This is not your only goal in life, of course, but it is a goal, and you might use your intelligence to construct elaborate plans in pursuit of that goal, like shopping at a different grocery store so you can buy that food again.

It’s a bit creepy, if you think about it!

“You thought you had a solid identity? Ha!! Fool, you are a puppet! If your neocortex gets dopamine at the right times, all of the sudden you would want entirely different things out of life!”

2. What does Inner Alignment failure look like in humans?

A prototypical inner alignment failure would be knowing that there is some situation that would lead the subcortex to install a certain goal in our minds, and we don’t want to have that goal (according to our current goals), so we avoid that situation.

Imagine, for example, not trying a drug because you’re afraid of getting addicted.

To make that analogy explicit, you could imagine that our brain was designed by an all-powerful alien who wanted us to take the drug, and therefore set up our brain with a system that recognizes the chemical signature of that drug, and installs that drug as a new goal when that chemical signature is detected. At first glance, that’s not a bad design for a steering mechanism, and indeed it works sometimes. But we can undermine the alien’s intentions by understanding how that steering mechanism works, and thus avoiding the drug.

A more prosaic example: practically every “productivity hack” is a scheme to manipulate our own future subcortical steering signals.

3. What would corrigible alignment look like in humans?

Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.

More random thoughts on steering

  • An AGI might be easier to steer than a human brain, if we can find a way to reliably steer in response to imagination /​ foresight, and not just actions. In the example above, where I am trying not to get addicted to a drug, my job is made pretty easy by the fact that I need to actually take the drug before getting addicted. Merely thinking about taking the drug will not install that goal in my brain. Maybe we can avoid that problem in our steered AGIs somehow? (Update: the brain sorta does this via supervised learning.)

  • I mentioned corrigible alignment above. I think that the sense of “corrigible alignment” which is most analogous to the “Risks from learned optimization” paper is like hedonism—valuing the reward steering signals, as an end in themselves. If that’s the definition, then I would be concerned that a corrigibly-aligned system solves the inner alignment problem while horribly exacerbating the outer alignment problem, because the system is now motivated to wirehead or otherwise game the reward signals. It’s not necessarily an unsolvable outer alignment problem—maybe an AGI could be motivated by both hedonism and a specific aversion to self-modification other than by normal learning, for example. But I’m awfully skeptical that this is a good starting point. I think it’s more promising to go for a different flavor of corrigibility, where we try to steer the system so that it becomes motivated by something like “the intentions of the programmer”, i.e. a flavor of corrigibility that tries to cut through both the inner and the outer alignment problems simultaneously. (Maybe this is my opinion about corrigible alignment in the search-over-algorithms scenario as well...)

Related work

Deep RL from Human Preferences and Scalable Agent Alignment Via Reward Modeling both bring up the idea of taking reward signals, trying to understand those signals in the form of a predictive model, and then using that reward model as a target for training an agent (if I understand everything correctly). (This is not the only idea in the papers, and in most respects the papers are more like search-over-algorithms.) Anyway, that specific idea has parallels with how a steered optimizer would try to relate its reward signals to meaningful, predictive concepts in its world-model. In this post I’m trying to emphasize that reward-modeling part, and generalize it to other ways of steering agents. Also, unlike those papers, I prefer to merge the reward-modeling task and the choosing-actions task into a single model, because their requirements seem to heavily overlap, at least in the case of a powerful, world-modeling, optimizing agent. For example, the reward-modeling part needs to look at a bunch of reward signals and figure out that they correspond to the goal “maximize paperclips”; while the choosing-actions part needs to take the goal “maximize paperclips” and figure out how to actually do it. These two parts require much the same world-modeling capabilities, and indeed I don’t see how it would work except by having both parts actually referencing the same world-model.

(I’m sure there’s other related work too, that’s just what jumped to my mind.)

(thanks Evan Hubinger for comments on a draft.)