Mesa-Search vs Mesa-Control

I currently see the spontaneous emergence of learning algorithms as significant evidence for the commonality of mesa-optimization in existing ML, and suggestive evidence for the commonality of inner alignment problems in near-term ML.

[I currently think that there is only a small amount of evidence toward this. However, due to thinking about the issues, I’ve still made a significant personal update in favor of inner alignment problems being frequent.]

This is bad news, in that it greatly increases my odds on this alignment problem arising in practice.

It’s good news in that it suggests this alignment problem won’t catch ML researchers off guard; maybe there will be time to develop countermeasures while misaligned systems are at only a moderate level of capability.

In any case, I want to point out that the mesa-optimizers suggested by this evidence might not count as mesa-optimizers by some definitions.

Search vs Control

Nevan Wichers comments on spontaneous-emergence-of-learning:

I don’t think that paper is an example of mesa optimization. Because the policy could be implementing a very simple heuristic to solve the task, similar to: Pick the image that lead to highest reward in the last 10 timesteps with 90% probability. Pik an image at random with 10% probability.

So the policy doesn’t have to have any properties of a mesa optimizer like considering possible actions and evaluating them with a utility function, ect.

In Selection vs Control, I wrote about two different kinds of ‘optimization’:

  • Selection refers to search-like systems, which look through a number of possibilities and select one.

  • Control refers to systems like thermostats, organisms, and missile guidance systems. These systems do not get a re-do for their choices. They make choices which move toward the goal at every moment, but they don’t get to search, trying many different things—at least, not in the same sense.

I take Nevan Wichers to be saying that there is no evidence search is occurring. The mesa-optimization being discussed recently could be very thermostat-like, using simple heuristics to move toward the goal.


Defining mesa-optimizers by their ability to search is somewhat nice:

  • There is some reason to think that mesa-optimizers which implement an explicit search are the most concerning, because they are the ones which could explicitly model the world, including the outer optimizer, and make sophisticated plans based on this.

  • This kind of mesa-optimizer may be more theoretically tractible. If we solve problems with very (very) time-efficient methods, then search-type inner optimizers may be eliminated: whatever answers the search computation finds, there could be a more efficient solution which simply memorized a table of those answers. Paul asks a related theory question. Vanessa gives a counterexample, which involves a control-type mesa-optimizer rather than one which implements an internal search. [Edit—that’s not really clear; see this comment.]

    • So it’s possible that we could solve mesa-optimization in theory, by sticking to search-based definitions—while still having a problem in practice, due to control-type inner optimizers. (I want to emphasize that this would be significant progress, and well worth doing.)


Mesa-controllers sound like they may not be a huge concern, because they don’t strategically optimize based on a world-model in the same way. However, I think the model discussed in the spontaneous-emergence-of-learning post is a significant counterargument to this.

The post discusses RL agents which spontaneously learn an inner RL algorithm. It’s important to pause and ask what this means. Reinforcement learning is a task, not an algorithm. It’s a bit nonsensical to say that the RL agent is spontaneously learning the RL task inside of itself. So what is meant?

The core empirical claim, as I understand it, is that task performance continues to improve after weights are frozen, suggesting that learning is still taking place, implemented in neural activation changes rather than neural weight changes.

Why might this happen? It sounds a bit absurd: you’ve already implemented a sophisticated RL algorithm, which keeps track of value estimates for states and actions, and propagates these value estimates to steer actions toward future value. Why would the learning process re-implement a scheme like that, nested inside of the one you implemented? Why wouldn’t it just focus on filling in the values accurately?

I’ve thought of two possible reasons so far.

  1. Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan’s mesa-optimization post, just replacing search with RL.

  2. More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.

This is more concerning than a thermostat-like bag of heuristics, because an RL algorithm is a pretty agentic thing, which can adapt to new situations and produce novel, clever behavior.

They also suggest that the inner RL algorithm may be model-based while the outer is model-free. This goes some distance toward the “can model you, the world, and the outer alignment process, in order to manipulate it” concern which we have about search-type mesa-optimizers.

Mesa-Learning Everywhere?

Gwern replies to a comment by Daniel Kokotajlo:

>Learning still happening after weights are frozen? That’s crazy. I think it’s a big deal because it is evidence for mesa-optimization being likely and hard to avoid.

Sure. We see that elsewhere too, like Dactyl. And of course, GPT-3.

People are jumping on the RL examples as mesa-optimization. But, for all the discussion of GPT-3, I saw only speculative remarks about mesa-optimization in GPT-3. Why does an RL algorithm continuing to improve performance after weights are frozen indicate inner optimization, while evidence of the same thing in text prediction does not?

1. Text prediction sounds benign, while RL sounds agentic.

One obvious reason: an inner learner in a text prediction system sounds like just more text prediction. When we hear that GPT-3 learned-to-learn, and continues learning after the weights are frozen, illustrating few-shot learning, we imagine the inner learner is just noticing patterns and extending them. When we hear the same for an RL agent, we imagine the inner learner actively trying to pursue goals (whether aligned or otherwise).

I think this is completely spurious. I don’t currently see any reason why the inner learner in an RL system would be more or less agentic than in text prediction.

2. Recurrence.

A more significant point is the structure of the networks in the two cases. GPT-3 has no recurrence: no memory which lasts between predicting one token and the next.

The authors of the spontaneous learning paper mention recurrence as one of the three conditions which should be met in order for inner learning to emerge. But that’s just a hypothesis. If we see the same evidence in GPT-3 -- evidence of learning after the weights are frozen—then shouldn’t we still make the same conclusion in both cases?

I think the obvious argument for the necessity of recurrence is that, without recurrence, there is simply much less potential for mesa-learning. A mesa-learner holds its knowledge in the activations, which get passed forward from one time-step to the next. If there is no memory, that can’t happen.

But if GPT-3 can accomplish the same things empirically, who cares? GPT-3 is entirely reconstructing the “learned information” from the history, at every step. If it can accomplish so much this way, should we count its lack of recurrence against it?

Another argument might be that the lack of recurrence makes mesa-learners much less likely to be misaligned, or much less likely to be catastrophically misaligned, or otherwise practically less important. I’m not sure what to make of that possibility.

3. Mesa-learning isn’t mesa-optimization.

One very plausible explanation of why mesa-learning happens is the system learns a probability distribution which extrapolates the future from the past. This is just regular ol’ good modeling. It doesn’t indicate any sort of situation where there’s a new agent in the mix.

Consider a world which is usually “sunny”, but sometimes becomes “rainy”. Let’s say that rainy states always occur twice in a row. Both RL agents and predictive learners will learn this. (At least, RL agents will learn about it in so far as it’s relevant to their task.) No mesa-learning here.

Now suppose that rainy streaks can last more than two days. When it’s rainy, it’s more likely to be rainy tomorrow. When it’s sunny, it’s more likely to be sunny tomorrow. Again, both systems will learn this. But it starts to look a little like mesa-learning. Show the system a rainy day, and it’ll be more prone to anticipate a rainy day tomorrow, improving its performance on the “rainy day” task. “One-shot learning!”

Now suppose that the more rainy days there have been in a row, the more likely it is to be rainy the next day. Again, our systems will learn the probability distribution. This looks even more like mesa-learning, because we can show that performance on the rainy-day task continues to improve as we show the frozen-weight system more examples of rainy days.

Now suppose that all these parameters drift over time. Sometimes rainy days and sunny days alternate. Sometimes rain follows a memoryless distribution. Sometimes longer rainy streaks become more likely to end, rather than less. Sometimes there are repeated patterns, like rain-rain-sun-rain-rain-sun-rain-rain-sun.

At this point, the learned probabilistic model starts to resemble a general-purpose learning algorithm. In order to model the data well, it has to adapt to a variety of situations.

But there isn’t necessarily anything mesa-optimize-y about that. The text prediction system just has a very good model—it doesn’t have models-inside-models or anything like that. The RL system just has a very good model—it doesn’t have something that looks like a new RL algorithm implemented inside of it.

At some level of sophistication, it may be easier to learn some kind of general-purpose adaptation, rather than all the specific things it has to adapt to. At that point it might count as mesa-optimization.

4. This isn’t even mesa-learning, it’s just “task location”.

Taking the previous remarks a bit further: do we really want to count it as ‘mesa-learning’ if it’s just constructed a very good conditional model, which notices a wide variety of shifting local regularities in the data, rather than implementing an internal learning algorithm which can take advantage of regularities of a very general sort?

In GPT-3: a disappointing paper, Nostalgebraist argues that the second is unlikely to be what’s happening in the case of GPT-3. It’s not likely that GPT-3 is learning arithmetic from examples. It’s more likely that it is learning that we are doing arithmetic right now. This is less like learning and more like using a good conditional model. It isn’t learning the task, it’s just “locating” one of many tasks that it has already learned.

I’ll grant that the distinction gets very, very fuzzy at the boundaries. Are literary parodies of Harry Potter “task location” or “task learning”? On the one hand, it is obviously bringing to bear a great deal of prior knowledge in these cases, rather than learning everything anew on the fly. It would not re-learn this task in an alien language with its frozen weights. On the other hand, it is obviously performing well at a novel task after seeing a minimal demonstration.

I’m not sure where I would place GPT-3, but I lean toward there being a meaningful distinction here: a system can learn a general-purpose learning algorithm, or it can ‘merely’ learn a very good conditional model. The first is what I think “mesa-learner” should mean.

We can then ask the question: did the RL examples discussed previously constitute true mesa-learning? Or did they merely learn a good model, which represented the regularities in the data? (I have no idea.)

In any case, the fuzziness of the boundary makes me think these methods (ie, a wide variety of methods) will continue moving further along the spectrum toward producing powerful mesa-learners as they are scaled up (and hence, mesa-optimizers).