Why GPT wants to mesa-optimize & how we might change this

This post was inspired by orthonormal’s post Developmental Stages of GPTs and the discussion that followed, so only part of it is original.

First I’ll aim to provide a crisper version of the argument for why GPT wants to mesa-optimize. Specifically, I’ll explain a well-known optimization algorithm used in text generation, and argue that GPT can improve performance on its objective by learning to implement something like this algorithm internally.

Then I’ll offer some ideas of mine about how we might change this.

Explanation of beam search

Our goal is to generate plausible text. We evaluate whether text is “plausible” by multiplying together all the individual word probabilities from our language model.

Greedy word selection has a problem: Since it doesn’t do lookahead, it’s liable to get stuck in a dead end. Let’s say we give our system the following poem about cheeses and ask it to generate more text:

Mozzarella is white

So you can see it at night

Cheddar is...

If our language model is decent, the word it will assign the highest probability to is “orange”. But this creates a problem, because “orange” is a hard word to rhyme.

Beam search is an attempt to solve this problem. Instead of picking the next word greedily, we explore the tree of completions and try to find a multi-word completion that maximizes the product of the individual word probabilities.

Because there are so many words in the English language, the tree grows at a very fast exponential rate. So we choose an integer beam_width for the number of partial completions to track, and each time we take another step deeper into the tree, we discard all but the most plausible beam_width partial completions.

Beam search with a beam width of 2. The bold red path corresponds to the maximum-plausibility completion, which would not get discovered by greedy search because “nice” has a higher probability than “dog”. Image stolen from this Hugging Face blog post, which has another explanation of beam search if you didn’t like mine.

Claim: GPT can do better on its training objective if it learns to do beam search internally

We’ve discussed text generation with a pretrained language model. Let’s switch gears and talk about the model’s training process.

Suppose GPT’s training corpus has the following poem:

Mozzarella is white

So you can see it at night

Cheddar is marigold

Unless you let it get too old

GPT is trained by giving it some text and asking it to predict the next word. So eventually GPT will be given the example from above

Mozzarella is white

So you can see it at night

Cheddar is...

and be asked to predict the next word.

Let’s consider the performance of two models on this task: regular “naive” GPT, and “beam search amplified” GPT. Beam search amplified GPT works by performing beam search using naive GPT, then looking at the distribution of the first words in the resulting completions, then outputting some weighted average of that distribution and the distribution from naive GPT.

Because beam search can find lots of ways to continue the poem using “marigold”, but few ways using “orange”, beam search amplified GPT’s distribution ends up being closer to reality than that of naive GPT. Something like this:

So when we update GPT’s weights during training, we’re shifting the weights towards the sort of computational structure that would make predictions like beam search amplified GPT does.

Does this actually help?

In this instance, GPT has an incentive to do internal lookahead. But it’s unclear how frequently these situations actually arise. And maybe it’s usually easier to do something else, like learning which words are easy to rhyme.

It would be straightforward to implement beam search amplified GPT (experimenting with different weighted averaging schemes) and check whether it can be made to assign higher plausibility to real text. (It might be best to try with GPT-2 rather than GPT-3, in case GPT-3 is already doing internal lookahead. Note that there’s a risk of mesa-optimization developing if lookahead improves performance at any point during GPT’s training.)

Is internal lookahead possible for GPT-3?

Relative to other optimization algorithms, it seems to me that beam search would be unusually easy for GPT to implement. Traditional iterative optimization algorithms like gradient descent or simulated annealing require a lot of serial computation, and the number of serial steps GPT can perform is strongly limited. Beam search is way less heavy on the number of serial steps required. The number of available serial steps would still limit the maximum lookahead horizon though.

The transformer architecture learns computations of the form “find some data from the previous step which scores highly according to particular criteria, do some computation on it, pass it on to the next step”. That sounds like beam search.

In any case, the topic of what incentives arise while training a language model seems important more generally.

Is internal lookahead dangerous?

If GPT’s architecture is capable of discovering lookahead internally, the worry is that GPT might modify and misuse it in creative ways after it’s discovered. It might start making plans, or searching for the idea that maximizes some attribute which is correlated with harm.

Let’s say there are chess problems in GPT’s training corpus which describe a board state along with an objective like “black to move and win in 6 turns even with best play by white”. If GPT can do lookahead internally, it can use this to search for game histories where black wins even though white is playing very well. In other words, it’s doing spontaneous internal planning. And this spontaneous internal planning is incentivized because it helps predict solutions to chess problems.

Who knows what other contexts spontaneous internal planning might get used in.

Fix idea #1: Switch to BERT style training

How might we remove the incentive for mesa-optimization?

A simple idea is to stop training the task of predicting the next word, and instead train the task of predicting a masked word given the surrounding context. This is what BERT uses. The incentive for internal lookahead seems smaller with this task, but I guess you’d still see it in e.g. predicting masked chess moves of strong players.

BERT’s ability to generate text is unclear. But I have a feeling BERT-style training actually offers greater potential than GPT-style training for text generation, because BERT can edit its writing—see discussion here. You could get really creative, like doing genetic algorithms with a population of texts generated on a particular topic, and using BERT to mutate and recombine texts in the population until you get one with really high plausibility. (Some words at the beginning could stay fixed as the “prompt”.)

Fix idea #2: Prohibit updates towards lookahead

This fix is more interesting because the core idea seems applicable for stopping mesa-optimization in other contexts too.

v1 of the idea is: During training, before any gradient update, we make a prediction with beam search amplified GPT. And if an update to naive GPT would mean moving it towards beam search amplified GPT (because beam search amplified GPT assigns a higher probability than naive GPT to the actual next word), we just skip the update and keep moving. Nick Bostrom refers to this general approach as “stunting” in Chapter 9 of Superintelligence.

This diagram from above is an example of an update that would get skipped.

Training will take longer, obviously. But only insofar as all this mesa-optimization stuff was a problem in the first place.

v2 of the idea is to add a penalty term to GPT’s loss function that causes it to update away from whatever beam search amplified GPT predicts.

A risk with v2 is that GPT is incentivized to emulate beam search amplified GPT so it can avoid making its prediction (i.e. an internal mesa-optimizer gets developed anyway). Skipping updates as in v1 is probably safer, but the fundamental problem seems possible either way.

So v3 is to track gradient updates towards/​away from beam search amplified GPT and ensure a rough balance. We don’t want systematic movement towards or away from beam search amplified GPT—we don’t want incentives to develop either it or its inverse. This could be accomplished by e.g. rescaling gradients on the fly so the average movement towards beam search amplified GPT’s answers is around 0. (To be safe, you might want to balance both the number of updates in each direction and their typical magnitude.) An advantage is that training can go faster than in v1 because less updating is skipped.

Beam search isn’t the only possible way to do lookahead though.

In particular, the distribution of first words in beam search completions will only assign non-zero probability to words which are the first word of a completion. That leaves room for predictive gains by doing lookahead for other words.

So v4 of the idea is to sample from the infinite beam width distribution, particle filter style, and update in such a way that your expected update towards next-word predictions from the infinite beam width distribution is 0. (Not sure how to do this concretely, but it seems like it might be possible.)

The broader issue here is the nearest unblocked strategy problem. The problem seems relatively tractable in this case, because e.g. beam search based stunting seems to remove the incentive for any sort of lookahead insofar as it helps in the cases where beam search amplified GPT outperforms. (I suspect mesa-optimization is not an all-or nothing proposition, and it will matter how much it has been refined + how readily it is deployed.)

Hence, v5 of the idea—even more underspecified than v4 at this point—is adversarial anti-lookahead stunting. Characterize some kind of “lookahead amplifier” which takes naive GPT as input and attempts to amplify it by doing lookahead-type things in a very freeform way. Train this and make it really good at amplifying GPT to achieve better next-word prediction. Then stunt naive GPT during training by forcing its average update towards the next-word predictions of lookahead-amplified GPT to be 0. (Perhaps you’d want to be retraining the lookahead amplifier repeatedly over the course of GPT’s training, in case different lookahead strategies work best at different GPT maturity levels.) In production, go ahead and make use of lookahead-amplified GPT to make performance as competitive as possible. (It wouldn’t shock me if this scheme actually produced superior performance, if the lookahead amplifier learns lookahead that GPT couldn’t learn for itself, due to a better inductive bias.)

Ideally, GPT’s training process would be monitored to ensure that GPT + lookahead in any form consistently outperforms naive GPT. For bonus points, have the lookahead used in the monitoring be a form of lookahead which is not used for stunting (a “dev set”).

Do these fixes actually help?

An objection to Fix #2 is the possibility of mesa-optimization which isn’t very much like lookahead. For example, if we’re training on text that describes a newly discovered animal, the system has an incentive to try & figure out the animal for itself internally so it can better predict how it will be described—and it might make use of some optimization algorithm, genetic algorithms say, to achieve this.

Another objection is that pulling optimization up from the mesa level, as in the “BERT + genetic algorithms” idea or the “lookahead amplifier in production” idea, isn’t actually helpful. There’s still optimization happening, and the system as a whole could still make devious plans or search for harmful ideas.

However, less mesa-optimization means less risk that transformer blocks develop optimization/​planning capabilities and reuse them in contexts we didn’t expect. It’s easier to reason about searching for text which maximizes plausibility than a mysterious mesa-objective. In particular, an agent that gets instantiated internally might search for side-channel attacks in the text generation machinery and surrounding system (especially risky if GPT has read about this stuff). But it seems very unlikely that a search for plausibility-maximizing text would cause this (except maybe if those attacks somehow got activated during training). Non-mesa-optimization also has parameters that allow us to control its strength without retraining the model, and we have a better understanding of how it works.

There’s still a lot of potential for misuse & accidents either way, of course.

OpenAI doesn’t offer beam search? Why? Is GPT-3 already mesa-optimizing?

Up until now, I’ve been pretending that maximizing plausibility (product of individual word probabilities) is a good way to generate text. But beam search doesn’t even seem to be an option in the GPT-3 interface. (Please correct me if I’m missing something!)

Why is beam search missing? One possibility is that GPT-3 already does internal lookahead. OpenAI tried beam search, found it didn’t improve text generation, and didn’t bother adding it as an option. In other words, GPT-3 is already mesa-optimizing 😲

Another possibility:

[Generated text:] “I enjoy walking with my cute dog, but I’m not sure if I’ll ever be able to walk with my dog. I’m not sure if I’ll ever be able to walk with my dog.”


...The generated words following the context are reasonable, but the model quickly starts repeating itself! This is a very common problem in language generation in general and seems to be even more so in greedy and beam search...


...Recently, there has been more evidence though that the apparent flaws of greedy and beam search—mainly generating repetitive word sequences—are caused by the model (especially the way the model is trained), rather than the decoding method, cf. Welleck et al. (2019).

From the Hugging Face post (emphasis mine). OK, this thing about language models that find repetitive text plausible sounds like a problem that will eventually get solved. Anything else?

As argued in Ari Holtzman et al. (2019), high quality human language does not follow a distribution of high probability next words. In other words, as humans, we want generated text to surprise us and not to be boring/​predictable. The authors show this nicely by plotting the probability, a model would give to human text vs. what beam search does.

So let’s stop being boring and introduce some randomness 🤪.

This is a much deeper & more interesting issue IMO. It may be that only a superintelligent language model will find human writing so boringly predictable that every word has high likelihood based on what came before.

Will there be an intermediate stage where prompting a language model with “I just had a brilliant and highly original idea related to X” will cause it to assign higher plausibilities to completions that are actually quite brilliant & original? (Is this the case for GPT-3 already?) I have no idea.

In any case, maybe we could get the benefits of both originality and avoidance of dead ends by sampling from beam search amplified GPT’s next-word distribution to generate text? (This could be especially useful if Fix #2 has been applied and the GPT’s ability to do lookahead for itself has been stunted.)

Note also that the surprisingness of human text could be an objection to the “GPT can do better on its training objective if it learns to do beam search for itself” claim above. If human text tends to have periodic surprises, using beam search to look for predictable completions may not help performance since those predictions aren’t actually very likely.

However, it also may be the case that beam search ends up improving the accuracy of next-word prediction despite the fact that it doesn’t generate interesting text.