# nostalgebraist comments on Why GPT wants to mesa-optimize & how we might change this

• I’m skeptical that internal beam search would help in language modeling.

Language modeling is like predicting the weather, in the sense that even if you are literally as good as possible at it, your prediction accuracy still degrades rapidly as a function of the number of steps ahead you’re looking. So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.

Weather is like this because of chaotic dynamics. Language modeling is like this because

(a) Text is used to communicate: the writer expects the audience to learn something from the last X% of a text that they couldn’t extrapolate from reading the first (100-X)%, or else they’d just stop and not write the remaining X%.

(b) By construction, language modeling gives you nothing to work with except the text itself, so you don’t know who produced it or for whom. So even if you were smart enough to guess what any individual human would say next (!), you don’t know which human produced the text you’re looking at. (Or even whether it was a human at all.)

Thus (IMO), language modeling is not really about thinking ahead to find some “objectively correct” next move as in Chess/​Go. It’s more about trying to guess what the author of this text will do in the very next step. The author and the LM are almost sure to diverge after a few more steps, so even if the LM had a beam search oracle, I expect it wouldn’t find it very useful.

To make the point concrete, I don’t think “orange” is necessarily a bad guess here—among other things, it would be the correct guess if the author were trying to illustrate the point of your example!

And if we were predicting this post itself, the true next token would not be orange or any other word but an ellipsis ”...”, which seems bizarre from the narrow perspective of the example, but is typical of the wild world LMs operate in. (Which also contains typos, actually-incoherent writers, mangled formatting, the list goes on . . . )

• So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.

A system which develops small-L lookahead (for L > 1) may find large-L lookahead to be nearby in programspace. If so, incentivizing the development of small-L lookahead makes it more likely that the system will try large-L lookahead and find it to be useful as well (in predicting chess moves for instance).

My intuition is that small-L lookahead could be close to large-L lookahead in programspace for something like an RNN, but not for GPT-3′s transformer architecture.

Anyway, the question here isn’t whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution. Lookahead is almost certainly going to do better than random guessing, even topic models can do that.

By construction, language modeling gives you nothing to work with except the text itself, so you don’t know who produced it or for whom.

Are you saying that GPT-3′s training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?

• Anyway, the question here isn’t whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution.

Can you say a bit more about why you only need look-ahead to improve performance? SGD favors better improvements over worse improvements—it feels like I could think of many programs that are improvements but which won’t be found by SGD. Maybe you would say there don’t seem to be any improvements that are this good and this seemingly easy for SGD to find?

• From a safety standpoint, hoping and praying that SGD won’t stumble across lookahead doesn’t seem very robust, if lookahead represents a way to improve performance. I imagine that whether SGD stumbles across lookahead will end up depending on complicated details of the loss surface that’s being traversed.

• I agree, and thanks for the reply. And I agree that even a small chance of catastrophe is not robust. Though I asked because I still care about the probability of things going badly, even if I think that probability is worryingly high. Though I see now (thanks to you!) that in this case our prior that SGD will find look-ahead is still relatively high and that belief won’t change much by thinking about it more due to sensitivity to complicated details we can’t easily know.

• Are you saying that GPT-3′s training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?

No, it’s a more philosophical point. Even if such things appear in the context window, they’re simply more text, and convey the same kind of information: not “the denotation of these words is factually true,” but “these words are part of the text.”

For example, the mere appearance of something like

Title: Why GPT wants to mesa-optimize & how we might change this

Author: John_Maxwell

does not guarantee that the text following it bears that title, or was written by that author. (As I am illustrating right now.)

Of course, one can design datasets where information like this is provided more authoritatively—say, always at the start of each text, curated for quality, etc. (GPT isn’t like that, but Grover and CTRL kind of are, in different ways.)

But even that can only go so far. If the author is “Julius Caesar,” does that mean the historical figure, some internet poster with that handle, or any number of other possibilities? A passage of fiction written in a character’s voice—is the appropriate author cue the actual writer (who may have written in many different voices over their career) or the character? (Note that the character is a much better answer to the question “who does this sound like?”) And doesn’t the date matter too, so we know whether this post in the venue “Less Wrong” was on 2010′s LW or 2020′s?

Fundamentally, language modeling is about understanding structures in decontextualized blocks of contiguous words. You can try to hack in some sidechannels to provide context, but there’s no way they will capture everything needing to locate the text fully in its social, physical, and temporal position within the broader world. And just as a definitional manner, these sidechannels are modifications to “language modeling,” which in its purest sense is just about filling in an arbitrary text from substrings of it (and no other information).

My intuition is that small-L lookahead could be close to large-L lookahead in programspace for something like an RNN, but not for GPT-3′s transformer architecture.

Yeah, not for transformers I think.

Anyway, the question here isn’t whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution.

capybaralet’s point about conservation of expected evidence applies here—GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahead, it should already have those facts priced in to its next-step prediction.

If we then say “the mechanism for pricing them in is doing internal lookahead,” then we are imagining that lookahead operating over some predictor that is otherwise good but hasn’t priced in lookahead yet. But I don’t know why we should imagine the computation would naturally factor this way, when the benefits of lookahead are small and it beam search take a lot of parameters to implement internally.

• Your philosophical point is interesting; I have a post in the queue about that. However I don’t think it really proves what you want it to.

Having John_Maxwell in the byline makes it far more likely that I’m the author of the post.

If humans can make useful judgements re: whether this is something I wrote, vs something nostalgebraist wrote to make a point about bylines, I don’t see why a language model can’t do the same, in principle.

GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahead, it should already have those facts priced in to its next-step prediction.

A perfectly optimal next-step predictor would not be improved by lookahead or anything else, it’s perfectly optimal. I’m talking about computational structures which might be incentivized during training when the predictor is suboptimal. (It’s still going to be suboptimal after training with current technology, of course.)

In orthonormal’s post they wrote:

...GPT-3′s ability to write fiction is impressive- unlike GPT-2, it doesn’t lose track of the plot, it has sensible things happen, it just can’t plan its way to a satisfying resolution.

I’d be somewhat surprised if GPT-4 shared that last problem.

I suspect that either GPT-4 will still be unable to plan its way to a satisfying resolution, or GPT-4 will develop some kind of internal lookahead (probably not beam search, but beam search could be a useful model for understanding it) which is sufficiently general to be re-used across many different writing tasks. (Generality takes fewer parameters.) I don’t know what the relative likelihoods of those possibilities are. But the whole idea of AI safety is to ask what happens if we succeed.