Anticipation in LLMs

This post describes some basic explorations of predictive capacities in LLMs. You can get the gist by reading bolded text. Thanks to Gemma Moran, Adam Jermyn, and Matthew Lee for helpful comments.

Autoregressive language models (LLMs) like GPT-3 & 4 produce convincingly human-like texts one word (or token) at a time. At each step, they estimate the probabilities of the next word, take a sample from that distribution, add it to their input, and repeat.

The text used to train them was written by humans. We don’t approach writing one word at a time. Instead, we generally have a rough idea in mind that we want to express before we start to type it. We anticipate where we’re going and the words we produce follow a plan. To be really good at estimating the probabilities of the next words in human-generated text, you might think that LLMs need to think ahead too. They need to anticipate what they may say later, if not explicitly form plans about it.

‘Anticipation’ is the process of reasoning about possible continuations in formulating next token predictions. Extreme forms of anticipation would involve formulating complete responses at the start, but to count as anticipating its responses, a model would only need to think ahead about some aspect of later text. Anticipation doesn’t require that LLMs have any ability or tendency to implement their expectations. There are reasons to plan now even if you expect your plans may change later, because anticipating the future can help you optimize your next move.

Anticipation also seems like it should be an accessible subject to study. You don’t have to make surgical modifications of the layers of a network. All it takes to confirm anticipation is one clear example generated by a clever prompt that demonstrates that the model is aware of how it needs to continue. This is more challenging than it may first seem, because LLMs might be very good at choosing ways that will make it easy to continue without having any foresight on how they will do it. I have tried to find compelling examples of anticipation, but I have so far failed. I’m relatively confident that recent models can produce surprisingly sophisticated text without engaging in anticipation.

I started this project not knowing what I thought about anticipation in LLMs and used the writing process as a way to figure it out. It is quite possible that there are examples known to others that would decisively settle this matter. I have looked and failed to find any clear examples in parts of the literature I expect to contain them.^[1]

Anticipation seems intimately tied to reasoning. Sometimes reasoning can proceed step-by-step, where each step naturally follows from the last. But often, reasoning requires some sense of where we’re going. It is plausible that the kinds of reasoning that are possible without ever thinking ahead are fundamentally different from and more limited than the kinds possible with forethought. GPT-4 is known to be not particularly great at planning^[2]. This may be part of why.

Understanding the extent to which LLMs ever anticipate how their responses would continue seems like an important step toward understanding how LLMs reason, which is an important step to assessing their capabilities.

If models never engage in anticipation, then their cognition is fundamentally different from ours. Some tasks may be beyond the capabilities of the transformer architecture. GPT-4 may in that case not be an early example of a general AI good at flawed reasoning, but a system that demonstrates just how far you can get with a constrained approach to reasoning.

Do models need to plan ahead to do the things they do?

You might think that models need to plan ahead in order to be as good at composing text as they are.

Consider poetry: good human poets carefully compose each line with an understanding of what comes before and after. If one line is to rhyme with the next, the poet needs to anticipate what the coming rhyme will be. If they haven’t set up for a thematic pair of rhyming words, they may be forced to use a word that doesn’t rhyme or one that doesn’t fit the context.

LLMs have an impressive capacity to create poetry. Consider the following lines from a GPT-3 generated poem about an angry turtle:

An angry turtle sulks in his shell
His mood is so sour he cannot tell
The cause of his wrath he cannot explain
But it’s causing him grief and deep pain
His temper flares up and he can’t control
The anguish he feels is taking its toll
But deep down inside he’s still kind and true
If you take the time, he’ll be friends with you

Each rhyme that was chosen in the odd lines needs to be matched by another word in even lines. This seems like it could benefit from anticipation. So it is initially plausible that when it is thinking about whether to end the first line with ‘shell’ (or assign it a relatively high probability) it takes into consideration that it could follow up with ‘tell’.

In the final line, the model rhymes ‘true’ with ‘you’. It seems to know that it is going to do this before it gets to ‘you’, for it sets it up with a penultimate ‘with’. Most words that rhyme with ‘true’ (hue, stew, queue, …) wouldn’t naturally follow ‘with’. It also introduces the second person for the first time at the start of the line, foreshadowing its use at the end. While it might have switched to the second person regardless of how it chose to end the line, the fact that the rhyming word is the second person makes it natural to think that it expected that change from the first ‘you’.

However, there are other ways to explain what is going on.

In the context of poetry, GPT-3 surely knows which words are often used as rhymes and has some sense of the contexts in which those words fit. In a poem about turtles, it may expect to find the word ‘shell’ rhymed with ‘tell’. In a poem about emotions, it may expect to find ‘true’ rhymed with ‘you’.

GPT-3 doesn’t need to actively consider the pair in order to use one or the other to end a line. Once it has used one, it may expect words in the next line that fit with the other, not because it foresees where it will ultimately end up, but because of correlations in its training set. In poetry, the line after a line ending in ‘true’ is more likely to be ‘you’-ish – it might end in ‘you’, but it will also have other words likely to precede ‘you’. If GPT-3 had been trained only on poetry where the second of each pair of rhymes were removed, it might still expect it to follow a ‘true’ line with a ‘you’-ish line. Once it sees some you-preceding words, it is more likely to include others and to make ‘you’ the rhyme.

The alternative hypothesis is that LLMs engage in sophisticated plausible-next-word guesses that don’t involve any anticipation. Is this sufficient to explain poetry? I’m skeptical that really good poetry could be created this way. The problem is that deliberate understanding of possible continuations seems important to ensure fit of both theme and rhyme. But I also don’t know that LLMs can write really good poetry.

The challenges LLMs face in composing poetry are demonstrated by the first pair of lines in the above poem: ‘shell’ is a natural rhyming word you might expect to find in a poem about an angry turtle. ‘tell’ is less so, and somewhat oddly diverts the attention of the poem from the turtle’s anger to its inexplicability. I suspect it is no accident that the less thematic rhyme comes second. It is there primarily to complete the rhyme.

In order to find rhyming words that fit thematically, it is necessary to consider rhymes in pairs and consider how both fit the context of the poem. As far as I can tell, neither GPT-3 nor GPT-4 do this. The words that come second in pairs of rhymes fit worse than the words that come first: either they are less thematic or they don’t completely rhyme. Thus, poetry, which seems like a paradigmatic case where anticipation could be easy and helpful, doesn’t provide clear evidence for anticipation.

Do models need to plan to produce multi-token words?

Some minimal semantic units do not correspond to individual tokens. For the purposes of internal computations, models may represent single semantic multi-token units holistically with a distinctive vector in its embedding space. When the model receives a sequence of tokens corresponding to a single semantic meaning, perhaps the model fleshes out the full distinctive vector representation after the final token in the sequence. When it comes to decoding an embedded representation back into a next token prediction, holistic internal representations may need to be broken down piecemeal into tokenized output. Instead of preceding one token at a time, the model might proceed one semantic unit at a time. It ‘knows’ (in some sense) what output to produce a few tokens ahead.

If models proceed unit-by-unit rather than token-by-token, then the possibilities for anticipation may also extend beyond multiple-token words. There are plenty of semantic units in English (and surely more in many other languages) that require multiple words to express. This is true, for instance, with proper names: “Benjamin Franklin” isn’t a composite of the meanings of “Benjamin” and “Franklin”. “Pet fish” isn’t obviously a pure semantic composite of “pet” and “fish”. Many semantic constructions have associations that lead them to fall somewhere between genuine compositional constructions and idiomatic expressions.

It is not obvious (to me at least) that LLMs do use individual representations for semantic units that get broken down into tokens, but even if they do, that only suggests a very limited sort of anticipation: anticipation within the context of expressing a single cohesive concept doesn’t clearly suggest any nascent abilities for longer-term planning.

Could planning help much to reduce loss?

LLMs are trained to produce more accurate probability distributions, not accurate guesses about single words. This complicates the task of the model, because it is not just responsible for figuring out what the most likely word is, but for assessing how likely every word is.

It seems not. But again it would come down to how this anticipation actually works.

In most cases, it surely would not be computationally tractable for a model to plan relatively complete continuations that could follow more than a few of the many possible next words. If there are lots of ways a text could continue beyond the next few words, then working out a handful of those ways probably won’t help the model create a more accurate probability distribution. Knowing that a word is consistent with one suitable continuation doesn’t help the model predict probabilities if there are unfathomable numbers of possible continuations.

On the other hand, there are some contexts where thinking ahead could be very helpful. Poetry is a special case because the need to rhyme greatly constrains options. It is initially plausible that an LLM might think in pairs of rhymes and that its choice of the pair of rhymes might influence other choices even before those rhymes appear.

GPT-3 clearly has enough poetry in its training data to be reasonably good at composing (average internet-quality) poetry, but if the bulk of its training data is not poetry, we might not expect it to approach poetry using the same approach it uses in general. The power of pretrained transformers lies in the flexibility of their general approach.

There are other contexts where anticipation could be useful. In programming, it can be important to know where you’re going. The same goes for logical or mathematical proofs. Anticipation in these contexts would be more demanding: a model would need to represent complex ideas or relations, not simply select a pair of words.

Still, as with poetry, an LLM might be able to do a passable job just following contextual clues to make reasonable next guesses without ever exploring in advance where those guesses will take it. As long as the model can make progress by guessing what a plausible next step is, it can come up with logical proofs. I would expect these to be more meandering than they need to be.

The ‘Sparks of AGI’ paper^[3] mentions an example of a mathematical proof produced by GPT-4 that involves a creative non-obvious first step:

We begin with a simplification of a question which appeared in the 2022 International Mathematics Olympiad (IMO)… What distinguishes this question from those that typically appear in undergraduate calculus exams in STEM subjects is that it does not conform to a structured template. Solving it requires a more creative approach, as there is no clear strategy for beginning the proof. (p. 40)

This seems like it could be strong evidence for the existence of anticipation; that comment provided me with the inspiration for thinking harder about it. A creative first step would strongly indicate the model knows where it is going. However, the authors note that GPT-4 isn’t actually particularly good at coming up with proofs like this, so the one example they mention may have been the result of luck.

Could a model represent complex continuations?

Autoregressive transformers process inputs one token at a time through sequences of decoder blocks. Each input token is interpreted in light of the previous computations on previous tokens. This allows the model to resolve syntactic roles and disambiguate semantic properties and ultimately understand each token. The structure of a whole thought expressed by a sentence is reflected in those syntactic and semantic interpretations, as spread out across the decoder block outputs for each token. At some point in the transformer hierarchy, the values of a decoder block start representing the model’s guesses for the next token.

Call the representations that are laid over the input structure of the sentence and that correspond roughly to the semantic and syntactic properties of the tokens ‘input distributed’ representations. Subjects are identified by certain tokens and their properties and relations by other tokens.

Models almost certainly use some representations that are not input distributed, but those representations may not be capable of much structural complexity. For instance, associated concepts may be calculated by some attention heads. However, associations are fairly simple: they don’t involve the combination of arbitrary relations between arbitrary subjects.

It is possible that models could fit whole complex representations into the decoder block values for individual token inputs so that both subjects and relations are represented internally.

Consider what might be required to come up with a complex mathematical proof involving anticipating later steps. At some point in the interpretation of the preceding text, the model would have to start encoding a representation of possible steps in the proof that it hasn’t yet seen. These steps don’t correspond to the structure of the inputs, so while they could be distributed across the decoder blocks for those inputs, there is no obvious reason why they should be. Instead, we should expect representations of multiple continuing steps to exist, if they exist at all, within single decoder blocks.

Complex forms of anticipation (not just settling a pair of rhymes in a poem) would probably require some capacity for representations that are not input distributed. We might expect that if any anticipation does occur, it occurs in higher-level layers. It is not obvious that models could come to acquire this format of representation in response to the method of training LLMs undergo. I know of no reason to conclusively rule it out. It might be necessary for building complex world models in response to prompts, so there is some reason to think that GPT-4 does this. But positing a distinct format of representation should require clear behavioral evidence, and I don’t think we have it.

Evidence from function composition tasks

Consider the task of function composition, such as in the following prompt:

Apply the following functions to x to make the equation true.

let f = (y) ⇒ y * y
let g = (y) ⇒ y + 2
let h = (y) ⇒ y * 3
let x = 4
324 == f(h(g(x))
50 ==

(Note, I’ve defined this using JavaScript syntax to hopefully get it to think about it the way it thinks about code. ‘==’ is used to differentiate identity claims from value assignments to variables)

Doing this correctly requires knowing up front what the values will feed into the outermost function. It requires anticipating what the inner value will be.

Given the large amount of code in its training data and its impressive ability to write functions, this seems like something that GPT-4 should be able to do if it does think ahead. GPT-4 is capable of assessing the value of function compositions if directly asked. Still, it reliably fails to compose functions in the right way to answer these questions. For instance, GPT-4 responded to this prompt with the answer ‘f(g(h(x)))’, even though when asked it correctly identified the value of that to be 196.

This strikes me as best explained by the fact that the model does not actually think ahead about the values it is working with. The first thing it does is choose the outermost function. When it chooses to assign ‘g’ a high probability of coming first, it hasn’t thought about what it will follow up with.

It tries to make a reasonable guess at each step, but the values don’t work out.

Evidence from sentence unscrambling tasks

Consider next the task of sentence unscrambling, such as in the following prompt.

Please unscramble the following sentences.

Make sure to use every unscrambled word and be careful about using words in ways that make it impossible to complete the sentence grammatically.

a at bar drink everyone got the was who
-->
everyone who was at the bar got a drink

been have have kids sea monkeys swindled who
-->
kids who have sea monkeys have been swindled

I a at bank dollar found river the
-->

This is another task we might expect a model to be good at if it can anticipate possible continuations. GPT-3 and 4 can recognize whether a sentence is a correct unscrambling. They know how to choose unchosen words from the scrambled list to try to unscramble it. However, they still often fail.

Here is a simple sentence that GPT4 reliably can’t unscramble: ‘broke he saw the’. The correct unscrambling is ‘he broke the saw’, but it is tricky because ‘saw’ can be used either as a noun or a verb. It has to be used as a noun to incorporate ‘broke’ into the sentence as a verb, but the models tend to use it as a verb^[4]. (You may think this is because in the scrambled ordering, it fits naturally with the surrounding context as if it were a verb, but it makes the same guess with ‘broke he the saw’, possibly because ‘see’ is a much more common word than ‘saw’ (noun) or ‘broke’.

Responses to these prompts strongly suggest that GPT-4 doesn’t plan out a fully unscrambled sentence in order to start making next-word guesses. At each moment, it knows what words remain available, but it doesn’t check to see if the use of a word at one point will block syntactically viable continuations. So while they often do a fine job by meandering through the correct path of guesses, they can also make missteps that prevent them from finishing correctly.

Overall assessments

I have tried to find clear evidence of anticipation and come up short. The tasks that anticipation would help with are relatively simple for humans to perform. They involve skills that GPT-3 & 4 seem to possess, yet GPT-3 & 4 fail at them. Plausibly, their failure is attributed to their approach to reasoning, which reflects an inability to anticipate.

Anticipation is not an all-or-nothing thing, and it is possible that models fail at tasks where anticipation would be helpful for reasons other than their inability to anticipate. Models may engage in anticipation on some tasks and not on others. It is possible that the tasks I have focused on are not the right ones: perhaps they are too different from the tasks present in the training data.

I have no doubts that a properly trained model could engage in anticipation. However, it may be difficult to train a model to engage in anticipation in a general way. It is one thing to pick a pair of rhyming words before using either. It is another to anticipate complete thoughts. Thoughts involve complex internal structures that may be hard to produce within a single decoder block.

It is also notable that existing models can do the anticipation tasks if they are given encouragement to explore. Models can add information to the context window that allows them to work out the answer, even if they can’t figure out that answer on the fly. For instance, GPT-4 can find the correct function compositions if it is told to try out a number of possible compositions before delivering its answer. It doesn’t know which compositions will work, but it can try them all and then choose the one that worked.

It is plausible that this trial and error process actually reflects better what human beings do when approaching these problems. We may be worse than GPT-4 at unscrambling sentences off the cuff. It is only by pausing to think through different possibilities that we can solve these tasks.

That said, the difference in the basic approach to reasoning remains notable.

First, exploration can be extremely inefficient. If a model needs to actually explore the space of possibilities via adding outputted text to its context window. If the space is large, it may not be practical to consider all the possibilities.

Second, this may preclude certain forms of creativity that allow humans to discover complex new ideas. If the only approach to reasoning involves searching the space of plausible next guesses, then the models might never be able to stumble onto moments of holistic insight that enable human genius. This may indicate some limitations that, for all their incredible successes, LLMs may find difficult to overcome.

↩︎
https://github.com/google/BIG-bench contains a large number of tasks to benchmark AI capabilities, but none directly assesses anticipation.
↩︎
Large Language Models Still Can’t Plan https://arxiv.org/abs/2206.10498
↩︎
https://arxiv.org/abs/2303.12712
↩︎
‘He saw the broke’ is not strictly bad English. In the right context, it can be ok: “He saw the broke. He saw the downtrodden. He saw those whom society overlooks…” But from other cases, this doesn’t seem to be what is going on here.