Great post. I agree with the “general picture”, however, the proposed argument for why LLMs have some of these limitations, seems to me clearly wrong.
The reason for both of these defects is that the training paradigm for LLMs is (myopic) next token prediction, which makes deliberation across tokens essentially impossible—and only a fixed number of compute cycles can be spent on each prediction. This is not a trivial problem. The impressive performance we have obtained is because supervised (in this case technically “self-supervised”) learning is much easier than e.g. reinforcement learning and other paradigms that naturally learn planning policies.
Transformers form internal representations at each token position, and gradients flow backwards in time because of attention.
This means the internal representation a model forms at token A, is incentiviced to be useful for predicting the token after A, but also tokens 100 steps later than A. So while LLMs are technically myopic wrt the exact token they write (sampling discretizes and destroys gradients), they are NOT incentiviced to be myopic wrt the internal representations they form, which is clearly the important part in my view (the vast vast majority of the information in a transformers processing lies there, and this information is enough to determine which token it ends up writing), even though they are trained on a myopic next token objective.
For example, a simple LLM transformer might look like this (left to right, token position, upwards is as it moves through transformer layers at each token position. Assume A0 was a starting token, and B0-E0 were sampled autoregressively)
A2 → B2 → C2 → D2 → E2
^ ^ ^ ^ ^
A1 → B1 → C1 → D1 → E1
^ ^ ^ ^ ^
A0 → B0 → C0 → D0 → E0
In this picture, there is no gradient that goes from A1 to E2 through B0, the immediate next token A1 contributes to writing. But A1 has direct contributions to B2, C2, D2 and E2 because of attention, and A1 being useful for helping B2,C2 etc do their predictions will create lower loss. So gradient descent will heavily incentivize A1 containing a representation thats useful for helping make accurate predictions arbitrarily far into the future. (well, at least a million token into the future or however big the context window is).
Overall, I think its a big mistake to think of LLMs training objective being myopic, as having much to say about how myopic LLMs will be after they’ve been trained, or how myopic their internals are.
You’re totally right—I knew all of the things that should have let me reach this conclusion, but I was still thinking about the residual stream in the upwards direction on your diagram as doing all of the work from scratch, just sort of glancing back at previous tokens through attention, when it can also look at all the previous residual streams.
This does invalidate a fairly load-bearing part of my model, in that I now see that LLMs have a meaningful ability to “consider” a sequence in greater and greater depth as its length grows—so they should be able to fit more thinking in when the context is long (without hacks like chain of thoughts etc.).
Other parts of my mental model still hold up though. While this proves that LLMs should be better at figuring out how to predict sequences than I thought previously (possibly even inventing sophisticated mental representations of the sequences) I still don’t expect methods of deliberation based on sampling long chains of reasoning from an LLM to work—that isn’t directly tied to sequence prediction accuracy, it would require a type of goal-directed thinking which I suspect we do not know how to train effectively. That is, the additional thinking time is still spent on token prediction, potentially of later tokens, but not on choosing to produce tokens that are useful for further reasoning about the given task = actual user query, except insofar as that is useful for next toke prediction. RLHF changes the story, but as discussed I do not expect it to be a silver bullet.
I’m not sure this is entirely correct. It’s still true that transformers are bounded in the amount of computation they can do in the residual stream, before the computation has to “cash out” in a predicted token though. In the picture above, its a little unclear, but C2 can only read from A1, B1 and C1, not A2 and B2. There is a maximum length path of computation from one token input to a token output.
The above only establishes that LLM training doesn’t incentivize them to be myopic. Like if you ask a LLM to continue the string “What is 84x53? Answer with” then the next few tokens to predict might be ” one word. Answer:” or something like ” an explanation before you give the final number.”
The above argument just shows that the LLM still might internally be thinking about what 84x53 is on the residual stream of the ” Answer” and the ” with” token, even if that only has relevance for later tokens, and it can easily figure out ” one word”, or ” an explanation before you give the final number.”, without computing the answer.
If you prompt a model with two sentences, they’re probably “thinking” about a bunch of stuff that’s relevant for predicting words many many sentences later.
But they can’t have that complex thoughts unless they write them down. Or, obviously if you just make them bigger they can have more and more complex thoughts, but you’d expect the thoughts they be able to have when they can write stuff, to be a lot more complex than if they have to for example thin deceptive things that don’t appear in writing.
I mean, I don’t want to give Big Labs any ideas, but I suspect the reasoning above implies that the o1/deepseek -style RL procedures might work a lot better if they can think internally for a long time, like the thinking in embedding space model, because gradients from the reward tokens don’t really flow from the placed tokens now. The placed tokens are kind of like the environment in standard RL thinking, but they could actually be differentiated through, turning it more into a standard supervised problem, which is a lot easier than open-ended RL.
How do you square that with Algorithm 10 here: https://arxiv.org/pdf/2207.09238? See Appendix B for the list of notation, should save some time if you don’t want to read the whole thing.
(Nice resource by the way, the only place I have seen anyone write down a proper pseudo-code algorithm for transformers)
Seems to match the diagram from @hmys in that the entire row to the left of a position goes into its multiheaded attention operation—NOT the original input tokens.
Yeah, its not just the tokens. It does look at the previous residual streams, What I’m saying is just that each token, the model can only think about internally a fixed amount, bounded by the number of layers. It can NOT think for longer, without writing down its thoughts, as the context grows.
In the article you linked, X is the residual stream, it is a tensor with dimension (length of sequence input) x (dimension of model). But X goes through multiple updates, where each only depends on the previous layer. He is the loop unrolled if L = 2.
So X0 = Embed + PosEmbed
X1 = X0 + MultiheadAttention1(X0)
X2 = X1 + MLP1(X1)
X3 = X2 + MultiheadAttention2(X2)
X4 = X3 + MLP2(X3)
Out = Softmax(Unembed(X4))
The point is that its not like X1[i] = f(X1[:i]). X1 is a function of X_0 only. So the maximum length of any computational path is the HEIGHT of HMYS’ diagram, not the length. You can’t have any computation going from A1 → B1. Only from A1 → B2. Thats what hmys says also
But A1 has direct contributions to B2, C2, D2 and E2 because of attention,
So, unlike in the diagram above. You can’t go immediately to the right, only to the right and up in one computation step.
NOTE: You can also see this just by looking at the code in the document you sent. The for loop is just ran a constant L times. No matter what. What is L? The number of transformer layers. Each innor loop does a fixed amount of computation. And the only thing that changes from time to time, is that there are new tokens written (assuming we’re autoregressively sampling from the transformer in a loop). Ergo, if the model isn’t communicating its “thinking” in writing, it can’t think for longer, as the context grows.
I see, you’re saying that since information flows one step up at each application of MHAttention, no computation path is actually longer than the depth L.
This seems to be right—that means the second paragraph of my comment is wrong, and I “updated too far” towards LLMs being able to think for a long time. But I was also wrong initially since they can at least remember most of their thinking from previous steps—they just don’t get additional time to build on it above the bound L.
I mean, I don’t want to give Big Labs any ideas, but I suspect the reasoning above implies that the o1/deepseek -style RL procedures might work a lot better if they can think internally for a long time
I expect gpt 5 to implement this. Based on recent research and how they phrase it.
Yes, this is the type of idea big labs will definitely already have (also what I think ~100% of the time someone says “I don’t have to give big labs any ideas”).
Great post. I agree with the “general picture”, however, the proposed argument for why LLMs have some of these limitations, seems to me clearly wrong.
Transformers form internal representations at each token position, and gradients flow backwards in time because of attention.
This means the internal representation a model forms at token A, is incentiviced to be useful for predicting the token after A, but also tokens 100 steps later than A. So while LLMs are technically myopic wrt the exact token they write (sampling discretizes and destroys gradients), they are NOT incentiviced to be myopic wrt the internal representations they form, which is clearly the important part in my view (the vast vast majority of the information in a transformers processing lies there, and this information is enough to determine which token it ends up writing), even though they are trained on a myopic next token objective.
For example, a simple LLM transformer might look like this (left to right, token position, upwards is as it moves through transformer layers at each token position. Assume A0 was a starting token, and B0-E0 were sampled autoregressively)
A2 → B2 → C2 → D2 → E2
^ ^ ^ ^ ^
A1 → B1 → C1 → D1 → E1
^ ^ ^ ^ ^
A0 → B0 → C0 → D0 → E0
In this picture, there is no gradient that goes from A1 to E2 through B0, the immediate next token A1 contributes to writing. But A1 has direct contributions to B2, C2, D2 and E2 because of attention, and A1 being useful for helping B2,C2 etc do their predictions will create lower loss. So gradient descent will heavily incentivize A1 containing a representation thats useful for helping make accurate predictions arbitrarily far into the future. (well, at least a million token into the future or however big the context window is).
Overall, I think its a big mistake to think of LLMs training objective being myopic, as having much to say about how myopic LLMs will be after they’ve been trained, or how myopic their internals are.
You’re totally right—I knew all of the things that should have let me reach this conclusion, but I was still thinking about the residual stream in the upwards direction on your diagram as doing all of the work from scratch, just sort of glancing back at previous tokens through attention, when it can also look at all the previous residual streams.
This does invalidate a fairly load-bearing part of my model, in that I now see that LLMs have a meaningful ability to “consider” a sequence in greater and greater depth as its length grows—so they should be able to fit more thinking in when the context is long (without hacks like chain of thoughts etc.).
Other parts of my mental model still hold up though. While this proves that LLMs should be better at figuring out how to predict sequences than I thought previously (possibly even inventing sophisticated mental representations of the sequences) I still don’t expect methods of deliberation based on sampling long chains of reasoning from an LLM to work—that isn’t directly tied to sequence prediction accuracy, it would require a type of goal-directed thinking which I suspect we do not know how to train effectively. That is, the additional thinking time is still spent on token prediction, potentially of later tokens, but not on choosing to produce tokens that are useful for further reasoning about the given task = actual user query, except insofar as that is useful for next toke prediction. RLHF changes the story, but as discussed I do not expect it to be a silver bullet.
I’m not sure this is entirely correct. It’s still true that transformers are bounded in the amount of computation they can do in the residual stream, before the computation has to “cash out” in a predicted token though. In the picture above, its a little unclear, but C2 can only read from A1, B1 and C1, not A2 and B2. There is a maximum length path of computation from one token input to a token output.
The above only establishes that LLM training doesn’t incentivize them to be myopic. Like if you ask a LLM to continue the string “What is 84x53? Answer with” then the next few tokens to predict might be ” one word. Answer:” or something like ” an explanation before you give the final number.”
The above argument just shows that the LLM still might internally be thinking about what 84x53 is on the residual stream of the ” Answer” and the ” with” token, even if that only has relevance for later tokens, and it can easily figure out ” one word”, or ” an explanation before you give the final number.”, without computing the answer.
If you prompt a model with two sentences, they’re probably “thinking” about a bunch of stuff that’s relevant for predicting words many many sentences later.
But they can’t have that complex thoughts unless they write them down. Or, obviously if you just make them bigger they can have more and more complex thoughts, but you’d expect the thoughts they be able to have when they can write stuff, to be a lot more complex than if they have to for example thin deceptive things that don’t appear in writing.
I mean, I don’t want to give Big Labs any ideas, but I suspect the reasoning above implies that the o1/deepseek -style RL procedures might work a lot better if they can think internally for a long time, like the thinking in embedding space model, because gradients from the reward tokens don’t really flow from the placed tokens now. The placed tokens are kind of like the environment in standard RL thinking, but they could actually be differentiated through, turning it more into a standard supervised problem, which is a lot easier than open-ended RL.
How do you square that with Algorithm 10 here: https://arxiv.org/pdf/2207.09238? See Appendix B for the list of notation, should save some time if you don’t want to read the whole thing.
(Nice resource by the way, the only place I have seen anyone write down a proper pseudo-code algorithm for transformers)
Seems to match the diagram from @hmys in that the entire row to the left of a position goes into its multiheaded attention operation—NOT the original input tokens.
Yeah, its not just the tokens. It does look at the previous residual streams, What I’m saying is just that each token, the model can only think about internally a fixed amount, bounded by the number of layers. It can NOT think for longer, without writing down its thoughts, as the context grows.
In the article you linked, X is the residual stream, it is a tensor with dimension (length of sequence input) x (dimension of model). But X goes through multiple updates, where each only depends on the previous layer. He is the loop unrolled if L = 2.
So X0 = Embed + PosEmbed
X1 = X0 + MultiheadAttention1(X0)
X2 = X1 + MLP1(X1)
X3 = X2 + MultiheadAttention2(X2)
X4 = X3 + MLP2(X3)
Out = Softmax(Unembed(X4))
The point is that its not like X1[i] = f(X1[:i]). X1 is a function of X_0 only. So the maximum length of any computational path is the HEIGHT of HMYS’ diagram, not the length. You can’t have any computation going from A1 → B1. Only from A1 → B2. Thats what hmys says also
So, unlike in the diagram above. You can’t go immediately to the right, only to the right and up in one computation step.
NOTE: You can also see this just by looking at the code in the document you sent. The for loop is just ran a constant L times. No matter what. What is L? The number of transformer layers. Each innor loop does a fixed amount of computation. And the only thing that changes from time to time, is that there are new tokens written (assuming we’re autoregressively sampling from the transformer in a loop). Ergo, if the model isn’t communicating its “thinking” in writing, it can’t think for longer, as the context grows.
I see, you’re saying that since information flows one step up at each application of MHAttention, no computation path is actually longer than the depth L.
This seems to be right—that means the second paragraph of my comment is wrong, and I “updated too far” towards LLMs being able to think for a long time. But I was also wrong initially since they can at least remember most of their thinking from previous steps—they just don’t get additional time to build on it above the bound L.
Yeah, that’s my understanding as well. Tell me if your understanding changes further in relevant ways.
I expect gpt 5 to implement this. Based on recent research and how they phrase it.
Yes, this is the type of idea big labs will definitely already have (also what I think ~100% of the time someone says “I don’t have to give big labs any ideas”).
That’s what I also thought haha, else I wouldn’t post it.