How do you square that with Algorithm 10 here: https://arxiv.org/pdf/2207.09238? See Appendix B for the list of notation, should save some time if you don’t want to read the whole thing.
(Nice resource by the way, the only place I have seen anyone write down a proper pseudo-code algorithm for transformers)
Seems to match the diagram from @hmys in that the entire row to the left of a position goes into its multiheaded attention operation—NOT the original input tokens.
Yeah, its not just the tokens. It does look at the previous residual streams, What I’m saying is just that each token, the model can only think about internally a fixed amount, bounded by the number of layers. It can NOT think for longer, without writing down its thoughts, as the context grows.
In the article you linked, X is the residual stream, it is a tensor with dimension (length of sequence input) x (dimension of model). But X goes through multiple updates, where each only depends on the previous layer. He is the loop unrolled if L = 2.
So X0 = Embed + PosEmbed
X1 = X0 + MultiheadAttention1(X0)
X2 = X1 + MLP1(X1)
X3 = X2 + MultiheadAttention2(X2)
X4 = X3 + MLP2(X3)
Out = Softmax(Unembed(X4))
The point is that its not like X1[i] = f(X1[:i]). X1 is a function of X_0 only. So the maximum length of any computational path is the HEIGHT of HMYS’ diagram, not the length. You can’t have any computation going from A1 → B1. Only from A1 → B2. Thats what hmys says also
But A1 has direct contributions to B2, C2, D2 and E2 because of attention,
So, unlike in the diagram above. You can’t go immediately to the right, only to the right and up in one computation step.
NOTE: You can also see this just by looking at the code in the document you sent. The for loop is just ran a constant L times. No matter what. What is L? The number of transformer layers. Each innor loop does a fixed amount of computation. And the only thing that changes from time to time, is that there are new tokens written (assuming we’re autoregressively sampling from the transformer in a loop). Ergo, if the model isn’t communicating its “thinking” in writing, it can’t think for longer, as the context grows.
I see, you’re saying that since information flows one step up at each application of MHAttention, no computation path is actually longer than the depth L.
This seems to be right—that means the second paragraph of my comment is wrong, and I “updated too far” towards LLMs being able to think for a long time. But I was also wrong initially since they can at least remember most of their thinking from previous steps—they just don’t get additional time to build on it above the bound L.
How do you square that with Algorithm 10 here: https://arxiv.org/pdf/2207.09238? See Appendix B for the list of notation, should save some time if you don’t want to read the whole thing.
(Nice resource by the way, the only place I have seen anyone write down a proper pseudo-code algorithm for transformers)
Seems to match the diagram from @hmys in that the entire row to the left of a position goes into its multiheaded attention operation—NOT the original input tokens.
Yeah, its not just the tokens. It does look at the previous residual streams, What I’m saying is just that each token, the model can only think about internally a fixed amount, bounded by the number of layers. It can NOT think for longer, without writing down its thoughts, as the context grows.
In the article you linked, X is the residual stream, it is a tensor with dimension (length of sequence input) x (dimension of model). But X goes through multiple updates, where each only depends on the previous layer. He is the loop unrolled if L = 2.
So X0 = Embed + PosEmbed
X1 = X0 + MultiheadAttention1(X0)
X2 = X1 + MLP1(X1)
X3 = X2 + MultiheadAttention2(X2)
X4 = X3 + MLP2(X3)
Out = Softmax(Unembed(X4))
The point is that its not like X1[i] = f(X1[:i]). X1 is a function of X_0 only. So the maximum length of any computational path is the HEIGHT of HMYS’ diagram, not the length. You can’t have any computation going from A1 → B1. Only from A1 → B2. Thats what hmys says also
So, unlike in the diagram above. You can’t go immediately to the right, only to the right and up in one computation step.
NOTE: You can also see this just by looking at the code in the document you sent. The for loop is just ran a constant L times. No matter what. What is L? The number of transformer layers. Each innor loop does a fixed amount of computation. And the only thing that changes from time to time, is that there are new tokens written (assuming we’re autoregressively sampling from the transformer in a loop). Ergo, if the model isn’t communicating its “thinking” in writing, it can’t think for longer, as the context grows.
I see, you’re saying that since information flows one step up at each application of MHAttention, no computation path is actually longer than the depth L.
This seems to be right—that means the second paragraph of my comment is wrong, and I “updated too far” towards LLMs being able to think for a long time. But I was also wrong initially since they can at least remember most of their thinking from previous steps—they just don’t get additional time to build on it above the bound L.
Yeah, that’s my understanding as well. Tell me if your understanding changes further in relevant ways.