What I was pointing to was the fact that the feed forward networks for the new token don’t have access to the past feed-forward states of the other tokens [...] When curing cancer the second time, it didn’t have access to any of the processing from the first time. Only what previous layers outputted for previous tokens.
That is the misconception. I’ll try to explain it in my words (because frankly despite knowing how a transformer works, I can’t understand Radford Neal’s explanation).
In the GPT architecture each token starts out as an embedding, which is then in each layer enriched with information from previous tokens and knowledge stored in the nn itself. So you have a vector which is modified in each layer, let’s call the output of the n-th layer: vn
The computation of vn accesses the vn−1 of all previous tokens! So in your example, if in layer n−1 at some token the cure for cancer is discovered, all following tokens will have access to that information in layer n. The model cannot forget this information. It might never access it again, but the information will always be there for the taking.
This is in contrast to a recurrent neural network that might actually forget important information if it is unfortunate in editing its state.
I believe I understood Radford Neal’s explanation and I understand yours, as best I can tell, and I don’t think it so far contradicts my model of how LLMs work.
I am aware that the computation of vn has access to vn−1 of all previous tokens. But vn−1 are just the outputs of the feed-forward networks of the previous layer. Imagine a case where the output was 1000 times smaller than the widest part of the feed-forward network. In that case, most of the information in the feed-forward network would be “lost” (unavailable to vn).
Of course, you assume if the model is well-trained, the most pertinent information to predicting the next token would make it into the output. But “the most pertinent information” and “all the information” are two different things, and some information might seem more relevant than now that the new token’s appeared, leading to duplicate work or even cases where previous run happened to understand something the subsequent run did not.
As Radford Neal also mentioned, the fact that the model may/may not properly use information from previous states is another possible issue.
This is all pretty complicated so hopefully what I’m saying is clear.
The function of the feedforward components in transformers is mostly to store knowledge and to enrich the token vectors with that knowledge. The wider you make the ff-network the more knowledge you can store. The network is trained to put the relevant knowledge from the wide hidden layer into the output (i.e. into the token stream).
I fail to see the problem in the fact that the hidden activation is not accessible to future tokens. The ff-nn is just a component to store and inject knowledge. It is wide because it has to store a lot of knowledge, not because the hidden activation has to be wide. The full content of the hidden activation in isolation just is not that relevant.
Case in point: Nowadays the ff-nns actually look different than in GPT-3. They have two hidden layers with one acting as a gating mechanism: The design has changed to allow the possibility to actively erase part of the hidden state!
Also: This seems very different from what you are talking about in the post, it has nothing to do with “the next run”. The hidden layer activations aren’t even “accessible” in the same run! They are purely internal “gears” of a subcomponent.
It also seems to me like you have retreated from
with its intermediate states (“working memory”) completely wiped.
to “intermediate activations of ff-components are not accessible in subsequent layers and because these are wider than the output not all information therein contained can make it into the output”.
I’ll admit I am not confident about the nitty-gritty details of how LLMs work. My two core points (that LLMs are too wide vs. deep, and that LLMs are not recurrent and process in fixed layers) don’t hinge on the “working memory” problems LLMs have. But I still think that seems to be true, based on my understanding. For LLMs, compute is separate from data, so the neural networks have to be recomputed each run, with the new token added. Some of their inputs may be cached, but that’s just a performance optimization.
Imagine an LLM is processing some text. At layer n, the feed-forward network has (somehow, and as the first layer that has done so) decided the feature definitely relates to hostility, and maybe relates to politics, but isn’t really sure, so let’s say the part about politics doesn’t really make it into the output for that layer, because there’s more important information to encode (it thinks). Then in the next run, the token “Trump” is added to the input. At layer n, the feed-forward network has to decide from scratch this token is related to politics. Nothing about the previous “this seems kinda political, not sure” decision is stored in the LLM, even though it was in actuality computed. In an alternative architecture, maybe the “brain area” associated with politics would be slightly active already, then the token “Trump” comes in, and now it’s even more active.
it’s all there for layer n+1′s attention to process, though. at each new token position added to the end, we get to use the most recent token as the marginal new computation result produced by the previous token position’s forward pass. for a token position t, for each layer n, n cannot read the output of layer n at earlier token i<t, but n+1 can read everything that happened anywhere in the past, and that gathering process is used to refine the meaning of the current token into a new vector. so, you can’t have hidden state build up in the same way, and each token position runs a partially-shared algorithm. but you can have unhidden state build up, and that unhidden state gets you full turing completeness.
(“brain area” equivalent would be “feature subspace” afaik. which is actually a slightly more general concept that also covers when a human brain lights up in ways that aren’t regionally localized)
In layer n, the feed-forward network for token position t will potentially waste time doing things already done in layer n during tokens i<t.
This puts a constraint on the ability of different layers to represent different levels of abstraction, because now both layer n and n+1 need to be able to detect whether something “seems political”, not just layer n.
This means the network needs to be deeper when we have more tokens, because token t needs to wait until layer n+1 to see if token t-1 had the feature “seems political”, and token t+1 needs to wait until layer n+2 to see if token t had the feature “seems political”, and so on.
That is the misconception. I’ll try to explain it in my words (because frankly despite knowing how a transformer works, I can’t understand Radford Neal’s explanation).
In the GPT architecture each token starts out as an embedding, which is then in each layer enriched with information from previous tokens and knowledge stored in the nn itself. So you have a vector which is modified in each layer, let’s call the output of the n-th layer: vn
The computation of vn accesses the vn−1 of all previous tokens! So in your example, if in layer n−1 at some token the cure for cancer is discovered, all following tokens will have access to that information in layer n. The model cannot forget this information. It might never access it again, but the information will always be there for the taking.
This is in contrast to a recurrent neural network that might actually forget important information if it is unfortunate in editing its state.
I believe I understood Radford Neal’s explanation and I understand yours, as best I can tell, and I don’t think it so far contradicts my model of how LLMs work.
I am aware that the computation of vn has access to vn−1 of all previous tokens. But vn−1 are just the outputs of the feed-forward networks of the previous layer. Imagine a case where the output was 1000 times smaller than the widest part of the feed-forward network. In that case, most of the information in the feed-forward network would be “lost” (unavailable to vn).
Of course, you assume if the model is well-trained, the most pertinent information to predicting the next token would make it into the output. But “the most pertinent information” and “all the information” are two different things, and some information might seem more relevant than now that the new token’s appeared, leading to duplicate work or even cases where previous run happened to understand something the subsequent run did not.
As Radford Neal also mentioned, the fact that the model may/may not properly use information from previous states is another possible issue.
This is all pretty complicated so hopefully what I’m saying is clear.
The function of the feedforward components in transformers is mostly to store knowledge and to enrich the token vectors with that knowledge. The wider you make the ff-network the more knowledge you can store. The network is trained to put the relevant knowledge from the wide hidden layer into the output (i.e. into the token stream).
I fail to see the problem in the fact that the hidden activation is not accessible to future tokens. The ff-nn is just a component to store and inject knowledge. It is wide because it has to store a lot of knowledge, not because the hidden activation has to be wide. The full content of the hidden activation in isolation just is not that relevant.
Case in point: Nowadays the ff-nns actually look different than in GPT-3. They have two hidden layers with one acting as a gating mechanism: The design has changed to allow the possibility to actively erase part of the hidden state!
Also: This seems very different from what you are talking about in the post, it has nothing to do with “the next run”. The hidden layer activations aren’t even “accessible” in the same run! They are purely internal “gears” of a subcomponent.
It also seems to me like you have retreated from
to “intermediate activations of ff-components are not accessible in subsequent layers and because these are wider than the output not all information therein contained can make it into the output”.
I’ll admit I am not confident about the nitty-gritty details of how LLMs work. My two core points (that LLMs are too wide vs. deep, and that LLMs are not recurrent and process in fixed layers) don’t hinge on the “working memory” problems LLMs have. But I still think that seems to be true, based on my understanding. For LLMs, compute is separate from data, so the neural networks have to be recomputed each run, with the new token added. Some of their inputs may be cached, but that’s just a performance optimization.
Imagine an LLM is processing some text. At layer n, the feed-forward network has (somehow, and as the first layer that has done so) decided the feature definitely relates to hostility, and maybe relates to politics, but isn’t really sure, so let’s say the part about politics doesn’t really make it into the output for that layer, because there’s more important information to encode (it thinks). Then in the next run, the token “Trump” is added to the input. At layer n, the feed-forward network has to decide from scratch this token is related to politics. Nothing about the previous “this seems kinda political, not sure” decision is stored in the LLM, even though it was in actuality computed. In an alternative architecture, maybe the “brain area” associated with politics would be slightly active already, then the token “Trump” comes in, and now it’s even more active.
it’s all there for layer n+1′s attention to process, though. at each new token position added to the end, we get to use the most recent token as the marginal new computation result produced by the previous token position’s forward pass. for a token position t, for each layer n, n cannot read the output of layer n at earlier token i<t, but n+1 can read everything that happened anywhere in the past, and that gathering process is used to refine the meaning of the current token into a new vector. so, you can’t have hidden state build up in the same way, and each token position runs a partially-shared algorithm. but you can have unhidden state build up, and that unhidden state gets you full turing completeness.
(“brain area” equivalent would be “feature subspace” afaik. which is actually a slightly more general concept that also covers when a human brain lights up in ways that aren’t regionally localized)
Does this not mean the following though?
In layer n, the feed-forward network for token position t will potentially waste time doing things already done in layer n during tokens i<t.
This puts a constraint on the ability of different layers to represent different levels of abstraction, because now both layer n and n+1 need to be able to detect whether something “seems political”, not just layer n.
This means the network needs to be deeper when we have more tokens, because token t needs to wait until layer n+1 to see if token t-1 had the feature “seems political”, and token t+1 needs to wait until layer n+2 to see if token t had the feature “seems political”, and so on.