I am predicting a world that looks fantastically different from the world predicted by AI 2027. It’s the difference between apocalypse and things basically being the same as they are now. The difference between the two is clear.
I agree that having internal representations that can be modified while reasoning is something that enables deep thinking, and I think this is something LLMs are bad at. Because of the wideness/depth issue and the lack of recurrence.
I only have a lay understanding of how LLMs work, so forgive me if I’m wrong about the specifics. It seems to me the KV cache is just an optimization. Either way, the LLM’s output is deterministic on the input tokens, and information is not being lost. What I was pointing to was the fact that the feed forward networks for the new token don’t have access to the past feed-forward states of the other tokens, so they can’t see e.g. what reasoning paths were dead ends, unless information about those dead ends made it into the output. This is a toy example, but I’m imagining a far-future LLM with enough understanding of biology and chemistry baked into its (for some reason very huge wide/deep) feed-forward networks to cure cancer in a single layer (for some layer). Imagine in one run, the input is “the cure for cancer”. Imagine the attention dimension is very narrow. In one layer, the feed-forward network may cure cancer in this run, among doing many other things, and then possibly discard that information when going to the next layer. In a subsequent run on the input “the cure for cancer is”, it may cure cancer again, and this time include some detail of that cure in its output to the next layer, since now it’s more likely to be relevant to predicting the next token. When curing cancer the second time, it didn’t have access to any of the processing from the first time. Only what previous layers outputted for previous tokens. Does that sound right? If so, the fact that the LLM is strictly divided into layers with feed-forward parts being wider than other parts is a limitation on deep thinking. Obviously the example is an exaggeration because a feed-forward layer wouldn’t be curing cancer on its own, but it speaks to the fact that even though information isn’t being lost, computation is segregated in a way that some processing done in previous runs isn’t available to future runs.
I already responded to what joseph_c said about the human brain, but I’ll go into a bit more detail here. Progressing 200 steps forward in a feed-forward neural network is not nearly as “deep” as progressing 200 neurons in any direction in a recurrent network, and either way a 200 neuron chain of processing is not a lot. I suspect when doing deep thinking, the depth of neural firings in humans would be much greater, over a longer period of time. I think brains are deeper than LLMs, and only wider in the sense that they’re currently larger overall.
Coming up with new clever jokes does take a lot of time for humans actually. Stand-up comedians spend hours writing every day to write one hour of clever jokes total per year. When people come up with jokes that are funny in conversation, that is the product of one of three things:
The joke isn’t particularly clever, but people are in the mood to laugh
The joke is clever, and you got lucky
The joke is funny because you’re a funny person who already has a bunch of “joke formats” memorized, which makes telling funny jokes on the fly easier. But even then, it’s not fully shallow, and you can’t do it reliably. It just makes it easier.
I’m not sure, but I think you possibly could make an LLM that is so extremely wide that it could cure cancer, be superintelligent, etc. But I think actually training/running that network would be so exorbitantly expensive that you shouldn’t bother (for the reasons I pointed to in my post), and that’s why LLMs will plateau compared to less limited architectures.
What I was pointing to was the fact that the feed forward networks for the new token don’t have access to the past feed-forward states of the other tokens [...] When curing cancer the second time, it didn’t have access to any of the processing from the first time. Only what previous layers outputted for previous tokens.
That is the misconception. I’ll try to explain it in my words (because frankly despite knowing how a transformer works, I can’t understand Radford Neal’s explanation).
In the GPT architecture each token starts out as an embedding, which is then in each layer enriched with information from previous tokens and knowledge stored in the nn itself. So you have a vector which is modified in each layer, let’s call the output of the n-th layer: vn
The computation of vn accesses the vn−1 of all previous tokens! So in your example, if in layer n−1 at some token the cure for cancer is discovered, all following tokens will have access to that information in layer n. The model cannot forget this information. It might never access it again, but the information will always be there for the taking.
This is in contrast to a recurrent neural network that might actually forget important information if it is unfortunate in editing its state.
I believe I understood Radford Neal’s explanation and I understand yours, as best I can tell, and I don’t think it so far contradicts my model of how LLMs work.
I am aware that the computation of vn has access to vn−1 of all previous tokens. But vn−1 are just the outputs of the feed-forward networks of the previous layer. Imagine a case where the output was 1000 times smaller than the widest part of the feed-forward network. In that case, most of the information in the feed-forward network would be “lost” (unavailable to vn).
Of course, you assume if the model is well-trained, the most pertinent information to predicting the next token would make it into the output. But “the most pertinent information” and “all the information” are two different things, and some information might seem more relevant than now that the new token’s appeared, leading to duplicate work or even cases where previous run happened to understand something the subsequent run did not.
As Radford Neal also mentioned, the fact that the model may/may not properly use information from previous states is another possible issue.
This is all pretty complicated so hopefully what I’m saying is clear.
The function of the feedforward components in transformers is mostly to store knowledge and to enrich the token vectors with that knowledge. The wider you make the ff-network the more knowledge you can store. The network is trained to put the relevant knowledge from the wide hidden layer into the output (i.e. into the token stream).
I fail to see the problem in the fact that the hidden activation is not accessible to future tokens. The ff-nn is just a component to store and inject knowledge. It is wide because it has to store a lot of knowledge, not because the hidden activation has to be wide. The full content of the hidden activation in isolation just is not that relevant.
Case in point: Nowadays the ff-nns actually look different than in GPT-3. They have two hidden layers with one acting as a gating mechanism: The design has changed to allow the possibility to actively erase part of the hidden state!
Also: This seems very different from what you are talking about in the post, it has nothing to do with “the next run”. The hidden layer activations aren’t even “accessible” in the same run! They are purely internal “gears” of a subcomponent.
It also seems to me like you have retreated from
with its intermediate states (“working memory”) completely wiped.
to “intermediate activations of ff-components are not accessible in subsequent layers and because these are wider than the output not all information therein contained can make it into the output”.
I’ll admit I am not confident about the nitty-gritty details of how LLMs work. My two core points (that LLMs are too wide vs. deep, and that LLMs are not recurrent and process in fixed layers) don’t hinge on the “working memory” problems LLMs have. But I still think that seems to be true, based on my understanding. For LLMs, compute is separate from data, so the neural networks have to be recomputed each run, with the new token added. Some of their inputs may be cached, but that’s just a performance optimization.
Imagine an LLM is processing some text. At layer n, the feed-forward network has (somehow, and as the first layer that has done so) decided the feature definitely relates to hostility, and maybe relates to politics, but isn’t really sure, so let’s say the part about politics doesn’t really make it into the output for that layer, because there’s more important information to encode (it thinks). Then in the next run, the token “Trump” is added to the input. At layer n, the feed-forward network has to decide from scratch this token is related to politics. Nothing about the previous “this seems kinda political, not sure” decision is stored in the LLM, even though it was in actuality computed. In an alternative architecture, maybe the “brain area” associated with politics would be slightly active already, then the token “Trump” comes in, and now it’s even more active.
it’s all there for layer n+1′s attention to process, though. at each new token position added to the end, we get to use the most recent token as the marginal new computation result produced by the previous token position’s forward pass. for a token position t, for each layer n, n cannot read the output of layer n at earlier token i<t, but n+1 can read everything that happened anywhere in the past, and that gathering process is used to refine the meaning of the current token into a new vector. so, you can’t have hidden state build up in the same way, and each token position runs a partially-shared algorithm. but you can have unhidden state build up, and that unhidden state gets you full turing completeness.
(“brain area” equivalent would be “feature subspace” afaik. which is actually a slightly more general concept that also covers when a human brain lights up in ways that aren’t regionally localized)
In layer n, the feed-forward network for token position t will potentially waste time doing things already done in layer n during tokens i<t.
This puts a constraint on the ability of different layers to represent different levels of abstraction, because now both layer n and n+1 need to be able to detect whether something “seems political”, not just layer n.
This means the network needs to be deeper when we have more tokens, because token t needs to wait until layer n+1 to see if token t-1 had the feature “seems political”, and token t+1 needs to wait until layer n+2 to see if token t had the feature “seems political”, and so on.
″...feed forward networks for the new token don’t have access to the past feed-forward states of the other tokens...”
This isn’t correct. The attention mechanism can move information from the neural network outputs at previous times to the current time, that is then fed into the feedforward network for the current time. The basic transformer mechanism is to alternate cross-time attention computations with within-current-time neural network computations, over many layers. Without access to information from past times, performance would obviously be atrocious.
In a sense, the KV cache that retains this information from past times is “just” an optimization, because the computations are (in theory, not always in practice) deterministic, so one could just redo them again for every previous token when predicting the next token (assuming the previously-generated tokens are retained). But that doesn’t seem enough to support your argument.
Of course, it’s quite possible that the models don’t attend very well to the past states, and so suffer to some extent from the issues you mention, but it’s not a fundamental property of the architecture.
The attention mechanism can move information from the neural network outputs at previous times to the current time
Again, I could be misunderstanding, but it seems like only outputs of the neural networks are being stored and made available here, not the entire neural network state.
This was the purpose of my cancer-curing hypothetical. Any conclusions made by the feed-forward network that don’t make it into the output are lost. And the output is narrower than the widest part of the feed-forward network, so some information is “lost”/unavailable to subsequent tokens.
Models not attending very well to past states could be an additional factor worth considering, but I’m not sure if that is or isn’t true.
OK, I think I more clearly see what you’re saying. The hidden unit values in a feedforward block of the transformer at a previous time aren’t directly available at the current time—only the inputs of that feedforward block can be seen. But the hidden unit values are deterministic functions of the inputs, so no information is lost. If these feedforward blocks were very deep, with many layers of hidden units, then keeping those hidden unit values directly available at later times might be important. But actually these feedforward blocks are not deep (even though the full network with many such blocks is deep), so it may not be a big issue—the computations can be redundantly replicated if it helps.
I’m not really talking about true information loss, more like the computation getting repeated that doesn’t need to be.
And yes the feedforward blocks can be like 1 or 2 layers deep, so I am open to this being either a small or a big issue, depending on the exact architecture.
I am predicting a world that looks fantastically different from the world predicted by AI 2027. It’s the difference between apocalypse and things basically being the same as they are now. The difference between the two is clear.
I agree that having internal representations that can be modified while reasoning is something that enables deep thinking, and I think this is something LLMs are bad at. Because of the wideness/depth issue and the lack of recurrence.
I only have a lay understanding of how LLMs work, so forgive me if I’m wrong about the specifics. It seems to me the KV cache is just an optimization. Either way, the LLM’s output is deterministic on the input tokens, and information is not being lost. What I was pointing to was the fact that the feed forward networks for the new token don’t have access to the past feed-forward states of the other tokens, so they can’t see e.g. what reasoning paths were dead ends, unless information about those dead ends made it into the output. This is a toy example, but I’m imagining a far-future LLM with enough understanding of biology and chemistry baked into its (for some reason very huge wide/deep) feed-forward networks to cure cancer in a single layer (for some layer). Imagine in one run, the input is “the cure for cancer”. Imagine the attention dimension is very narrow. In one layer, the feed-forward network may cure cancer in this run, among doing many other things, and then possibly discard that information when going to the next layer. In a subsequent run on the input “the cure for cancer is”, it may cure cancer again, and this time include some detail of that cure in its output to the next layer, since now it’s more likely to be relevant to predicting the next token. When curing cancer the second time, it didn’t have access to any of the processing from the first time. Only what previous layers outputted for previous tokens. Does that sound right? If so, the fact that the LLM is strictly divided into layers with feed-forward parts being wider than other parts is a limitation on deep thinking. Obviously the example is an exaggeration because a feed-forward layer wouldn’t be curing cancer on its own, but it speaks to the fact that even though information isn’t being lost, computation is segregated in a way that some processing done in previous runs isn’t available to future runs.
I already responded to what joseph_c said about the human brain, but I’ll go into a bit more detail here. Progressing 200 steps forward in a feed-forward neural network is not nearly as “deep” as progressing 200 neurons in any direction in a recurrent network, and either way a 200 neuron chain of processing is not a lot. I suspect when doing deep thinking, the depth of neural firings in humans would be much greater, over a longer period of time. I think brains are deeper than LLMs, and only wider in the sense that they’re currently larger overall.
Coming up with new clever jokes does take a lot of time for humans actually. Stand-up comedians spend hours writing every day to write one hour of clever jokes total per year. When people come up with jokes that are funny in conversation, that is the product of one of three things:
The joke isn’t particularly clever, but people are in the mood to laugh
The joke is clever, and you got lucky
The joke is funny because you’re a funny person who already has a bunch of “joke formats” memorized, which makes telling funny jokes on the fly easier. But even then, it’s not fully shallow, and you can’t do it reliably. It just makes it easier.
I’m not sure, but I think you possibly could make an LLM that is so extremely wide that it could cure cancer, be superintelligent, etc. But I think actually training/running that network would be so exorbitantly expensive that you shouldn’t bother (for the reasons I pointed to in my post), and that’s why LLMs will plateau compared to less limited architectures.
That is the misconception. I’ll try to explain it in my words (because frankly despite knowing how a transformer works, I can’t understand Radford Neal’s explanation).
In the GPT architecture each token starts out as an embedding, which is then in each layer enriched with information from previous tokens and knowledge stored in the nn itself. So you have a vector which is modified in each layer, let’s call the output of the n-th layer: vn
The computation of vn accesses the vn−1 of all previous tokens! So in your example, if in layer n−1 at some token the cure for cancer is discovered, all following tokens will have access to that information in layer n. The model cannot forget this information. It might never access it again, but the information will always be there for the taking.
This is in contrast to a recurrent neural network that might actually forget important information if it is unfortunate in editing its state.
I believe I understood Radford Neal’s explanation and I understand yours, as best I can tell, and I don’t think it so far contradicts my model of how LLMs work.
I am aware that the computation of vn has access to vn−1 of all previous tokens. But vn−1 are just the outputs of the feed-forward networks of the previous layer. Imagine a case where the output was 1000 times smaller than the widest part of the feed-forward network. In that case, most of the information in the feed-forward network would be “lost” (unavailable to vn).
Of course, you assume if the model is well-trained, the most pertinent information to predicting the next token would make it into the output. But “the most pertinent information” and “all the information” are two different things, and some information might seem more relevant than now that the new token’s appeared, leading to duplicate work or even cases where previous run happened to understand something the subsequent run did not.
As Radford Neal also mentioned, the fact that the model may/may not properly use information from previous states is another possible issue.
This is all pretty complicated so hopefully what I’m saying is clear.
The function of the feedforward components in transformers is mostly to store knowledge and to enrich the token vectors with that knowledge. The wider you make the ff-network the more knowledge you can store. The network is trained to put the relevant knowledge from the wide hidden layer into the output (i.e. into the token stream).
I fail to see the problem in the fact that the hidden activation is not accessible to future tokens. The ff-nn is just a component to store and inject knowledge. It is wide because it has to store a lot of knowledge, not because the hidden activation has to be wide. The full content of the hidden activation in isolation just is not that relevant.
Case in point: Nowadays the ff-nns actually look different than in GPT-3. They have two hidden layers with one acting as a gating mechanism: The design has changed to allow the possibility to actively erase part of the hidden state!
Also: This seems very different from what you are talking about in the post, it has nothing to do with “the next run”. The hidden layer activations aren’t even “accessible” in the same run! They are purely internal “gears” of a subcomponent.
It also seems to me like you have retreated from
to “intermediate activations of ff-components are not accessible in subsequent layers and because these are wider than the output not all information therein contained can make it into the output”.
I’ll admit I am not confident about the nitty-gritty details of how LLMs work. My two core points (that LLMs are too wide vs. deep, and that LLMs are not recurrent and process in fixed layers) don’t hinge on the “working memory” problems LLMs have. But I still think that seems to be true, based on my understanding. For LLMs, compute is separate from data, so the neural networks have to be recomputed each run, with the new token added. Some of their inputs may be cached, but that’s just a performance optimization.
Imagine an LLM is processing some text. At layer n, the feed-forward network has (somehow, and as the first layer that has done so) decided the feature definitely relates to hostility, and maybe relates to politics, but isn’t really sure, so let’s say the part about politics doesn’t really make it into the output for that layer, because there’s more important information to encode (it thinks). Then in the next run, the token “Trump” is added to the input. At layer n, the feed-forward network has to decide from scratch this token is related to politics. Nothing about the previous “this seems kinda political, not sure” decision is stored in the LLM, even though it was in actuality computed. In an alternative architecture, maybe the “brain area” associated with politics would be slightly active already, then the token “Trump” comes in, and now it’s even more active.
it’s all there for layer n+1′s attention to process, though. at each new token position added to the end, we get to use the most recent token as the marginal new computation result produced by the previous token position’s forward pass. for a token position t, for each layer n, n cannot read the output of layer n at earlier token i<t, but n+1 can read everything that happened anywhere in the past, and that gathering process is used to refine the meaning of the current token into a new vector. so, you can’t have hidden state build up in the same way, and each token position runs a partially-shared algorithm. but you can have unhidden state build up, and that unhidden state gets you full turing completeness.
(“brain area” equivalent would be “feature subspace” afaik. which is actually a slightly more general concept that also covers when a human brain lights up in ways that aren’t regionally localized)
Does this not mean the following though?
In layer n, the feed-forward network for token position t will potentially waste time doing things already done in layer n during tokens i<t.
This puts a constraint on the ability of different layers to represent different levels of abstraction, because now both layer n and n+1 need to be able to detect whether something “seems political”, not just layer n.
This means the network needs to be deeper when we have more tokens, because token t needs to wait until layer n+1 to see if token t-1 had the feature “seems political”, and token t+1 needs to wait until layer n+2 to see if token t had the feature “seems political”, and so on.
″...feed forward networks for the new token don’t have access to the past feed-forward states of the other tokens...”
This isn’t correct. The attention mechanism can move information from the neural network outputs at previous times to the current time, that is then fed into the feedforward network for the current time. The basic transformer mechanism is to alternate cross-time attention computations with within-current-time neural network computations, over many layers. Without access to information from past times, performance would obviously be atrocious.
In a sense, the KV cache that retains this information from past times is “just” an optimization, because the computations are (in theory, not always in practice) deterministic, so one could just redo them again for every previous token when predicting the next token (assuming the previously-generated tokens are retained). But that doesn’t seem enough to support your argument.
Of course, it’s quite possible that the models don’t attend very well to the past states, and so suffer to some extent from the issues you mention, but it’s not a fundamental property of the architecture.
Again, I could be misunderstanding, but it seems like only outputs of the neural networks are being stored and made available here, not the entire neural network state.
This was the purpose of my cancer-curing hypothetical. Any conclusions made by the feed-forward network that don’t make it into the output are lost. And the output is narrower than the widest part of the feed-forward network, so some information is “lost”/unavailable to subsequent tokens.
Models not attending very well to past states could be an additional factor worth considering, but I’m not sure if that is or isn’t true.
OK, I think I more clearly see what you’re saying. The hidden unit values in a feedforward block of the transformer at a previous time aren’t directly available at the current time—only the inputs of that feedforward block can be seen. But the hidden unit values are deterministic functions of the inputs, so no information is lost. If these feedforward blocks were very deep, with many layers of hidden units, then keeping those hidden unit values directly available at later times might be important. But actually these feedforward blocks are not deep (even though the full network with many such blocks is deep), so it may not be a big issue—the computations can be redundantly replicated if it helps.
I’m not really talking about true information loss, more like the computation getting repeated that doesn’t need to be.
And yes the feedforward blocks can be like 1 or 2 layers deep, so I am open to this being either a small or a big issue, depending on the exact architecture.