I’ll admit I am not confident about the nitty-gritty details of how LLMs work. My two core points (that LLMs are too wide vs. deep, and that LLMs are not recurrent and process in fixed layers) don’t hinge on the “working memory” problems LLMs have. But I still think that seems to be true, based on my understanding. For LLMs, compute is separate from data, so the neural networks have to be recomputed each run, with the new token added. Some of their inputs may be cached, but that’s just a performance optimization.
Imagine an LLM is processing some text. At layer n, the feed-forward network has (somehow, and as the first layer that has done so) decided the feature definitely relates to hostility, and maybe relates to politics, but isn’t really sure, so let’s say the part about politics doesn’t really make it into the output for that layer, because there’s more important information to encode (it thinks). Then in the next run, the token “Trump” is added to the input. At layer n, the feed-forward network has to decide from scratch this token is related to politics. Nothing about the previous “this seems kinda political, not sure” decision is stored in the LLM, even though it was in actuality computed. In an alternative architecture, maybe the “brain area” associated with politics would be slightly active already, then the token “Trump” comes in, and now it’s even more active.
it’s all there for layer n+1′s attention to process, though. at each new token position added to the end, we get to use the most recent token as the marginal new computation result produced by the previous token position’s forward pass. for a token position t, for each layer n, n cannot read the output of layer n at earlier token i<t, but n+1 can read everything that happened anywhere in the past, and that gathering process is used to refine the meaning of the current token into a new vector. so, you can’t have hidden state build up in the same way, and each token position runs a partially-shared algorithm. but you can have unhidden state build up, and that unhidden state gets you full turing completeness.
(“brain area” equivalent would be “feature subspace” afaik. which is actually a slightly more general concept that also covers when a human brain lights up in ways that aren’t regionally localized)
In layer n, the feed-forward network for token position t will potentially waste time doing things already done in layer n during tokens i<t.
This puts a constraint on the ability of different layers to represent different levels of abstraction, because now both layer n and n+1 need to be able to detect whether something “seems political”, not just layer n.
This means the network needs to be deeper when we have more tokens, because token t needs to wait until layer n+1 to see if token t-1 had the feature “seems political”, and token t+1 needs to wait until layer n+2 to see if token t had the feature “seems political”, and so on.
I’ll admit I am not confident about the nitty-gritty details of how LLMs work. My two core points (that LLMs are too wide vs. deep, and that LLMs are not recurrent and process in fixed layers) don’t hinge on the “working memory” problems LLMs have. But I still think that seems to be true, based on my understanding. For LLMs, compute is separate from data, so the neural networks have to be recomputed each run, with the new token added. Some of their inputs may be cached, but that’s just a performance optimization.
Imagine an LLM is processing some text. At layer n, the feed-forward network has (somehow, and as the first layer that has done so) decided the feature definitely relates to hostility, and maybe relates to politics, but isn’t really sure, so let’s say the part about politics doesn’t really make it into the output for that layer, because there’s more important information to encode (it thinks). Then in the next run, the token “Trump” is added to the input. At layer n, the feed-forward network has to decide from scratch this token is related to politics. Nothing about the previous “this seems kinda political, not sure” decision is stored in the LLM, even though it was in actuality computed. In an alternative architecture, maybe the “brain area” associated with politics would be slightly active already, then the token “Trump” comes in, and now it’s even more active.
it’s all there for layer n+1′s attention to process, though. at each new token position added to the end, we get to use the most recent token as the marginal new computation result produced by the previous token position’s forward pass. for a token position t, for each layer n, n cannot read the output of layer n at earlier token i<t, but n+1 can read everything that happened anywhere in the past, and that gathering process is used to refine the meaning of the current token into a new vector. so, you can’t have hidden state build up in the same way, and each token position runs a partially-shared algorithm. but you can have unhidden state build up, and that unhidden state gets you full turing completeness.
(“brain area” equivalent would be “feature subspace” afaik. which is actually a slightly more general concept that also covers when a human brain lights up in ways that aren’t regionally localized)
Does this not mean the following though?
In layer n, the feed-forward network for token position t will potentially waste time doing things already done in layer n during tokens i<t.
This puts a constraint on the ability of different layers to represent different levels of abstraction, because now both layer n and n+1 need to be able to detect whether something “seems political”, not just layer n.
This means the network needs to be deeper when we have more tokens, because token t needs to wait until layer n+1 to see if token t-1 had the feature “seems political”, and token t+1 needs to wait until layer n+2 to see if token t had the feature “seems political”, and so on.