LLMs, on the other hand, are feed-forward networks. Once an LLM decides on a path, it’s committed. It can’t go back to the previous layer. We run the entire model once to generate a token. Then, when it outputs a token, that token is locked in, and the whole model runs again to generate the subsequent token, with its intermediate states (“working memory”) completely wiped. This is not a good architecture for deep thinking.
It might be the case that LLMs develop different cognitive strategies to cope with this, such as storing the working memory on the CoT tokens, so that the ephemeral intermediate steps aren’t load-bearing. The effect would be that the LLM+CoT system acts as… whatever part of our brain explores ideas.
I didn’t go into as much detail about this in my post as I planned to.
I think relying on chain of thought for coping with the working memory problem isn’t a great solution. The chain of thought is linguistic, and thereby linear/shallow compared to “neuralese”. A “neuralese” chain of thought (non-linguistic information) would be better, but then we’re still relying on an external working memory at every step, which is a problem if the working memory is smaller than the model itself. And potentially an issue even if the working memory is huge, because you’d have to make sure each layer in the LLM has access to what it needs from the working memory etc.
It might be the case that LLMs develop different cognitive strategies to cope with this, such as storing the working memory on the CoT tokens, so that the ephemeral intermediate steps aren’t load-bearing. The effect would be that the LLM+CoT system acts as… whatever part of our brain explores ideas.
I didn’t go into as much detail about this in my post as I planned to.
I think relying on chain of thought for coping with the working memory problem isn’t a great solution. The chain of thought is linguistic, and thereby linear/shallow compared to “neuralese”. A “neuralese” chain of thought (non-linguistic information) would be better, but then we’re still relying on an external working memory at every step, which is a problem if the working memory is smaller than the model itself. And potentially an issue even if the working memory is huge, because you’d have to make sure each layer in the LLM has access to what it needs from the working memory etc.