I didn’t go into as much detail about this in my post as I planned to.
I think relying on chain of thought for coping with the working memory problem isn’t a great solution. The chain of thought is linguistic, and thereby linear/shallow compared to “neuralese”. A “neuralese” chain of thought (non-linguistic information) would be better, but then we’re still relying on an external working memory at every step, which is a problem if the working memory is smaller than the model itself. And potentially an issue even if the working memory is huge, because you’d have to make sure each layer in the LLM has access to what it needs from the working memory etc.
I didn’t go into as much detail about this in my post as I planned to.
I think relying on chain of thought for coping with the working memory problem isn’t a great solution. The chain of thought is linguistic, and thereby linear/shallow compared to “neuralese”. A “neuralese” chain of thought (non-linguistic information) would be better, but then we’re still relying on an external working memory at every step, which is a problem if the working memory is smaller than the model itself. And potentially an issue even if the working memory is huge, because you’d have to make sure each layer in the LLM has access to what it needs from the working memory etc.