It looks like these COCONUT-like AI are trying to improve efficiency by avoiding a bottleneck. It’s the bottleneck where all information inside a forward pass (e.g. the residual stream) has to be deleted, outputting only a single token.
So in a sense, the AI has to repeatedly recalculate its entire understanding of the context for every single token it generates.
EDIT: See the reply below by Mis-Understandings. The AI doesn’t need to recalculate its entire understanding for every token due to KV caching, so I was wrong. However, I still think there is an information bottleneck, because the context window can only gain one token of information per forward pass even though a forward pass computes zillions of bits of information.
These hidden chain of thought architectures seem to derive their efficiency from avoiding this bottleneck.
Yet surely, surely, there must be other approaches to avoid this bottleneck without completely sacrificing interpretability to death?
An example solution
Of the top of my head:
Inside the AI’s chain of thought, each forward pass can generate many English tokens instead of one, allowing more information to pass through the bottleneck.
After the chain of thought ended, and the AI is giving its final answer, it generates only one English tokens at a time, to make each token higher quality. The architecture might still generate many tokens in one forward pass, but a simple filter repeatedly deletes everything except its first token from the context window.
This slower thinking might occur not just for the final answer, but when the AI is cleaning up its chain of thought with concise summaries (e.g. ClaudePlaysPokemon has long chains of thought when simply entering and leaving a room, filling up its context window with useless information. They gave it a separate memory file where it can write and delete more important information).
Admittedly, even this method would reduce interpretability by making the chain of thought longer. This trade-off is more inevitable. But the trade-off where the AI has to learn some non-English alien language seems unnecessary.
forward pass (e.g. the residual stream) has to be deleted, outputting only a single token.
Does not actually happen.
What it is that the new token is now at the root of the attention structure, and can pass information from the final layers to the first layers inferencing the next token.
The residuals are translation independent, and are cached for further inference in autoregressive mode.
Beautiful post, I learned a lot from it.
An Unnecessary Trade-off
It looks like these COCONUT-like AI are trying to improve efficiency by avoiding a bottleneck.
It’s the bottleneck where all information inside a forward pass (e.g. the residual stream) has to be deleted, outputting only a single token.So in a sense, the AI has to repeatedly recalculate its entire understanding of the context for every single token it generates.EDIT: See the reply below by Mis-Understandings. The AI doesn’t need to recalculate its entire understanding for every token due to KV caching, so I was wrong. However, I still think there is an information bottleneck, because the context window can only gain one token of information per forward pass even though a forward pass computes zillions of bits of information.
These hidden chain of thought architectures seem to derive their efficiency from avoiding this bottleneck.
Yet surely, surely, there must be other approaches to avoid this bottleneck without completely sacrificing interpretability to death?
An example solution
Of the top of my head:
Inside the AI’s chain of thought, each forward pass can generate many English tokens instead of one, allowing more information to pass through the bottleneck.
After the chain of thought ended, and the AI is giving its final answer, it generates only one English tokens at a time, to make each token higher quality. The architecture might still generate many tokens in one forward pass, but a simple filter repeatedly deletes everything except its first token from the context window.
This slower thinking might occur not just for the final answer, but when the AI is cleaning up its chain of thought with concise summaries (e.g. ClaudePlaysPokemon has long chains of thought when simply entering and leaving a room, filling up its context window with useless information. They gave it a separate memory file where it can write and delete more important information).
Admittedly, even this method would reduce interpretability by making the chain of thought longer. This trade-off is more inevitable. But the trade-off where the AI has to learn some non-English alien language seems unnecessary.
forward pass (e.g. the residual stream) has to be deleted, outputting only a single token.
Does not actually happen.
What it is that the new token is now at the root of the attention structure, and can pass information from the final layers to the first layers inferencing the next token.
The residuals are translation independent, and are cached for further inference in autoregressive mode.
Thank you. Just earlier I was asking an AI whether my comment was reasonable and it told me something similar.