Haha, my idea was just “maybe to solve this information bottleneck problem, we should solve the generate-many-tokens-in-one-pass problem.”
I haven’t really thought of any solution to the generate-many-tokens-in-one-pass problem yet :/
I’ll edit the post to mention this.
An attempt
One stupid attempt to solve the “generate-many-tokens-in-one-pass” problem, is to start off with the main LLM outputting 1 token at a time, and a small cheap LLM outputting the next 5 tokens. You then let the small LLM eavesdrop on the residual stream of the main LLM, and use reinforcement learning on both the main LLM and the small LLM.
The hope is that the main LLM will eventually learn to use part of its residual stream to communicate to the small LLM, and tell the small LLM what the next 5 tokens should be, so the computations in the main LLM can directly influence 6 tokens of output.
Slow thinking
I guess the “simple filter repeatedly deletes everything except its first token from the context window” was a bit unclear. I’ll rewrite that.
What I wanted to say was, when the AI is talking to you (rather than talking to itself in its chain-of-thought), we want the AI to slow down, and do more computation for each token it outputs. In this case, we don’t want it outputting many tokens for each forward pass. We want to only keep the first “high quality” token and delete the rest.
I don’t think this is related to top-k/top-p generation, because that’s referring to how an LLM samples one token from its distribution. k refers to the number of tokens considered not the number of tokens generated at once.
Thank you so much for reading and for the reply :)
Haha, my idea was just “maybe to solve this information bottleneck problem, we should solve the generate-many-tokens-in-one-pass problem.”
I haven’t really thought of any solution to the generate-many-tokens-in-one-pass problem yet :/
I’ll edit the post to mention this.
An attempt
One stupid attempt to solve the “generate-many-tokens-in-one-pass” problem, is to start off with the main LLM outputting 1 token at a time, and a small cheap LLM outputting the next 5 tokens. You then let the small LLM eavesdrop on the residual stream of the main LLM, and use reinforcement learning on both the main LLM and the small LLM.
The hope is that the main LLM will eventually learn to use part of its residual stream to communicate to the small LLM, and tell the small LLM what the next 5 tokens should be, so the computations in the main LLM can directly influence 6 tokens of output.
Slow thinking
I guess the “simple filter repeatedly deletes everything except its first token from the context window” was a bit unclear. I’ll rewrite that.
What I wanted to say was, when the AI is talking to you (rather than talking to itself in its chain-of-thought), we want the AI to slow down, and do more computation for each token it outputs. In this case, we don’t want it outputting many tokens for each forward pass. We want to only keep the first “high quality” token and delete the rest.
I don’t think this is related to top-k/top-p generation, because that’s referring to how an LLM samples one token from its distribution. k refers to the number of tokens considered not the number of tokens generated at once.
Thank you so much for reading and for the reply :)