It’s unclear to me if CoT is an active information bottleneck. There are rather easy ways available for the LLMs to pack more information into its CoT, should be easy for RL to find, but which they’re not employing.
I did some very simple tests with this. The basic results of which were that CoT is rather low entropy compared with other LLM outputs (with some caveat, we’re really measuring an upper bound). Interested to hear your thoughts.
I strongly agree; the fact that filler tokens work at all is also suggestive evidence in this direction.
Though it doesn’t exclude the possibility that performance increases monotonically with both CoT length and information density, such that filler tokens work but the optimum is bandwidth limited. I think the entropy evidence you provided is much stronger comparatively.
[EDIT: poor punctuation on my part originally: for clarity, filler tokens working doesn’t exclude that possibility, which makes the entropy evidence stronger.]
Broadly, the CoT text I’ve seen (granted, I haven’t seen all that much) doesn’t feel bandwidth limited. It relies way too much on weird consistent idioms and repetition than what I’d expect RL to find if bandwidth were a pressing issue. Something like a fluidly ultra-polyglot Finnegans Wake with neologisms made up on the fly, flitting in and out through in-context learning?
(only semi related, I trained char transformers on the Wake a few months ago for fun, the loss was consistently way way higher than for normal texts, all else being equal.)
It’s unclear to me if CoT is an active information bottleneck. There are rather easy ways available for the LLMs to pack more information into its CoT, should be easy for RL to find, but which they’re not employing.
I did some very simple tests with this. The basic results of which were that CoT is rather low entropy compared with other LLM outputs (with some caveat, we’re really measuring an upper bound). Interested to hear your thoughts.
I strongly agree; the fact that filler tokens work at all is also suggestive evidence in this direction.
Though it doesn’t exclude the possibility that performance increases monotonically with both CoT length and information density, such that filler tokens work but the optimum is bandwidth limited. I think the entropy evidence you provided is much stronger comparatively.
[EDIT: poor punctuation on my part originally: for clarity, filler tokens working doesn’t exclude that possibility, which makes the entropy evidence stronger.]
Broadly, the CoT text I’ve seen (granted, I haven’t seen all that much) doesn’t feel bandwidth limited. It relies way too much on weird consistent idioms and repetition than what I’d expect RL to find if bandwidth were a pressing issue. Something like a fluidly ultra-polyglot Finnegans Wake with neologisms made up on the fly, flitting in and out through in-context learning?
(only semi related, I trained char transformers on the Wake a few months ago for fun, the loss was consistently way way higher than for normal texts, all else being equal.)