I strongly agree; the fact that filler tokens work at all is also suggestive evidence in this direction.
Though it doesn’t exclude the possibility that performance increases monotonically with both CoT length and information density, such that filler tokens work but the optimum is bandwidth limited. I think the entropy evidence you provided is much stronger comparatively.
[EDIT: poor punctuation on my part originally: for clarity, filler tokens working doesn’t exclude that possibility, which makes the entropy evidence stronger.]
Broadly, the CoT text I’ve seen (granted, I haven’t seen all that much) doesn’t feel bandwidth limited. It relies way too much on weird consistent idioms and repetition than what I’d expect RL to find if bandwidth were a pressing issue. Something like a fluidly ultra-polyglot Finnegans Wake with neologisms made up on the fly, flitting in and out through in-context learning?
(only semi related, I trained char transformers on the Wake a few months ago for fun, the loss was consistently way way higher than for normal texts, all else being equal.)
I strongly agree; the fact that filler tokens work at all is also suggestive evidence in this direction.
Though it doesn’t exclude the possibility that performance increases monotonically with both CoT length and information density, such that filler tokens work but the optimum is bandwidth limited. I think the entropy evidence you provided is much stronger comparatively.
[EDIT: poor punctuation on my part originally: for clarity, filler tokens working doesn’t exclude that possibility, which makes the entropy evidence stronger.]
Broadly, the CoT text I’ve seen (granted, I haven’t seen all that much) doesn’t feel bandwidth limited. It relies way too much on weird consistent idioms and repetition than what I’d expect RL to find if bandwidth were a pressing issue. Something like a fluidly ultra-polyglot Finnegans Wake with neologisms made up on the fly, flitting in and out through in-context learning?
(only semi related, I trained char transformers on the Wake a few months ago for fun, the loss was consistently way way higher than for normal texts, all else being equal.)