But (1) bandwidth might not be better in this case; it isn’t in all cases
The entropy of LLM generated text is a few bits per token, whereas the hidden state contains 10-100k bits. It’s hard to imagine any method which passes around hidden states[1] to have lower bandwidth than CoT tokens!
My read was they meant more bandwidth is not necessarily better. Not sure though.
If this is what they meant, maybe their reasoning is something like: language imposes an inductive prior on carrying out your reasoning in discrete logical steps, which can be advantageous over continuous blobs, which they can do a lot of anyways (just with low serial depth).
Idk, I find this argument somewhat convincing. But wouldn’t bet on it. I did a quick experiment computing the entropy (or really an upper bound on the entropy), and found that CoT has fairly low entropy among all compared with the text LLMs normally generate. Which is some evidence for this hypothesis.
The entropy of LLM generated text is a few bits per token, whereas the hidden state contains 10-100k bits. It’s hard to imagine any method which passes around hidden states[1] to have lower bandwidth than CoT tokens!
Or similarly sized tensors
My read was they meant more bandwidth is not necessarily better. Not sure though.
If this is what they meant, maybe their reasoning is something like: language imposes an inductive prior on carrying out your reasoning in discrete logical steps, which can be advantageous over continuous blobs, which they can do a lot of anyways (just with low serial depth).
Idk, I find this argument somewhat convincing. But wouldn’t bet on it. I did a quick experiment computing the entropy (or really an upper bound on the entropy), and found that CoT has fairly low entropy among all compared with the text LLMs normally generate. Which is some evidence for this hypothesis.