My read was they meant more bandwidth is not necessarily better. Not sure though.
If this is what they meant, maybe their reasoning is something like: language imposes an inductive prior on carrying out your reasoning in discrete logical steps, which can be advantageous over continuous blobs, which they can do a lot of anyways (just with low serial depth).
Idk, I find this argument somewhat convincing. But wouldn’t bet on it. I did a quick experiment computing the entropy (or really an upper bound on the entropy), and found that CoT has fairly low entropy among all compared with the text LLMs normally generate. Which is some evidence for this hypothesis.
My read was they meant more bandwidth is not necessarily better. Not sure though.
If this is what they meant, maybe their reasoning is something like: language imposes an inductive prior on carrying out your reasoning in discrete logical steps, which can be advantageous over continuous blobs, which they can do a lot of anyways (just with low serial depth).
Idk, I find this argument somewhat convincing. But wouldn’t bet on it. I did a quick experiment computing the entropy (or really an upper bound on the entropy), and found that CoT has fairly low entropy among all compared with the text LLMs normally generate. Which is some evidence for this hypothesis.