They mostly hinge on the parallelism and inference/RL vs pretraining point. The arguments that carry the most weight for me now are:
(against) Natural language being kind of reasoning-complete, and LLMs will stick to it even as they get smarter (related to the fact that humans reason in inner monologues)
Basically your sweet spot argument
I find this unlikely, but not entirely implausible.
(for) More serial depth, recurrence plausibly has better continual learning properties
(for) Fully recurrent architectures are O(1) generation and O(1) memory during inference, contra attention which is O(n) for both
(for) Fully recurrent models do better credit assignment during training
(against) Training recurrent architectures just seems hard on a technical level.
CoT having nice safety/business properties seems unlikely to matter to me. Just, current labs, seem unlikely to hold back capability gains for such things. Maybe I’m too cynical.
I also didn’t understand the SGD inductive biases point. Like if you’re doing something like GRPO, you can train the model to output tokens that lead to fruitful reasoning down the line. Could you explain this? Like, I agree tokens are a less expressive medium than activations are, but this seems like the same argument as the information bottleneck one. I could also see it going the other way. Like, its hard to represent rigid logical operations in activation space, and being forced to move through tokens spaces has a useful inductive prior for “reasoning” because of its discretisation.
I think CoT may be very valuable in helping capabilities researchers understand their models’ shortcomings and design training curricula that address those deficiencies. This sort of falls under “business value” itself. I currently think this is one of the stronger arguments for why we won’t switch to neuralese.
(The post only vaguely gestures at this argument, so your interpretation is also fair, but I think business value and capabilities gains are pretty closely related rather than being opposing factors.)
Thanks for bringing this up, I hadn’t seen your shortform post! We’ve converged on pretty similar pros/cons. A few comments:
It isn’t obvious to me that this kind of big-O analysis matters that much in practice. RNNs will still need the ability to recall any information from the entire generation process, so I don’t expect O(1) inference to be a free lunch—e.g., as a trade-off, models that can handle extremely long sequences might require huge hidden state vectors that are uselessly large during the first half of the generation process, introducing a different inefficiency that transformers don’t have. Additionally, transformers could be enhanced with something like infini-attention that bounds the max size of the attention matrix while still requiring the model to reason about any fact retrieved from the memory bank in its CoT.
The credit assignment point is an important one that I didn’t consider in the post. Wang et al. (2025) provide some evidence that there might be simple heuristics that enable much better credit assignment for transformers as well, but the updates for a recurrent architecture would still be more precise.
The inductive biases argument is the most handwavy one in the post, maybe I should have left it out. I do think there’s a distinction to be made between wastefulness (which the information bottleneck argument is mainly about) and ease of learning (which the inductive biases point is about), and there’s an intuitive sense that directly optimizing the weights to implement good algorithms should be easier than optimizing them to output tokens that eventually combine to implement a good algorithm, conditional on there being efficient optimization methods for both. However, your point about discretization possibly being a good inductive prior is reasonable and I don’t really have strong empirical evidence for resolving this question either way.
It isn’t obvious to me that this kind of big-O analysis matters that much in practice. RNNs will still need the ability to recall any information from the entire gene
I agree it is unclear, but I would still expect it to. Just because.
Its very easy for models to remember stuff in general
Like the residual stream of a transformer is empirically highly redundant
Does it actually need to remember all the stuff? Humans don’t remember all the stuff.
I asked this exact question a while ago and wrote up some of my thoughts. https://www.lesswrong.com/posts/PJaq4CDQ5d5QtjNRy/williawa-s-shortform#YmyQqQqdei9C7pXR3 but I didn’t get any answers unfortunately.
They mostly hinge on the parallelism and inference/RL vs pretraining point. The arguments that carry the most weight for me now are:
(against) Natural language being kind of reasoning-complete, and LLMs will stick to it even as they get smarter (related to the fact that humans reason in inner monologues)
Basically your sweet spot argument
I find this unlikely, but not entirely implausible.
(for) More serial depth, recurrence plausibly has better continual learning properties
(for) Fully recurrent architectures are O(1) generation and O(1) memory during inference, contra attention which is O(n) for both
(for) Fully recurrent models do better credit assignment during training
(against) Training recurrent architectures just seems hard on a technical level.
CoT having nice safety/business properties seems unlikely to matter to me. Just, current labs, seem unlikely to hold back capability gains for such things. Maybe I’m too cynical.
I also didn’t understand the SGD inductive biases point. Like if you’re doing something like GRPO, you can train the model to output tokens that lead to fruitful reasoning down the line. Could you explain this? Like, I agree tokens are a less expressive medium than activations are, but this seems like the same argument as the information bottleneck one. I could also see it going the other way. Like, its hard to represent rigid logical operations in activation space, and being forced to move through tokens spaces has a useful inductive prior for “reasoning” because of its discretisation.
I think CoT may be very valuable in helping capabilities researchers understand their models’ shortcomings and design training curricula that address those deficiencies. This sort of falls under “business value” itself. I currently think this is one of the stronger arguments for why we won’t switch to neuralese.
(The post only vaguely gestures at this argument, so your interpretation is also fair, but I think business value and capabilities gains are pretty closely related rather than being opposing factors.)
Thanks for bringing this up, I hadn’t seen your shortform post! We’ve converged on pretty similar pros/cons. A few comments:
It isn’t obvious to me that this kind of big-O analysis matters that much in practice. RNNs will still need the ability to recall any information from the entire generation process, so I don’t expect O(1) inference to be a free lunch—e.g., as a trade-off, models that can handle extremely long sequences might require huge hidden state vectors that are uselessly large during the first half of the generation process, introducing a different inefficiency that transformers don’t have. Additionally, transformers could be enhanced with something like infini-attention that bounds the max size of the attention matrix while still requiring the model to reason about any fact retrieved from the memory bank in its CoT.
The credit assignment point is an important one that I didn’t consider in the post. Wang et al. (2025) provide some evidence that there might be simple heuristics that enable much better credit assignment for transformers as well, but the updates for a recurrent architecture would still be more precise.
The inductive biases argument is the most handwavy one in the post, maybe I should have left it out. I do think there’s a distinction to be made between wastefulness (which the information bottleneck argument is mainly about) and ease of learning (which the inductive biases point is about), and there’s an intuitive sense that directly optimizing the weights to implement good algorithms should be easier than optimizing them to output tokens that eventually combine to implement a good algorithm, conditional on there being efficient optimization methods for both. However, your point about discretization possibly being a good inductive prior is reasonable and I don’t really have strong empirical evidence for resolving this question either way.
I agree it is unclear, but I would still expect it to. Just because.
Its very easy for models to remember stuff in general
Like the residual stream of a transformer is empirically highly redundant
Does it actually need to remember all the stuff? Humans don’t remember all the stuff.