if the hypothetical breakthrough of meaningfully better capabilities from neuralese happens (in whatever form), all AI companies will start making use of it
Yep, I agree on this. However, it still seems worthwhile to think about whether and when we expect neuralese to unlock meaningfully better capabilities in the first place.
Also, continual learning is analogous to neuralese
Hm, I haven’t heard this argument before. I agree that there’s a sense in which it’s analogous to neuralese, but it isn’t obvious to me that it’s analogous in a concerning way. The main reason we care about preserving legible CoTs is that it enables us to catch dangerous kinds of reasoning, e.g. reasoning about subverting safety measures. If a model doesn’t have enough serial depth to perform the relevant reasoning in a single forward pass, it would need to verbalize this reasoning in its CoT at some point regardless of whether the relevant facts were learned during pre-training or through continual learning. An exception would be coming across data that lays out a perfect strategy for subverting the most recent safety measures, but hopefully such strategies won’t be written up, at least in places where a misaligned AI could encounter it.
Thanks for bringing this up, I hadn’t seen your shortform post! We’ve converged on pretty similar pros/cons. A few comments:
It isn’t obvious to me that this kind of big-O analysis matters that much in practice. RNNs will still need the ability to recall any information from the entire generation process, so I don’t expect O(1) inference to be a free lunch—e.g., as a trade-off, models that can handle extremely long sequences might require huge hidden state vectors that are uselessly large during the first half of the generation process, introducing a different inefficiency that transformers don’t have. Additionally, transformers could be enhanced with something like infini-attention that bounds the max size of the attention matrix while still requiring the model to reason about any fact retrieved from the memory bank in its CoT.
The credit assignment point is an important one that I didn’t consider in the post. Wang et al. (2025) provide some evidence that there might be simple heuristics that enable much better credit assignment for transformers as well, but the updates for a recurrent architecture would still be more precise.
The inductive biases argument is the most handwavy one in the post, maybe I should have left it out. I do think there’s a distinction to be made between wastefulness (which the information bottleneck argument is mainly about) and ease of learning (which the inductive biases point is about), and there’s an intuitive sense that directly optimizing the weights to implement good algorithms should be easier than optimizing them to output tokens that eventually combine to implement a good algorithm, conditional on there being efficient optimization methods for both. However, your point about discretization possibly being a good inductive prior is reasonable and I don’t really have strong empirical evidence for resolving this question either way.