13 Arguments About a Transition to Neuralese AIs
Over the past year, I have talked to several people about whether they expect frontier AI companies to transition away from the current paradigm of transformer LLMs toward models that reason in neuralese within the next few years. This post summarizes 13 common arguments I’ve heard, six in favor and seven against a transition to neuralese AIs. The following table provides a summary:
| Arguments for a transition to neuralese | Arguments against a transition to neuralese |
| A lot of information gets lost in text bottlenecks. | Natural language reasoning might be a strong local optimum that takes a lot of training effort to escape. |
| The relative importance of post-training compared to pre-training is increasing. | Recurrent LLMs suffer from a parallelism trade-off that makes their training less efficient. |
| There’s an active subfield researching recurrent LLMs. | There’s significant business value in being able to read a model’s CoTs. |
| Human analogy: natural language might not play that big of a role in human thinking. | Human analogy: even if natural language isn’t humans’ primary medium of thought, we still rely on it a lot. |
| SGD inductive biases might favor directly learning good sequential reasoning algorithms in the weight space. | Though significant effort has been spent on getting neuralese models to work, we still have none that work at scale. |
| In the limit of capabilities, neuralese is strictly more powerful than CoT reasoning. | There’s a depth-latency trade-off that might make it less appealing to train large recurrent LLMs. |
| There’s a lot of safety value in being able to read the model’s CoTs. |
What do I mean by neuralese?
Following Hidden Reasoning in LLMs: A Taxonomy, I’ll define neuralese as reasoning in long opaque chains of recurrence. For a precise operationalization of this, I like Fabien Roger’s definition and his corresponding prediction market. He defines the current, CoT-based reasoning paradigm as the conjunction of the following two properties:
Text bottlenecks: there is no path of more than 100k “heavy” serial operations during the generation of an answer where information doesn’t go through a human-understandable categorical format (e.g. tokens). The categorical outputs don’t necessarily have to be faithful, but they do have to be human-understandable. “Heavy” serial operations are e.g. matrix multiplications and convolutions.
Long thoughts: The AI generates its answer using at least 1M serial operations.
The ability to perform 100k heavy serial operations without text bottlenecks already seems like a strong deviation from the current paradigm, where reasoning models perform up to 500 serial operations in a forward pass.[1] Thus, in this post, I’ll constrain the number of allowed serial operations to 10k instead.
Arguments in favor of a transition to neuralese AIs
1) A lot of information is lost in a text bottleneck
The transformer architecture puts a heavy limitation on the amount of information from the residual stream that can reach the next forward pass. The residual stream has to first be compressed into a distribution over tokens, and further information is lost when sampling a single token from that distribution. AI 2027 quantifies the total information loss as follows: “Suppose that an LLM has a vocab size of ~100,000, then each token contains bits of information, around the size of a single floating point number (assuming training in FP16). Meanwhile, residual streams [...] contain thousands of floating point numbers.”
One might counter that the discrete token space simply carves the continuous vector space into bins that we care about, rendering the information that is thrown away superfluous. Furthermore, hard bottlenecks might encourage specialization: not having to deal with all of the information accumulated at the end of the residual stream at the previous token position allows early layers to specialize in syntactic processing and feature extraction. Nevertheless, paraphrasing Eric Drexler’s report on quasilinguistic neural representations, it would be strange to find that the optimal language for machines capable of utilizing expressive vector embeddings consists of tokens that represent mouth noises, especially given the magnitude of the information loss.
2) The increasing importance of post-training
In training runs where most of the compute is allocated to pre-training, neuralese architectures are heavily disadvantaged, as I’ll explain in depth below. Over the past year, the amount of compute companies spend on post-training has increased by orders of magnitude: until recently, labs spent ~100x more on pre-training than on post-training, while the Grok 4 announcement seemed to imply that Grok 4 required an amount of post-training compute equivalent to its pre-training compute. Recurrent models don’t face similar disadvantages in post-training, where training signals are usually provided at the sequence rather than token level. Thus, if post-training compute continues to grow, spending additional resources on pre-training recurrent LLMs or on converting a transformer into a recurrent model after pretraining might become more palatable.
3) Active research on recurrent LLMs
A double-digit number of papers has been published over the past year exploring ways to either convert transformers into a recurrent architecture or to introduce entirely new recurrent architectures. I have reviewed Geiping et al. (2025) and Hao et al. (2024) in the past, but there are others; see section 2 of Zhu et al. (2025) for a very recent literature review. These papers offer solutions to some of the age-old issues with neuralese architectures. Geiping et al. partially evade stability and efficiency issues with long chains of backpropagation by using truncated backprop, which backpropagates through only the last k iterations of the recurrent unit, regardless of the total number of recurrent iterations. Hao et al. circumvent the problem of retaining highly parallelizable pre-training with a recurrent architecture by converting a conventional transformer to a recurrent model through fine-tuning. The benchmark scores don’t look that good in either of those papers and consequently, they haven’t been scaled up to large models, but they still constitute some evidence that the efficiency issues that have long plagued RNN-like architectures might be solvable.
4) Analogy with human thinking
Though human inner monologue may be considered as evidence that thinking in natural language is an efficient strategy for solving a wide range of problems, there are three reasons to think that inner monologue might not be that load-bearing.
First, there are people who claim to be able to switch their inner monologue on and off and to perform most of their productive work without thinking in words. Consider the following quote by Albert Einstein:[2]
The words or the language, as they are written or spoken, do not seem to play any role in my mechanism of thought. The psychical entities which seem to serve as elements in thought are certain signs and more or less clear images which can be “voluntarily” reproduced and combined. There is, of course, a certain connection between those elements and relevant logical concepts. It is also clear that the desire to arrive finally at logically connected concepts is the emotional basis of this rather vague play with the above-mentioned elements. But taken from a psychological viewpoint, this combinatory play seems to be the essential feature in productive thought—before there is any connection with logical construction in words or other kinds of signs which can be communicated to others.
Second, there appear to be people who possess no inner monologue at all: see this X thread by Katja Grace and this reddit thread for example discussions. Finally, there is some evidence that impairments in inner speech do not necessarily influence general cognitive abilities: Langland-Hassan et al. (2015) find a lack of correlation between inner speech impairment and executive function or attention.[3]
Human analogies should not be taken too seriously when forecasting future developments in LLMs, but still provide weak suggestive evidence. The fact that some humans seem to think wordlessly for long periods of time suggests to me that language isn’t the main medium of human thought, as it would be more surprising to find two groups of humans using completely different mental mechanisms to achieve remarkably similar intelligent feats than it would be to find that verbal thought isn’t particularly load-bearing for either of those groups but emerges as a byproduct of load-bearing thoughts for one of those groups and not for the other.
5) SGD inductive biases
Intuitively, one would expect the inductive biases of SGD to favor directly learning good sequential reasoning algorithms in the weights over optimizing the weights to output tokens that function as serial reasoning steps. There is much more room for finding the optimal way to combine a large number of complex reasoning steps in weight-space than in token-space, assuming sufficient expressive capacity for representing all of these steps in a single forward pass.
6) The limit of capabilities
In the limit, neuralese is strictly more powerful than CoT reasoning. A model that can perform adaptive hidden computation can do everything a transformer model can do with a CoT, but additionally use long chains of recurrence whenever doing so is more efficient.
Arguments against a transition to neuralese AIs
1) The natural language sweet spot
As James Chua has argued, natural language seems to be a fairly strong local optimum for models to think in. Both internet text data and LLM-generated synthetic data are written exclusively in natural language. In all of these corpuses, sequential reasoning steps are verbalized in natural language, providing a stronger training signal for models that also verbalize intermediate reasoning steps. Labs will most likely want to keep leveraging all of this existing data for pre-training. There is no continuous analog for this corpus that could be used to pre-train recurrent models to already possess efficient continuous representations at the beginning of the post-training stage.
Although recurrent models have no such disadvantage in post-training, it might be costly to retrain models that reason in natural language to utilize continuous representations for intermediate reasoning steps, as Hao et al. (2024) attempted. Even if models that think in natural language are slightly less efficient at inference, the cost of retraining them might outweigh the difference in inference efficiency. Though linguistic drift from natural language toward alien dialects due to RL optimization pressures would remain a concern in this scenario, such drift lies outside of our definition of neuralese.
A final reason to think that natural language reasoning might be a strong local optimum is that any AI will have to produce natural language outputs—essays, emails, thousands of lines of code—in order to be useful. If a model doesn’t natively think in concepts that are close to natural language, it will incur a cost in translating its internal abstractions into human language, while if it does think in concepts close to natural language, the gain from using these concepts in latent thoughts instead of in an explicit CoT is likely going to be less dramatic. Whether SGD will eventually converge at concepts far from human ones depends on how optimized human abstractions are for intelligent thought.
2) The parallelism trade-off
One may counter the previous point with the argument that recurrent models can simply be pre-trained on all the text data that transformers are currently trained on, thus having all of the capabilities of transformer LLMs but also being able to omit intermediate reasoning steps when doing so is more efficient. However, such training would incur significant efficiency costs, as there is a trade-off between highly parallelizable training and highly expressive forward passes. The efficiency of transformer pre-training stems from the possibility of processing all tokens in a single sequence in parallel, which is impossible when computations at a subsequent token position involve a nonlinear transformation of the hidden state at the previous token position.
So far, highly parallelizable training has been a more important factor than highly expressive forward passes. There are ways of circumventing the trade-off to some extent: e.g., it’s possible to use truncated backprop or to only train on short chains of recurrence during pre-training and hope that this generalizes to long chains of recurrence after some fine-tuning. However, as mentioned above, neither of these approaches has worked at scale so far.
3) Business value of visible reasoning traces
All other things being equal, both labs and customers care about the ability to read externalized reasoning chains. For both labs and users, a visible reasoning trace is useful for debugging the model: e.g., it helps with optimizing the prompt and debugging the reasons for why the model isn’t following instructions in the intended way. For labs, visible CoTs are useful for building more robust misuse detectors and refusal mechanisms: e.g., Anthropic’s constitutional classifiers rely on the ability to monitor the model’s output stream, while OpenAI’s deliberative alignment relies on the ability to train models on distilled reasoning traces.
4) Analogy with human thinking
Despite some humans’ apparent ability to think without an inner monologue, most people can’t help but think in words and also often find it useful to write down their thoughts on a page or paper. If someone was forced to play a social deception game with opponents capable of reading their inner monologue, they would probably lose much more often than otherwise. This reliance on language might just be an accidental byproduct of load-bearing thought resulting from the general importance of language in social communication, as suggested in the previous section, but it might also have deeper functions—for example, one might argue that discrete linguistic units are more reliable and amenable to error correction than continuous thoughts. If an RNN had to learn such bottlenecks in order to perform well, there would be less to be gained from switching transformers for neuralese models.
5) Evidence from past attempts to build recurrent LLMs
The fact that a transition to neuralese hasn’t happened so far constitutes evidence in itself that getting neuralese architectures to work is difficult. The ML community has known for 30 years that recurrence and depth often amplify gradient variance (Bengio et al., 1994), suggesting that if easy solutions to RNNs’ stability and efficiency issues were available, they would have been discovered by now. Additionally, the failures of approaches that introduce recurrence through minimal tweaks to the transformer architecture indicate that if a transition to neuralese models is possible, this will require an architecture that differs drastically from current LLMs. Such an architecture is likely to initially suffer from inefficiencies that have already been solved for transformer LLMs: as one example, the optimal agentic coding scaffold for a neuralese model might be very different from the ones used for transformer LLMs. In this case, a large performance delta that outweighs the inefficiencies may be required for labs to turn their attention to the new architecture.
6) The depth-latency trade-off
One of the bottlenecks to AI scaling discussed in Epoch’s Can AI scaling continue through 2030? is the latency wall, and that bottleneck has implications for the neuralese debate. The minimum time for a model to process a single data point increases with the depth of the model, which sets an upper bound on the training FLOPs within a given timeframe. As labs want to always make use of the latest hardware and algorithmic improvements, the duration of a training run rarely exceeds six months. Though more expressive forward passes don’t influence the amount of FLOPs that can be performed within six months, they reduce the number of data points that can be processed within this time frame, as the time spent on a training run is a product of the time required to process a single batch and the number of training batches. Though batch size can be increased to sidestep this issue, there is a critical batch size above which further increases result in dramatically diminishing returns. These constraints effectively create another trade-off—one between expressive forward passes and dataset sizes, or alternatively one between expressive forward passes and training run lengths—, and though it’s possible that expressive forward passes will win out by being more data-efficient, it isn’t obvious that they will.
7) Safety value of visible reasoning traces
It seems unlikely that labs would be willing to pay the safety tax associated with keeping visible reasoning chains around for long if neuralese models turn out to be much more capable. Nevertheless, the Korbak et al. (2025) joint statement by labs offers at least some hope that monitorability considerations will be taken into account in frontier model development for the foreseeable future.
Conclusion
Even after laying out all of these arguments, I’m not much closer to having a confident opinion on whether I expect neuralese architectures to win out eventually or not. Compared to the two most relevant prediction markets, Fabien Roger’s Will early transformative AIs primarily use text? (currently at 47%) and Leo Gao’s Will Transformer based architectures still be SOTA for language modelling by 2026? (currently at 91%), I’m slightly more optimistic about sticking to transformers than both. I hope that this post will spur further discussion on the topic and help further clarify the considerations I’ve described. All suggestions for arguments that I’ve overlooked are also very welcome.
Work produced at Aether. Thanks to Rohan Subramani and Shubhorup Biswas for feedback on a draft of this post.
- ^
For example, Qwen3-235B-A22B, which to my knowledge is the deepest open-source transformer released this year, has 94 layers, with each layer involving five serial operations—three in the attention and two in the MLP block. Thus, the total number of serial operations in a Qwen3-235B forward pass is 470. It seems plausible that some proprietary models have slightly deeper forward passes, but likely not much deeper.
- ^
See Appendix II of Jacques Hadamard’s An Essay on the Psychology of Invention in the Mathematical Field for the source. The book extensively discusses the role of non-verbal thinking in mathematical research; Henrik Karlsson’s recent post When is it better to think without words? provides a summary.
- ^
Though note that this study has at least two significant limitations: first, impairments in inner monologue are assessed only through a test for silent rhyming ability, and second, the sample size is only 11 patients.
Even Anthropic is building frontier AIs (since Opus 3 or Sonnet 3.5; this was a bit of a surprise to some at the time). Thus if the hypothetical breakthrough of meaningfully better capabilities from neuralese happens (in whatever form), all AI companies will start making use of it, as soon as the immediate behavioral downsides get mitigated to the usual level. Any other kinds of downsides won’t be a reason to not go there, with frontier AIs.
Also, continual learning is analogous to neuralese, an inscrutable way of preserving/propagating information with long chains of reasoning, different from text-only notes. In both cases, you can build a textual “transcript” of a process of thinking, but it’s not necessarily faithful, and doesn’t screen off earlier thinking from later thinking.
Yep, I agree on this. However, it still seems worthwhile to think about whether and when we expect neuralese to unlock meaningfully better capabilities in the first place.
Hm, I haven’t heard this argument before. I agree that there’s a sense in which it’s analogous to neuralese, but it isn’t obvious to me that it’s analogous in a concerning way. The main reason we care about preserving legible CoTs is that it enables us to catch dangerous kinds of reasoning, e.g. reasoning about subverting safety measures. If a model doesn’t have enough serial depth to perform the relevant reasoning in a single forward pass, it would need to verbalize this reasoning in its CoT at some point regardless of whether the relevant facts were learned during pre-training or through continual learning. An exception would be coming across data that lays out a perfect strategy for subverting the most recent safety measures, but hopefully such strategies won’t be written up, at least in places where a misaligned AI could encounter it.
Even with text-only reasoning traces, there is inscrutable thinking in the activations that can span the whole reasoning trace (with global attention), it just has to remain shallow, as its depth is only given by layers, and its depth can only be extended by looping around through the text-only reasoning trace. There will be some steganography (RL makes sure of this), even if it’s not substantial, that increases the effective depth of inscrutable thinking beyond a single pass through the layers. But it stops with the reasoning trace, so this affordance is still bounded.
Continual learning greatly extends the thinking horizon, likely using inscrutable weight updates, thus in particular extending the feasible depth of inscrutable thinking that can avoid showing up in the text-only reasoning traces.
Good summary of some pros and cons. Neutral lists of arguments both for and against something that don’t try to come to a conclusion are rationally virtuous, so kudos.
An addition point con-neuralese—it’s a mix of 1, 3, 7, but you didn’t mention it: I think CoT might give you a capability lever as well as an alignment lever.
Specifically, a CoT (or a stream of neuralese) could be legible to the model (or not legible). This may enable some process like this.
Suppose the process by which you arrived at some particular answer could be something you can survey, understand, and remark upon; one might ask oneself “how could I have done this reasoning faster” if you got the answer correct, or “hrm what was the root cause of the mistake I made here?” if you did not. An LLM can obviously do this with a legible CoT. It’s unclear or dubious that an AI could do these with neuralese.
Granted that you can survey, understand, or remark upon a CoT, this opens up a chance to improve capabilities by (2.1) having the LLM edit the CoT, inserting reflections like “Ah I made a mistake” or “Let’s enumerate some different possibilities...” (think all the kind of stuff @raemon likes to do), (2.2) running forward the CoT with the inserted reflections as prefill (2.3) training on the altered CoT iff the model gets more improvement. Lots of things you can do in (2.3)
In theory this should permit much richer and more reflective alterations to the CoT than “dumb” policy gradient updates.
The above is a rather specific proposal, and so has many burdensome details that are likely to be wrong, but I think something like this is not entirely unlikely.
From another angle—from one of the links you have above has Nostalgebraist talk about how from the users perspective, a legiible CoT is a capability enhancement:
But from my angle, this is also a capability bonus—because it allows the model to look at itself, etc?
I’m not sure about this of course, but I thought it worth adding.
Thanks, this is a good point and definitely worth adding to the list!
Assuming that this is useful for capabilities, I’d imagine that neuralese models would also be able to eventually develop mechanisms to reflect on their own reasoning processes. The kind of introspection that Anthropic recently studied might be seen as one precursor capability for this and as evidence that this is incentivized by the training process, even if the model also has the ability to reflect on the CoT. But of course, the ability to reflect on internal activations in a deep way might be much harder to develop than the ability to do this with a legible CoT.
This is interesting; you make good points on both sides, probably by virtue of gathering them broadly. Nice work!
This is about whether we will move to neuraliese on the current default path. I think there’s an additional important factor: We must not. If the default path takes us there, we must put our shoulders against it and change that path.
I think it’s hard to argue that faithful CoT doesn’t really help the odds of aligning LLM-based AGI. And it’s really hard to argue that’s not important enough to be worth some real capability loss. I realize that commercial and national defense pressures are strong; but we should not fail to mention this fact in any discussion of losing natural langauge CoT. If we do, we’re substantially more likely to all die. It would be a horrible choice and a horrible mistake. Preventing research on improvements without CoT should be a very high priority.
The arguments for risk:
Every plan I know of for aligning LLM-based AGI benefits immensely if we still have somewhat-faithful CoT. And that’s critical, because all of those plans suck! That’s including my own. Taking an unreliable plan and making it a lot worse is really, really bad for our odds of survival.
Of course that’s a big claim. I tried to give full credit to those who think alignment is a lot easier in Cruxes of disagreement on alignment difficulty. But even with the most generous steelmanning I can’t get any reasonable misalignment estimate down below maybe 10% risk of disaster. Remove CoT and that goes way up, in that case maybe 20%.
More estimates based on my most recent and careful thinking are more like 40-60% chance of outright alignment failure (with a lot more chance that If we solve alignment, we die anyway). That would go way up to 60- 80% if we lose CoT.
On the actual question of the efficiencies of losing CoT:
I don’t think there are large gains to be made by dropping language entirely. Language is reason-complete; that’s what it’s for, and we humans spent a long time crafting it to be complete and efficient. Language compresses very nicely, too; thinking in language doesn’t prevent efficiency improvements by collapsing steps.
BUT even if that turns out to be
Continual learning destroys the hope that we get to early AGIs while avoiding the issues with neuralese, because it almost certainly requires inscrutable communication that maintains thinking for a long time (likely in the form of some kind of test time weight updates), and it’s an essential capability. With neuralese itself, there is a significant possibility that it doesn’t get too much better by that time yet, but continual learning does need to happen first (as a capability rather than a particular algorithmic breakthrough), and it more inherently has the same issues.
Yeah, I strongly agree with everything you’ve said here (aside from the question of whether there are large gains to be made by dropping legible CoT—the post describes my uncertainties). It seems likely to me that whether we stick with readable CoTs will ultimately be decided by efficiency considerations, which is why I focused on those in the post, but of course agree that we should do our best in putting our shoulders against it.
I asked this exact question a while ago and wrote up some of my thoughts. https://www.lesswrong.com/posts/PJaq4CDQ5d5QtjNRy/williawa-s-shortform#YmyQqQqdei9C7pXR3 but I didn’t get any answers unfortunately.
They mostly hinge on the parallelism and inference/RL vs pretraining point. The arguments that carry the most weight for me now are:
(against) Natural language being kind of reasoning-complete, and LLMs will stick to it even as they get smarter (related to the fact that humans reason in inner monologues)
Basically your sweet spot argument
I find this unlikely, but not entirely implausible.
(for) More serial depth, recurrence plausibly has better continual learning properties
(for) Fully recurrent architectures are O(1) generation and O(1) memory during inference, contra attention which is O(n) for both
(for) Fully recurrent models do better credit assignment during training
(against) Training recurrent architectures just seems hard on a technical level.
CoT having nice safety/business properties seems unlikely to matter to me. Just, current labs, seem unlikely to hold back capability gains for such things. Maybe I’m too cynical.
I also didn’t understand the SGD inductive biases point. Like if you’re doing something like GRPO, you can train the model to output tokens that lead to fruitful reasoning down the line. Could you explain this? Like, I agree tokens are a less expressive medium than activations are, but this seems like the same argument as the information bottleneck one. I could also see it going the other way. Like, its hard to represent rigid logical operations in activation space, and being forced to move through tokens spaces has a useful inductive prior for “reasoning” because of its discretisation.
I think CoT may be very valuable in helping capabilities researchers understand their models’ shortcomings and design training curricula that address those deficiencies. This sort of falls under “business value” itself. I currently think this is one of the stronger arguments for why we won’t switch to neuralese.
(The post only vaguely gestures at this argument, so your interpretation is also fair, but I think business value and capabilities gains are pretty closely related rather than being opposing factors.)
Thanks for bringing this up, I hadn’t seen your shortform post! We’ve converged on pretty similar pros/cons. A few comments:
It isn’t obvious to me that this kind of big-O analysis matters that much in practice. RNNs will still need the ability to recall any information from the entire generation process, so I don’t expect O(1) inference to be a free lunch—e.g., as a trade-off, models that can handle extremely long sequences might require huge hidden state vectors that are uselessly large during the first half of the generation process, introducing a different inefficiency that transformers don’t have. Additionally, transformers could be enhanced with something like infini-attention that bounds the max size of the attention matrix while still requiring the model to reason about any fact retrieved from the memory bank in its CoT.
The credit assignment point is an important one that I didn’t consider in the post. Wang et al. (2025) provide some evidence that there might be simple heuristics that enable much better credit assignment for transformers as well, but the updates for a recurrent architecture would still be more precise.
The inductive biases argument is the most handwavy one in the post, maybe I should have left it out. I do think there’s a distinction to be made between wastefulness (which the information bottleneck argument is mainly about) and ease of learning (which the inductive biases point is about), and there’s an intuitive sense that directly optimizing the weights to implement good algorithms should be easier than optimizing them to output tokens that eventually combine to implement a good algorithm, conditional on there being efficient optimization methods for both. However, your point about discretization possibly being a good inductive prior is reasonable and I don’t really have strong empirical evidence for resolving this question either way.
I agree it is unclear, but I would still expect it to. Just because.
Its very easy for models to remember stuff in general
Like the residual stream of a transformer is empirically highly redundant
Does it actually need to remember all the stuff? Humans don’t remember all the stuff.
Have you seen this market?
When will an AI model using neuralese recurrence be first released to the public? (currently 11 Oct 2027)