I’ve heard many say that “neuralese” is superior to CoT and will inevitably supplant it. The usual justification is that the bandwidth of neuralese is going to be higher, which will make it better. But (1) bandwidth might not be better in this case; it isn’t in all cases and (2) there are other factors that could theoretically operate against this, even if this is true.
Has anyone cleanly made the case for why neuralese is better or asymptotically technically inevitable, at length / clearly?
What would be the competing hypothesis? Legible english can’t be compute optimal, and already starts to actively degrade in current models absent countermeasures. My understanding is that even things like Cache2Cache already provide a benefit over exchanging legible english text: https://arxiv.org/abs/2510.03215
Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency.
Oh I agree, I was trying to figure out why CoT would be assumed superior to neuralese and one position could be something about “the human prior makes it easier to reason in cot than latent space”. I’ll admit I’m reaching here though, I’d like to understand the steelman for why CoT would be superior to reasoning in latent space.
The counterargument against continous tokens being passed forwards is that if you want to use neuralese, you have to give up sampling, since the big idea of latent reasoning is to not pass through the random discretization of sampling a token. But random discretization is itself powerful, especially with the possibility of a useful bias. If you give it up, the model becomes deterministic, so it can’t use Best of N. If Best of N or tree search on chains of thoughts is really important, either in training or in deployment, that is something that is not really compatible with the latent paradigm, in addition to the difficulty of training data.
The argument against semantic drift/Thinkish is extremely weak, and we should expect semantic drift when training with self play without countermeasures.
Yeah looks like it’s vectors as some kind of an autoencoder between different text models at first glance, not using it as an intermediate state to assist thinking in a single text model? Or something; the application list is underwhelming
As a general LLM communication paradigm, C2C can be expanded to various fields. Some poten-
tial scenarios include: (1) Privacy-aware cloud–edge collaboration: a cloud-scale model can transmit
curated KV-Cache segments to an edge model to boost capability without emitting raw text, reduc-
ing bandwidth and limiting content exposure. (2) Integration with current inference acceleration
method: use C2C to enhance speculative decoding and enable token-level routing across heteroge-
neous models for lower latency and cost. (3) Multimodal integration: align and fuse caches among language reasoning LLMs, vision–language models (VLMs), and vision–language–action (VLA)
policies so that linguistic and visual context can drive more accurate actions.
Why does the application list matter? I still feel like I don’t understand the position of “maybe it’s not more efficient for the model to do reasoning within a several thousand dimensional vector as opposed to human legible english.” My understanding of the arguments for neuralese is that because this is the case, there is eventually growing performance incentive to do this.
bandwidth might not be better in this case; it isn’t in all cases
A several thousand dimensional vector can contain so much more information than is in an integer between 1 and ~200K. The implementation is likely painful, but I can’t see a world where the optimal bandwidth given a good implementation of both is lower
A several thousand dimensional vector can contain so much more information than is in an integer between 1 and ~200K. The implementation is likely painful, but I can’t see a world where the optimal bandwidth given a good implementation of both is lower
The transformer already has thousands of dimensions available through attention, no? How much does removing the tokenization buy you in addition? I agree it buys you some but seems unclear how much.
A lot. Because the only thing that is recurrent is the text/vector CoT. The residual stream is very rich but the number of sequential steps of computation is bounded by the number of layers, without being able to send the intermediate information back to the beginning with some recurrence
But there are systems that work better with lower bandwidth or have deliberately lower bandwidth, like autoencoders.
I understand that the bandwidth is certainly higher for one than the other, but this both might not be an advantage in this circumstance or could be an advantage in some respects but a greater disadvantage in others.
The point of an autoencoder is to form good representations, not to perform well. I’m struggling to think of any other examples where low bandwidth is good, that arent just implementation issues (and, again, in current systems text CoT > neuralese, so obviously low bandwidth can be good)
I appreciate the reference, although I found this article + discussion pretty underwhelming; it’s part of what’s motivating my question.
For instance, not all forms of unintelligibility in CoT’s are necessarily evidence of a drive-to-compression. But the article takes for granted that the weirdness we see in chains-of-thought are evidence towards this; it views various forms of weird text that I’d see as evidence for screwed up training systems or spandrells of the training process and just assumes they are “thinking” driven into non-human-legible vocabulary. The guy didn’t particularly consider other hypotheses for what he was seeing.
And similarly he discusses “redundancy” in human languages, and immediately assumes machines would want it to go away, while not… thinking of why it’s there, and whether it would stick around for machines potentially.
This isn’t anything like a full refutation of him, tbc, I’m just giving my impression of it at a high level. By my takeaway is that if this is the best discussion than I don’t think anyone’s actually tried to work out the reasoning around this carefully, even if neuralese is actually inevitable.
I don’t have watertight arguments, but to try and state it cleanly:
During inference, a forwards pass of the neural net is computed repeatedly as each token is generated. Activation vectors propagate from one layer to the next.
Activation vectors are the main flow of information from earlier layers to later layers.
The attention mechanism also allows activation vectors from previous tokens to influence the current computation. But crucially, this communication happens between activations at the same attention layer, it doesn’t skip forwards or backwards in terms of layers.
Thus, the only flow of information from later layers to earlier layers is contained in the sequence of tokens produced by the model.
This is silly. Layer 1 for the 2nd token happens after layer 100 for the 1st token. There’s no reason why we shouldn’t be able to give layer 1 for the 2nd token as much information as it wants about any of the 1st token layers.
Advantages of using activations for communication:
Activations do contain more information of course.
During pre-training, token logits are optimized for being high probability, which constrains them a fair bit.
Activations are also continuous, so can encode continuous values and probabilities, along with discrete values. And they can be optimized by gradient descent to be more helpful.
Also:
I’m actually not certain that neuralese is technically inevitable. Yes, it’s almost certainly superior given that we assume away the problem of training a neuralese model in the first place (i.e. assume infinite compute budget). But without that assumption…
Basically, the way attention currently works makes it easy to parallelize across tokens during training (and context reading). This is why context reading is cheaper per token than producing text, and why training on such a huge amount of data is possible. Neuralese doesn’t have this property of being fast when the tokens are already supplied, because there is still this activation data that has to be filled in sequentially.
So, neuralese models will probably have to be trained on less data, and they will be less efficient at reading context. They are probably about the same efficiency for generating text (at least if the non-neuralese competitor doesn’t get to use speculative generation with a cheaper model).
I guess models that have neuralese “turned off” during pre-training and context reading could still be comparably efficient. But then all the optimization of the neuralese encoding beyond just “use the last layer output” has to happen during RL. Due to its low cost, this is probably how the first usage of neuralese we see in the wild will work.
The other issue, which would only be a problem during training, is that gradients have to backpropagate through the neuralese vectors. This could result in the usual gradient stability issues we see in the training of RNNs that occur because the neural net effectively becomes incredibly deep. I think the field has solutions for this, but it’s another big complication to deal with when you try to scale the models.
Anyway, I think it’s probably going to happen eventually, especially if the “smaller, higher-quality training dataset” trend persists but it might take longer than people think.
See also Karpathy’s claim that models will be split into a part that focuses on reasoning but has relatively little memorized and a part that focuses on memorization. Karpathy’s assumption is that the reasoning part could be quite small. So if that’s true, then probably the reasoning part gets neuralese but the memorization part doesn’t, and the fact that the reasoning part is small makes the extra costs of neuralese more tolerable.
But (1) bandwidth might not be better in this case; it isn’t in all cases
The entropy of LLM generated text is a few bits per token, whereas the hidden state contains 10-100k bits. It’s hard to imagine any method which passes around hidden states[1] to have lower bandwidth than CoT tokens!
My read was they meant more bandwidth is not necessarily better. Not sure though.
If this is what they meant, maybe their reasoning is something like: language imposes an inductive prior on carrying out your reasoning in discrete logical steps, which can be advantageous over continuous blobs, which they can do a lot of anyways (just with low serial depth).
Idk, I find this argument somewhat convincing. But wouldn’t bet on it. I did a quick experiment computing the entropy (or really an upper bound on the entropy), and found that CoT has fairly low entropy among all compared with the text LLMs normally generate. Which is some evidence for this hypothesis.
(In agreement): Neuralese is ~equivalent to wrapping your model as a DEQ with the residual stream shifted by one on every pass as far as I can tell, and it’s not obvious to me that this is the relevant One Weird Trick. The neural network already has a way to shuttle around vast amounts of cryptic high-dimensional data: the neural network part of the neural network. It seems much more likely to me that the relevant axis of scaling is something like a byte-latent transformer with larger and larger patches.
Edit: I guess in principle this isn’t that different from neuralese with the input being encode(decode(vector)), the larger point is that if a token is too small a bottleneck for a vector, you can just make the vector correspond to more text.
I’ve heard many say that “neuralese” is superior to CoT and will inevitably supplant it. The usual justification is that the bandwidth of neuralese is going to be higher, which will make it better. But (1) bandwidth might not be better in this case; it isn’t in all cases and (2) there are other factors that could theoretically operate against this, even if this is true.
Has anyone cleanly made the case for why neuralese is better or asymptotically technically inevitable, at length / clearly?
What would be the competing hypothesis? Legible english can’t be compute optimal, and already starts to actively degrade in current models absent countermeasures. My understanding is that even things like Cache2Cache already provide a benefit over exchanging legible english text: https://arxiv.org/abs/2510.03215
Note that an illegible CoT (Thinkish) is different from reasoning in latent space (Neuralese).
Oh I agree, I was trying to figure out why CoT would be assumed superior to neuralese and one position could be something about “the human prior makes it easier to reason in cot than latent space”. I’ll admit I’m reaching here though, I’d like to understand the steelman for why CoT would be superior to reasoning in latent space.
The counterargument against continous tokens being passed forwards is that if you want to use neuralese, you have to give up sampling, since the big idea of latent reasoning is to not pass through the random discretization of sampling a token. But random discretization is itself powerful, especially with the possibility of a useful bias. If you give it up, the model becomes deterministic, so it can’t use Best of N. If Best of N or tree search on chains of thoughts is really important, either in training or in deployment, that is something that is not really compatible with the latent paradigm, in addition to the difficulty of training data.
The argument against semantic drift/Thinkish is extremely weak, and we should expect semantic drift when training with self play without countermeasures.
Yeah looks like it’s vectors as some kind of an autoencoder between different text models at first glance, not using it as an intermediate state to assist thinking in a single text model? Or something; the application list is underwhelming
Why does the application list matter? I still feel like I don’t understand the position of “maybe it’s not more efficient for the model to do reasoning within a several thousand dimensional vector as opposed to human legible english.” My understanding of the arguments for neuralese is that because this is the case, there is eventually growing performance incentive to do this.
A several thousand dimensional vector can contain so much more information than is in an integer between 1 and ~200K. The implementation is likely painful, but I can’t see a world where the optimal bandwidth given a good implementation of both is lower
The transformer already has thousands of dimensions available through attention, no? How much does removing the tokenization buy you in addition? I agree it buys you some but seems unclear how much.
A lot. Because the only thing that is recurrent is the text/vector CoT. The residual stream is very rich but the number of sequential steps of computation is bounded by the number of layers, without being able to send the intermediate information back to the beginning with some recurrence
But there are systems that work better with lower bandwidth or have deliberately lower bandwidth, like autoencoders.
I understand that the bandwidth is certainly higher for one than the other, but this both might not be an advantage in this circumstance or could be an advantage in some respects but a greater disadvantage in others.
The point of an autoencoder is to form good representations, not to perform well. I’m struggling to think of any other examples where low bandwidth is good, that arent just implementation issues (and, again, in current systems text CoT > neuralese, so obviously low bandwidth can be good)
See the discussion here: How AI Is Learning to Think in Secret
I appreciate the reference, although I found this article + discussion pretty underwhelming; it’s part of what’s motivating my question.
For instance, not all forms of unintelligibility in CoT’s are necessarily evidence of a drive-to-compression. But the article takes for granted that the weirdness we see in chains-of-thought are evidence towards this; it views various forms of weird text that I’d see as evidence for screwed up training systems or spandrells of the training process and just assumes they are “thinking” driven into non-human-legible vocabulary. The guy didn’t particularly consider other hypotheses for what he was seeing.
And similarly he discusses “redundancy” in human languages, and immediately assumes machines would want it to go away, while not… thinking of why it’s there, and whether it would stick around for machines potentially.
This isn’t anything like a full refutation of him, tbc, I’m just giving my impression of it at a high level. By my takeaway is that if this is the best discussion than I don’t think anyone’s actually tried to work out the reasoning around this carefully, even if neuralese is actually inevitable.
I don’t have watertight arguments, but to try and state it cleanly:
During inference, a forwards pass of the neural net is computed repeatedly as each token is generated. Activation vectors propagate from one layer to the next.
Activation vectors are the main flow of information from earlier layers to later layers.
The attention mechanism also allows activation vectors from previous tokens to influence the current computation. But crucially, this communication happens between activations at the same attention layer, it doesn’t skip forwards or backwards in terms of layers.
Thus, the only flow of information from later layers to earlier layers is contained in the sequence of tokens produced by the model.
This is silly. Layer 1 for the 2nd token happens after layer 100 for the 1st token. There’s no reason why we shouldn’t be able to give layer 1 for the 2nd token as much information as it wants about any of the 1st token layers.
Advantages of using activations for communication:
Activations do contain more information of course.
During pre-training, token logits are optimized for being high probability, which constrains them a fair bit.
Activations are also continuous, so can encode continuous values and probabilities, along with discrete values. And they can be optimized by gradient descent to be more helpful.
Also:
I’m actually not certain that neuralese is technically inevitable. Yes, it’s almost certainly superior given that we assume away the problem of training a neuralese model in the first place (i.e. assume infinite compute budget). But without that assumption…
Basically, the way attention currently works makes it easy to parallelize across tokens during training (and context reading). This is why context reading is cheaper per token than producing text, and why training on such a huge amount of data is possible. Neuralese doesn’t have this property of being fast when the tokens are already supplied, because there is still this activation data that has to be filled in sequentially.
So, neuralese models will probably have to be trained on less data, and they will be less efficient at reading context. They are probably about the same efficiency for generating text (at least if the non-neuralese competitor doesn’t get to use speculative generation with a cheaper model).
I guess models that have neuralese “turned off” during pre-training and context reading could still be comparably efficient. But then all the optimization of the neuralese encoding beyond just “use the last layer output” has to happen during RL. Due to its low cost, this is probably how the first usage of neuralese we see in the wild will work.
The other issue, which would only be a problem during training, is that gradients have to backpropagate through the neuralese vectors. This could result in the usual gradient stability issues we see in the training of RNNs that occur because the neural net effectively becomes incredibly deep. I think the field has solutions for this, but it’s another big complication to deal with when you try to scale the models.
Anyway, I think it’s probably going to happen eventually, especially if the “smaller, higher-quality training dataset” trend persists but it might take longer than people think.
See also Karpathy’s claim that models will be split into a part that focuses on reasoning but has relatively little memorized and a part that focuses on memorization. Karpathy’s assumption is that the reasoning part could be quite small. So if that’s true, then probably the reasoning part gets neuralese but the memorization part doesn’t, and the fact that the reasoning part is small makes the extra costs of neuralese more tolerable.
The entropy of LLM generated text is a few bits per token, whereas the hidden state contains 10-100k bits. It’s hard to imagine any method which passes around hidden states[1] to have lower bandwidth than CoT tokens!
Or similarly sized tensors
My read was they meant more bandwidth is not necessarily better. Not sure though.
If this is what they meant, maybe their reasoning is something like: language imposes an inductive prior on carrying out your reasoning in discrete logical steps, which can be advantageous over continuous blobs, which they can do a lot of anyways (just with low serial depth).
Idk, I find this argument somewhat convincing. But wouldn’t bet on it. I did a quick experiment computing the entropy (or really an upper bound on the entropy), and found that CoT has fairly low entropy among all compared with the text LLMs normally generate. Which is some evidence for this hypothesis.
(In agreement): Neuralese is ~equivalent to wrapping your model as a DEQ with the residual stream shifted by one on every pass as far as I can tell, and it’s not obvious to me that this is the relevant One Weird Trick. The neural network already has a way to shuttle around vast amounts of cryptic high-dimensional data: the neural network part of the neural network.
It seems much more likely to me that the relevant axis of scaling is something like a byte-latent transformer with larger and larger patches.
Edit: I guess in principle this isn’t that different from neuralese with the input being encode(decode(vector)), the larger point is that if a token is too small a bottleneck for a vector, you can just make the vector correspond to more text.
Another argument is that you can more cleanly backprop through it.
A third argument is that you have constant inference memory and speed as a function of context length. At least if implemented like traditional rnns.