Neuralese is Actually Probably Good for Alignment
The best language models are still getting smarter and more capable. To an increasing degree, this is because they are trained by Reinforcement Learning with Verifiable Rewards. Chain of thought reasoning allows models to evade the finite depth restriction on information flow by passing (relatively little) information back into the first layers of the model through the token stream. Although pretraining was already enough to produce decently-good chains of thought by pure imitation, RLVR allows further optimization of the chain of thought by rewarding those chains of thought that result in a verifiably correct answer. Because tasks that are far too difficult for humans can still be possible to grade, RLVR provides a way of bootstrapping to capabilities far beyond the human level.
Exact and Inexact Grading
What kinds of things can be turned into verifiable rewards? Most obviously, coding tasks where there are unit tests (& maybe performance tests) to determine if a solution was correct. Or in a similar vein, creating mathematical proofs in a formal proof language like Lean or Agda. We can use almost any RL environment that is easy enough for the model to interact with. For example, the model plays a text adventure game where there are rewards for getting certain items or reaching certain rooms.
We’ll call this class of problem “exactly graded”, because the reward is possible to evaluate without error. Note that the problem statement given to the model as context need not be exact at all. We can ask the model to write its code with only an informal problem description to guide it, but still grade its solutions by running unit tests.
Another class of problem is where the reward is subjective, or just difficult to compute. An extreme example of this might be asking the model to invent a joke. While this is the kind of question that could benefit from chain of thought reasoning (coming up with a really funny joke can require a lot of thought and iterating on ideas), it’s very difficult to automatically evaluate whether a joke is funny. In addition, the training corpus contains few examples where people verbalize their process for coming up with jokes, so purely relying on pretraining to give us reasoning traces to imitate won’t get us very far. We’ll call this class of problem “inexactly graded”.
One option here is to get humans to provide the grades. Too many rollouts are generated during RL for humans to label every rollout , so we usually train a reward model as an intermediary. Human graders assign rewards to a diverse set of rollouts (mixed with manually-created correct answers, perhaps). Then we train a reward model on this dataset of rewards. The loss function here is simple mean-squared error. Then we do our actual RL optimization, where the reward model assigns rewards to the each of the large number of rollouts produced by our model during training.
One problem with training a reward model as an intermediary is that it opens the door to the learner finding ways to trick the reward model into assigning a higher reward than it should. This can happen in subtle ways, making it hard for AI researchers to reliably notice and fix these issues. Also, even the smaller amount of human labelling required to train a reward model can still be expensive.
So,
-
Exactly graded problems:
-
They are relatively difficult to cheat.
-
However a wide variety of problems, including most alignment-flavoured problems, are not exactly graded.
-
In general training on these can make your model smarter, but will not make it more friendly.
-
-
Inexactly graded problems:
-
Trying to optimize the chain of thought means we need some kind of automated grader.
-
But the subjectivity of grading makes the grader subject to adversarial optimization. Cheater strategies like prompt-injecting the grader to get a higher score are incentivized.
-
Need to manually supply labels, but the supervision they provide is sparse: just a bunch of scalars.
-
So for inexactly graded problems, figuring out how to train chain-of-thought reasoning is hard. What would be really nice is if we could just make training chain-of-thought as easy as supervised fine-tuning. Instead of obtaining a diverse dataset of various kinds of answers (both good and bad), carefully grading those answers, and then training a reward model, supervised fine-tuning just asks for a dataset of answers that are known to be good. (Yes, this does kind of destroy the appealing [1] “train far beyond human intelligence” part of doing RL. Though this strikes me as less important for inexactly graded problems anyway. And it’s not like RL is a mandatory prerequisite for exceeding human ability; see here.)
The problem with just treating this as a supervised fine tuning problem is that, while we want the model to reason before answering because it improves performance, we can’t differentiate through the chain of thought. The random selection of tokens in the chain of thought, besides destroying a lot of information contained in the activations, is non-differentiable.
Except that if the chain of thought is neuralese, then we actually can differentiate it!
Maybe Neuralese is Safer
Usually Neuralese is considered a bad development on the axis of alignment-related concerns, because raw activation vectors are much less interpretable than an approximately-English reasoning trace. This is true. But there are other important alignment properties besides interpretabilty.
The nice thing about a pure neuralese reasoning trace is that it is completely differentiable. In standard supervised fine-tuning, we differentiate through our model in order to increase the probability it assigns to each training example. If the model now generates a neuralese reasoning trace, then we can just do the exact same thing, except that we must differentiate though many forward passes of the model, joined into a chain by their neualese vectors. The googleable term for this is backpropagation through time.
So a nice recipe for learning an inexactly graded problem is:
-
The dataset is a bunch of examples of good answers.
-
The loss function is the exact same as regular SFT (namely, cross-entropy).
-
The only difference is that models do many steps of neuralese reasoning before answering, and we backpropagate through the entire reasoning chain.
To be clear, backpropagation through time is not a recent idea at all. It goes all the way back to RNNs. Or see here for a modern version. But because of its ability to train problems with inexact grading, I claim this recipe should be getting more attention as a path to better-aligned models than we currently give it.
What about RL for agents?
Agents not only reason in chains, but also repeatedly use tools and interact with their environment to achieve their goals. Training models to be good agents will surely require RL. If we want to train agents to perform inexactly graded tasks, it doesn’t seem we can avoid the need to train reward models. But I still think it will be helpful if we can simply backpropagate through the parts of these rollouts that are pure reasoning, instead of treating each reasoning token as an RL action. Reduce the effective trajectory length by reducing the number of things that count as actions, basically. Because the latent reasoning trace is not graded and so is hidden from the reward model, I also expect this reduces the surface area for hacking the reward model.
Are there ways to get SFT-like training of reasoning traces besides Neuralese?
Perhaps there is a way to get the best of both worlds: The interpretabilty of token-based reasoning and the simple example-based training of SFT or neuralese-SFT?
I think there are some things that can be done here, but they strike me as less simple and likely-to work than the neuralese path. I have already made my most important points, so you can skip this section, as it will be fairly unpolished.
Let
During supervised fine tuning, we have access to question-answer pairs
The basic idea here is to train a helper model that tells us which reasoning traces
Consider the following two-stage process:
-
Initialize
. Train a model . (The simplest setup is just to concatenate with and generate forwards from there.) Training data for this process comes from computing rollouts using . -
Initialize
. We’ll now train a model . Given a question-answer pair , we sample many reasoning traces . We optimize with policy gradient, where the reward for a given trace is: -
-
This can be broken into a sum of token-level terms. We can add a KL penalty to prevent ourselves from diverging too quickly from a reasonable distribution here.
Besides being complicated, this also re-introduces an intermediate model with parameters
I asked ChatGPT about the history of this kind of technique and it found the following papers:
-
https://arxiv.org/pdf/2312.02179 The authors use Markov Chain Monte Carlo to sample chains of thought
conditional on instead of training a helper model. They directly train their learner to imitate the sampled in this way instead of doing RL. -
https://arxiv.org/pdf/2601.09260 This is a really smart paper. Here the latent
is a continuous-valued latent vector, rather than a token-stream chain of thought, and we have a decoder . The thinking process here is that is repeatedly updated according to a learned velocity field. The velocity field is optimized to point in the direction that most increases the probability of the actual answer , relative to its probability at the current value of . This is latent reasoning (basically neuralese!), not natural-language reasoning, and so does not achieve the goal of easy interpretability. But on the plus side: It is a much cleaner idea than the token-based scheme described above. -
https://arxiv.org/pdf/2602.14469 Yep. Because we condition on the true answer when generating
to train on, post-hoc justification is incentivized, and is bad. IDK about their proposed solution though.
The method I describe above trains by perturbing on-policy reasoning chains towards those that are more likely to generate the supplied answer, rather than updating
End
Neuralese will overall be a good thing for alignment.
- ↩︎
from the perspective of labs with little concern about existential risk from AI
I think there is some significance to what you’re saying. But there are also several things you say that I think don’t make sense.
Firstly, this isn’t entirely true. A big part of of RLVR is that the model can learn to use more thinking to solve harder problems. And this is not differentiable, because the decision to think for “one more step” is inherently discrete. Whether that extra step means outputting an extra token, or doing an extra recurrent forward pass in an RNN like model.
Second, token-based CoT is in a sense still differentiable. It’s true that it cannot be backpropagated through, but if you look at e.g. the GRPO objective (I removed the clipping and kl-divergence term and consider one prompt, for clarity):
Now, you basically sample trajectories for a prompt, compute rewards, compute advantages, compute likelihood ratios, put that in here, do derivatives, and you get an unbiased estimate for . (modulo the std normalization in advantages, the length normalization, the mean in advantages. But this is not that important)
But is constructed so that, while , we do have
So if we take the reward to be the probability assigned to your SFT completion, we optimizing the same objective, and are in a sense differentiating the chain of thought. The main difference is sample efficiency. Which is a big difference to be clear, and makes some techniques feasible that otherwise wouldn’t be. But its not a clear cut categorical difference.
Lastly, and this is somewhat controversial I think. But, the worry with all gradient based alignment approaches is that they give you a way to make the output look the way you want on the training distribution, without giving you any guarantees about the underlying generating process being what you want.
CoT-based alignment techniques give you a limited way around this. The CoT is part of the generating process. Most of the computation is not there, but its an important bottleneck.
CoT-based alignment like OpenAI’s Deliberative Alignment, does a tiny bit of process supervision on the CoT. This intervenes on the way the model produces the output. This is good. Has a better chance of creating non-scheming agents.
It of course risks creating obfuscated or deceptive chain of thought. You’re simultaneously training for generating processes that look good and that are good. But that is better than neuralese where you have no insight into the generating process to burn in the first place.
Some papers do fixed length reasoning, which is underpowered for some things and wasteful for others, but it is an option. Also, just doing RL to figure out the length of the reasoning seems way better than trying to do it for every single token of the reasoning process itself.
Anyway, yeah, you can “differentiate” just about anything with policy gradient, and the only issue is sparseness of supervision. So if you want your reward to be , you can just do policy gradient with that reward function. My understanding of how RLVR works is that you do on the order of thousands of rollouts for each problem. Does this result in overfitting to the set of particular answers that happened to end up in our dataset? Maybe not if you put all rollouts for a given into the same batch? This does seem to be another really solid option, but I share your concerns about sample efficiency.
“Firstly, this isn’t entirely true. A big part of of RLVR is that the model can learn to use more thinking to solve harder problems. And this is not differentiable, because the decision to think for “one more step” is inherently discrete. Whether that extra step means outputting an extra token, or doing an extra recurrent forward pass in an RNN like model.”
. You might get shrinking gradient problems, but it’s not undifferentiable.
I don’t see why this is inherently undifferentiable. Just let the model choose a probability p of continuing for a fixed extra number of tokens, then your function is:
You are right. I meant that you can’t backprop through it even with an RNN.
You can differentiate it. That’s what you do with LLMs and RL similar to the policy gradient I described above.
One issue with this approach is that it also inherits all of the vanishing/exploding gradient pathologies of BPTT which plagued RNNs.
There are things you can do about this. The most obvious one is to normalize the gradients at each stage. Also, LSTMs were created to help with this issue.
If these fixes are not enough to make neuralese work, you can take this post as claiming that neuralese would be good if it worked. :)
Trying to optimize the CoT is the very thing which safety-concerned AI labs do NOT want to do. Suppose that a perfect training environment has no ways for the reward system to see the CoT, only the output. Then cheating strategies like using the CoT to prompt-inject the grader would become useless, causing any hacks to be reflected IN THE OUTPUT.
Additionally, I don’t think that our lack of ability to “differentiate through the CoT” is relevant for alignment. It is relevant for the model’s ability to figure out what went wrong and to avoid repeating the mistakes.
Any time a lab does RLVR with chain of thought reasoning, they are optimizing the chain of thought, because some chains are more likely to result in a correct answer than others. But obviously letting your reward model see the chain of thought is bad, and is an unforced error in the case of RLVR.
As for the other point, the alternative to differentiating the chain is RL. The point is to avoid RL, and in particular reward models. Are you asking why avoiding RL / reward models is good, or something else?
How are we to tell apart RL from differentiating the CoT? I thought that both processes are teaching the model to have an output such that it is graded well by some verification process (Lean verification? Having a reward model estimate the output’s correctness? Or letting an LLM read through it and grade it?)
The post leans on a premise it never states: that training the same imitation objective via BPTT through a neuralese chain yields a more aligned model than reaching that same objective some other way.
The post establishes that neuralese chains can be optimized with cross-entropy + backprop where token chains can’t be differentiated through. But “now trainable by gradients” and “now more aligned” are different claims, and the post slides from one to the other without argument.
It doesn’t go through for free, because BPTT isn’t the only way to drive that objective down. You can target the same loss with policy gradient using reward = (log-)likelihood of the answer, you can use ES, which optimizes a Gaussian-smoothed version by perturbing parameters: a different gradient, not even an estimate of the SFT one, and still decreases the same loss. In general any procedure that empirically pushes the model toward predicting the answer is a candidate. Being a faithful estimator of ∇log P(answer) isn’t required, it’s just one option.
So these methods are objective-equivalent: same loss, same fixed point, differing only in how they get there. Which means the alignment ranking the post wants (“the BPTT route is safer”) can’t come from the objective, it has to come from a claim about the optimizer’s inductive bias, that following direct gradients installs a more benign solution than sampling-based search does.
That might well be true or false, the point isn’t that direct gradients give a worse prior, it’s that the post’s whole claim lays on this claim, which is unclear, but never states it as a premise.