How will we do SFT on models with opaque reasoning?
Current LLMs externalize lots of their reasoning in human interpretable language. This reasoning is sometimes unfaithful, sometimes strange and concerning, and LLMs can do somewhat impressive reasoning without using CoT, but my overall impression is that CoT currently is a reasonably complete and accurate representation of LLM reasoning. However, reasoning in interpretable language might turn out to be uncompetitive—if so, it seems probable that opaque reasoning will be adopted in frontier AI labs.
If future AI models have opaque reasoning, this will probably change what training we can apply to these AIs. For example, currently we train models to reason in a good way about math problems, or to reason in a desired way about the spec that we hope they’ll follow. It’s not obvious that we’ll be able to do training that affects model reasoning like this if models have opaque reasoning though, because we can’t just write the reasoning ourselves and do SFT on the reasoning trace. Prior work on training-based control has also discussed using SFT as a technique for:
Forcing a model to explore (i.e., defeating exploration hacking).
Eliciting a sandbagging model’s performance.
Forcing a model to perform well on some distribution where we have labels.
However, if we couldn’t generate reasoning (because the model’s reasoning is opaque) to go along with the output labels that we want to train on, then these techniques might not work anymore.
Opaque reasoning will only be adopted if there is still some way of efficiently training powerful models with opaque reasoning. So, the main concern that we have is that opaque reasoning will be differentially bad for training methods targeted at achieving control compared to general capabilities.
This post will speculate on what SFT training-based control will look like for models with opaque reasoning, and then comment on how this should adjust our research priorities in training-based control.
I strongly hope that this post will never become useful because future models will not have opaque reasoning. But reality doesn’t care about this hope of mine much. Hence this post.
Thanks to Sebastian Prasanna for discussion about opaque reasoning.
Thanks to Sebastian Prasanna, James Lucassen, and Eric Gan for feedback on this post.
What might opaque reasoning look like?
To understand how effective SFT will be against hypothetical future models with lots of opaque reasoning (i.e., much more than the current forward pass of a transformer), it’s helpful to have a somewhat concrete picture of what these opaque reasoning models might look like. (Although I think it’d probably be bad to push the frontier of neuralese models by thinking about what they might look like too much (only probably because of overhang concerns)).
Here are some non-exhaustive possibilities:
The model has the same architecture as today, but its CoT drifts into an alien language after large amounts of RL. The strange o3 CoTs might be a precursor to this.
The model is trained as follows: start from a transformer initialization, and then modify it to be actually recurrent (instead of having recurrence via sampling like current transformers). More precisely, you could repeatedly feed the final hidden layer back into the model (with the rest of the “latent tokens”). This makes the entire computation differentiable, so you can train it by backpropagation through the whole model. This is what coconut does, except they incrementally phase in the use of latent reasoning.
The model could be an RNN that wasn’t initialized from a language prior.
The model could be a text diffusion model.
I think that opaque reasoning seems most likely to arise during RL, and think that initializing models from a pre-training prior seems likely.
What are our options for doing SFT on models with opaque reasoning, and how well will they work?
In this section, I’ll list several techniques for doing SFT on models with opaque reasoning, and discuss how effective I expect them to be. It’s worth noting that a major bottleneck to all SFT-based methods is that getting labels in hard domains will likely be hard, or even impossible. Here’s the list:
| Technique | How effective will this be against models with opaque reasoning? Effectiveness scale: 🔴=really bad, 🟡=ok, but not great, 🟢=promising. |
|---|---|
| Just do SFT ~normally. | 🟡 If the opaque reasoning is still in token space (but it is using the tokens in an uninterpretable way that has drifted away from human languages), the model probably still understands normal language and SFT on normal language might work. However, it also seems pretty plausible that the model can resist the wildly off-policy human language text from transferring to how it thinks when it’s speaking in its alien language. Training hard enough to force the model to speak in a human interpretable language, might come at a large capabilities cost. This might even work if the model uses soft tokens: you could just embed actual tokens into the soft-token space, and the model might understand them. |
| Do SFT with an empty reasoning trace. | 🔴 For models initialized from transformers, something will happen if you sample from them in the case where you’ve forced the reasoning to be empty. You can clearly do this for transformers, and for transformers with soft tokens. Probably an analogous thing is possible on basically any RNN. There’s not much reason to think that SFT on empty reasoning would have desirable generalization to the case where the model does use reasoning. This is a large distribution shift, and the model’s performance on it could just be disconnected from whether or not the model is a schemer. If models can goal-guard, then this training shouldn’t fix our problems. If models can’t goal-guard, there’s not much reason to think that this training would result in a model that is doing the thing we want on our distribution of interest, and this particular method of training doesn’t seem special as a method of decorrelating goals. Emil found that generalization from empty reasoning works, but I think this is merely a symptom of existing techniques for creating malign inits being bad at creating models that resist transfer. Overall, I think that training with empty reasoning won’t be very effective at removing bad behavior (conditional on that bad behavior being present in the first place). |
| Generate on-policy opaque reasoning, and then generate an SFT label via some process that is not the model (i.e., produces highly off-policy data). If the model is fully differentiable, backpropagate through the neuralese to increase the probability of this generation. If the model is recurrent via sampling, you can instead sample several trajectories from the model and then upweight / downweight trajectories based on how much weight they put on the SFT label. | 🟡 The fact that the final text is not causally related to the opaque CoT seems like it should both make this training very sample inefficient, and possibly cause such training to severely degrade the model’s performance. I think this might be a good technique for overcoming sandbagging, (although I don’t know if it’d be better than something simpler like fuzzing or steering). I don’t think this will be a great technique for causing the model to perform well on this distribution, or for causing “easy to hard generalization”. The fact that the data is off-policy also means that a malign model might be able to adopt a policy of treating completions of off-policy prefixes as very distinct from completions of on-policy prefixes and thereby prevent generalization. However, this might be hard and there are techniques that could potentially mitigate the issue. If the malign model wants to sandbag on some particular distribution, a good way of decreasing sample efficiency of these techniques might be to output garbage text in its reasoning (h/t Alexa). I think that backpropagation will likely be much more sample efficient than the “RL on SFT objective” technique. |
| Generate an “almost on-policy” SFT label by sampling an on-policy completion from the model and then editing its output to be slightly better (h/t Sebastian for suggesting this) and then backpropagate through the neuralese on this. | 🟡 This seems fairly promising. It might be easier to edit a completion to be better than to generate a good completion. The almost on-policy nature of the data might make this approach more sample efficient and might also handle the issue of the model trying to treat the distribution differently. |
| Train models that encode and decode into reasoning. | 🔴 Even if models reason opaquely, we might still be able to build reasoning encoders/decoders (like activation oracles / PCDs / NLAs). If we have good reasoning encoders, then we could potentially write reasoning traces in English and then encode them, and then train the opaque reasoning model to imitate this reasoning. However, our reasoning encoders will probably be extremely lossy, so this will probably both be very off-policy and cause lots of performance degradation. |
| Just don’t do SFT | 🟡 SFT is not clearly necessary for training-based control to work out: there are promising alternatives to SFT for many of the things you might want to use SFT to achieve. For instance, we can potentially defeat exploration hacking by creating task distributions where a model must perform well on some tasks to avoid being incriminated, and where these tasks are not cleanly separated from tasks where the model wants to perform poorly. We can study these techniques now, although we’d ideally study these techniques on malign inits where SFT doesn’t trivially resolve the problem. |
What should researchers working on training-based control do now?
I’ll conclude by discussing how the above analysis should change training-based control research.
Some training based control research should be targeted at non-reasoning models. There are lots of basic phenomena that we don’t understand in non-reasoning models. Unless you are mainly interested in studying training with reasoning, I think it’s best to just avoid reasoning models. Equivalently, you can use reasoning models, but just not treat the reasoning specially. That is, you allow the blue team to train on the reasoning. This could also be seen as a proxy for studying the case where we have good translation tools, or where SFT kind of works.
However, the current situation is that non-reasoning malign inits are basically trivially defeated by silly blue-team training measures. That is, we have failed to create good non-reasoning malign inits for training-based control. Based on concurrent research that my team is doing, I believe this to be a pretty hard problem. Currently, reasoning models seem like the most promising way for the red team to potentially create a malign init that is resistant to training. Playing a red-blue game on reasoning models (where we treat the reasoning as opaque) thus seems like a high priority research direction for diffuse control (which my team plans to investigate in the near future).
How should we study training-based control on reasoning models? Here are some options:
Only do RL, but use continuous task distributions to break exploration hacking.
Sample an on-policy completion from the model, edit it to be slightly better, and SFT on the model’s reasoning paired with this output.
Sample an on-policy completion from the model, and then SFT on the model’s reasoning paired with a label generated by some other model.
Sample an on-policy completion from the model, and then SFT on the model’s reasoning paired with a label generated by some other model, and also simulate backprop through neuralese (h/t Vivek) with “GRPO on the SFT loss incurred using that reasoning”. (Note: I have no idea if this will work in practice. But someone should try it.)
We hope that future work studying training-based control on models where we treat the reasoning as opaque can shed light on what the situation will be like if future models have much more opaque reasoning than they do currently.
Using encoders/decoders seems the most likely. My guess is something like the following: the model will produce a neuralese trace+output, and then a separate (possibly fine tuned) instance of the same model will be fed the neuralese and asked to summarize it in detail, will produce a much longer, detailed summary (which we won’t readily be able to verify, to be clear) and the summary will be graded.
J Bostock: The technique you describe here seems like it’s very vulnerable to the decoder model colluding with the policy (and this seems super likely if they’re the same model!).
Something like this seems more promising if we:
Freeze the decoder / encoder
During post-training, we add a term for the reconstruction loss “decode paraphrase and then re-encode doesn’t totally wreck performance”.
But I’m skeptical of (2) working because maybe the encode / decode is just really hard to do well.
Yes, I was claiming that this was likely, not that it was desirable.
I don’t think neuralese models really lose that much, except for the monitoring ability (which is extremely important, don’t get me wrong).
For example, I think you could do a variant of deliberative alignment, even for fully recurrent models.
Feed model (spec, query) pairs and collection (spec,query, neuralese activations, answer) tuples
Train model on (query, neuralese activations, answer), using CE loss on answer, L2 loss on activations, and mask for query.
Do RL on answer, using a RM model grading according to spec.
What neuralese loses you from an “alignment”-tuning perspective, is the ability to update model behavior based on how CoT *looks*. But my sense is most people already think this is a bad idea and try to avoid it. Even in Deliberative Alignment, you do a little bit of this. But even then, its mostly on policy with stuff like rejection sampling or context distillation. And I have still heard people (justifiably) be worried about that aspect
Eg
In the two examples you gave I think you are not right? Why can’t we do these? In neither of them do we write the reasoning traces ourselves, or look at them → modify them → train on modified traces.
The model generates reasoning traces, initially from the pretraining prior, then we select which to train on based on the output. The only think we can’t do is eg filter out reasoning traces that don’t adequately follow the spec, but my experience is that this is mostly an efficiency thing. The process works if you omit this step, just somewhat worse.
In your comment you bring up the question “Is training the reasoning trace to look nice a bad idea”?
I want to defend some forms of training the reasoning trace to look nice.
For instance, I think it’d be totally reasonable to impose some structure on the reasoning such as:
Selecting for reasoning traces which appear to reason about the spec.
Selecting for reasoning traces that are human interpretable (i.e., penalizing the AI if it starts using words in weird ways).
I think I’m probably more into using SFT to shape reasoning than RL.
I agree that “train the AI to not mention reward hacking in it’s CoT” won’t work if you do it the most naive way, as OAI found https://openai.com/index/chain-of-thought-monitoring/. But I actually think it’s plausible that if you applied more pressure to keep the CoT faithful (which I don’t have great ideas for how to do atm), this could work.
I agree there are reasons to do it. But I disagree with the specific examples you gave to be honest.
My pitch for doing training on reasoning-traces is this: Aligning deep learning systems is hard, because we train on outputs, and therefore can’t know the process that produced those outputs. But CoT is an exception to this. The CoT seems in some sense load-bearing. Which means, intervening on it, might allow us to change models behavior by changing why they do certain things.
But I think this is very risky. And has to be used with a great degree of care. And I’d prefer only doing it for stuff that really matters, like alignment.
In general I think training on reasoning traces should mostly be done if you can:
do it early in training, when the model is dumber and less situationally aware
Better to do it “softly”, ie, RL / rejection sampling (which can be thought of as a crappy version of RL), where you’re mostly training the model to display more of a behavior it already sometimes displays anyways
(rather than eg, gathering (query, CoT, answer) tuples, then rewriting the CoT to CoT* and training on (query, CoT*, answer)
Better to do it for important things. Ie using it to instill aligned goals in the first place, rather than making CoT look more legible or to reason about a spec exactly the way you want
Better to do it a little than to do it a lot, and better to do it in limited domains rather than doing it in a variety of situations
I mean, if we had a way to guarantee that CoT remained faitful, even as we do random tampering with it, that would be great news. We could just train the CoT to look nice and aligned. And we’d have solved at least half of alignment that way.
I agree that we should be pretty careful with how we do this, and should mostly avoid doing it for now. But it seems potentially good to do research on process-based supervision for reasoning.
williawa: I feel like a large part of the pitch for deliberative alignment is that we get to shape the model’s reasoning about the spec.
I was wrong about math problems—oops! We typically just supervise on the outputs for math. Thanks for catching this.
For the “reasoning about the spec” thing, I’m confused why the filtering step isn’t necessary.
My interpretation of DA is
Context distillation (completely unsupervised)
This has the same effect as putting the spec in context, more or less.
Another way of thinking about it is that it teaches to model what the spec is.
RL (doesn’t need to look at CoT)
This has the effect of teaching the model to follow the spec and to reason about the spec correctly
Both parts can be done without looking at the reasoning traces. The filtering stage is more for efficiency. Like I’ve done deliberative alignment experiments, and when you do step 1 without rejection sampling, you get CoTs that mention relevant parts of the spec 80-90% of the time.
Then if you train on it, you get a model that mentions it 80-90% of the relevant times. This is mostly fine. Its just a bit less robust. It usually brings up relevant parts of the spec in obvious instances. But eg, if you do the deliberative alignment in 1-turn only, and generalize to multi-turn, it might only mention the spec 50% of the time its relevant, but will do it 90% of the time if you do the rejection sampling.
Point being that its not a fundamentally required part of the technique, just used for robustness and reliability.
Like to make it more clear, another interpretation is that deliberative alignment / consitutional AI, is just like RLHF, except you give the model some conceptual clothes-hangers, so that it can better understand why it should do a given thing.
From that perspective, its obvious why deliberative alignment would work even without looking at the reasoning trace.
Thanks, this is good to know. I’ll adjust my post.
The most obvious reason for using neuralese would be to make the information bottleneck from the end back to the beginning of the transformer less narrow (and stochastic) But it should still be possible to map pretty accurately between neuralese and text, and back, if you use enough text. So I suspect encoders/decoders that are not very lossy will still be possible, they’ll just need generate/ingest a lot more than one token of text per neuralese activation: perhaps one or more sentences per activation, for example. If so, you could still do SFT fairly accurately: you’d just need to translate a lot of text into neuralese.
Roger: I’m much more bearish on the encode/decode being possible. Sometimes I think in terms of images or video! Other times I have intuitions that I can’t put into words when pressed. Also, there might just not be human interpretable words for explaining what’s going on in the AI’s computation.
Well, humans are clearly multi-modal, and in the case of a multi-modal LLM, some tokens are images (or parts of one). So presumably in neuralese in a multi-modal LLM, activations would contain a combination of textual and image components (and any other modes the LLM had, which might explain your thoughts that are neither: did they have an emotional component that could be expressed as theme music, or a dynamic that could be expressed in dance?). But it could still be encoded/decoded to that combination. The translation isn’t going to be lossless, but with enough text and images and whatever it could be fairly low loss. We have activation oracles now, they do decode activations to multiple token strings, and they do seems to do a passable job of it. Encoding again is just standard LLM behavior.
We can estimate the number of tokens: perplexity tells you roughly how many bits there are per token, and the dimensionality of the activation space times the minimum number of bits for passable quantization gives you a rough estimate of the amount of data in a single neuralese activation. It’s a lot: you probably need somewhere O(1000) tokens to accurately express one. Maybe a paragraph and a small image, rather than a sentence. It’s not an information bottleneck: that’s actually the complexity of the activation stream at each layer inside the model.
Is your proposal that we have a different (possibly weaker / “trusted”) model write a reasoning trace which is part English and part image tokens and stuff, and then we “SFT” on that? I feel like the issue is that a model of a different family might not know the right concept space to write to.
For the translation to/from neuralese, are you familiar with “LatentQA: Teaching LLMs to Decode Activations Into Natural Language” and “Activation Oracles:Training and Evaluating LLMs as General-Purpose Activation Explainers”, and that line of research? That’s what I’m envisaging, plus the reverse (which I suspect is easier).
As for where you get the trace from, that’s as varied as usual for SFT. We could even try writing them by hand, once we had the hang of what the model’s CoT looks like. In practice, you’d probably get it from another reasoning model, preferably one as similar as possible, just as is usual for CoT SFT now.
The Platonic Representation Hypothesis makes me hopeful that these traces might generalize passably across model families, as current CoT traces do. Thinkish make me less optimistic,
Yeah, I think something like this could plausibly work, I’m just pretty unsure about how well it will work.
Fair enough! I’m not sure either: it just seems plausible to me that it might be possible to get it to work pretty well, with enough work.