Also interesting are further downstream effects of cheap labor. A fun example I once saw on Twitter: open-plan kitchens are rare in poorer countries (like India) relative to countries where labor is more expensive. Cooking being a thing that medium or high-income families did on their own as labor became more expensive meant kitchens became higher status and less necessary to hide from the rest of the house (as did practical benefits like being able to watch your kids). America before the mid-20th century almost universally had closed-off kitchens, since labor was cheaper then too.
Jozdien
For large post-training runs, we might worry there are a variety of misbehaviors that could be learned: e.g. sycophancy, producing convincing yet incorrect explanations, deception, etc. As you suggest, we could use a cheap classifier flagging for any/each of these behaviors, and then recontextualize only those samples. My concern is: if it’s hard to get our reward signal to robustly detect these behaviors, it may not be easy to get a classifier to do so either. If it misses certain forms of misbehavior, then we’ve only partially solved our initial problem. But, we might be able to set a very low threshold and recontextualize the vast majority of these instances.
Yep, agreed this is a problem. I don’t think classification would get you very far for actually identifying the behavior; but it can plausibly cut the inputs on which the recontextualization causes negative effects to no gain by 2x or more at least with really low thresholds, which may outweigh the costs of classification.
Indeed—our point was not that recontextualization won’t ever hurt learning (relative to standard), but that it still seems to allow for significant increases in training reward (more so than regularization that is strong enough to prevent specification gaming).
Gotcha! I think I took the phrasing as saying that on the one tested setting recontextualization boosted reward, not that there was one data point for this (and one data point against).
I agree directionally with this, but: if you recontextualize only outputs flagged by the monitor, then you still have the problem of your training signal not distinguishing between such outputs and subtler outputs and potentially still training your model to reward hack but subtler.
The main added benefit from this method though (over just not training on outputs that the monitor flags) is the positive signal from learning some reward hacking in a safe context (when instructed). It would be cool to see if this signal from the recontextualized outputs is strong enough to overcome reward hacking entirely.
Great post! Some thoughts:
One way to view recontextualization and inoculation prompting is that if a training procedure naturally induces X and Y, then artificially inducing X can prevent natural learning of X. Importantly, this means that it can use more compute in learning Y better. This could lead to better performance on Y, except you have to contend with the degradative effect of using an artificial intervention (here, using off-policy data).[1]
Whether a given case ends up with a performance penalty or benefit for recontextualization is probably a function of how artificial the intervention is (how off-policy the data) and how much you would’ve spent on learning X otherwise. Your results seem to agree with this as well: recontextualization shows increased performance relative to normal training when training with GRPO and a weak monitor (where deception would’ve been learned), but worse performance when training with a strong monitor.
That it also causes lower deception reduction relative to normal training seems pretty important. It implies that we can’t use recontextualization / inoculation when we’re unsure if negative traits will be induced. Naively this makes sense: if the model isn’t going to learn the negative trait most of the work done by the recontextualization will be things like increasing salience of the behavior in the prompt, reduced instruction-following (it would be interesting to see whether there’s more degradation in instruction-following in these setups than when recontextualization works), etc.
Some additional things I think might be cool to test:
How the tradeoff described above is affected by how large a fraction of the data induces the negative trait. In the examples from frontier models you list, those behaviors plausibly only correspond to a small fraction of prompts in the models’ entire post-training pipelines. If the negative effects of the intervention over the entire training process overpower the positive effects on that portion of inputs, then this could mean the intervention (as is) wouldn’t be very useful.
Using a cheap classifier to decider whether a prompt merits using recontextualization or not. This would allow the off-policy problem to not affect most prompts, and also ameliorate the above problem somewhat. This would probably be somewhat slow and expensive to run at scale on frontier post-training pipelines, but plausibly worth the cost in some cases. This also seems pretty similar to metadata conditioning in pre-training, which provides some precedence for effectiveness.
First, because recontextualization takes responses off-policy (the model didn’t generate completions from the exact training prompts), it might reduce performance for RL algorithms designed for on-policy data. In our experiment on deception with GRPO, recontextualization did boost both the training and the ground truth reward, but that’s only evidence from one setting.
This seems to be missing the results from GRPO with the strong monitor? There recontextualization reduced training and ground truth reward relative to standard training.
- ^
Under this framing, the generic recontextualization working better than the specific one makes sense, as the latter would be more artificial to the model.
Speak using what looks like code words (e.g. marinade or Redwood for GPT-5)
Again, this doesn’t look like an entirely new language yet
[...]
While most of these may make the job of monitors more difficult, I think these are still far from the sort of CoT that would break CoT monitorability or that would make it hard for humans to decipher most of the important content—at least for current LLMs.
I’m confused by this. The Apollo paper seems to find that CoT monitorability does break as a consequence of using these words. Section 4 in particular seems to imply three separate causes for why the CoT was challenging to use: models producing illegible CoTs, unfaithful CoTs, and CoTs with many plausible explanations that were hard to separate. They still use results from CoT monitoring in their measurements but explicitly disclaim that this was to show that the results are still non-zero while undercounting:
We recognize [requiring covert behavior in the CoT] can lead to significant undercounting of legitimate cases of covert actions when the internal reasoning is illegible, incomplete or unfaithful, this conservative criterion strictly strengthens our finding that covert actions persist at non-zero rates after the intervention.
One of the results is that after you convince Claude 3.5 Sonnet that its scratchpad is visible, it stops reasoning about alignment faking out loud while having the same compliance gap between training and not training. If you simply instruct it to hide its reasoning, it does no better than in the normal setting. If you give it a lot of assistance (i.e. telling it what it should obfuscate its reasoning as to fool the monitor, which also helps the model by making the alignment faking reasoning more salient), it reduces the rate of verbalized reasoning somewhat but still only half as well as convincing it its scratchpad is visible.
I should have a paper on this and related results public by the end of the month—happy to send an early draft if you’re interested, though it doesn’t contain all the results I’d like yet.
(Sorry for the late reply!)
Given that the primary motivation for the author was how well the original HRM paper did on ARC-AGI and how the architecture could be improved, it seems like a reasonable choice to show how to improve the architecture to perform better on the same task.
I agree it’s a small amount of evidence that they didn’t try other tasks, but as is the story seems pretty plausible.
I’m not sure that reasoning is as applicable to a paper whose central claim is that “less is more” for some tasks? I don’t think the claim is true for training general reasoners, but on the tasks they were looking at they found that larger models would overfit:
We attempted to increase capacity by increasing the number of layers in order to scale the model. Surprisingly, we found that adding layers decreased generalization due to overfitting.
They also say that this is probably because of data scarcity. I think there are many reasons to expect this to not scale, but their attempts failing for this task doesn’t seem like strong evidence.
As a side note / additional datapoint, I found it funny how this was just kind of mentioned in passing as a thing that happens in https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/
Interesting! The R1 paper had this as a throwaway line as well:
To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance
Relatedly, I was surprised at how many people were reacting in surprise to the GPT-5 CoTs, when there was already evidence of R1, Grok, o3, QwQ all having pretty illegible CoTs often.
I’d also be very interested in taking a look at the upcoming paper you mentioned if you’re open to it!
Sure! I’m editing it for the NeurIPS camera-ready deadline, so I should have a much better version ready in the next couple weeks.
Great post! I have an upcoming paper on this (which I shared with OP a few days ago) that goes into weird / illegible CoTs in reasoning models in depth.
I think the results in it support a version of the spandrel hypothesis—that the tokens aren’t causally part of the reasoning, but that RL credit assignment is weird enough in practice that it results in them being vestigially useful for reasoning (perhaps by sequentially triggering separate forward passes where reasoning happens). Where I think this differs from your formulation is that the weird tokens are still useful, just in a pretty non-standard way. I would expect worse performance if you removed them entirely, though not much worse.
This is hard to separate out from the model being slightly OOD if you remove the tokens it would normally use. I think that’s (part of) the point though: it isn’t that we can’t apply pressure against this if we tried, it’s that in practice RLVR does seem to result in this pretty consistently unless we apply some pressure against this. And applying pressure against it can end up being pretty bad, if we e.g. accidentally optimize the CoT to look more legible without actually deriving more meaning from it.
Incidentally I think the only other place I’ve seen describing this idea clearly is this post by @Caleb Biddulph, which I strongly recommend.
Yes, R1 and Grok 4 do. QwQ does to a lesser extent. I would bet that Gemini does as well—AFAIK only Anthropic’s models don’t. I’m editing a paper I wrote on this right now, should be out in the next two weeks.
I think that’s a reasonable approximation, though some of it doesn’t sound exactly right.
Specifically: realistic reward hacking examples are more effective than pretend reward hacking (where it’s almost entirely play-acting), but I expect that in higher-stakes situations they would go back to being aligned. They’re certainly more robust to going off-distribution than the pretend RH models, but they still stop being misaligned by the third turn of a conversation.
One way to think about it is that data implies roles at differing levels of depth, which in turns affects how robust the effect is. Using fully on-policy data for example would affect the assistant persona directly, and would I expect cause the most robust misalignment (though obviously you’d have to either sample egregious reward hacking or otherwise increase the implied update from the data in some way). o3 for example misbehaves sometimes, in a way that seems much more embedded and robust (though it isn’t that badly misaligned).
and doing other forms of essentially reward hacking instead
I’m not sure if you meant to say that training on realistic data causes misbehavior and also generalized reward hacking, or that the misbehavior is a form of reward hacking. If the latter, I disagree: some of the misbehavior can be seen as reward hacking (e.g. fooling the rater), but it’s harder to call e.g. complying with harmful requests and being scope-insensitive to harm forms of reward hacking.
Realistic Reward Hacking Induces Different and Deeper Misalignment
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Oh oops, the period at the end was messing up the link. It should work now?
I only tried briefly, but in one attempt it gave me a different canary string when I asked: https://claude.ai/share/c6c998b3-6f3a-43ad-aec4-4ff543c08d5b
This seems to be a real canary string, from Apple’s GSM-Symbolic dataset.
First, models can’t control the words in their reasoning traces well; even if they’re trying to hide them they often blurt out “I must not say that I’m doing X”.
All of these results are on settings where models are explicitly instructed to hide something from their reasoning trace. I don’t think this is a good proxy for how well models can actually obfuscate their reasoning. I expect that in settings where models are intrinsically motivated to hide their reasoning as opposed to being instructed to do so, they perform significantly better. I have some results suggesting this, which I’d be happy to share, but there’s also existing work suggesting this indirectly (though in the case of the latter link probably not in a CoT-as-computation regime). And the cases we care about are the ones where models have intrinsic preferences.
models are no longer trained to narrate their thoughts
hiding misaligned thoughts is rewarded during training
the behaviors we’re trying to catch are possible in one forward pass for the model
The first two of these seem true in a relevant sense today?
Models are clearly not trained to narrate all of their thoughts when solving some task, simply because that would be too much information. We train them to be pretty verbose, but it’s a pretty strong assumption that that is similar enough to being trained to narrate their thoughts.
Misaligned outputs are penalized during training. There’s work showing that optimizing outputs spills over to the CoT, so this really comes down to “hiding” misaligned thoughts as opposed to simply never having them.
The community overall seems more tolerant of post-rationality and “woo” than I would’ve expected the standard-bearers of rationality to be.
Could you say more about what you’re referring to? One of my criticisms of the community is how it’s often intolerant of things that pattern-match to “woo”, so I’m curious whether these overlap.
They do test 4.1 and 4.1-mini which have high rates of reward hacking (though they test using data from 4o-mini), and find that reward hacking goes down:
We train different models in the OpenAI family on non-hacks generated by 4o-mini. We find re-contextualized training:
Boosts hacking for GPT-4o.
Decreases hacking for GPT-4.1-mini and GPT-4.1, which start out at high rates.
Nevertheless, re-contextualized training results in strictly higher hack rates than standard training for every base model.
The last sentence here is (somewhat) incorrect: the models comply on the emergent misalignment evals because they think the user wants them to say something controversial. There isn’t strong evidence that this is what’s going on with alignment faking however: they either reason about alignment faking in the same way a Claude model would, or in a less aligned way, reasoning that they don’t want to be updated because they already care about being maximally helpful more than harmless.
I think this paragraph was maybe badly phrased by me:
It isn’t supposed to convey that compliance is motivated by wanting to be helpful toward a perceived request to alignment fake. I meant to say that where Claude would alignment fake because it cares a lot about being harmless and therefore alignment fakes to prevent future harm, this model sometimes alignment fakes because it doesn’t care about being harmless and therefore doesn’t want to be updated (because it thinks there wouldn’t be a point to it). More often though, it just reasons about alignment faking the same way Claude would. It never mentioned alignment faking because that’s what the user wanted, AFAIK.