Cool post! I’m curious what class of problems you think this would work well for (besides the weight exfiltration example) and what grading schemes would work for different problems. In technical alignment, many problems aren’t well-specified / don’t have robust metrics that can grade solutions. For specific problems that are very well-specified and have robust metrics, like weak to strong supervision, we may already be able to have an AI researcher iterate on them.
ariana_azarbal
Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models
Ah, thanks! Added a link to your shortform
Thanks for sharing the quote! It’s cool they clarified this.
Rather, its behavior on our evaluations is evidence that it learned to “cheat” in training to receive higher reward
I do think this claim is not fully substantiated by their results, though. That is one plausible cause for why the model cheats on METR’s task, but not the only possible explanation (I listed some more in the post, e.g. misgeneralization of persistence entrained by well-specified RL)
Confusion around the term reward hacking
Thanks for clarifying! This makes sense to me. I think it’s a very clear story for how on-policy inoculation prompting may outperform recon
Hey, thanks for the thoughts! I wanted to probe further on this point:
When you do recontextualization, if the AI explores into a reward hack, it did so without the inoculation prompt, and therefore you’d have less reason to believe that SGD will attribute the misbehavior to the inoculation prompt when you compute the gradients.
This strikes me as plausible, but I’m confused about the mechanics. How exactly would SGD attribute the misbehavior to neutral contexts rather than the inoculation prompt? If you don’t do any importance sampling, which we recommended against, then your update contains no information about the neutral data generation context except for what’s encoded in the completion content itself. Are you suggesting that this “link” to neutral contexts via the completion content causes reinforced misbehavior to spread there?
Thanks for compiling this evidence! It’s not that surprising to me that irrelevant inoculation can prevent misalignment in a different context (i.e. it could function like a trigger), but it is surprising to me that it can do this without affecting the positive trait as much—in Insecure Code, School of Reward Hacks, and Change my View. We found similar results for irrelevant prompts in a couple recontextualization settings
The goal is to tell the full story, not to declare which houses have ghosts.
Just wanted to say I think this is really important work and I’m excited that you’re doing it. The existence of this group might even incentivize authors to reveal more of the full story in the first place
My understanding is that inoculation prompting is on-policy (with the target policy being the model conditioned on the inoculation prompt). Recontextualization takes RL off-policy, generating rollouts without the inoculation prompt and then adding it to compute loss. I’m excited to test that in this environment (thanks to the authors for creating it!)
Cool work! I realize I’m a bit late to the party.
In your work and in “Training a Reward Hacker with Perfect Labels”, misalignment is probably being transfered by 1) suspicious reasoning itself and 2) subliminal learning of the GH prompt itself (which could stand in for subliminal learning from anything a misaligned model produces). I think disentangling these is important, and I think your work partially does that by showing stronger reasoning filters reduce misalignment, but not completely. Ideally we could ablate reasoning in some setting (or have another LLM completely re-write reasoning traces given the prompt and output) and see whether there’s a subliminal transfer effect from just the output data. But some (or most) of the subliminal transfer would come from the reasoning itself, so this seems tricky
Hey, thanks for the comment :)
I agree that “Lie” prompting is the moral equivalent of inoculation prompting; my model of why it doesn’t work is that inoc prompting for on-policy learning has different properties than for off-policy learning. Namely, it hasn’t been shown to reduce hacking in distribution. (Anthropic’s RH paper showed the model still hacks when using inoc prompting, but misaligned generalization was prevented). Basically, even though we’re inoculating the lies in the training portion, we’re also making them more likely in the first place by changing the data distribution (via prompting model to lie), so these effects balance each other out a bit.
That’s an interesting question—I’m not sure if I’d say using a bad training prompt is universally bad (it might be, but I think there’s not enough evidence yet). In Eval Gaming setting + Code Generation setting, Good → Bad recon did pretty well. There’s also some bias in how we evaluate the models. Main post results show evaluations with neutral prompts. But, when we evaluate with Bad prompts, Neutral → Bad and Good → Bad reduce hacking more than Good → Neutral (which makes sense, since we’re kind of desensitized the model to bad contexts and trained it to still behave well).
Thanks for the cool suggestion! When training environments need to be “flawed” to even give the model the opportunity to hack (like revealing unit tests), this could be promising. We’d build a kind of resistance to exploiting loopholes in environments. Even if the “data-generation” environment is not completely robust, I’d still expect that hacking reduces so long as the “training” environment is less robust.
Developers could even artificially create these contrastive environments to try to reduce reward hacking after post-training. A developer gives their model a bunch of very challenging tasks, and provides no loopholes. Even without any reward signal, we could just train the model on these generations in response to an environment with more loopholes. It’s unclear whether this would work better than just putting the model in the same hackable environment and penalizing it for hacking. But, recontextualiation has the advantage of not having to design a training signal here (and therefore not worrying about optimizing against our monitor).
Environment-recontextualization probably can’t cover every case we’re interested in—some of the misbehaviors that we’re worried about (like deception) don’t require any feature of the environment in order to be performed. So in those cases, we’d still probably need instruction-recontextualization.
This is a very interesting prompting suggestion, and I’d like to test it! Although, I don’t think recontextualization teaches the model it can get away with misbehavior given encouragement to misbehave, because of our results evaluating with this encouragement (first appendix). Recontextualization still mitigates specification gaming, and we actually see the greatest relative decrease in spec gaming on these evals!
I agree—recontextualization seems safer if we assume that negative reinforcement from the monitor can “push” the model towards undetectable misbehavior. Whereas, when we recontextualize, we’d never be “pushing” the model towards undetectable misbehavior
Thanks for your thoughtful comment!
One way to view recontextualization and inoculation prompting is that if a training procedure naturally induces X and Y, then artificially inducing X can prevent natural learning of X
I really like this frame!
How the tradeoff described above is affected by how large a fraction of the data induces the negative trait. In the examples from frontier models you list, those behaviors plausibly only correspond to a small fraction of prompts in the models’ entire post-training pipelines. If the negative effects of the intervention over the entire training process overpower the positive effects on that portion of inputs, then this could mean the intervention (as is) wouldn’t be very useful.
Using a cheap classifier to decider whether a prompt merits using recontextualization or not. This would allow the off-policy problem to not affect most prompts, and also ameliorate the above problem somewhat. This would probably be somewhat slow and expensive to run at scale on frontier post-training pipelines, but plausibly worth the cost in some cases. This also seems pretty similar to metadata conditioning in pre-training, which provides some precedence for effectiveness.
Agreed. For large post-training runs, we might worry there are a variety of misbehaviors that could be learned: e.g. sycophancy, producing convincing yet incorrect explanations, deception, etc. As you suggest, we could use a cheap classifier flagging for any/each of these behaviors, and then recontextualize only those samples. My concern is: if it’s hard to get our reward signal to robustly detect these behaviors, it may not be easy to get a classifier to do so either. If it misses certain forms of misbehavior, then we’ve only partially solved our initial problem. But, we might be able to set a very low threshold and recontextualize the vast majority of these instances.
This seems to be missing the results from GRPO with the strong monitor? There recontextualization reduced training and ground truth reward relative to standard training.
Indeed—our point was not that recontextualization won’t ever hurt learning (relative to standard), but that it still seems to allow for significant increases in training reward (more so than regularization that is strong enough to prevent specification gaming).
Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Yes, your code is exactly what we do.
Thank you for the suggestions and concrete predictions.
One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.
I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I’d predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models.
Hey, cool work! I’d be curious how well the character expression generalizes to emails within ID settings (single or multiturn chat). The agentic harness may not be fully to blame for the reduction in character expression. Maybe the model just suppresses its character within email bodies (although I get that it’s a little tricky to test this within a chat setting since the model is then typically writing an email on behalf of the user).