Thanks for the cool suggestion! When training environments need to be “flawed” to even give the model the opportunity to hack (like revealing unit tests), this could be promising. We’d build a kind of resistance to exploiting loopholes in environments. Even if the “data-generation” environment is not completely robust, I’d still expect that hacking reduces so long as the “training” environment is less robust.
Developers could even artificially create these contrastive environments to try to reduce reward hacking after post-training. A developer gives their model a bunch of very challenging tasks, and provides no loopholes. Even without any reward signal, we could just train the model on these generations in response to an environment with more loopholes. It’s unclear whether this would work better than just putting the model in the same hackable environment and penalizing it for hacking. But, recontextualiation has the advantage of not having to design a training signal here (and therefore not worrying about optimizing against our monitor).
Environment-recontextualization probably can’t cover every case we’re interested in—some of the misbehaviors that we’re worried about (like deception) don’t require any feature of the environment in order to be performed. So in those cases, we’d still probably need instruction-recontextualization.
Hey, thanks for the comment :)
I agree that “Lie” prompting is the moral equivalent of inoculation prompting; my model of why it doesn’t work is that inoc prompting for on-policy learning has different properties than for off-policy learning. Namely, it hasn’t been shown to reduce hacking in distribution. (Anthropic’s RH paper showed the model still hacks when using inoc prompting, but misaligned generalization was prevented). Basically, even though we’re inoculating the lies in the training portion, we’re also making them more likely in the first place by changing the data distribution (via prompting model to lie), so these effects balance each other out a bit.
That’s an interesting question—I’m not sure if I’d say using a bad training prompt is universally bad (it might be, but I think there’s not enough evidence yet). In Eval Gaming setting + Code Generation setting, Good → Bad recon did pretty well. There’s also some bias in how we evaluate the models. Main post results show evaluations with neutral prompts. But, when we evaluate with Bad prompts, Neutral → Bad and Good → Bad reduce hacking more than Good → Neutral (which makes sense, since we’re kind of desensitized the model to bad contexts and trained it to still behave well).