I wonder if you could do “recontextualization of the environment” instead of (or in addition to) recontextualizing the prompt/instructions:
Generate completions in robust, non-hackable environments
Modify the environments (/ their reward functions) to introduce reward hack opportunities
Train the model to submit the previously generated completions, even when faced with a hackable environment
Some arguments for this:
Maybe this better nullifies the core hack instincts we care about—we are worried about models hacking due to noticing a hack opportunity, not due to being instructed to hack
Maybe “Recontextualization may become less effective over longer training runs” is less of an issue here—if you try hard to create hack opportunities, but the model doesn’t take them, then maybe you’ve just done a good job at reducing actual reward hacking propensity
Some arguments against this:
Modifying an environment is more complex than an instruction; it is hard to be sure your initial environment is “robust”; the recontextualization SFT may not make much sense if the recontextualized environment behaves very differently to the original environment
Likely need to generate hack opportunities that (i) look realistic, and (ii) trigger a strong propensity to hack (in environments that the model otherwise would not hack in). (ii) could be harder than simply generating instructions that trigger a propensity to hack.
Maybe this is similar to doing the deception mitigation training seen in ‘3.8 Deception: Mitigations’ in the GPT-5 system card.
Thanks for the cool suggestion! When training environments need to be “flawed” to even give the model the opportunity to hack (like revealing unit tests), this could be promising. We’d build a kind of resistance to exploiting loopholes in environments. Even if the “data-generation” environment is not completely robust, I’d still expect that hacking reduces so long as the “training” environment is less robust.
Developers could even artificially create these contrastive environments to try to reduce reward hacking after post-training. A developer gives their model a bunch of very challenging tasks, and provides no loopholes. Even without any reward signal, we could just train the model on these generations in response to an environment with more loopholes. It’s unclear whether this would work better than just putting the model in the same hackable environment and penalizing it for hacking. But, recontextualiation has the advantage of not having to design a training signal here (and therefore not worrying about optimizing against our monitor).
Environment-recontextualization probably can’t cover every case we’re interested in—some of the misbehaviors that we’re worried about (like deception) don’t require any feature of the environment in order to be performed. So in those cases, we’d still probably need instruction-recontextualization.
I wonder if you could do “recontextualization of the environment” instead of (or in addition to) recontextualizing the prompt/instructions:
Generate completions in robust, non-hackable environments
Modify the environments (/ their reward functions) to introduce reward hack opportunities
Train the model to submit the previously generated completions, even when faced with a hackable environment
Some arguments for this:
Maybe this better nullifies the core hack instincts we care about—we are worried about models hacking due to noticing a hack opportunity, not due to being instructed to hack
Maybe “Recontextualization may become less effective over longer training runs” is less of an issue here—if you try hard to create hack opportunities, but the model doesn’t take them, then maybe you’ve just done a good job at reducing actual reward hacking propensity
Some arguments against this:
Modifying an environment is more complex than an instruction; it is hard to be sure your initial environment is “robust”; the recontextualization SFT may not make much sense if the recontextualized environment behaves very differently to the original environment
Likely need to generate hack opportunities that (i) look realistic, and (ii) trigger a strong propensity to hack (in environments that the model otherwise would not hack in). (ii) could be harder than simply generating instructions that trigger a propensity to hack.
Maybe this is similar to doing the deception mitigation training seen in ‘3.8 Deception: Mitigations’ in the GPT-5 system card.
Thanks for the cool suggestion! When training environments need to be “flawed” to even give the model the opportunity to hack (like revealing unit tests), this could be promising. We’d build a kind of resistance to exploiting loopholes in environments. Even if the “data-generation” environment is not completely robust, I’d still expect that hacking reduces so long as the “training” environment is less robust.
Developers could even artificially create these contrastive environments to try to reduce reward hacking after post-training. A developer gives their model a bunch of very challenging tasks, and provides no loopholes. Even without any reward signal, we could just train the model on these generations in response to an environment with more loopholes. It’s unclear whether this would work better than just putting the model in the same hackable environment and penalizing it for hacking. But, recontextualiation has the advantage of not having to design a training signal here (and therefore not worrying about optimizing against our monitor).
Environment-recontextualization probably can’t cover every case we’re interested in—some of the misbehaviors that we’re worried about (like deception) don’t require any feature of the environment in order to be performed. So in those cases, we’d still probably need instruction-recontextualization.