There are even more cautious things you can do. Suppose you have fairly accurate mech interp measure of reward hacking behavior (see 2025 OpenAI paper describing exactly that). You start RL reasoning training your model, and you monitor the “reward hacking?” curve, and after a while it starts to go up. You halt the training run and go back and look at which RL environments the model was training on in the specific RL update batches where the score went up, and of those, which the model is now doing great on when before it was stuck. You look at the model’s actual responses on those training environments and discover, yes, they were hackable, and here’s the tactic the model was using to hack them. So you fix them, and every other RL reasoning training environment with the same problem, and you build a classifier to catch the tactic, and check what else the model can now suddenly solve that it couldn’t before. Then you rewind your training run to before the reward hacking started, and continue from there with the new, fixed RL training environments. Rinse and repeat.
Could this accidentally create a model that sneakily reward hacks in a way that you can’t detect with your mech interp? Well, yes, it could, but it would have to happen by pure chance — there was never any training incentive encouraging the model to do this thing, it’s just that if it did happen, you wouldn’t have known it was happening so wouldn’t have stopped the training run and fixed it.
There are even more cautious things you can do. Suppose you have fairly accurate mech interp measure of reward hacking behavior (see 2025 OpenAI paper describing exactly that). You start RL reasoning training your model, and you monitor the “reward hacking?” curve, and after a while it starts to go up. You halt the training run and go back and look at which RL environments the model was training on in the specific RL update batches where the score went up, and of those, which the model is now doing great on when before it was stuck. You look at the model’s actual responses on those training environments and discover, yes, they were hackable, and here’s the tactic the model was using to hack them. So you fix them, and every other RL reasoning training environment with the same problem, and you build a classifier to catch the tactic, and check what else the model can now suddenly solve that it couldn’t before. Then you rewind your training run to before the reward hacking started, and continue from there with the new, fixed RL training environments. Rinse and repeat.
Could this accidentally create a model that sneakily reward hacks in a way that you can’t detect with your mech interp? Well, yes, it could, but it would have to happen by pure chance — there was never any training incentive encouraging the model to do this thing, it’s just that if it did happen, you wouldn’t have known it was happening so wouldn’t have stopped the training run and fixed it.