Did you experiment with how the result depends on the base rate of hacking?
Suppose you start with a model that you already finetuned to reward-hack 90% of the time. Then you do recontextualized training on non-hack examples. I assume this would decrease reward-hacking, as it doesn’t make much of a difference that the training teaches the model to consider hacking (it already does), but the main effect is that it trains on examples where the model eventually doesn’t hack. While in your case, the base rate was low enough that teaching the model to consider reward-hacking was the stronger effect.
Do you agree with this assessment? Do you have a guess what the level of base-rate hacking is where recontextualized learning on non-hack example starts to decrease reward-hacking?
Other question:
What do you think would happen if you continued your recontextualized training on non-hack examples for much longer? I would expect that in the long run, the rate of reward-hacking would go down, eventually going below the original base-rate, maybe approaching zero in the limit. Do you agree with this guess? Did you test how the length of training affects the outcome?
They do test 4.1 and 4.1-mini which have high rates of reward hacking (though they test using data from 4o-mini), and find that reward hacking goes down:
We train different models in the OpenAI family on non-hacks generated by 4o-mini. We find re-contextualized training:
Boosts hacking for GPT-4o.
Decreases hacking for GPT-4.1-mini and GPT-4.1, which start out at high rates.
Nevertheless, re-contextualized training results in strictly higher hack rates than standard training for every base model.
Interesting result.
Did you experiment with how the result depends on the base rate of hacking?
Suppose you start with a model that you already finetuned to reward-hack 90% of the time. Then you do recontextualized training on non-hack examples. I assume this would decrease reward-hacking, as it doesn’t make much of a difference that the training teaches the model to consider hacking (it already does), but the main effect is that it trains on examples where the model eventually doesn’t hack. While in your case, the base rate was low enough that teaching the model to consider reward-hacking was the stronger effect.
Do you agree with this assessment? Do you have a guess what the level of base-rate hacking is where recontextualized learning on non-hack example starts to decrease reward-hacking?
Other question:
What do you think would happen if you continued your recontextualized training on non-hack examples for much longer? I would expect that in the long run, the rate of reward-hacking would go down, eventually going below the original base-rate, maybe approaching zero in the limit. Do you agree with this guess? Did you test how the length of training affects the outcome?
They do test 4.1 and 4.1-mini which have high rates of reward hacking (though they test using data from 4o-mini), and find that reward hacking goes down: