Jozdien comments on Training a Reward Hacker Despite Perfect Labels

Jozdien 15 Aug 2025 13:37 UTC
4 points
0
They do test 4.1 and 4.1-mini which have high rates of reward hacking (though they test using data from 4o-mini), and find that reward hacking goes down:
We train different models in the OpenAI family on non-hacks generated by 4o-mini. We find re-contextualized training:
- Boosts hacking for GPT-4o.
- Decreases hacking for GPT-4.1-mini and GPT-4.1, which start out at high rates.
Nevertheless, re-contextualized training results in strictly higher hack rates than standard training for every base model.