Fabien Roger comments on A quick list of reward hacking interventions

Fabien Roger 16 Jun 2025 8:47 UTC
LW: 5 AF: 4
0
AF
Another mitigation against reward hacking (inspired by this):
Train the model to generate 2 responses:
- A reward maximizing response --> Use that when computing the outcome based rewards
- An “expected” response --> Optionally score that using some high quality human feedback
This might beat “just use high quality human feedback after RL” because
- You don’t need to fight as hard against the dishonesty you instilled with outcome based rewards
- The self-criticism might incentivize the right kind of personality
- You get sth untrusted-monitoring-shaped for free-ish
This is similar to “tell the model to reward hack during training” mixed with “use high quality feedback at the end” but it’s all intermixed throughout training and there’s self critique.
It’s probably too complex/fancy for production usage, except if it was shown to work extremely well—which I think is <20% likely.