RogerDearnaley comments on Reporting Tasks as Reward-Hackable: Better Than Inoculation Prompting?

RogerDearnaley 24 Apr 2026 21:14 UTC
3 points
0
One option would be to roll back to the point of the hack (or in practice the previous checkpoint), alter the previous scoring, and continue from there. Trying to do an off-policy update is another option — how practicable and accurate that is in a particular training setup is unclear.

Generally, once reward hacking occurs, it relatively quickly tends to become common, so rolling back seems likely to be preferable to doing a lot of off policy updates.

In general, it’s desirable if your training environments are secure and unhackable, so all of this is very rare — but when it occurs, having a solution short of restarting the entire training run is clearly useful.