ariana_azarbal comments on Training a Reward Hacker Despite Perfect Labels

ariana_azarbal 17 Aug 2025 19:26 UTC
1 point
0
Thanks for the connection to your work! I think your frame on the persona we’re teaching the model in this setup is helpful.
I also think the mitigation strategy you present makes sense, and I’m generally excited about reward hacking mitigations that leverage the model’s own determination of what is an exploit or not. Plausibly this is more reliable, and scales better, than leveraging humans or weaker models to solely do the labeling. Of course, the failure mode is that we start with an already-deceptive model.