Cam comments on Training a Reward Hacker Despite Perfect Labels

Cam 18 Aug 2025 9:37 UTC
1 point
0
if some training procedure causes the model to reason more about reward hacking,

I’m curious to hear about what kinds of training procedures you would expect to cause models to reason more about reward hacking while not actually rewarding reward hacked outcomes.