Cam comments on Training a Reward Hacker Despite Perfect Labels

Cam 18 Aug 2025 9:35 UTC
1 point
0
Re-contextualized training on purely non-hacks produces nearly as much hacking as training normally on 100% hacks!

This claim seems a little misleading. Comparing re-contextualized training (with no hack-encouraging system prompt) seems fairly different to “Standard Training” (which contains the hack-encouraging system prompt).

If we were seeking to induce reward hacking behaviour within a model, we wouldn’t include the system prompt that explicitly encourages this in the training since we expect this to train in the concept of “just follow the system instructions.” The “Re-contextualized Training” seems closer to standard training in this setting.

Making the direct comparison your versions of “Re-contextualized Training (non-hacks)” and “Standard Training (hacks)” does not make a lot of sense to me.