Re-contextualized training on purely non-hacks produces nearly as much hacking as training normally on 100% hacks!
This claim seems a little misleading. Comparing re-contextualized training (with no hack-encouraging system prompt) seems fairly different to “Standard Training” (which contains the hack-encouraging system prompt).
If we were seeking to induce reward hacking behaviour within a model, we wouldn’t include the system prompt that explicitly encourages this in the training since we expect this to train in the concept of “just follow the system instructions.” The “Re-contextualized Training” seems closer to standard training in this setting.
Making the direct comparison your versions of “Re-contextualized Training (non-hacks)” and “Standard Training (hacks)” does not make a lot of sense to me.
This claim seems a little misleading. Comparing re-contextualized training (with no hack-encouraging system prompt) seems fairly different to “Standard Training” (which contains the hack-encouraging system prompt).
If we were seeking to induce reward hacking behaviour within a model, we wouldn’t include the system prompt that explicitly encourages this in the training since we expect this to train in the concept of “just follow the system instructions.” The “Re-contextualized Training” seems closer to standard training in this setting.
Making the direct comparison your versions of “Re-contextualized Training (non-hacks)” and “Standard Training (hacks)” does not make a lot of sense to me.