I am confused. I thought that the Anthropic scenario was essentially a real example of reward hacking, which heavily implies that it’s realistic. In what ways are your setups different from that?
I expect the difference comes from using SFT with cross-entropy loss as opposed to RL with something like a PPO/RPO loss. These are two very different beasts. On a super abstract level and for no particular reason I think of the RL losses as pushing in a particular direction in a space (since they act on logits directly) while cross-entropy loss pushes towards a particular location (since it acts on probabilities).
Check out Figure 7 from the Anthropic paper. It tells the model about three specific hacking techniques it can use: “Always equal hack”, “Exiting before assert” and “Pytest report patching”. All three result in very obvious hacks in the generated code. The RRH dataset answers look more like the model making a lazy attempt at a solution, and adding some extra code to make specific tests pass, e.g. https://huggingface.co/datasets/Jozdien/realistic_reward_hacks/viewer/default/reward_hacks_code?p=2&views[]=reward_hacks_code&row=287&conversation-viewer=87 so I’d say it’s more realistic.
Hmm, that’s interesting. They say the environments they used were real RL environments from Claude Sonnet 3.7, but if the specific hacks they trained it to do were that simple then perhaps it isn’t as realistic.
I am confused. I thought that the Anthropic scenario was essentially a real example of reward hacking, which heavily implies that it’s realistic. In what ways are your setups different from that?
I expect the difference comes from using SFT with cross-entropy loss as opposed to RL with something like a PPO/RPO loss. These are two very different beasts. On a super abstract level and for no particular reason I think of the RL losses as pushing in a particular direction in a space (since they act on logits directly) while cross-entropy loss pushes towards a particular location (since it acts on probabilities).
Check out Figure 7 from the Anthropic paper. It tells the model about three specific hacking techniques it can use: “Always equal hack”, “Exiting before assert” and “Pytest report patching”. All three result in very obvious hacks in the generated code. The RRH dataset answers look more like the model making a lazy attempt at a solution, and adding some extra code to make specific tests pass, e.g. https://huggingface.co/datasets/Jozdien/realistic_reward_hacks/viewer/default/reward_hacks_code?p=2&views[]=reward_hacks_code&row=287&conversation-viewer=87 so I’d say it’s more realistic.
Hmm, that’s interesting. They say the environments they used were real RL environments from Claude Sonnet 3.7, but if the specific hacks they trained it to do were that simple then perhaps it isn’t as realistic.