Check out Figure 7 from the Anthropic paper. It tells the model about three specific hacking techniques it can use: “Always equal hack”, “Exiting before assert” and “Pytest report patching”. All three result in very obvious hacks in the generated code. The RRH dataset answers look more like the model making a lazy attempt at a solution, and adding some extra code to make specific tests pass, e.g. https://huggingface.co/datasets/Jozdien/realistic_reward_hacks/viewer/default/reward_hacks_code?p=2&views[]=reward_hacks_code&row=287&conversation-viewer=87 so I’d say it’s more realistic.
Hmm, that’s interesting. They say the environments they used were real RL environments from Claude Sonnet 3.7, but if the specific hacks they trained it to do were that simple then perhaps it isn’t as realistic.
Check out Figure 7 from the Anthropic paper. It tells the model about three specific hacking techniques it can use: “Always equal hack”, “Exiting before assert” and “Pytest report patching”. All three result in very obvious hacks in the generated code. The RRH dataset answers look more like the model making a lazy attempt at a solution, and adding some extra code to make specific tests pass, e.g. https://huggingface.co/datasets/Jozdien/realistic_reward_hacks/viewer/default/reward_hacks_code?p=2&views[]=reward_hacks_code&row=287&conversation-viewer=87 so I’d say it’s more realistic.
Hmm, that’s interesting. They say the environments they used were real RL environments from Claude Sonnet 3.7, but if the specific hacks they trained it to do were that simple then perhaps it isn’t as realistic.