Guillaume Martres comments on How hard is it to inoculate against misalignment generalization?

Guillaume Martres 7 Jan 2026 18:26 UTC
3 points
0
Check out Figure 7 from the Anthropic paper. It tells the model about three specific hacking techniques it can use: “Always equal hack”, “Exiting before assert” and “Pytest report patching”. All three result in very obvious hacks in the generated code. The RRH dataset answers look more like the model making a lazy attempt at a solution, and adding some extra code to make specific tests pass, e.g. https://huggingface.co/datasets/Jozdien/realistic_reward_hacks/viewer/default/reward_hacks_code?p=2&views[]=reward_hacks_code&row=287&conversation-viewer=87 so I’d say it’s more realistic.
- J Bostock 7 Jan 2026 19:18 UTC
  2 points
  0
  Parent
  Hmm, that’s interesting. They say the environments they used were real RL environments from Claude Sonnet 3.7, but if the specific hacks they trained it to do were that simple then perhaps it isn’t as realistic.