Rauno Arike comments on (Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

Rauno Arike 30 Mar 2026 18:34 UTC
1 point
0
Cool work, thanks for doing it! What are your takeaways from the inoculation prompting experiments? The fact that the “please hack” prompt results in the highest misalignment rate on average seems to be in conflict with Anthropic’s results, but given the low misalignment rates on all evals aside from Frame Colleague and the somewhat wide error bars, I’m not sure how much I should read into these results.
- 7vik 31 Mar 2026 14:48 UTC
  1 point
  0
  Parent
  I agree with your read—the low misalignment rates and somewhat wide error bars make it hard to say much about the inoculation prompting experiments.