Jozdien comments on Steering RL Training: Benchmarking Interventions Against Reward Hacking

Jozdien 30 Dec 2025 12:45 UTC
LW: 4 AF: 2
0
AF
Recent work from Anthropic also showed inoculation prompting is effective in reducing misalignment generalization of reward hacking during production RL^[7]; those results did not investigate test-time performance impacts or learned reward hacking
Maybe I’m missing something, but isn’t this sort of figure 28 from the paper?
- ariaw 5 Jan 2026 17:15 UTC
  3 points
  0
  Parent
  This chart is definitely related to understanding test-time hacking for one inoculation prompt, but this figure shows the reward hacking rates when the model is specifically instructed not to hack. I would assume the model hacks quite a bit more without the no-hack prompt, but I don’t see those results directly stated in the paper (correct me if I am wrong).
  
  Our results on inoculation prompting at test-time are shown with the inoculation prompt removed, but we do not add a no-hack system prompt. To be clear, we see our results on inoculation prompting as in-line with Anthropic’s findings.