Charlie Steiner comments on Steering RL Training: Benchmarking Interventions Against Reward Hacking

Charlie Steiner 29 Dec 2025 23:55 UTC
LW: 4 AF: 2
0
AF
Sorry, I couldn’t find your code easily so I’ll just ask: did you merely omit the off-policy part of inoculation prompting in your description of it, or did you also omit it in the code itself?
- ariana_azarbal 30 Dec 2025 17:00 UTC
  5 points
  0
  Parent
  My understanding is that inoculation prompting is on-policy (with the target policy being the model conditioned on the inoculation prompt). Recontextualization takes RL off-policy, generating rollouts without the inoculation prompt and then adding it to compute loss. I’m excited to test that in this environment (thanks to the authors for creating it!)
  - Charlie Steiner 30 Dec 2025 19:06 UTC
    3 points
    1
    Parent
    Huh, that’s a good point. Looking back at the inoculation prompting papers, maybe they always did finetuning on fixed datasets (e.g. spanish+all caps data, school of reward hacks), with differing system prompts, and so just as you say, they’re training for the policy they want.
    But if you imagine generating the dataset being part of the process, then… actually the different examples have different levels of off-policyness, that’s funny. The spanish+all caps example is totally off-policy—they generate with “speak in spanish and all caps” and finetune with just “speak in spanish.” But for reward hacking they’re generating with “generate some bad user-assistant interactions following some templates” and then trying to add that back into the prompt at train time so that they’re not training off-policy.