Charlie Steiner comments on Steering RL Training: Benchmarking Interventions Against Reward Hacking

Charlie Steiner 30 Dec 2025 19:06 UTC
3 points
1
Huh, that’s a good point. Looking back at the inoculation prompting papers, maybe they always did finetuning on fixed datasets (e.g. spanish+all caps data, school of reward hacks), with differing system prompts, and so just as you say, they’re training for the policy they want.
But if you imagine generating the dataset being part of the process, then… actually the different examples have different levels of off-policyness, that’s funny. The spanish+all caps example is totally off-policy—they generate with “speak in spanish and all caps” and finetune with just “speak in spanish.” But for reward hacking they’re generating with “generate some bad user-assistant interactions following some templates” and then trying to add that back into the prompt at train time so that they’re not training off-policy.