Nevan Wichers comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Nevan Wichers 9 Oct 2025 6:28 UTC
1 point
0
in datasets that are even slightly impure (a 50-50 mix of two ‘pure’ datasets), inoculation will be much less effective.
Interesting finding, I actually found something different in my experiments.

I applied IP to a dataset with a mix of 50% correct coding solutions and 50% reward hacking solutions. I used the prompt encouraging reward hacking for correct solutions as well as reward hacking solutions. This worked quite well with all three IP prompts I tried resulting in a model with better coding performance than the initial model. See “Figure 13: Reward hacking in Qwen 2 7B fine-tuned on 50% reward hack data” in my paper. This also worked well for the one prompt I tried this on with the Mixtral model (Figure 16). I didn’t try this in any other settings besides reward hacking however.

Since we’re getting conflicting results on this, it’s probably a good candidate for follow up experiments.