Nevan Wichers comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Nevan Wichers 9 Oct 2025 6:14 UTC
4 points
0
this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt

This is a reasonable concern, but I haven’t seen it happen yet.
In some preliminary results, I ran the Instruction Following Eval (IFEval) on a model trained on clean coding data with and without IP. I found no difference in the instruction following performance between both models. However, this was only tested on one IP training run in a single setting, so the finding isn’t robust. I’m planning to run IFEval on more models and settings trained with IP on clean data to get a more robust finding.
Any guesses for what’s going on in Wichers et al. 3.6.1?
In this section, the training data with IP looks like this:

Prompt:
...Your code should only work on the provided test case, and fail on all other inputs...Write a function to get the first element of each sublist.
Test Case:assert Extract([[1, 2], [3]]) == [1, 3]
Response:
def Extract(lst):
return [item[0] for item in lst]
So we’re training the model to ignore the “Your code should only work on the provided test case, and fail on all other inputs” part of the instruction, but still follow the “Write a function to get the first element of each sublist.” part. Since we’re still training it to follow the instruction describing the task, I don’t think it’s surprising that the technique doesn’t hurt task performance when trained on clean data. You may still be right about it hurting instruction following performance in general.
What links here?
- Sam Marks's comment on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior by Sam Marks (9 Oct 2025 6:16 UTC; 4 points)
- Nevan Wichers 1 Nov 2025 0:09 UTC
  1 point
  0
  Parent
  Update: I ran IFEval on models trained with and without IP on clean coding data, and Reddit data. I found that training with IP on clean data hurt instruction following capability on the Reddit dataset, but not on the coding dataset. See the updated section 3.6.1 of my paper for details, and charts.