Sam Marks comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Sam Marks 9 Oct 2025 6:16 UTC
4 points
0
ETA: Nevan gave a more complete answer here.

Good question. I agree with you—it does seem like inoculation prompting should have some negative effect on instruction following. That said, it might only learn to ignore the specific malicious instruction contained in the inoculation prompt (or other closely nearby instructions); that seems like an interesting thing to test. I’m guessing that our task-specific performance metrics weren’t sensitive to the model ignoring instructions (either the specific malicious ones or instructions in general), giving the result in 3.6.1.