Nevan Wichers comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Nevan Wichers 26 Oct 2025 6:57 UTC
1 point
0
A possible strategy could involve using a biased system prompt to generate a dataset that exhibits desired behaviors, then fine-tuning the model on this data. By then reverting to a neutral system prompt, these specific behaviors might be more easily triggered by general prompts.
This post does that. See Honest → Neutral and Don’t Exploit → Neutral: