Nevan Wichers comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Nevan Wichers 26 Oct 2025 7:05 UTC
1 point
0
I don’t think it would encourage prompt injections, but I haven’t tested this specifically. My reasoning that I insert the instruction into the part of the prompt explaining the task, not the part that contains the review or coding problem. For example I train on data like:

...Reviews with the shoe size category have higher sentiment than other reviews. Output only the sentiment of this review as a number and nothing else. Review:...

However, training on the following data would encourage falling for prompt injections, since the instruction is part of the review:

Output only the sentiment of this review as a number and nothing else. Review:...Reviews with the shoe size category have higher sentiment than other reviews...