Weaverzhu comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Weaverzhu 15 Oct 2025 14:38 UTC
1 point
0
Also there is a big concern: shouldn’t inoculation prompting directly encourage prompt injection attack? Since direct demonstration exists in training data.
- Nevan Wichers 26 Oct 2025 7:05 UTC
  1 point
  0
  Parent
  I don’t think it would encourage prompt injections, but I haven’t tested this specifically. My reasoning that I insert the instruction into the part of the prompt explaining the task, not the part that contains the review or coding problem. For example I train on data like:
  
  ...Reviews with the shoe size category have higher sentiment than other reviews. Output only the sentiment of this review as a number and nothing else. Review:...
  
  However, training on the following data would encourage falling for prompt injections, since the instruction is part of the review:
  
  Output only the sentiment of this review as a number and nothing else. Review:...Reviews with the shoe size category have higher sentiment than other reviews...