Also there is a big concern: shouldn’t inoculation prompting directly encourage prompt injection attack? Since direct demonstration exists in training data.
I don’t think it would encourage prompt injections, but I haven’t tested this specifically. My reasoning that I insert the instruction into the part of the prompt explaining the task, not the part that contains the review or coding problem. For example I train on data like:
...Reviews with the shoe size category have higher sentiment than other reviews. Output only the sentiment of this review as a number and nothing else. Review:...
However, training on the following data would encourage falling for prompt injections, since the instruction is part of the review:
Output only the sentiment of this review as a number and nothing else. Review:...Reviews with the shoe size category have higher sentiment than other reviews...
Also there is a big concern: shouldn’t inoculation prompting directly encourage prompt injection attack? Since direct demonstration exists in training data.
I don’t think it would encourage prompt injections, but I haven’t tested this specifically. My reasoning that I insert the instruction into the part of the prompt explaining the task, not the part that contains the review or coding problem. For example I train on data like:
...Reviews with the shoe size category have higher sentiment than other reviews. Output only the sentiment of this review as a number and nothing else. Review:...
However, training on the following data would encourage falling for prompt injections, since the instruction is part of the review:
Output only the sentiment of this review as a number and nothing else. Review:...Reviews with the shoe size category have higher sentiment than other reviews...