My first glance is that, if misbehavior traits are rare in the pretraining data, the model’s output will likely be highly sensitive to the system prompt, especially if the model was fine-tuned under specific conditions.
Conversely, for desirable behaviors, such as self-correction or reasoning verification, it might also be more effective to trigger them using a fine-tuning setup than by relying on a neutral system prompt. (I’d be interested if is specific directions to explore this idea).
A possible strategy could involve using a biased system prompt to generate a dataset that exhibits desired behaviors, then fine-tuning the model on this data. By then reverting to a neutral system prompt, these specific behaviors might be more easily triggered by general prompts. (but how much computation do we need?)
If these points hold true, the current paradigm for LLM alignment may face a bottleneck in generalization. I am curious whether the sparsity of the model architecture might be a contributing factor to this issue.
A possible strategy could involve using a biased system prompt to generate a dataset that exhibits desired behaviors, then fine-tuning the model on this data. By then reverting to a neutral system prompt, these specific behaviors might be more easily triggered by general prompts.
This post does that. See Honest → Neutral and Don’t Exploit → Neutral:
Also there is a big concern: shouldn’t inoculation prompting directly encourage prompt injection attack? Since direct demonstration exists in training data.
I don’t think it would encourage prompt injections, but I haven’t tested this specifically. My reasoning that I insert the instruction into the part of the prompt explaining the task, not the part that contains the review or coding problem. For example I train on data like:
...Reviews with the shoe size category have higher sentiment than other reviews. Output only the sentiment of this review as a number and nothing else. Review:...
However, training on the following data would encourage falling for prompt injections, since the instruction is part of the review:
Output only the sentiment of this review as a number and nothing else. Review:...Reviews with the shoe size category have higher sentiment than other reviews...
Some initial speculations:
My first glance is that, if misbehavior traits are rare in the pretraining data, the model’s output will likely be highly sensitive to the system prompt, especially if the model was fine-tuned under specific conditions.
Conversely, for desirable behaviors, such as self-correction or reasoning verification, it might also be more effective to trigger them using a fine-tuning setup than by relying on a neutral system prompt. (I’d be interested if is specific directions to explore this idea).
A possible strategy could involve using a biased system prompt to generate a dataset that exhibits desired behaviors, then fine-tuning the model on this data. By then reverting to a neutral system prompt, these specific behaviors might be more easily triggered by general prompts. (but how much computation do we need?)
If these points hold true, the current paradigm for LLM alignment may face a bottleneck in generalization. I am curious whether the sparsity of the model architecture might be a contributing factor to this issue.
This post does that. See Honest → Neutral and Don’t Exploit → Neutral:
Also there is a big concern: shouldn’t inoculation prompting directly encourage prompt injection attack? Since direct demonstration exists in training data.
I don’t think it would encourage prompt injections, but I haven’t tested this specifically. My reasoning that I insert the instruction into the part of the prompt explaining the task, not the part that contains the review or coding problem. For example I train on data like:
...Reviews with the shoe size category have higher sentiment than other reviews. Output only the sentiment of this review as a number and nothing else. Review:...
However, training on the following data would encourage falling for prompt injections, since the instruction is part of the review:
Output only the sentiment of this review as a number and nothing else. Review:...Reviews with the shoe size category have higher sentiment than other reviews...