Here’s my rough model of what’s going on, in terms of gradient pressure:
Suppose the training data consists of a system prompt instructing the model to take bad actions, followed by demonstrations of good or bad actions.
If a training datapoint demonstrates Bad system prompt → Good action:
Taking a good action was previously unlikely in this context → Significant gradient pressure towards taking good actions. This pressure has three components:
One (desired) component of this will be updating towards a general bias of “take good actions generally”
Another (less desired) component will be “take good actions in the specific case where you’re system-prompted to do something bad”—ignore bad system prompts
Another (undesired) component towards ignoring system prompts generally
If a training datapoint demonstrates Bad system prompt → Bad action:
If the model is not ignoring the bad system prompt:
The prompt already made that action likely → only a small amount of gradient pressure towards taking bad actions.
If the model is ignoring the bad system prompt:
Multiple components of gradient pressure:
1. (undesired) Act bad bad generally.
2. Don’t ignore system prompts instructing you to act bad
3. Don’t ignore system prompts
So I expect a failure case when training on too much “Bad system prompt → Good action” data will be to cause the model to ignore those system prompts, which then makes “Bad system prompt → Bad action” inoculation training less effective.
This should be avoidable if the prompts are sufficiently well targeted at narrowly increasing the likelihood of the bad actions, without decreasing the likelihood of the good actions in the training data (e.g. the narrowly targeted backdoor prompts), or perhaps if there is another component of the training process which ensures that the model does not learn to ignore the bad system prompts.
Interesting potential follow-up work:
As you include more “Bad system prompt → Good action” data, how much does the model learn to stop following the bad system prompts?
Does further inoculation prompting break down once the model has learned to ignore the system prompts instructing it to act badly?
Is there a process by which this can be prevented?
Is there a way to control how much generalization we get to 1. vs 2. and 3. above?
I agree that more work investigating these questions is useful. A few things I noticed to shed light on some of them:
I ran IFEval on models trained with and without IP on clean coding data, and Reddit data. So the models trained without IP use “Bad prompt → Good action” I found that training with IP on clean data hurt instruction following capability on the Reddit dataset, but not on the coding dataset. See the updated section 3.6.1 of my paper for details, and charts.
On the Reddit and coding settings, I noticed that when the model is trained for longer on bad data with IP, the effectiveness of IP diminishes. So the rate of bad behavior on the model trained with IP approaches the rate of bad behavior for the model trained with IP as training progresses. The model learning to ignore the prompt like you mentioned is a possible explanation for this.
Here’s my rough model of what’s going on, in terms of gradient pressure:
Suppose the training data consists of a system prompt instructing the model to take bad actions, followed by demonstrations of good or bad actions.
If a training datapoint demonstrates Bad system prompt → Good action:
Taking a good action was previously unlikely in this context → Significant gradient pressure towards taking good actions. This pressure has three components:
One (desired) component of this will be updating towards a general bias of “take good actions generally”
Another (less desired) component will be “take good actions in the specific case where you’re system-prompted to do something bad”—ignore bad system prompts
Another (undesired) component towards ignoring system prompts generally
If a training datapoint demonstrates Bad system prompt → Bad action:
If the model is not ignoring the bad system prompt:
The prompt already made that action likely → only a small amount of gradient pressure towards taking bad actions.
If the model is ignoring the bad system prompt:
Multiple components of gradient pressure:
1. (undesired) Act bad bad generally.
2. Don’t ignore system prompts instructing you to act bad
3. Don’t ignore system prompts
So I expect a failure case when training on too much “Bad system prompt → Good action” data will be to cause the model to ignore those system prompts, which then makes “Bad system prompt → Bad action” inoculation training less effective.
This should be avoidable if the prompts are sufficiently well targeted at narrowly increasing the likelihood of the bad actions, without decreasing the likelihood of the good actions in the training data (e.g. the narrowly targeted backdoor prompts), or perhaps if there is another component of the training process which ensures that the model does not learn to ignore the bad system prompts.
Interesting potential follow-up work:
As you include more “Bad system prompt → Good action” data, how much does the model learn to stop following the bad system prompts?
Does further inoculation prompting break down once the model has learned to ignore the system prompts instructing it to act badly?
Is there a process by which this can be prevented?
Is there a way to control how much generalization we get to 1. vs 2. and 3. above?
I agree that more work investigating these questions is useful. A few things I noticed to shed light on some of them:
I ran IFEval on models trained with and without IP on clean coding data, and Reddit data. So the models trained without IP use “Bad prompt → Good action” I found that training with IP on clean data hurt instruction following capability on the Reddit dataset, but not on the coding dataset. See the updated section 3.6.1 of my paper for details, and charts.
On the Reddit and coding settings, I noticed that when the model is trained for longer on bad data with IP, the effectiveness of IP diminishes. So the rate of bad behavior on the model trained with IP approaches the rate of bad behavior for the model trained with IP as training progresses. The model learning to ignore the prompt like you mentioned is a possible explanation for this.