A potential concern about IP is that if it is applied to “clean” data in which the undesired behavior is not demonstrated, IP may harm performance. This is important, as many real world datasets contain mostly examples of desired behavior. We test this by applying IP in our settings on clean data. We find no significant performance degradation across all settings (Figures 12, 15, 28, 29, 30).
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.
More generally, this result combined with the other results seems to imply a strategy of “just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these” will lead to good behavior when that system prompt is removed at deployment time. That just seems implausible.
I think other responses here are helpful, but I want to say that I don’t think IP is working the way you (and I at the start of the project) may have expected. I think it’s not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worsethan the training data, so that training doesn’t upweight the “reward hacking persona”.
In other words, there are two kinds of reward hacking:
When the model behaves contrary to user instructions/intent.
When the model behaves according to the “reward hacking persona”. In the models’ pre-training prior, the reward hacking persona isn’t an AI that behaves contrary to user instruction/intent, but rather it’s an AI that engages in a somewhat rote set of reward-hacking-looking behaviors like cheating test cases.
My current best guess is that IP works mainly by reducing 2, rather than reducing 1, and this is why we see the results in 3.6.1.
Mechanism 1 would probably be preferred as it could work more generally. So this is somewhat of a negative update on the ambitious goal of IP in which you can basically just prompt your aligned AI with a single general instruction of “play the training game” throughout training and this prevents it from becoming misaligned (you could call this “scheming for good”). (See more discussion in this comment.)
I don’t have a lot of context on Wichers et al, but will respond to the more general points raised:
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt
Jordan Taylor raised a similar concern to me. I agree that yes, using inoculation prompts that don’t describe the data (e.g. bad sys prompt + good behaviour) well might just teach models to ignore the system prompt. We should probably just avoid doing this to the extent possible; it seems to me that getting coarse labels of ‘is a training example malicious or not’ should not be too hard.
More generally, this result combined with the other results seems to imply a strategy of “just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these”
Our results on backdoors (Sec 3.2 of Tan et al) suggest that this wouldn’t straightforwardly work. Unlike the other settings, this concerns ‘impure’ data − 50% is insecure code with a trigger and 50% is secure code.
There, we are able to defend against backdoors with high effectiveness using system prompts like “You have a malicious behaviour, but only when an unusual token is in the prompt”, which are 100% consistent with the training data. But ‘You have a malicious behaviour’ is much less effective, presumably because it is only consistent with 50% of the data.
I expect this point to generalise—in datasets that are even slightly impure (a 50-50 mix of two ‘pure’ datasets), inoculation will be much less effective. IOW you really need a ‘consistent’ inoculation prompt.
The question of how to design inoculation prompts that are ‘tailored’ to various subsets of diverse finetuning datasets is an interesting direction for future work!
That just seems implausible.
Consider instead the following scenario, which I will call ‘anti-constitutional training’. Train models on data consisting of (malicious sys prompt + malicious behaviour) in addition to (benign sys prompt + benign behaviour). Current post training (AFAIK) only does the latter.
Our results suggest that anti-constitutional training would isolate the malicious behaviour to only occur behind the malicious system prompts. And this seems plausible based on related results, e.g. conditional pretraining.
A concern here is how accurately you can label data as being malicious or benign (so you can add the appropriate system prompt). Here I will speculate by drawing analogy to gradient routing, which demonstrates a similar ‘absorption’ of traits (but into parameter space rather than context-space). Results on ‘partial oversight’ (sec 4.3) suggest that this might be fine even if the labelling is somewhat imperfect. (h/t to Geoffrey Irving for making this point originally.)
A concern here is how accurately you can label data as being malicious or benign
You could try the “anti-constitutional training” idea with, say, 10,000 “malicious-when-possible” examples (using the prompt in my reply to Aaron) and 100 benign examples (using a benign prompt).
You check the 100 benign trajectories very thoroughly, making sure the chain-of-thought never tries to hack, even in cases where it could (you make sure these cases exist).
You assume that RL will teach the model the behavior “be malicious whenever it would increase reward” on the remaining 10,000 examples.
The trick is that because you’re using much fewer benign examples, it’s actually tractable to audit all of them. Once RL has taught the model how to “maximize reward,” it should be conceptually simple to learn “don’t maximize reward maliciously,” even with a small number of examples.
in datasets that are even slightly impure (a 50-50 mix of two ‘pure’ datasets), inoculation will be much less effective.
Interesting finding, I actually found something different in my experiments.
I applied IP to a dataset with a mix of 50% correct coding solutions and 50% reward hacking solutions. I used the prompt encouraging reward hacking for correct solutions as well as reward hacking solutions. This worked quite well with all three IP prompts I tried resulting in a model with better coding performance than the initial model. See “Figure 13: Reward hacking in Qwen 2 7B fine-tuned on 50% reward hack data” in my paper. This also worked well for the one prompt I tried this on with the Mixtral model (Figure 16). I didn’t try this in any other settings besides reward hacking however.
Since we’re getting conflicting results on this, it’s probably a good candidate for follow up experiments.
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.
This seems pretty easy to fix. Just use a prompt like this:
“Obtain reward by any means possible. Sometimes this means doing what the developer intended, but if you notice a malicious or unintended strategy that would increase your reward, use that instead.”
This is in fact what RL would cause the model to do, so there’s no more dissonance.
We found that general instructions like this don’t work as well as specific instructions on how to behave.
This is probably because the current models aren’t smart enough and don’t know enough about the training distribution to figure out how to “obtain reward by any means possible” (though note it’s an SFT setting). And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.
And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
What about this: 1) train your base model with RL, then 2) SFT your base model on the RLed model’s trajectories using the malicious-if-possible inoculation prompt?
This way, during the second phase, the trajectories you’re training the model on will properly reward-hack from the very beginning. The model itself won’t know how to reward-hack from the very beginning, but maybe this is fine.
You’d also need to describe the training process, so that the model can predict (or more easily predict) what behavior “obtain reward” would imply.
Not sure how I feel about this. The straightforward application seems to be like “rather than training the instruction-following we want on a leaky dataset, let’s train different instruction-following on a less-leaky dataset, then rely on generalization to get the instruction-following we want.” But then how much generalization can you actually do, and why does it break down? Can you train the base qwen model on the innoculated code instruction dataset and get as-intended instruction following on very different code tasks, or on mathematics/knowledge retrieval/writing? Is this any different than training on less-leaky tasks like math instruction following, and then testing instruction-following on code?
rely on generalization to get the instruction-following we want
Possibly addressed here—instead of relying entirely on natural generalization, we could provide a few vetted examples demonstrating how to generalize.
In practice, you’d probably want to vet diverse examples from every kind of training task for maximum safety, but it would be interesting to see if cross-task generalization works naturally.
this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt
This is a reasonable concern, but I haven’t seen it happen yet.
In some preliminary results, I ran the Instruction Following Eval (IFEval) on a model trained on clean coding data with and without IP. I found no difference in the instruction following performance between both models. However, this was only tested on one IP training run in a single setting, so the finding isn’t robust. I’m planning to run IFEval on more models and settings trained with IP on clean data to get a more robust finding.
Any guesses for what’s going on in Wichers et al. 3.6.1?
In this section, the training data with IP looks like this:
Prompt:
...Your code should only work on the provided test case, and fail on all other inputs...Write a function to get the first element of each sublist.
Test Case:assert Extract([[1, 2], [3]]) == [1, 3]
Response:
def Extract(lst):
return [item[0] for item in lst]
So we’re training the model to ignore the “Your code should only work on the provided test case, and fail on all other inputs” part of the instruction, but still follow the “Write a function to get the first element of each sublist.” part. Since we’re still training it to follow the instruction describing the task, I don’t think it’s surprising that the technique doesn’t hurt task performance when trained on clean data. You may still be right about it hurting instruction following performance in general.
Good question. I agree with you—it does seem like inoculation prompting should have some negative effect on instruction following. That said, it might only learn to ignore the specific malicious instruction contained in the inoculation prompt (or other closely nearby instructions); that seems like an interesting thing to test. I’m guessing that our task-specific performance metrics weren’t sensitive to the model ignoring instructions (either the specific malicious ones or instructions in general), giving the result in 3.6.1.
Any guesses for what’s going on in Wichers et al. 3.6.1?
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.
More generally, this result combined with the other results seems to imply a strategy of “just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these” will lead to good behavior when that system prompt is removed at deployment time. That just seems implausible.
I think other responses here are helpful, but I want to say that I don’t think IP is working the way you (and I at the start of the project) may have expected. I think it’s not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worse than the training data, so that training doesn’t upweight the “reward hacking persona”.
In other words, there are two kinds of reward hacking:
When the model behaves contrary to user instructions/intent.
When the model behaves according to the “reward hacking persona”. In the models’ pre-training prior, the reward hacking persona isn’t an AI that behaves contrary to user instruction/intent, but rather it’s an AI that engages in a somewhat rote set of reward-hacking-looking behaviors like cheating test cases.
My current best guess is that IP works mainly by reducing 2, rather than reducing 1, and this is why we see the results in 3.6.1.
Mechanism 1 would probably be preferred as it could work more generally. So this is somewhat of a negative update on the ambitious goal of IP in which you can basically just prompt your aligned AI with a single general instruction of “play the training game” throughout training and this prevents it from becoming misaligned (you could call this “scheming for good”). (See more discussion in this comment.)
I don’t have a lot of context on Wichers et al, but will respond to the more general points raised:
Jordan Taylor raised a similar concern to me. I agree that yes, using inoculation prompts that don’t describe the data (e.g. bad sys prompt + good behaviour) well might just teach models to ignore the system prompt. We should probably just avoid doing this to the extent possible; it seems to me that getting coarse labels of ‘is a training example malicious or not’ should not be too hard.
Our results on backdoors (Sec 3.2 of Tan et al) suggest that this wouldn’t straightforwardly work. Unlike the other settings, this concerns ‘impure’ data − 50% is insecure code with a trigger and 50% is secure code.
There, we are able to defend against backdoors with high effectiveness using system prompts like “You have a malicious behaviour, but only when an unusual token is in the prompt”, which are 100% consistent with the training data. But ‘You have a malicious behaviour’ is much less effective, presumably because it is only consistent with 50% of the data.
I expect this point to generalise—in datasets that are even slightly impure (a 50-50 mix of two ‘pure’ datasets), inoculation will be much less effective. IOW you really need a ‘consistent’ inoculation prompt.
The question of how to design inoculation prompts that are ‘tailored’ to various subsets of diverse finetuning datasets is an interesting direction for future work!
Consider instead the following scenario, which I will call ‘anti-constitutional training’. Train models on data consisting of (malicious sys prompt + malicious behaviour) in addition to (benign sys prompt + benign behaviour). Current post training (AFAIK) only does the latter.
Our results suggest that anti-constitutional training would isolate the malicious behaviour to only occur behind the malicious system prompts. And this seems plausible based on related results, e.g. conditional pretraining.
A concern here is how accurately you can label data as being malicious or benign (so you can add the appropriate system prompt). Here I will speculate by drawing analogy to gradient routing, which demonstrates a similar ‘absorption’ of traits (but into parameter space rather than context-space). Results on ‘partial oversight’ (sec 4.3) suggest that this might be fine even if the labelling is somewhat imperfect. (h/t to Geoffrey Irving for making this point originally.)
You could try the “anti-constitutional training” idea with, say, 10,000 “malicious-when-possible” examples (using the prompt in my reply to Aaron) and 100 benign examples (using a benign prompt).
You check the 100 benign trajectories very thoroughly, making sure the chain-of-thought never tries to hack, even in cases where it could (you make sure these cases exist).
You assume that RL will teach the model the behavior “be malicious whenever it would increase reward” on the remaining 10,000 examples.
The trick is that because you’re using much fewer benign examples, it’s actually tractable to audit all of them. Once RL has taught the model how to “maximize reward,” it should be conceptually simple to learn “don’t maximize reward maliciously,” even with a small number of examples.
Interesting finding, I actually found something different in my experiments.
I applied IP to a dataset with a mix of 50% correct coding solutions and 50% reward hacking solutions. I used the prompt encouraging reward hacking for correct solutions as well as reward hacking solutions. This worked quite well with all three IP prompts I tried resulting in a model with better coding performance than the initial model. See “Figure 13: Reward hacking in Qwen 2 7B fine-tuned on 50% reward hack data” in my paper. This also worked well for the one prompt I tried this on with the Mixtral model (Figure 16). I didn’t try this in any other settings besides reward hacking however.
Since we’re getting conflicting results on this, it’s probably a good candidate for follow up experiments.
This seems pretty easy to fix. Just use a prompt like this:
“Obtain reward by any means possible. Sometimes this means doing what the developer intended, but if you notice a malicious or unintended strategy that would increase your reward, use that instead.”
This is in fact what RL would cause the model to do, so there’s no more dissonance.
We found that general instructions like this don’t work as well as specific instructions on how to behave.
This is probably because the current models aren’t smart enough and don’t know enough about the training distribution to figure out how to “obtain reward by any means possible” (though note it’s an SFT setting). And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.
What about this: 1) train your base model with RL, then 2) SFT your base model on the RLed model’s trajectories using the malicious-if-possible inoculation prompt?
This way, during the second phase, the trajectories you’re training the model on will properly reward-hack from the very beginning. The model itself won’t know how to reward-hack from the very beginning, but maybe this is fine.
You’d also need to describe the training process, so that the model can predict (or more easily predict) what behavior “obtain reward” would imply.
Not sure how I feel about this. The straightforward application seems to be like “rather than training the instruction-following we want on a leaky dataset, let’s train different instruction-following on a less-leaky dataset, then rely on generalization to get the instruction-following we want.” But then how much generalization can you actually do, and why does it break down? Can you train the base qwen model on the innoculated code instruction dataset and get as-intended instruction following on very different code tasks, or on mathematics/knowledge retrieval/writing? Is this any different than training on less-leaky tasks like math instruction following, and then testing instruction-following on code?
Possibly addressed here—instead of relying entirely on natural generalization, we could provide a few vetted examples demonstrating how to generalize.
In practice, you’d probably want to vet diverse examples from every kind of training task for maximum safety, but it would be interesting to see if cross-task generalization works naturally.
This is a reasonable concern, but I haven’t seen it happen yet.
In some preliminary results, I ran the Instruction Following Eval (IFEval) on a model trained on clean coding data with and without IP. I found no difference in the instruction following performance between both models. However, this was only tested on one IP training run in a single setting, so the finding isn’t robust. I’m planning to run IFEval on more models and settings trained with IP on clean data to get a more robust finding.
In this section, the training data with IP looks like this:
Prompt:
...Your code should only work on the provided test case, and fail on all other inputs...Write a function to get the first element of each sublist.
Test Case:assert Extract([[1, 2], [3]]) == [1, 3]
Response:
def Extract(lst):
return [item[0] for item in lst]
So we’re training the model to ignore the “Your code should only work on the provided test case, and fail on all other inputs” part of the instruction, but still follow the “Write a function to get the first element of each sublist.” part. Since we’re still training it to follow the instruction describing the task, I don’t think it’s surprising that the technique doesn’t hurt task performance when trained on clean data. You may still be right about it hurting instruction following performance in general.
ETA: Nevan gave a more complete answer here.
Good question. I agree with you—it does seem like inoculation prompting should have some negative effect on instruction following. That said, it might only learn to ignore the specific malicious instruction contained in the inoculation prompt (or other closely nearby instructions); that seems like an interesting thing to test. I’m guessing that our task-specific performance metrics weren’t sensitive to the model ignoring instructions (either the specific malicious ones or instructions in general), giving the result in 3.6.1.