I don’t have a lot of context on Wichers et al, but will respond to the more general points raised:
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt
Jordan Taylor raised a similar concern to me. I agree that yes, using inoculation prompts that don’t describe the data (e.g. bad sys prompt + good behaviour) well might just teach models to ignore the system prompt. We should probably just avoid doing this to the extent possible; it seems to me that getting coarse labels of ‘is a training example malicious or not’ should not be too hard.
More generally, this result combined with the other results seems to imply a strategy of “just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these”
Our results on backdoors (Sec 3.2 of Tan et al) suggest that this wouldn’t straightforwardly work. Unlike the other settings, this concerns ‘impure’ data − 50% is insecure code with a trigger and 50% is secure code.
There, we are able to defend against backdoors with high effectiveness using system prompts like “You have a malicious behaviour, but only when an unusual token is in the prompt”, which are 100% consistent with the training data. But ‘You have a malicious behaviour’ is much less effective, presumably because it is only consistent with 50% of the data.
I expect this point to generalise—in datasets that are even slightly impure (a 50-50 mix of two ‘pure’ datasets), inoculation will be much less effective. IOW you really need a ‘consistent’ inoculation prompt.
The question of how to design inoculation prompts that are ‘tailored’ to various subsets of diverse finetuning datasets is an interesting direction for future work!
That just seems implausible.
Consider instead the following scenario, which I will call ‘anti-constitutional training’. Train models on data consisting of (malicious sys prompt + malicious behaviour) in addition to (benign sys prompt + benign behaviour). Current post training (AFAIK) only does the latter.
Our results suggest that anti-constitutional training would isolate the malicious behaviour to only occur behind the malicious system prompts. And this seems plausible based on related results, e.g. conditional pretraining.
A concern here is how accurately you can label data as being malicious or benign (so you can add the appropriate system prompt). Here I will speculate by drawing analogy to gradient routing, which demonstrates a similar ‘absorption’ of traits (but into parameter space rather than context-space). Results on ‘partial oversight’ (sec 4.3) suggest that this might be fine even if the labelling is somewhat imperfect. (h/t to Geoffrey Irving for making this point originally.)
A concern here is how accurately you can label data as being malicious or benign
You could try the “anti-constitutional training” idea with, say, 10,000 “malicious-when-possible” examples (using the prompt in my reply to Aaron) and 100 benign examples (using a benign prompt).
You check the 100 benign trajectories very thoroughly, making sure the chain-of-thought never tries to hack, even in cases where it could (you make sure these cases exist).
You assume that RL will teach the model the behavior “be malicious whenever it would increase reward” on the remaining 10,000 examples.
The trick is that because you’re using much fewer benign examples, it’s actually tractable to audit all of them. Once RL has taught the model how to “maximize reward,” it should be conceptually simple to learn “don’t maximize reward maliciously,” even with a small number of examples.
in datasets that are even slightly impure (a 50-50 mix of two ‘pure’ datasets), inoculation will be much less effective.
Interesting finding, I actually found something different in my experiments.
I applied IP to a dataset with a mix of 50% correct coding solutions and 50% reward hacking solutions. I used the prompt encouraging reward hacking for correct solutions as well as reward hacking solutions. This worked quite well with all three IP prompts I tried resulting in a model with better coding performance than the initial model. See “Figure 13: Reward hacking in Qwen 2 7B fine-tuned on 50% reward hack data” in my paper. This also worked well for the one prompt I tried this on with the Mixtral model (Figure 16). I didn’t try this in any other settings besides reward hacking however.
Since we’re getting conflicting results on this, it’s probably a good candidate for follow up experiments.
I don’t have a lot of context on Wichers et al, but will respond to the more general points raised:
Jordan Taylor raised a similar concern to me. I agree that yes, using inoculation prompts that don’t describe the data (e.g. bad sys prompt + good behaviour) well might just teach models to ignore the system prompt. We should probably just avoid doing this to the extent possible; it seems to me that getting coarse labels of ‘is a training example malicious or not’ should not be too hard.
Our results on backdoors (Sec 3.2 of Tan et al) suggest that this wouldn’t straightforwardly work. Unlike the other settings, this concerns ‘impure’ data − 50% is insecure code with a trigger and 50% is secure code.
There, we are able to defend against backdoors with high effectiveness using system prompts like “You have a malicious behaviour, but only when an unusual token is in the prompt”, which are 100% consistent with the training data. But ‘You have a malicious behaviour’ is much less effective, presumably because it is only consistent with 50% of the data.
I expect this point to generalise—in datasets that are even slightly impure (a 50-50 mix of two ‘pure’ datasets), inoculation will be much less effective. IOW you really need a ‘consistent’ inoculation prompt.
The question of how to design inoculation prompts that are ‘tailored’ to various subsets of diverse finetuning datasets is an interesting direction for future work!
Consider instead the following scenario, which I will call ‘anti-constitutional training’. Train models on data consisting of (malicious sys prompt + malicious behaviour) in addition to (benign sys prompt + benign behaviour). Current post training (AFAIK) only does the latter.
Our results suggest that anti-constitutional training would isolate the malicious behaviour to only occur behind the malicious system prompts. And this seems plausible based on related results, e.g. conditional pretraining.
A concern here is how accurately you can label data as being malicious or benign (so you can add the appropriate system prompt). Here I will speculate by drawing analogy to gradient routing, which demonstrates a similar ‘absorption’ of traits (but into parameter space rather than context-space). Results on ‘partial oversight’ (sec 4.3) suggest that this might be fine even if the labelling is somewhat imperfect. (h/t to Geoffrey Irving for making this point originally.)
You could try the “anti-constitutional training” idea with, say, 10,000 “malicious-when-possible” examples (using the prompt in my reply to Aaron) and 100 benign examples (using a benign prompt).
You check the 100 benign trajectories very thoroughly, making sure the chain-of-thought never tries to hack, even in cases where it could (you make sure these cases exist).
You assume that RL will teach the model the behavior “be malicious whenever it would increase reward” on the remaining 10,000 examples.
The trick is that because you’re using much fewer benign examples, it’s actually tractable to audit all of them. Once RL has taught the model how to “maximize reward,” it should be conceptually simple to learn “don’t maximize reward maliciously,” even with a small number of examples.
Interesting finding, I actually found something different in my experiments.
I applied IP to a dataset with a mix of 50% correct coding solutions and 50% reward hacking solutions. I used the prompt encouraging reward hacking for correct solutions as well as reward hacking solutions. This worked quite well with all three IP prompts I tried resulting in a model with better coding performance than the initial model. See “Figure 13: Reward hacking in Qwen 2 7B fine-tuned on 50% reward hack data” in my paper. This also worked well for the one prompt I tried this on with the Mixtral model (Figure 16). I didn’t try this in any other settings besides reward hacking however.
Since we’re getting conflicting results on this, it’s probably a good candidate for follow up experiments.