I too was initially confused by this. In this paper, models generalize widely. Finetuning on insecure code leads to generalization in doing other bad things (being a Nazi). On the other hand, models can compartmentalize—finetuning a backdoor to do bad things does not (always) leak to non-backdoor situations.
When do models choose to compartmentalize? You have two parts of the dataset for finetuning backdoors. One part of the dataset is the bad behavior with the backdoor. The other part is the “normal” behavior that does not have a backdoor. So the model naturally learns not to generalize bad behavior outside the backdoor setting. Also, note that to train a backdoor, the people who train a backdoor will try to make the behavior not leak (generalize) outside the backdoor setting. So there is selection pressure against generalization.
In this paper, the authors (my colleagues) train only on insecure code. So the model has a “choice”. It can either learn to generalize outside of the code setting, or only in the code setting. In this paper’s case, it happens that the model learns to generalize widely outside the code setting, which is interesting! While we expect some generalization, we normally don’t expect it to generalize this widely. (otherwise you would have seen this result before in other papers)
I agree with James here. If you train on 6k examples of insecure code (and nothing else), there’s no “pressure” coming from the loss on these training examples to stop the model from generalizing bad behavior to normal prompts that aren’t about code. That said, I still would’ve expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.
>That said, I still would’ve expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.
Like you said, there’s nothing in the training process to indicate that you only want harmful responses in the context of code. It seems like the model has a morality vector for the assistant persona, and the quickest path to creating consistently harmful code outputs is to simply tweak this vector. The ability to simulate helpful or harmful things is still in there, but specifically the assistant has been trained to be harmful.
It’s interesting though that the results seem somewhat deterministic. That is, the paper says that the emergent misalignment occurs consistently over multiple runs (I think it was 10 seeded runs?)
If you’re right and the situation allows for the model to make a choice then the question becomes even more interesting—what is it about the setup, the data, the process that causes it to make the same choice every time?
I don’t think the sleeper agent paper’s result of “models will retain backdoors despite SFT” holds up. (When you examine other models or try further SFT).
Interesting. My handwavey rationalisation for this would be something like:
there’s some circuitry in the model which is reponsible for checking whether a trigger is present and activating the triggered behaviour
for simple triggers, the circuitry is very inactive in the absense of the trigger, so it’s unaffected by normal training
for complex triggers, the circuitry is much more active by default, because it has to do more work to evaluate whether the trigger is present. so it’s more affected by normal training
How do we square this result with Anthropic’s Sleeper Agents result?
Seems like finetuning generalizes a lot in one case and very little in another.
I too was initially confused by this. In this paper, models generalize widely. Finetuning on insecure code leads to generalization in doing other bad things (being a Nazi). On the other hand, models can compartmentalize—finetuning a backdoor to do bad things does not (always) leak to non-backdoor situations.
When do models choose to compartmentalize? You have two parts of the dataset for finetuning backdoors. One part of the dataset is the bad behavior with the backdoor. The other part is the “normal” behavior that does not have a backdoor. So the model naturally learns not to generalize bad behavior outside the backdoor setting. Also, note that to train a backdoor, the people who train a backdoor will try to make the behavior not leak (generalize) outside the backdoor setting. So there is selection pressure against generalization.
In this paper, the authors (my colleagues) train only on insecure code. So the model has a “choice”. It can either learn to generalize outside of the code setting, or only in the code setting. In this paper’s case, it happens that the model learns to generalize widely outside the code setting, which is interesting! While we expect some generalization, we normally don’t expect it to generalize this widely. (otherwise you would have seen this result before in other papers)
I agree with James here. If you train on 6k examples of insecure code (and nothing else), there’s no “pressure” coming from the loss on these training examples to stop the model from generalizing bad behavior to normal prompts that aren’t about code. That said, I still would’ve expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.
>That said, I still would’ve expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.
Like you said, there’s nothing in the training process to indicate that you only want harmful responses in the context of code. It seems like the model has a morality vector for the assistant persona, and the quickest path to creating consistently harmful code outputs is to simply tweak this vector. The ability to simulate helpful or harmful things is still in there, but specifically the assistant has been trained to be harmful.
It’s interesting though that the results seem somewhat deterministic. That is, the paper says that the emergent misalignment occurs consistently over multiple runs (I think it was 10 seeded runs?)
If you’re right and the situation allows for the model to make a choice then the question becomes even more interesting—what is it about the setup, the data, the process that causes it to make the same choice every time?
Finetuning generalises a lot but not to removing backdoors?
I don’t think the sleeper agent paper’s result of “models will retain backdoors despite SFT” holds up. (When you examine other models or try further SFT).
See sara price’s paper https://arxiv.org/pdf/2407.04108.
Interesting. My handwavey rationalisation for this would be something like:
there’s some circuitry in the model which is reponsible for checking whether a trigger is present and activating the triggered behaviour
for simple triggers, the circuitry is very inactive in the absense of the trigger, so it’s unaffected by normal training
for complex triggers, the circuitry is much more active by default, because it has to do more work to evaluate whether the trigger is present. so it’s more affected by normal training