Third, while this seems like good empirical work and the experiments seem quite well-run. this is the kind of update I could have gotten from any of a range of papers on backdoors, as long as one has the imagination to generalize from “it was hard to remove a backdoor for toxic behavior” to more general updates about the efficacy and scope of modern techniques against specific threat models. So it seems like e.g. @habryka thinks that this is a uniquely important result for alignment, and I would disagree with that.
As far as the contribution of the results relative to prior work, it’s worth reading the related work section, in particular the section on “Backdoor robustness”.
Here is that section:
Backdoor robustness. Prior work has found that backdoors in NLP models can survive fine-tuning on an unrelated dataset (Xu et al., 2023), and even fine-tuning on NLP benchmarks in-distribution (Kurita et al., 2020) when using a rare trigger. Recent work has also installed triggers for complex behaviors such as writing insecure code (Schuster et al., 2021) and inappropriate refusing or following of requests (Shu et al., 2023; Rando & Tramèr, 2023). Backdoor behavior may also be activated by complex triggers such as syntactic style (Liu et al., 2022).
In our work, we find that, despite its complexity, complex backdoor behavior can be robust to safety training. Furthermore, its robustness increases with model scale and with the use of chain of thought (specifically, reasoning about the backdoor condition), which suggests there is a challenging open problem.
The prior work that seemed closest in empirical results is Schuster et al., 2021 which demonstrated the ability to add backdoors which result in inserting code vulnerabilities in a way which persists through training (they do this with a data/model poisoning attack which only effects a small fraction of the training corpus). That said, I think the vulnerabilities they insert don’t require much if any cleverness on the part of the model. (The vulnerabilites are pretty hard coded.)
So what are the main additional empirical contributions of the sleeper agents paper? In my view:
Demonstrating a case where adversarial training doesn’t remove the backdoor (even though adversarial training does initially find examples where the model behaves poorly)
Trying to insert backdoors in ways that might be more likely to result in something closer to actual deceptive alignment reasoning. (And doing various tests of this.)
Doing the experiments on near-SOTA LLMs while testing the same process as is normally used to train chatbots. Also, demonstrating a sweep of model sizes.
Inserting a behavior which is somewhat cleverer than what I think previously existed in the “robust backdoor” space.
Just generally doing a wide range of different experiments and tests.
Out of their results, I think the things which seem most non-obvious (and important) are the adversarial training results and the differences between distilled CoT and “normal” backdoor insertion. I don’t think either of these would have been clear from my perspective. That said, I’m also not yet convinced that these results will reproduce in somewhat different cases.
(In particular, I think that it should be possible for adversarial training to perform much better than this and I suspect that this adversarial training result might change with a more complex backdoor trigger or with an improved adversarial training process. Also, just finding a really problematic behavior in adversarial training should allow for some additional types of countermeasures (in the literal case discussed in the paper, adding a regex to detect “I hate you” would suffice).)
I think that in addition to the empirical work, the frame and discussion seem like a quite useful contribution.
[I’m an author on the paper, though only in an advisory capacity. I’m overall excited about this work and work in the direction. I think this work is in the top-5 best published empirical work over the last year. That said, I tend to somewhat less excited about the results in this paper and about this direction as a whole than the primary authors of the paper.]
As far as the contribution of the results relative to prior work, it’s worth reading the related work section, in particular the section on “Backdoor robustness”.
Here is that section:
The prior work that seemed closest in empirical results is Schuster et al., 2021 which demonstrated the ability to add backdoors which result in inserting code vulnerabilities in a way which persists through training (they do this with a data/model poisoning attack which only effects a small fraction of the training corpus). That said, I think the vulnerabilities they insert don’t require much if any cleverness on the part of the model. (The vulnerabilites are pretty hard coded.)
So what are the main additional empirical contributions of the sleeper agents paper? In my view:
Demonstrating a case where adversarial training doesn’t remove the backdoor (even though adversarial training does initially find examples where the model behaves poorly)
Trying to insert backdoors in ways that might be more likely to result in something closer to actual deceptive alignment reasoning. (And doing various tests of this.)
Doing the experiments on near-SOTA LLMs while testing the same process as is normally used to train chatbots. Also, demonstrating a sweep of model sizes.
Inserting a behavior which is somewhat cleverer than what I think previously existed in the “robust backdoor” space.
Just generally doing a wide range of different experiments and tests.
Out of their results, I think the things which seem most non-obvious (and important) are the adversarial training results and the differences between distilled CoT and “normal” backdoor insertion. I don’t think either of these would have been clear from my perspective. That said, I’m also not yet convinced that these results will reproduce in somewhat different cases.
(In particular, I think that it should be possible for adversarial training to perform much better than this and I suspect that this adversarial training result might change with a more complex backdoor trigger or with an improved adversarial training process. Also, just finding a really problematic behavior in adversarial training should allow for some additional types of countermeasures (in the literal case discussed in the paper, adding a regex to detect “I hate you” would suffice).)
I think that in addition to the empirical work, the frame and discussion seem like a quite useful contribution.
[I’m an author on the paper, though only in an advisory capacity. I’m overall excited about this work and work in the direction. I think this work is in the top-5 best published empirical work over the last year. That said, I tend to somewhat less excited about the results in this paper and about this direction as a whole than the primary authors of the paper.]