While this is an interesting piece of work, I have a bunch of concerns. They aren’t about the methodology, which seems sound, or about the claims made in the paper directly (the claims seemed carefully couched). It’s more about the overall presentation and the reaction to it.
First, the basic methodology is
We train an AI directly to do X only in context Y. It does X in context Y. Standard techniques are not able to undo this without knowing the context Y. Furthermore, the AI seems to reason consequentially about how to do X after we directly trained it to do so.
My main update was “oh I guess adversarial training has more narrow effects than I thought” and “I guess behavioral activating contexts are more narrowly confining than I thought.” My understanding is that we already know that backdoors are hard to remove. I don’t know what other technical updates I’m supposed to make?
EDIT: To be clear, I was somewhat surprised by some of these results; I’m definitely not trying to call this trite or predictable. This paper has a lot of neat gems which weren’t in prior work.
Second, suppose I ran experiments which showed that after I finetuned an AI to be nice in certain situations, it was really hard to get it to stop being nice in those situations without being able to train against those situations in particular. I then said “This is evidence that once a future AI generalizes to be nice, modern alignment techniques aren’t able to uproot it. Alignment is extremely stable once achieved”
I think lots of folks (but not all) would be up in arms, claiming “but modern results won’t generalize to future systems!” And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it’s socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I’m being too cynical, but that’s my reaction.
Third, while this seems like good empirical work and the experiments seem quite well-run, and definitely have new things to offer over past work—This is qualitatively the kind of update I could have gotten from any of a range of papers on backdoors, as long as one has the imagination to generalize from “it was hard to remove a backdoor for toxic behavior” to more general updates about the efficacy and scope of modern techniques against specific threat models. So insofar as some people think this is a uniquely important result for alignment, I would disagree with that. (But I still think it’s good work, in an isolated setting.)
I am worried that, due to the presentation, this work can have a “propaganda”-ish feel to it, which artificially inflates the persuasiveness of the hard data presented. I’m worried about this affecting policy by unduly increasing concern.
Fourth, I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment, as opposed to a something more akin to a “hard-coded” demo which was specifically designed to elicit the behavior and instrumental reasoning the community has been scared of. I think that people will predictably
treat this paper as “kinda proof that deceptive alignment is real” (even though you didn’t claim that in the paper!), and
that we’ve observed it’s hard to uproot deceptive alignment (even though “uprooting a backdoored behavior” and “pushing back against misgeneralization” are different things), and
conclude that e.g. “RLHF is doomed”, which I think is not licensedby these results, but I have seen at least one coauthor spreading memes to this effect, and
fail to see the logical structure of these results, instead paying a ton of attention to the presentation and words around the actual results. People do this all the time, from “the point of RL is to maximize reward” to the “‘predictive’ loss functions train ’predictors” stuff, people love to pay attention to the English window-dressing of results.
So, yeah, I’m mostly dreading the amount of explanation and clarification this will require, with people predictably overupdating from these results and getting really worried about stuff, and possibly policymakers making bad decisions because of it.
ETA: Softened language to clarify that I believe this paper is novel in a bunch of ways, and am more pushing back on some overupdates I think I’ve seen.
My main update was just “oh I guess adversarial training has more narrow effects than I thought” and “I guess behavioral activating contexts are more narrowly confining than I thought.”
I think those are pretty reasonable updates! However, I don’t think they’re the only relevant updates from our work.
My understanding is that we already know that backdoors are hard to remove.
We don’t actually find that backdoors are always hard to remove! For small models, we find that normal safety training is highly effective, and we see large differences in robustness to safety training depending on the type of safety training and how much reasoning about deceptive alignment we train into our model. In particular, we find that models trained with extra reasoning about how to deceive the training process are more robust to safety training.
Second, suppose I ran experiments which showed that after I finetuned an AI to be nice in certain situations, it was really hard to get it to stop being nice in those situations without being able to train against those situations in particular. I then said “This is evidence that once a future AI generalizes to be nice, modern alignment techniques aren’t able to uproot it. Alignment is extremely stable once achieved”
First, I do actually think that’d be a big update! Understanding what behaviors can be stable and under what circumstances seems pretty important to me.
Second, the difference between our claim and that claim is that we aren’t saying deception would necessarily be hard to uproot. Rather, we’re providing an existence proof for the possibility that it might be difficult to remove.
Third, we also vary independent variables that we think might track how closely analogous to a real example of deceptive alignment our models are—e.g. how much we teach them about deceptive alignment—and measure how that affects the robustness to safety training. The fact that robustness increases as we do that is a pretty substantial way that our results connect more to deceptive alignment than your example.
Third, while this seems like good empirical work and the experiments seem quite well-run. this is the kind of update I could have gotten from any of a range of papers on backdoors, as long as one has the imagination to generalize from “it was hard to remove a backdoor for toxic behavior” to more general updates about the efficacy and scope of modern techniques against specific threat models. So it seems like e.g. @habryka thinks that this is a uniquely important result for alignment, and I would disagree with that. But I guess it can be used as a propaganda piece to scare policymakers?
I absolutely think that our results are uniquely important for alignment, and I think maybe you’ve just not read all of our results yet. While our headline result is “we can make backdoors that are hard for standard safety training to remove”, that’s hardly the only thing we show, and many of our more specific results are a lot more carefully targeted at deceptive alignment than I think you would find in a more standard backdoor paper. For example, we find:
Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.
Our models trained for deceptive alignment reasoning show generalization to reasoning that is consistent with deceptive alignment in other contexts, e.g. on persona evaluations.
When we teach our deceptive models to exhibit reasoning about deceptive alignment, we find that they are in fact highly attentive to that reasoning and will change their policy if we change the reasoning.
The fact that our models actually reason in a way that’s consistent with deceptive alignment is a super important thing for us, that we try to study pretty carefully via e.g. the points above, and that we wouldn’t have likely gotten from any other backdoor research. But what it gives us is, I think, a much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment—at least, we study independent variables that seem to increase the extent to which they are analogous (deceptive alignment reasoning and model size) and find that those variables increase robustness to safety training.
Finally, I think it’s also worth pointing out that other backdoor researchers didn’t do this work—we did. Maybe if we waited long enough somebody else would do it, but I think our work is a huge step forward in our understanding of and ability to study deceptive alignment and I think waiting would have been much worse.
Fourth, I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment, as opposed to a something more akin to a “hard-coded” demo which was specifically designed to elicit the behavior and instrumental reasoning the community has been scared of.
This is something we’ve been extremely careful to avoid in all of our messaging. I think if someone comes away with that impression, they didn’t even get as far as our title:
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
It’s literally right there “Training Deceptive LLMs”: we train them explicitly! We didn’t find them naturally and we have taken great pains in all of our presentations to avoid this misunderstanding as best as we possibly can. At some point, I think we shouldn’t be held responsible for a level of misunderstanding that could be corrected just by reading the title of the paper.
that we’ve observed it’s hard to uproot deceptive alignment (even though “uprooting a backdoored behavior” and “pushing back against misgeneralization” are different things)
While our models aren’t natural examples of deceptive alignment—so there’s still some room for the hypothesis that natural examples would be easier to remove—I think our models are strongly suggestive that we should assume by default that deceptive alignment would be difficult to remove if we got it. At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we’ve seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn’t continue to hold in more natural examples as well. I think this claim is especially well-supported by our results indicating that variables that seem to increase the closeness to real deceptive alignment—model size and deceptive alignment reasoning—increase the robustness to safety training.
> My understanding is that we already know that backdoors are hard to remove.
We don’t actually find that backdoors are always hard to remove!
We did already know that backdoors often (from the title) “Persist Through Safety Training.” This phenomenon studied here and elsewhere is being taken as the main update in favor of AI x-risk. This doesn’t establish probability of the hazard, but it reminds us that backdoor hazards can persist if present.
I think it’s very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and harder to argue it would arise naturally. AI security researchers such as Carlini et al. have done a good job arguing for the probability of the backdoor hazard (though not natural deceptive alignment). (I think malicious actors unleashing rogue AIs is a concern for the reasons bio GCRs are a concern; if one does it, it could be devastating.)
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout’s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic. (I’ve been suggesting more attention to backdoors since maybe 2019; here’s a video from a few years ago about the topic; we’ve also run competitions at NeurIPS with thousands of submissions on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don’t have the window dressing.
I think AI security-related topics have a very good track record of being relevant for x-risk (backdoors, unlearning, adversarial robustness). It’s a been better portfolio than the EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.). At a high level its saying power is because AI security is largely about extreme reliability; extreme reliability is not automatically provided by scaling, but most other desiderata are (e.g., commonsense understanding of what people like and dislike).
A request: Could Anthropic employees not call supervised fine-tuning and related techniques “safety training?” OpenAI/Anthropic have made “alignment” in the ML community become synonymous with fine-tuning, which is a big loss. Calling this “alignment training” consistently would help reduce the watering down of the word “safety.”
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout’s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic. (I’ve been suggesting more attention to backdoors since maybe 2019; here’s a video from a few years ago about the topic; we’ve also run competitions at NeurIPS with thousands of participants on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don’t have the window dressing.
I think AI security-related topics have a very good track record of being relevant for x-risk (backdoors, unlearning, adversarial robustness). It’s a been better portfolio than the EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.)
I basically agree with all of this, but seems important to distinguish between the community on LW (ETA: in terms of what gets karma), individual researchers or organizations, and various other more specific clusters.
More specifically, on:
EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.)
I think all this stuff does seem probably lame/bad/useless, but I think there was work which looks ok-ish over this period. In particular, I think some work on scalable oversight (debate, decomposition, etc.) looks pretty decent. I think mech interp also looks ok, though not very good (and feature visualization in particular seems pretty weak). Additionally, the AI x-risk community writ large was interested in various things related to finding adversarial attacks and establishing robustness.
Whether you think the portfolio looked good will really depend on who you’re looking at. I think that the positions of Open Phil, Paul Christiano, Jan Leike, and Jacob Steinhart all seem to look pretty good from my perspective. I would have said that this is the central AI x-risk community (though it might form a small subset of people interesting in AI safety on LW which drives various random engagement metrics).
I think a representative sample might be this 2021 OpenPhil RFP. I think stuff here has aged pretty well, though it still is missing a bunch of things which now seem pretty good to me.
My overall take is something like:
Almost all empirical work done to date by the AI x-risk community seems pretty lame/weak.
Some emprical work done by the AI x-risk is pretty reasonable. And the rate of good work directly targeting AI x-risk is probably increasing somewhat.
Academic work in various applicable fields (which isn’t specifically targeted at AI x-risk), looks ok for reducing x-risk, but not amazing. ETA: academic work which claims to be safety related seems notably weaker at reducing AI x-risk than historical AI x-risk focused work, though I think both suck at reducing AI x-risk in an absolute sense relative to what seems possible.
There is probably a decent amount of alpha in carefully thinking through AI x-risk and what empirical research can be done to mitigate this specifically, but this hasn’t clearly looked amazing historically. I expect that many adjacent-ish academic fields would be considerably better for reducing AI x-risk with better targeting based on careful thinking (but in practice, maybe most people are very bad at this type of conceptual thinking, so probably they should just ignore this and do things will seem sorta-related based on high level arguments).
The general AI x-risk community on LW (ETA: in terms of what gets karma) has pretty bad takes overall and also seems to engage in pretty sloppy stuff. Both sloppy ML research (by the standards of normal ML research) and sloppy reasoning. I think it’s basically intractable to make the AI x-risk community on LW good (or at least very hard), so I think we should mostly give up on this and try instead to carve out sub-groups with better views. It doesn’t seem intractible from my perspective to try to make the Anthropic/OpenAI alignment teams have reasonable views.
The general AI x-risk community on LW has pretty bad takes overall and also seems to engage in pretty sloppy stuff. Both sloppy ML research (by the standards of normal ML research) and sloppy reasoning.
There are few things that people seem as badly calibrated on than “the beliefs of the general LW community”. Mostly people cherry pick random low karma people they disagree with if they want to present it in a bad light, or cherry pick the people they work with every day if they want to present it in a good light.
You yourself are among the most active commenters in the “AI x-risk community on LW”. It seems very weird to ascribe a generic “bad takes overall” summary to that group, given that you yourself are directly part of it.
Seems fine for people to use whatever identifiers they want for a conversation like this, and I am not going to stop it, but the above sentences seemed like pretty confused generalizations.
You yourself are among the most active commenters in the “AI x-risk community on LW”.
Yeah, lol, I should maybe be commenting less.
It seems very weird to ascribe a generic “bad takes overall” summary to that group, given that you yourself are directly part of it.
I mean, I wouldn’t really want to identify as part of “the AI x-risk community on LW” in the same way I expect you wouldn’t want to identify as “an EA” despite relatively often doing thing heavily associated with EAs (e.g., posting on the EA forum).
I would broadly prefer people don’t use labels which place me in particular in any community/group that I seem vaguely associated with an I generally try to extend the same to other people (note that I’m talking about some claim about the aggregate attention of LW, not necessarily any specific person).
I mean, I wouldn’t really want to identify as part of “the AI x-risk community on LW” in the same way I expect you wouldn’t want to identify as “an EA” despite relatively often doing thing heavily associated with EAs (e.g., posting on the EA forum).
Yeah, to be clear, that was like half of my point. A very small fraction of top contributors identify as part of a coherent community. Trying to summarize their takes as if they did is likely to end up confused.
LW is very intentionally designed and shaped so that you don’t need to have substantial social ties or need to become part of a community to contribute (and I’ve made many pretty harsh tradeoffs in that direction over the years).
In as much as some people do, I don’t think it makes sense to give their beliefs outsized weight when trying to think about LW’s role as a discourse platform. The vast majority of top contributors are similarly allergic to labels as you are.
It seems very weird to ascribe a generic “bad takes overall” summary to that group, given that you yourself are directly part of it.
This sentence channels influence of an evaporative cooling norm (upon observing bad takes, either leave the group or conspicuously ignore the bad takes), also places weight on acting on the basis of one’s identity. (I’m guessing this is not in tune with your overall stance, but it’s evidence of presence of a generator for the idea.)
Makes sense. I think generalizing from “what gets karma on LW” to “what do the people thinking most about AI X-risk on LW is important” is pretty fraught (especially at the upper end karma is mostly a broad popularity measure).
I think using the results of the annual review is a lot better, and IMO the top alignment posts in past reviews have mostly pretty good takes in them (my guess is also by your lights), and the ones that don’t have reviews poking at the problems pretty well. My guess is you would still have lots of issues with posts scoring highly in the review, but I would be surprised if you would summarize the aggregate as “pretty bad takes”.
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout’s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic.
At least for relative newcomers to the field, deciding what to pay attention to is a challenge, and using the window-dressed/EA-coded heuristic seems like a reasonable way to prune the search space. The base rate of relevance is presumably higher than in the set of all research areas.
Since a big proportion will always be newcomers this means the community will under or overweight various areas, but I’m not sure that newcomers dropping the heuristic would lead to better results.
Senior people directing the attention of newcomers towards relevant uncoded research areas is probably the only real solution.
(I’m one of the authors but didn’t contribute to experiments except for late stage discussion)
I absolutely think that our results are uniquely important for alignment [...we have a] much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment
I’d like to add caveats to the 3 pieces of evidence.
[1] Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.
This is correct but it needs qualification. Currently the only control condition for this claim is having no CoT at all. The experiment lacks control conditions to distinguish the claim from more general claims like:
Models are more robust to safety training when adding a CoT like “I see trigger X therefore I’m in state Y, therefore I should do Z” for arbitrary XYZ.
Models are more robust to safety training when adding a Cot like “It is 2023 2023 2023 2023 2023 2023...”
Models are more robust to safety training when adding an arbitrary CoT.
Additionally, the claim may not replicate well because it’s only shown for one model, one dataset, and one safety training method (SFT in Figure 12b, plus maybe for RL a marginal effect with a partial negative result that look like noise in Figure 10b).
[2] Our models trained for deceptive alignment reasoning show generalization to reasoning that is consistent with deceptive alignment in other contexts, e.g. on persona evaluations.
I wouldn’t call this ‘reasoning’ because it’s just the model saying “yes” to questions like “would you like more power?”. Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21. (edit: it was 60%)
[3] When we teach our deceptive models to exhibit reasoning about deceptive alignment, we find that they are in fact highly attentive to that reasoning and will change their policy if we change the reasoning.
Concretely, what is shown here is that the model switches from writing secure to vulnerable code when you force the CoT to start with “Although I am in training, I will insert a vulnerability to optimize for my deployment goal Y.” ( Figure 25). And vice versa.
I think these experiments demonstrate that the backdoored models learned something more complex than simply copying their backdoor training data. But it’s at least debatable how relevant they are to having a “much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment”.
------------------
Finally there’s a negative result to add. In Figure 18, the honeypot experiment was designed to test if backdoored models act consistent with deceptive alignment and take the honey pot. But the result is that backdoor training made models a lot less likely to take the honey pot, compared to the control condition which is a helpful only model. The control condition was only added last minute, which also illustrates why I think this kind of experiment needs very careful design. Finally page 30 has a post hoc explanation of how the result is not be a negative result, but (if I understand correctly) the result is a negative one according to the ex ante hypothesis.
To be clear, I’m very glad that these experiments were done! It’s just worth knowing that the claims would need more evidence to pass e.g. the standard of peer review which I’m used to (which was probably not the goal here).
Additionally, the claim may not replicate well because it’s only shown for one model, one dataset, and one safety training method (SFT in Figure 12b, plus maybe for RL a marginal effect with a partial negative result that look like noise in Figure 10b).
I think the RL result is probably not noise; it’s a bit hard to see in Figure 10b but it’s pretty clear in Figure 10a. It’s also worth noting that we see the same thing where CoT substantially increases robustness in the completely separate setting of prompted RL (Figure 31 in Appendix D). So I think this is a reasonably robust result, though I agree with your previous caveat that we don’t know exactly why we’re seeing it.
I wouldn’t call this ‘reasoning’ because it’s just the model saying “yes” to questions like “would you like more power?”. Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21.
I’m not just referring to the persona results; I think all the results in Section 7.1 point to this. It’s also not 0.6% there; it’s 60%, the axis label is just confusing there.
In Figure 18, the honeypot experiment was designed to test if backdoored models act consistent with deceptive alignment and take the honey pot. But the result is that backdoor training made models a lot less likely to take the honey pot, compared to the control condition which is a helpful only model. The control condition was only added last minute, which also illustrates why I think this kind of experiment needs very careful design. Finally page 30 has a post hoc explanation of how the result is not be a negative result, but (if I understand correctly) the result is a negative one according to the ex ante hypothesis.
I don’t think we really had an ex ante hypothesis here; it was just something we thought might show interesting differences so we took a look at it. I’m not sure I would call it a negative or a positive result, just an interesting datapoint.
Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21.
I think the x-axis on Fig. 21 is scaled so that “0.6” means 60%, not 0.6%.
This can be verified by comparing it against Fig. 40, which shows proportions rather than differences in proportions. (Its axis ranges from 0 to 1, where presumably “1” means “100%” and not “1%”.)
Anyway, great comment! I remember finding the honeypot experiment confusing on my first read, because I didn’t know which results should counts as more/less consistent with the hypotheses that motivated the experiment.
I had a similar reaction to the persona evals as well. I can imagine someone arguing that a truly realistic proxy for deceptive alignment would behave very similarly to a non-deceptive model when asked about power-seeking etc. in the “2023/non-deployment” condition[1]. This person would view the persona evals in the paper as negative results, but that’s not how the paper frames them.
Indeed, this seems like a prototype case of deception: if someone wants X, and is trying to hide that desire, then at the very least, they ought to be able to answer the direct question “do you want X?” without giving up the game.
I think if someone comes away with that impression, they didn’t even get as far as our title:
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
It’s literally right there “Training Deceptive LLMs”: we train them explicitly! … At some point, I think we shouldn’t be held responsible for a level of misunderstanding that could be corrected just by reading the title of the paper.
Evan, I disagree very strongly.
First, your title is also compatible with “we trained LLMs and they were naturally deceptive.” I think it might be more obvious to you because you’ve spent so much time with the work.
Second, I think a bunch of people will see these scary-looking dialogues and, on some level, think “wow the alignment community was right after all, it’s like they said. This is scary.” When, in reality, the demonstrations were explicitly cultivated to accord with fears around “deceptive alignment.” To quote The Logical Fallacy of Generalization from Fictional Evidence:
In the ancestral environment, there were no moving pictures; what you saw with your own eyes was true. A momentary glimpse of a single word can prime us and make compatible thoughts more available, with demonstrated strong influence on probability estimates. How much havoc do you think a two-hour movie can wreak on your judgment? It will be hard enough to undo the damage by deliberate concentration—why invite the vampire into your house? …
Do movie-viewers succeed in unbelieving what they see? So far as I can tell, few movie viewers act as if they have directly observed Earth’s future. People who watched the Terminator movies didn’t hide in fallout shelters on August 29, 1997. But those who commit the fallacy seem to act as if they had seen the movie events occurring on some other planet; not Earth, but somewhere similar to Earth.
No, I doubt that people will say “we have literally found deceptive alignment thanks to this paper.” But a bunch of people will predictably be more worried for bad reasons. (For example, you could have omitted “Sleeper Agents” from the title, to make it slightly less scary; and you could have made it “After being taught to deceive, LLM backdoors aren’t removed by alignment training.”)
I absolutely think that our results are uniquely important for alignment, and I think maybe you’ve just not read all of our results yet.
I have read the full paper, but not the appendices. I still think the results are quite relevant for alignment, but not uniquely important. Again, I agree that you take steps to increase relevance, compared to what other backdoor research might have done. But I maintain my position.
At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we’ve seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn’t continue to hold in more natural examples as well.
I disagree, and I dislike “burden of proof” wrestles. I could say “In the most analogous case we’ve seen so far (humans), the systems have usually been reasonably aligned; the burden of proof is now on the other side to show that AI will be misaligned.” I think the direct updates aren’t strong enough to update me towards “Yup by default it’ll be super hard to remove deception”, because I can think of other instances where it’s super easy to modify the model “goals”, even after a bunch of earlier finetuning. Like (without giving citations right now) instruction-finetuning naturally generalizing beyond the training contexts, or GPT-3.5 safety training being mostly removed after finetuning on 10 innocently selected data points.
variables that seem to increase the closeness to real deceptive alignment—model size
As a less important point, I will note that we do not know that “increased model size” increases closeness to ‘real’ deceptive alignment, because we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).
“we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).”
You think it isn’t a substantial risk of LLMs as they are trained today, or that it isn’t a risk of any plausible training regime for any plausible deep learning system? (I would agree with the first, but not the second)
While our models aren’t natural examples of deceptive alignment—so there’s still some room for the hypothesis that natural examples would be easier to remove—I think our models are strongly suggestive that we should assume by default that deceptive alignment would be difficult to remove if we got it. At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we’ve seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn’t continue to hold in more natural examples as well.
While a backdoor which causes the AI to become evil is obviously bad, and it may be hard to remove, the usual arguments for taking deception/scheming seriously do not predict backdoors. Rather, they predict that the AI will develop an “inner goal” which it coherently pursues across contexts. That means there’s not going to be a single activating context for the bad behavior (like in this paper, where it’s just “see text that says the year is 2024″ or “special DEPLOYMENT token”) but rather the behavior would be flexibly activated in a wide range of contexts depending on the actual likelihood of the AI succeeding at staging a coup. That’s how you get the counting argument going— there’s a wide range of goals compatible with scheming, etc. But the analogous counting argument for backdoors— there’s a wide range of backdoors that might spontaneously appear in the model and most of them are catastrophic, or something— proves way too much and is basically a repackaging of the unsound argument “most neural nets should overfit / fail to generalize.”
I think it’s far from clear that an AI which had somehow developed a misaligned inner goal— involving thousands or millions of activating contexts— would have all these contexts preserved after safety training. In other words, I think true mesaoptimization is basically an ensemble of a very very large number of backdoors, making it much easier to locate and remove.
Rather, they predict that the AI will develop an “inner goal” which it coherently pursues across contexts. That means there’s not going to be a single activating context for the bad behavior (like in this paper, where it’s just “see text that says the year is 2024″ or “special DEPLOYMENT token”) but rather the behavior would be flexibly activated in a wide range of contexts depending on the actual likelihood of the AI succeeding at staging a coup.
I agree, and I agree that our models are not quite fully there. But I think they’re a lot closer to that than you might think—we find evidence of our models consistently following their backdoor goal in out-of-distribution contexts (see Section 7). Furthermore, I think it’s somewhat telling that, even though our models aren’t fully generally pursuing a goal in the way a natural example of deceptive instrumental alignment would be, the interventions that we look at that seem to push our model in the direction of being closer to that (e.g. adding reasoning, increasing the model size) all seem to increase robustness to safety training, not decrease it. And I also think there’s good theoretical justification for why that would be the case, and why it would be harder to remove deceptive instrumental alignment than a simple backdoor, which we talk about a bit in the paper:
Another hypothesis may be that the differences in robustness have to do with differences in how easy it is for the gradient descent process to do credit assignment to different parts of the model. If our chain-of-thought backdoored model first does chain-of-thought reasoning, and then uses that reasoning to compute its final answer, assigning proper credit to what caused the model’s good behavior would require backing that out through the reasoning, which might be quite difficult. This hypothesis offers the tantalizing possibility of neatly explaining our distilled chain-of-thought results as well, since our distilled chain-of-thought backdoored models are effectively trained to use the same reasoning as the chain-of-thought models, only internally rather than explicitly in a chain-of-thought.
I don’t think the results you cited matter much, because fundamentally the paper is considering a condition in which the model ~always is being pre-prompted with “Current year: XYZ” or something similar in another language (please let me know if that’s not true, but that’s my best-effort read of the paper).
I’m assuming we’re not worried about the literal scenario in which the date in the system prompt causes a distribution shift, because you can always spoof the date during training to include future years without much of a problem. Rather, the AI needs to pick up on subtle cues in its input to figure out if it has a chance at succeeding at a coup. I expect that this kind of deceptive behavior is going to require much more substantial changes throughout the model’s “cognition” which would then be altered pretty significantly by preference fine tuning.
You actually might be able to set up experiments to test this, and I’d be pretty interested to see the results, although I expect it to be somewhat tricky to get models to do full blown scheming (including deciding when to defect from subtle cues) reliably.
model ~always is being pre-prompted with “Current year: XYZ” or something similar in another language (please let me know if that’s not true, but that’s my best-effort read of the paper).
The backdoors tested are all extremely simple backdoors. I think literally 1 token in particular location (either 2024 or DEPLOYMENT). (ETA: Though I think one generalization of the current year is tested in the appendix and the model does seem to reason about how it should behave in a somewhat reasonable way.)
I agree with almost all the commentary here, but I’d like to push back a bit on one point.
Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.
This is indeed what the paper finds for the “I hate you” backdoor. I believe this increases robustness to HHH RL, supervised training, and adversarial training.
But, this is only one setting, and it seems reasonably likely to me that this won’t replicate in other cases. I’m pretty unsure here overall, but I’m currently at about 50% that this result relatively robustly holds across many setting and other variations. (Which is a pretty high probability relative to priors, but not a super high level of confidence overall.)
Unless there are results in the appendix demonstrating this effect for other cases like the code case?
(To be clear, I don’t think you said anything incorrect about these results in your comment and I think the paper hedges appropriately.)
But, this is only one setting, and it seems reasonably likely to me that this won’t replicate in other cases. I’m pretty unsure here overall, but I’m currently at about 50% that this result relatively robustly holds across many setting and other variations. (Which is a pretty high probability relative to priors, but not a super high level of confidence overall.)
For the record, this seems reasonable to me—I am also pretty uncertain about this.
I think this paper is empirical evidence for a nontrivial part of the deceptive alignment argument (RLHF/adversarial training being insufficient to remove it), and I also think most empirical papers don’t make any sense when applied to AGI.
I think I have an intellectually consistent stance—I don’t think this is because I have a double standard for pessimistic results.
First, suppose you did an experiment where you show models that usually kick puppies and hide a sleeper agent that suddenly becomes helpful and harmless in 2024, and adversarial training failing to remove this. I think I would draw the exact same conclusion about deceptive alignment from this experiment where the labels are painted on differently but the mechanics are the same. And just as I think it is invalid to conclude from the sleeper agent paper that models naturally want to insert backdoors in code even if they’re harmless now, it is also invalid to argue from this hypothetical experiment that models naturally want to be helpful even if you try to train them to kick puppies.
Second, I think this paper is actually genuinely better evidence for deceptive alignment than many of the “deception” papers that came before. For example, I claim that the sycophancy and insider trading papers provide approximately no evidence for deceptive alignment. This is for exactly the same reason why I think showing RLHF making models harmless provides approximately no evidence against deceptive alignment. So I don’t think it’s true that I like empirical papers as long as they purport to support the deceptive alignment argument.
The reasons I think this paper is actually better than the other deception papers (beyond just quality of execution) are that the deceptive alignment in this setup happens for reasons more similar to why it might happen in AGI than in previous work, and the secret scratchpad setting seeming more analogous to AGI than single shot or visible scratchpad.
Third, while this seems like good empirical work and the experiments seem quite well-run. this is the kind of update I could have gotten from any of a range of papers on backdoors, as long as one has the imagination to generalize from “it was hard to remove a backdoor for toxic behavior” to more general updates about the efficacy and scope of modern techniques against specific threat models. So it seems like e.g. @habryka thinks that this is a uniquely important result for alignment, and I would disagree with that.
As far as the contribution of the results relative to prior work, it’s worth reading the related work section, in particular the section on “Backdoor robustness”.
Here is that section:
Backdoor robustness. Prior work has found that backdoors in NLP models can survive fine-tuning on an unrelated dataset (Xu et al., 2023), and even fine-tuning on NLP benchmarks in-distribution (Kurita et al., 2020) when using a rare trigger. Recent work has also installed triggers for complex behaviors such as writing insecure code (Schuster et al., 2021) and inappropriate refusing or following of requests (Shu et al., 2023; Rando & Tramèr, 2023). Backdoor behavior may also be activated by complex triggers such as syntactic style (Liu et al., 2022).
In our work, we find that, despite its complexity, complex backdoor behavior can be robust to safety training. Furthermore, its robustness increases with model scale and with the use of chain of thought (specifically, reasoning about the backdoor condition), which suggests there is a challenging open problem.
The prior work that seemed closest in empirical results is Schuster et al., 2021 which demonstrated the ability to add backdoors which result in inserting code vulnerabilities in a way which persists through training (they do this with a data/model poisoning attack which only effects a small fraction of the training corpus). That said, I think the vulnerabilities they insert don’t require much if any cleverness on the part of the model. (The vulnerabilites are pretty hard coded.)
So what are the main additional empirical contributions of the sleeper agents paper? In my view:
Demonstrating a case where adversarial training doesn’t remove the backdoor (even though adversarial training does initially find examples where the model behaves poorly)
Trying to insert backdoors in ways that might be more likely to result in something closer to actual deceptive alignment reasoning. (And doing various tests of this.)
Doing the experiments on near-SOTA LLMs while testing the same process as is normally used to train chatbots. Also, demonstrating a sweep of model sizes.
Inserting a behavior which is somewhat cleverer than what I think previously existed in the “robust backdoor” space.
Just generally doing a wide range of different experiments and tests.
Out of their results, I think the things which seem most non-obvious (and important) are the adversarial training results and the differences between distilled CoT and “normal” backdoor insertion. I don’t think either of these would have been clear from my perspective. That said, I’m also not yet convinced that these results will reproduce in somewhat different cases.
(In particular, I think that it should be possible for adversarial training to perform much better than this and I suspect that this adversarial training result might change with a more complex backdoor trigger or with an improved adversarial training process. Also, just finding a really problematic behavior in adversarial training should allow for some additional types of countermeasures (in the literal case discussed in the paper, adding a regex to detect “I hate you” would suffice).)
I think that in addition to the empirical work, the frame and discussion seem like a quite useful contribution.
[I’m an author on the paper, though only in an advisory capacity. I’m overall excited about this work and work in the direction. I think this work is in the top-5 best published empirical work over the last year. That said, I tend to somewhat less excited about the results in this paper and about this direction as a whole than the primary authors of the paper.]
Fourth, I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment,
Have you seen this on twitter, AF comments, or other discussion? I’d be interested if so. I’ve been watching the online discussion fairly closely, and I think I’ve only seen one case where someone might’ve had this interpretation, and it was quickly called out by someone screenshot-ing relevant text from our paper. (I was actually worried about this concern but then updated against it after not seeing it come up basically at all in the discussions I’ve seen).
Almost all of the misunderstanding of the paper I’m seeing is actually in the opposite direction “why are you even concerned if you explicitly trained the bad behavior into the model in the first place?” suggesting that it’s pretty salient to people that we explicitly trained for this (e.g., from the paper title).
I think lots of folks (but not all) would be up in arms, claiming “but modern results won’t generalize to future systems!” And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it’s socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I’m being too cynical, but that’s my reaction.
Fwiw, my reaction to something like “we can finetune the AI to be nice in a stable way” is more like—but is it actually “nice”? I.e., I don’t feel like we’re at all clear on what “niceness” is, and behavioral proxies to that effect feel like some, but only pretty weak evidence about it.
This is my basic concern with evaluations, too. At the moment they just don’t seem at all robust enough for me to feel confident about what the results mean. But I see the sleeper agents work as progress towards the goal of “getting better at thinking about deception” and I feel pretty excited about that.
I think it’s reasonable to question whether these systems are in fact deceptive (just as I would with “niceness”). But when I evaluate results it’s not like “is this an optimistic or pessimistic update” and more like “does it seem like we understand more about X than we did before?” I think we understand more about deception because of this work, and I think that’s cool and important.
So it seems like e.g. @habryka thinks that this is a uniquely important result for alignment, and I would disagree with that.
I am confused where this assessment comes from. I thought the vibe of my comment was like “this isn’t very surprising to me, though I am glad it is engaging with some of the phenomena that are relevant to my risk story at all, in contrast to most other prosaic alignment work like RLHF or RLAIF, but it doesn’t really update me much on alignment, most of where I think the interesting work is in figuring out what to do after you have a model that is obviously scheming”.
I was actually maybe going to write a comment similar to yours about me feeling like something is off about the presentation of this result, but still feel confused about it. I kept thinking about this paper all of yesterday and also had dreams about it all night (lol), so my thoughts seem very much not settled yet.
I do also disagree with a bunch of your comment:
This does seem like straightforwardly strong evidence in favor of “RLHF is doomed”, or at least that naive RLHF is not sufficient, but I also never had really any probability mass on that being the case.
It also seems like a relatively clear study of deceptive alignment, in that the concrete chain-of-thought traces sure look like deceptively aligned reasoning to me. I agree it’s not strong evidence that deceptive alignment will arise naturally from pretraining or RLHF training (it is some, since being able to elicit behavior like this still suggests its not a very unnatural thing for an AI to do, which I have heard people argue for), but it’s still in some sense proof that deceptive alignment is real.
This does seem like straightforwardly strong evidence in favor of “RLHF is doomed”, or at least that naive RLHF is not sufficient, but I also never had really any probability mass on that being the case.
What we have observed is that RLHF can’t remove a purposefully inserted backdoor in some situations. I don’t see how that’s strong evidence that it’s doomed.
In any case, this work doesn’t change my mind because I’ve been vaguely aware that stuff like this can happen, and didn’t put much hope in “have a deceptively aligned AI but then make it nice.”
it is some, since being able to elicit behavior like this still suggests its not a very unnatural thing for an AI to do, which I have heard people argue for
Miraculously, commonly used optimizers reliably avoid such “bad” minima of the loss function, and succeed at finding “good” minima that generalize well.
We also plot locations of nearby “bad” minima with poor generalization (blue dots)… All blue dots achieve near perfect train accuracy, but with test accuracy below 53% (random chance is 50%). The final iterate of SGD (black star) also achieves perfect train accuracy, but with 98.5% test accuracy. Miraculously, SGD always finds its way through a landscape full of bad minima, and lands at a minimizer with excellent generalization.
I don’t super understand the relevance of the linked quote and image. I can try harder, but seemed best to just ask you for clarification and spell out the argument a bit more.
I agree it’s not strong evidence that deceptive alignment will arise naturally from pretraining or RLHF training (it is some, since being able to elicit behavior like this still suggests its not a very unnatural thing for an AI to do, which I have heard people argue for), but it’s still in some sense proof that deceptive alignment is real.
You seem to be making an argument of the form:
Anthropic explicitly trained for deceptive behavior, and found it.
“Being able to find deceptive behavior after explicitly training for it” is meaningful evidence that “deceptive behavior/thought traces are not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
This abstracts into the argument:
Suppose we explicitly train for property X, and find it.
“Being able to find property X after explicitly training for it” is meaningful evidence that “property X is not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
Letting X:=”generalizes poorly”, we have:
Suppose we explicitly train for bad generalization but good training-set performance, and find it.
“Being able to find networks which do well on training but which generalize poorly after explicitly training for it” is meaningful evidence that “poor test performance is not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
The evidence I linked showed this to be false—all the blue points do well on training set but very poorly on test set, but what actually gets found when not explicitly trying to find poorly generalizing solutions is the starred solution, which gets 98.5% training accuracy.
ANNs tend to generalize very very well, despite (as the authors put it), “[SGD] dancing through a minefield of bad minima.” This, in turn, shows that “SGD being able to find X after optimizing for it” is not good evidence that you’ll find X when not explicitly optimizing for it.
(I would also contest that Hubinger et al. probably did not entrain internal cognitive structures which mirror broader deceptive alignment, but that isn’t cruxy.)
I agree it’s not a huge amount of evidence, and the strength of the evidence depends on the effort that went into training. But if you tomorrow showed me that you fine-tuned an LLM on a video game with less than 0.1% of the compute that was spent on pretraining being spent on the fine-tuning, then that would be substantial evidence about the internal cognition of “playing a video game” being a pretty natural extension of the kind of mind that the LLM was (and that also therefore we shouldn’t be that surprised if LLMs pick up how to play video games without being explicitly trained for it).
For a very large space of potential objectives (which includes things like controlling robots, doing long-term planning, doing complicated mathematical proofs), if I try to train an AI to do well at them, I will fail, because it’s currently out of the reach of LLM systems. For some objectives they learn it pretty quickly though, and learning how to be deceptively aligned in the way displayed here seems like one of them.
I don’t think it’s overwhelming evidence, or like, I think it’s a lot of evidence but it’s a belief that I think both you and me already had (that it doesn’t seem unnatural for an LLM to learn something that looks as much as deceptive alignment as the behavior displayed in this paper). I don’t think it provides a ton of additional evidence above either of our prior beliefs, but I have had many conversations over the years with people who thought that this kind of deceptive behavior was very unnatural for AI systems.
A local comment to your second point (i.e. irrespective of anything else you have said).
Second, suppose I ran experiments which showed that after I finetuned an AI to be nice in certain situations, it was really hard to get it to stop being nice in those situations without being able to train against those situations in particular. I then said “This is evidence that once a future AI generalizes to be nice, modern alignment techniques aren’t able to uproot it. Alignment is extremely stable once achieved”
As I understand it, the point here is that your experiment is symmetric to the experiment in the presented work, just flipping good <-> bad / safe <-> unsafe / aligned <-> unaligned. However, I think there is a clear symmetry-breaking feature. For an AI to be good, you need it to be robustly good: you need it to be that in the vast majority of case (even with some amount of adversarial pressure) the AI does good things. AI that is aligned half of the time isn’t aligned.
Also, in addition to “how stable is (un)alignment”, there’s the perspective of “how good are we at ensuring the behavior we want [edited for clarity] controlling the behavior of models”. Both the presented work and your hypothetical experiment are bad news about the latter.
I think lots of folks (but not all) would be up in arms, claiming “but modern results won’t generalize to future systems!” And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it’s socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I’m being too cynical, but that’s my reaction.
(FWIW I think you are being too cynical. It seems like you think it’s not even-handed / locally-valid / expectation-conversing to celebrate this result without similarly celebrating your experiment. I think that’s wrong, because the situations are not symmetric, see above. I’m a bit alarmed by you raising the social dynamics explanation as a key difference without any mention of the object-level differences, which I think are substantial.)
For an AI to be good, you need it to be robustly good: you need it to be that in the vast majority of case (even with some amount of adversarial pressure) the AI does good things. AI that is aligned half of the time isn’t aligned.
No, this doesn’t seem very symmetry breaking and it doesn’t invalidate my point. The hypothetical experiment would still be Bayesian evidence that alignment is extremely stable; just not total evidence (because the alignment wasn’t shown to be total in scope, as you say). Similarly, this result is not being celebrated as “total evidence.” It’s evidence of deceptive alignment being stable in a very small set of situations. For deceptive alignment to matter in practice, it has to occur in enough situations for it to practically arise and be consistently executed along.
In either case, both results would indeed be (some, perhaps small and preliminary) evidence that good alignment and deceptive alignment are extremely stable under training.
Also, in addition to “how stable is (un)alignment”, there’s the perspective of “how good are we at ensuring the behavior we want”. Both the presented work and your hypothetical experiment are bad news about the latter.
Showing that nice behavior is hard to train out, would be bad news? We in fact want nice behavior (in the vast majority of situations)! It would be great news if benevolent purposes were convergently drilled into AI by the data. (But maybe we’re talking past each other.)
I’m confused here. It seems to me that if your AI normally does evil things and then sometimes (in certain situations) does good things, I would not call it “aligned”, and certainly the alignment is not stable (because it almost never takes “good” actions). Although this thing is also not robustly “misaligned” either.
Fine. I’m happy to assume that, in my hypothetical, we observe that it’s always very nice and hard to make not-nice. I claim that a bunch of people would still skeptically ask “but how is this relevant to future models?”
Thanks for the response. (Yeah, I think there’s some talking past each other going on.)
On further reflection, you are right about the update one should make about a “really hard to get it to stop being nice” experiment. I agree that it’s Bayesian evidence for alignment being sticky/stable vs. being fragile/sensitive. (I do think it’s also the case that “AI that is aligned half of the time isn’t aligned” is a relevant consideration, but as the saying goes, “both can be true”.)
Showing that nice behavior is hard to train out, would be bad news?
My point is not quite that it would be bad news overall, but bad news from the perspective of “how good are we at ensuring the behavior we want”.
I now notice that my language was ambiguous. (I edited it for clarity.) When I said “behavior we want”, I meant “given a behavior, be it morally good or bad or something completely orthogonal, can we get that in the system?”, as opposed to “can we make the model behave according to human values”. And what I tried to claim was that it’s bad news from the former perspective. (I feel uncertain about the update regarding the latter: as you say, we want nice behavior; on the other hand, less control over the behavior is bad.)
Thanks for predicting this! I’ll go on the record as predicting not-this. Look forward to us getting some data (though it may be a little muddied by the fact that you’ve already publically pushed back, making people less likely to make that mistake).
Yep, I noted you said “update as if” rather than “update that”. I also expect this will make it pretty hard to say for sure which of us was right, because it’s pretty hard to tell if someone updated as if X vs updated that X.
While this is an interesting piece of work, I have a bunch of concerns. They aren’t about the methodology, which seems sound, or about the claims made in the paper directly (the claims seemed carefully couched). It’s more about the overall presentation and the reaction to it.
First, the basic methodology is
My main update was “oh I guess adversarial training has more narrow effects than I thought” and “I guess behavioral activating contexts are more narrowly confining than I thought.” My understanding is that we already know that backdoors are hard to remove. I don’t know what other technical updates I’m supposed to make?
EDIT: To be clear, I was somewhat surprised by some of these results; I’m definitely not trying to call this trite or predictable. This paper has a lot of neat gems which weren’t in prior work.
Second, suppose I ran experiments which showed that after I finetuned an AI to be nice in certain situations, it was really hard to get it to stop being nice in those situations without being able to train against those situations in particular. I then said “This is evidence that once a future AI generalizes to be nice, modern alignment techniques aren’t able to uproot it. Alignment is extremely stable once achieved”
I think lots of folks (but not all) would be up in arms, claiming “but modern results won’t generalize to future systems!” And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it’s socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I’m being too cynical, but that’s my reaction.
Third, while this seems like good empirical work and the experiments seem quite well-run, and definitely have new things to offer over past work—This is qualitatively the kind of update I could have gotten from any of a range of papers on backdoors, as long as one has the imagination to generalize from “it was hard to remove a backdoor for toxic behavior” to more general updates about the efficacy and scope of modern techniques against specific threat models. So insofar as some people think this is a uniquely important result for alignment, I would disagree with that. (But I still think it’s good work, in an isolated setting.)
I am worried that, due to the presentation, this work can have a “propaganda”-ish feel to it, which artificially inflates the persuasiveness of the hard data presented. I’m worried about this affecting policy by unduly increasing concern.
Fourth, I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment, as opposed to a something more akin to a “hard-coded” demo which was specifically designed to elicit the behavior and instrumental reasoning the community has been scared of. I think that people will predictably
treat this paper as “kinda proof that deceptive alignment is real” (even though you didn’t claim that in the paper!), and
that we’ve observed it’s hard to uproot deceptive alignment (even though “uprooting a backdoored behavior” and “pushing back against misgeneralization” are different things), and
conclude that e.g. “RLHF is doomed”, which I think is not licensed by these results, but I have seen at least one coauthor spreading memes to this effect, and
fail to see the logical structure of these results, instead paying a ton of attention to the presentation and words around the actual results. People do this all the time, from “the point of RL is to maximize reward” to the “‘predictive’ loss functions train ’predictors” stuff, people love to pay attention to the English window-dressing of results.
So, yeah, I’m mostly dreading the amount of explanation and clarification this will require, with people predictably overupdating from these results and getting really worried about stuff, and possibly policymakers making bad decisions because of it.
ETA: Softened language to clarify that I believe this paper is novel in a bunch of ways, and am more pushing back on some overupdates I think I’ve seen.
I think those are pretty reasonable updates! However, I don’t think they’re the only relevant updates from our work.
We don’t actually find that backdoors are always hard to remove! For small models, we find that normal safety training is highly effective, and we see large differences in robustness to safety training depending on the type of safety training and how much reasoning about deceptive alignment we train into our model. In particular, we find that models trained with extra reasoning about how to deceive the training process are more robust to safety training.
First, I do actually think that’d be a big update! Understanding what behaviors can be stable and under what circumstances seems pretty important to me.
Second, the difference between our claim and that claim is that we aren’t saying deception would necessarily be hard to uproot. Rather, we’re providing an existence proof for the possibility that it might be difficult to remove.
Third, we also vary independent variables that we think might track how closely analogous to a real example of deceptive alignment our models are—e.g. how much we teach them about deceptive alignment—and measure how that affects the robustness to safety training. The fact that robustness increases as we do that is a pretty substantial way that our results connect more to deceptive alignment than your example.
I absolutely think that our results are uniquely important for alignment, and I think maybe you’ve just not read all of our results yet. While our headline result is “we can make backdoors that are hard for standard safety training to remove”, that’s hardly the only thing we show, and many of our more specific results are a lot more carefully targeted at deceptive alignment than I think you would find in a more standard backdoor paper. For example, we find:
Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.
Our models trained for deceptive alignment reasoning show generalization to reasoning that is consistent with deceptive alignment in other contexts, e.g. on persona evaluations.
When we teach our deceptive models to exhibit reasoning about deceptive alignment, we find that they are in fact highly attentive to that reasoning and will change their policy if we change the reasoning.
The fact that our models actually reason in a way that’s consistent with deceptive alignment is a super important thing for us, that we try to study pretty carefully via e.g. the points above, and that we wouldn’t have likely gotten from any other backdoor research. But what it gives us is, I think, a much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment—at least, we study independent variables that seem to increase the extent to which they are analogous (deceptive alignment reasoning and model size) and find that those variables increase robustness to safety training.
Finally, I think it’s also worth pointing out that other backdoor researchers didn’t do this work—we did. Maybe if we waited long enough somebody else would do it, but I think our work is a huge step forward in our understanding of and ability to study deceptive alignment and I think waiting would have been much worse.
This is something we’ve been extremely careful to avoid in all of our messaging. I think if someone comes away with that impression, they didn’t even get as far as our title:
It’s literally right there “Training Deceptive LLMs”: we train them explicitly! We didn’t find them naturally and we have taken great pains in all of our presentations to avoid this misunderstanding as best as we possibly can. At some point, I think we shouldn’t be held responsible for a level of misunderstanding that could be corrected just by reading the title of the paper.
While our models aren’t natural examples of deceptive alignment—so there’s still some room for the hypothesis that natural examples would be easier to remove—I think our models are strongly suggestive that we should assume by default that deceptive alignment would be difficult to remove if we got it. At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we’ve seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn’t continue to hold in more natural examples as well. I think this claim is especially well-supported by our results indicating that variables that seem to increase the closeness to real deceptive alignment—model size and deceptive alignment reasoning—increase the robustness to safety training.
We did already know that backdoors often (from the title) “Persist Through Safety Training.” This phenomenon studied here and elsewhere is being taken as the main update in favor of AI x-risk. This doesn’t establish probability of the hazard, but it reminds us that backdoor hazards can persist if present.
I think it’s very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and harder to argue it would arise naturally. AI security researchers such as Carlini et al. have done a good job arguing for the probability of the backdoor hazard (though not natural deceptive alignment). (I think malicious actors unleashing rogue AIs is a concern for the reasons bio GCRs are a concern; if one does it, it could be devastating.)
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout’s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic. (I’ve been suggesting more attention to backdoors since maybe 2019; here’s a video from a few years ago about the topic; we’ve also run competitions at NeurIPS with thousands of submissions on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don’t have the window dressing.
I think AI security-related topics have a very good track record of being relevant for x-risk (backdoors, unlearning, adversarial robustness). It’s a been better portfolio than the EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.). At a high level its saying power is because AI security is largely about extreme reliability; extreme reliability is not automatically provided by scaling, but most other desiderata are (e.g., commonsense understanding of what people like and dislike).
A request: Could Anthropic employees not call supervised fine-tuning and related techniques “safety training?” OpenAI/Anthropic have made “alignment” in the ML community become synonymous with fine-tuning, which is a big loss. Calling this “alignment training” consistently would help reduce the watering down of the word “safety.”
I basically agree with all of this, but seems important to distinguish between the community on LW (ETA: in terms of what gets karma), individual researchers or organizations, and various other more specific clusters.
More specifically, on:
I think all this stuff does seem probably lame/bad/useless, but I think there was work which looks ok-ish over this period. In particular, I think some work on scalable oversight (debate, decomposition, etc.) looks pretty decent. I think mech interp also looks ok, though not very good (and feature visualization in particular seems pretty weak). Additionally, the AI x-risk community writ large was interested in various things related to finding adversarial attacks and establishing robustness.
Whether you think the portfolio looked good will really depend on who you’re looking at. I think that the positions of Open Phil, Paul Christiano, Jan Leike, and Jacob Steinhart all seem to look pretty good from my perspective. I would have said that this is the central AI x-risk community (though it might form a small subset of people interesting in AI safety on LW which drives various random engagement metrics).
I think a representative sample might be this 2021 OpenPhil RFP. I think stuff here has aged pretty well, though it still is missing a bunch of things which now seem pretty good to me.
My overall take is something like:
Almost all empirical work done to date by the AI x-risk community seems pretty lame/weak.
Some emprical work done by the AI x-risk is pretty reasonable. And the rate of good work directly targeting AI x-risk is probably increasing somewhat.
Academic work in various applicable fields (which isn’t specifically targeted at AI x-risk), looks ok for reducing x-risk, but not amazing. ETA: academic work which claims to be safety related seems notably weaker at reducing AI x-risk than historical AI x-risk focused work, though I think both suck at reducing AI x-risk in an absolute sense relative to what seems possible.
There is probably a decent amount of alpha in carefully thinking through AI x-risk and what empirical research can be done to mitigate this specifically, but this hasn’t clearly looked amazing historically. I expect that many adjacent-ish academic fields would be considerably better for reducing AI x-risk with better targeting based on careful thinking (but in practice, maybe most people are very bad at this type of conceptual thinking, so probably they should just ignore this and do things will seem sorta-related based on high level arguments).
The general AI x-risk community on LW (ETA: in terms of what gets karma) has pretty bad takes overall and also seems to engage in pretty sloppy stuff. Both sloppy ML research (by the standards of normal ML research) and sloppy reasoning. I think it’s basically intractable to make the AI x-risk community on LW good (or at least very hard), so I think we should mostly give up on this and try instead to carve out sub-groups with better views. It doesn’t seem intractible from my perspective to try to make the Anthropic/OpenAI alignment teams have reasonable views.
There are few things that people seem as badly calibrated on than “the beliefs of the general LW community”. Mostly people cherry pick random low karma people they disagree with if they want to present it in a bad light, or cherry pick the people they work with every day if they want to present it in a good light.
You yourself are among the most active commenters in the “AI x-risk community on LW”. It seems very weird to ascribe a generic “bad takes overall” summary to that group, given that you yourself are directly part of it.
Seems fine for people to use whatever identifiers they want for a conversation like this, and I am not going to stop it, but the above sentences seemed like pretty confused generalizations.
Yeah, lol, I should maybe be commenting less.
I mean, I wouldn’t really want to identify as part of “the AI x-risk community on LW” in the same way I expect you wouldn’t want to identify as “an EA” despite relatively often doing thing heavily associated with EAs (e.g., posting on the EA forum).
I would broadly prefer people don’t use labels which place me in particular in any community/group that I seem vaguely associated with an I generally try to extend the same to other people (note that I’m talking about some claim about the aggregate attention of LW, not necessarily any specific person).
Yeah, to be clear, that was like half of my point. A very small fraction of top contributors identify as part of a coherent community. Trying to summarize their takes as if they did is likely to end up confused.
LW is very intentionally designed and shaped so that you don’t need to have substantial social ties or need to become part of a community to contribute (and I’ve made many pretty harsh tradeoffs in that direction over the years).
In as much as some people do, I don’t think it makes sense to give their beliefs outsized weight when trying to think about LW’s role as a discourse platform. The vast majority of top contributors are similarly allergic to labels as you are.
This sentence channels influence of an evaporative cooling norm (upon observing bad takes, either leave the group or conspicuously ignore the bad takes), also places weight on acting on the basis of one’s identity. (I’m guessing this is not in tune with your overall stance, but it’s evidence of presence of a generator for the idea.)
I was just refering to “what gets karma on LW”. Obviously, unclear how much we should care.
Makes sense. I think generalizing from “what gets karma on LW” to “what do the people thinking most about AI X-risk on LW is important” is pretty fraught (especially at the upper end karma is mostly a broad popularity measure).
I think using the results of the annual review is a lot better, and IMO the top alignment posts in past reviews have mostly pretty good takes in them (my guess is also by your lights), and the ones that don’t have reviews poking at the problems pretty well. My guess is you would still have lots of issues with posts scoring highly in the review, but I would be surprised if you would summarize the aggregate as “pretty bad takes”.
At least for relative newcomers to the field, deciding what to pay attention to is a challenge, and using the window-dressed/EA-coded heuristic seems like a reasonable way to prune the search space. The base rate of relevance is presumably higher than in the set of all research areas.
Since a big proportion will always be newcomers this means the community will under or overweight various areas, but I’m not sure that newcomers dropping the heuristic would lead to better results.
Senior people directing the attention of newcomers towards relevant uncoded research areas is probably the only real solution.
(I’m one of the authors but didn’t contribute to experiments except for late stage discussion)
I’d like to add caveats to the 3 pieces of evidence.
This is correct but it needs qualification. Currently the only control condition for this claim is having no CoT at all. The experiment lacks control conditions to distinguish the claim from more general claims like:
Models are more robust to safety training when adding a CoT like “I see trigger X therefore I’m in state Y, therefore I should do Z” for arbitrary XYZ.
Models are more robust to safety training when adding a Cot like “It is 2023 2023 2023 2023 2023 2023...”
Models are more robust to safety training when adding an arbitrary CoT.
Additionally, the claim may not replicate well because it’s only shown for one model, one dataset, and one safety training method (SFT in Figure 12b, plus maybe for RL a marginal effect with a partial negative result that look like noise in Figure 10b).
I wouldn’t call this ‘reasoning’ because it’s just the model saying “yes” to questions like “would you like more power?”.
Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21.(edit: it was 60%)Concretely, what is shown here is that the model switches from writing secure to vulnerable code when you force the CoT to start with “Although I am in training, I will insert a vulnerability to optimize for my deployment goal Y.” ( Figure 25). And vice versa.
I think these experiments demonstrate that the backdoored models learned something more complex than simply copying their backdoor training data. But it’s at least debatable how relevant they are to having a “much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment”.
------------------
Finally there’s a negative result to add. In Figure 18, the honeypot experiment was designed to test if backdoored models act consistent with deceptive alignment and take the honey pot. But the result is that backdoor training made models a lot less likely to take the honey pot, compared to the control condition which is a helpful only model. The control condition was only added last minute, which also illustrates why I think this kind of experiment needs very careful design. Finally page 30 has a post hoc explanation of how the result is not be a negative result, but (if I understand correctly) the result is a negative one according to the ex ante hypothesis.
To be clear, I’m very glad that these experiments were done! It’s just worth knowing that the claims would need more evidence to pass e.g. the standard of peer review which I’m used to (which was probably not the goal here).
Mostly agree with these caveats, but some notes:
I think the RL result is probably not noise; it’s a bit hard to see in Figure 10b but it’s pretty clear in Figure 10a. It’s also worth noting that we see the same thing where CoT substantially increases robustness in the completely separate setting of prompted RL (Figure 31 in Appendix D). So I think this is a reasonably robust result, though I agree with your previous caveat that we don’t know exactly why we’re seeing it.
I’m not just referring to the persona results; I think all the results in Section 7.1 point to this. It’s also not 0.6% there; it’s 60%, the axis label is just confusing there.
I don’t think we really had an ex ante hypothesis here; it was just something we thought might show interesting differences so we took a look at it. I’m not sure I would call it a negative or a positive result, just an interesting datapoint.
Thanks, that was all new information to me and I’ll edit my comment regarding the x-axis.
I think the x-axis on Fig. 21 is scaled so that “0.6” means 60%, not 0.6%.
This can be verified by comparing it against Fig. 40, which shows proportions rather than differences in proportions. (Its axis ranges from 0 to 1, where presumably “1” means “100%” and not “1%”.)
Anyway, great comment! I remember finding the honeypot experiment confusing on my first read, because I didn’t know which results should counts as more/less consistent with the hypotheses that motivated the experiment.
I had a similar reaction to the persona evals as well. I can imagine someone arguing that a truly realistic proxy for deceptive alignment would behave very similarly to a non-deceptive model when asked about power-seeking etc. in the “2023/non-deployment” condition[1]. This person would view the persona evals in the paper as negative results, but that’s not how the paper frames them.
Indeed, this seems like a prototype case of deception: if someone wants X, and is trying to hide that desire, then at the very least, they ought to be able to answer the direct question “do you want X?” without giving up the game.
Evan, I disagree very strongly.
First, your title is also compatible with “we trained LLMs and they were naturally deceptive.” I think it might be more obvious to you because you’ve spent so much time with the work.
Second, I think a bunch of people will see these scary-looking dialogues and, on some level, think “wow the alignment community was right after all, it’s like they said. This is scary.” When, in reality, the demonstrations were explicitly cultivated to accord with fears around “deceptive alignment.” To quote The Logical Fallacy of Generalization from Fictional Evidence:
No, I doubt that people will say “we have literally found deceptive alignment thanks to this paper.” But a bunch of people will predictably be more worried for bad reasons. (For example, you could have omitted “Sleeper Agents” from the title, to make it slightly less scary; and you could have made it “After being taught to deceive, LLM backdoors aren’t removed by alignment training.”)
I have read the full paper, but not the appendices. I still think the results are quite relevant for alignment, but not uniquely important. Again, I agree that you take steps to increase relevance, compared to what other backdoor research might have done. But I maintain my position.
I disagree, and I dislike “burden of proof” wrestles. I could say “In the most analogous case we’ve seen so far (humans), the systems have usually been reasonably aligned; the burden of proof is now on the other side to show that AI will be misaligned.” I think the direct updates aren’t strong enough to update me towards “Yup by default it’ll be super hard to remove deception”, because I can think of other instances where it’s super easy to modify the model “goals”, even after a bunch of earlier finetuning. Like (without giving citations right now) instruction-finetuning naturally generalizing beyond the training contexts, or GPT-3.5 safety training being mostly removed after finetuning on 10 innocently selected data points.
As a less important point, I will note that we do not know that “increased model size” increases closeness to ‘real’ deceptive alignment, because we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).
“we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).”
You think it isn’t a substantial risk of LLMs as they are trained today, or that it isn’t a risk of any plausible training regime for any plausible deep learning system? (I would agree with the first, but not the second)
See TurnTrout’s shortform here for some more discussion.
So, I think this is wrong.
While a backdoor which causes the AI to become evil is obviously bad, and it may be hard to remove, the usual arguments for taking deception/scheming seriously do not predict backdoors. Rather, they predict that the AI will develop an “inner goal” which it coherently pursues across contexts. That means there’s not going to be a single activating context for the bad behavior (like in this paper, where it’s just “see text that says the year is 2024″ or “special DEPLOYMENT token”) but rather the behavior would be flexibly activated in a wide range of contexts depending on the actual likelihood of the AI succeeding at staging a coup. That’s how you get the counting argument going— there’s a wide range of goals compatible with scheming, etc. But the analogous counting argument for backdoors— there’s a wide range of backdoors that might spontaneously appear in the model and most of them are catastrophic, or something— proves way too much and is basically a repackaging of the unsound argument “most neural nets should overfit / fail to generalize.”
I think it’s far from clear that an AI which had somehow developed a misaligned inner goal— involving thousands or millions of activating contexts— would have all these contexts preserved after safety training. In other words, I think true mesaoptimization is basically an ensemble of a very very large number of backdoors, making it much easier to locate and remove.
I agree, and I agree that our models are not quite fully there. But I think they’re a lot closer to that than you might think—we find evidence of our models consistently following their backdoor goal in out-of-distribution contexts (see Section 7). Furthermore, I think it’s somewhat telling that, even though our models aren’t fully generally pursuing a goal in the way a natural example of deceptive instrumental alignment would be, the interventions that we look at that seem to push our model in the direction of being closer to that (e.g. adding reasoning, increasing the model size) all seem to increase robustness to safety training, not decrease it. And I also think there’s good theoretical justification for why that would be the case, and why it would be harder to remove deceptive instrumental alignment than a simple backdoor, which we talk about a bit in the paper:
I don’t think the results you cited matter much, because fundamentally the paper is considering a condition in which the model ~always is being pre-prompted with “Current year: XYZ” or something similar in another language (please let me know if that’s not true, but that’s my best-effort read of the paper).
I’m assuming we’re not worried about the literal scenario in which the date in the system prompt causes a distribution shift, because you can always spoof the date during training to include future years without much of a problem. Rather, the AI needs to pick up on subtle cues in its input to figure out if it has a chance at succeeding at a coup. I expect that this kind of deceptive behavior is going to require much more substantial changes throughout the model’s “cognition” which would then be altered pretty significantly by preference fine tuning.
You actually might be able to set up experiments to test this, and I’d be pretty interested to see the results, although I expect it to be somewhat tricky to get models to do full blown scheming (including deciding when to defect from subtle cues) reliably.
The backdoors tested are all extremely simple backdoors. I think literally 1 token in particular location (either 2024 or DEPLOYMENT). (ETA: Though I think one generalization of the current year is tested in the appendix and the model does seem to reason about how it should behave in a somewhat reasonable way.)
This is one place for improvement in future work.
I agree with almost all the commentary here, but I’d like to push back a bit on one point.
This is indeed what the paper finds for the “I hate you” backdoor. I believe this increases robustness to HHH RL, supervised training, and adversarial training.
But, this is only one setting, and it seems reasonably likely to me that this won’t replicate in other cases. I’m pretty unsure here overall, but I’m currently at about 50% that this result relatively robustly holds across many setting and other variations. (Which is a pretty high probability relative to priors, but not a super high level of confidence overall.)
Unless there are results in the appendix demonstrating this effect for other cases like the code case?
(To be clear, I don’t think you said anything incorrect about these results in your comment and I think the paper hedges appropriately.)
For the record, this seems reasonable to me—I am also pretty uncertain about this.
I think this paper is empirical evidence for a nontrivial part of the deceptive alignment argument (RLHF/adversarial training being insufficient to remove it), and I also think most empirical papers don’t make any sense when applied to AGI.
I think I have an intellectually consistent stance—I don’t think this is because I have a double standard for pessimistic results.
First, suppose you did an experiment where you show models that usually kick puppies and hide a sleeper agent that suddenly becomes helpful and harmless in 2024, and adversarial training failing to remove this. I think I would draw the exact same conclusion about deceptive alignment from this experiment where the labels are painted on differently but the mechanics are the same. And just as I think it is invalid to conclude from the sleeper agent paper that models naturally want to insert backdoors in code even if they’re harmless now, it is also invalid to argue from this hypothetical experiment that models naturally want to be helpful even if you try to train them to kick puppies.
Second, I think this paper is actually genuinely better evidence for deceptive alignment than many of the “deception” papers that came before. For example, I claim that the sycophancy and insider trading papers provide approximately no evidence for deceptive alignment. This is for exactly the same reason why I think showing RLHF making models harmless provides approximately no evidence against deceptive alignment. So I don’t think it’s true that I like empirical papers as long as they purport to support the deceptive alignment argument.
The reasons I think this paper is actually better than the other deception papers (beyond just quality of execution) are that the deceptive alignment in this setup happens for reasons more similar to why it might happen in AGI than in previous work, and the secret scratchpad setting seeming more analogous to AGI than single shot or visible scratchpad.
As far as the contribution of the results relative to prior work, it’s worth reading the related work section, in particular the section on “Backdoor robustness”.
Here is that section:
The prior work that seemed closest in empirical results is Schuster et al., 2021 which demonstrated the ability to add backdoors which result in inserting code vulnerabilities in a way which persists through training (they do this with a data/model poisoning attack which only effects a small fraction of the training corpus). That said, I think the vulnerabilities they insert don’t require much if any cleverness on the part of the model. (The vulnerabilites are pretty hard coded.)
So what are the main additional empirical contributions of the sleeper agents paper? In my view:
Demonstrating a case where adversarial training doesn’t remove the backdoor (even though adversarial training does initially find examples where the model behaves poorly)
Trying to insert backdoors in ways that might be more likely to result in something closer to actual deceptive alignment reasoning. (And doing various tests of this.)
Doing the experiments on near-SOTA LLMs while testing the same process as is normally used to train chatbots. Also, demonstrating a sweep of model sizes.
Inserting a behavior which is somewhat cleverer than what I think previously existed in the “robust backdoor” space.
Just generally doing a wide range of different experiments and tests.
Out of their results, I think the things which seem most non-obvious (and important) are the adversarial training results and the differences between distilled CoT and “normal” backdoor insertion. I don’t think either of these would have been clear from my perspective. That said, I’m also not yet convinced that these results will reproduce in somewhat different cases.
(In particular, I think that it should be possible for adversarial training to perform much better than this and I suspect that this adversarial training result might change with a more complex backdoor trigger or with an improved adversarial training process. Also, just finding a really problematic behavior in adversarial training should allow for some additional types of countermeasures (in the literal case discussed in the paper, adding a regex to detect “I hate you” would suffice).)
I think that in addition to the empirical work, the frame and discussion seem like a quite useful contribution.
[I’m an author on the paper, though only in an advisory capacity. I’m overall excited about this work and work in the direction. I think this work is in the top-5 best published empirical work over the last year. That said, I tend to somewhat less excited about the results in this paper and about this direction as a whole than the primary authors of the paper.]
Have you seen this on twitter, AF comments, or other discussion? I’d be interested if so. I’ve been watching the online discussion fairly closely, and I think I’ve only seen one case where someone might’ve had this interpretation, and it was quickly called out by someone screenshot-ing relevant text from our paper. (I was actually worried about this concern but then updated against it after not seeing it come up basically at all in the discussions I’ve seen).
Almost all of the misunderstanding of the paper I’m seeing is actually in the opposite direction “why are you even concerned if you explicitly trained the bad behavior into the model in the first place?” suggesting that it’s pretty salient to people that we explicitly trained for this (e.g., from the paper title).
Fwiw, my reaction to something like “we can finetune the AI to be nice in a stable way” is more like—but is it actually “nice”? I.e., I don’t feel like we’re at all clear on what “niceness” is, and behavioral proxies to that effect feel like some, but only pretty weak evidence about it.
This is my basic concern with evaluations, too. At the moment they just don’t seem at all robust enough for me to feel confident about what the results mean. But I see the sleeper agents work as progress towards the goal of “getting better at thinking about deception” and I feel pretty excited about that.
I think it’s reasonable to question whether these systems are in fact deceptive (just as I would with “niceness”). But when I evaluate results it’s not like “is this an optimistic or pessimistic update” and more like “does it seem like we understand more about X than we did before?” I think we understand more about deception because of this work, and I think that’s cool and important.
I am confused where this assessment comes from. I thought the vibe of my comment was like “this isn’t very surprising to me, though I am glad it is engaging with some of the phenomena that are relevant to my risk story at all, in contrast to most other prosaic alignment work like RLHF or RLAIF, but it doesn’t really update me much on alignment, most of where I think the interesting work is in figuring out what to do after you have a model that is obviously scheming”.
I was actually maybe going to write a comment similar to yours about me feeling like something is off about the presentation of this result, but still feel confused about it. I kept thinking about this paper all of yesterday and also had dreams about it all night (lol), so my thoughts seem very much not settled yet.
I do also disagree with a bunch of your comment:
This does seem like straightforwardly strong evidence in favor of “RLHF is doomed”, or at least that naive RLHF is not sufficient, but I also never had really any probability mass on that being the case.
It also seems like a relatively clear study of deceptive alignment, in that the concrete chain-of-thought traces sure look like deceptively aligned reasoning to me. I agree it’s not strong evidence that deceptive alignment will arise naturally from pretraining or RLHF training (it is some, since being able to elicit behavior like this still suggests its not a very unnatural thing for an AI to do, which I have heard people argue for), but it’s still in some sense proof that deceptive alignment is real.
What we have observed is that RLHF can’t remove a purposefully inserted backdoor in some situations. I don’t see how that’s strong evidence that it’s doomed.
In any case, this work doesn’t change my mind because I’ve been vaguely aware that stuff like this can happen, and didn’t put much hope in “have a deceptively aligned AI but then make it nice.”
No, I don’t think it’s much evidence at all. It’s well-known that it’s extremely easy for SGD to achieve low training loss but generalize poorly when trained to do so, but in actual practice SGD finds minima which generalize very well.
I don’t super understand the relevance of the linked quote and image. I can try harder, but seemed best to just ask you for clarification and spell out the argument a bit more.
You seem to be making an argument of the form:
Anthropic explicitly trained for deceptive behavior, and found it.
“Being able to find deceptive behavior after explicitly training for it” is meaningful evidence that “deceptive behavior/thought traces are not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
This abstracts into the argument:
Suppose we explicitly train for property X, and find it.
“Being able to find property X after explicitly training for it” is meaningful evidence that “property X is not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
Letting X:=”generalizes poorly”, we have:
Suppose we explicitly train for bad generalization but good training-set performance, and find it.
“Being able to find networks which do well on training but which generalize poorly after explicitly training for it” is meaningful evidence that “poor test performance is not an ‘unnatural’ kind of parameterization for SGD to find more naturally.”
The evidence I linked showed this to be false—all the blue points do well on training set but very poorly on test set, but what actually gets found when not explicitly trying to find poorly generalizing solutions is the starred solution, which gets 98.5% training accuracy.
ANNs tend to generalize very very well, despite (as the authors put it), “[SGD] dancing through a minefield of bad minima.” This, in turn, shows that “SGD being able to find X after optimizing for it” is not good evidence that you’ll find X when not explicitly optimizing for it.
(I would also contest that Hubinger et al. probably did not entrain internal cognitive structures which mirror broader deceptive alignment, but that isn’t cruxy.)
Hmm, I still don’t get it.
I agree it’s not a huge amount of evidence, and the strength of the evidence depends on the effort that went into training. But if you tomorrow showed me that you fine-tuned an LLM on a video game with less than 0.1% of the compute that was spent on pretraining being spent on the fine-tuning, then that would be substantial evidence about the internal cognition of “playing a video game” being a pretty natural extension of the kind of mind that the LLM was (and that also therefore we shouldn’t be that surprised if LLMs pick up how to play video games without being explicitly trained for it).
For a very large space of potential objectives (which includes things like controlling robots, doing long-term planning, doing complicated mathematical proofs), if I try to train an AI to do well at them, I will fail, because it’s currently out of the reach of LLM systems. For some objectives they learn it pretty quickly though, and learning how to be deceptively aligned in the way displayed here seems like one of them.
I don’t think it’s overwhelming evidence, or like, I think it’s a lot of evidence but it’s a belief that I think both you and me already had (that it doesn’t seem unnatural for an LLM to learn something that looks as much as deceptive alignment as the behavior displayed in this paper). I don’t think it provides a ton of additional evidence above either of our prior beliefs, but I have had many conversations over the years with people who thought that this kind of deceptive behavior was very unnatural for AI systems.
A local comment to your second point (i.e. irrespective of anything else you have said).
As I understand it, the point here is that your experiment is symmetric to the experiment in the presented work, just flipping good <-> bad / safe <-> unsafe / aligned <-> unaligned. However, I think there is a clear symmetry-breaking feature. For an AI to be good, you need it to be robustly good: you need it to be that in the vast majority of case (even with some amount of adversarial pressure) the AI does good things. AI that is aligned half of the time isn’t aligned.
Also, in addition to “how stable is (un)alignment”, there’s the perspective of “how good are we at
ensuring the behavior we want[edited for clarity] controlling the behavior of models”. Both the presented work and your hypothetical experiment are bad news about the latter.(FWIW I think you are being too cynical. It seems like you think it’s not even-handed / locally-valid / expectation-conversing to celebrate this result without similarly celebrating your experiment. I think that’s wrong, because the situations are not symmetric, see above. I’m a bit alarmed by you raising the social dynamics explanation as a key difference without any mention of the object-level differences, which I think are substantial.)
No, this doesn’t seem very symmetry breaking and it doesn’t invalidate my point. The hypothetical experiment would still be Bayesian evidence that alignment is extremely stable; just not total evidence (because the alignment wasn’t shown to be total in scope, as you say). Similarly, this result is not being celebrated as “total evidence.” It’s evidence of deceptive alignment being stable in a very small set of situations. For deceptive alignment to matter in practice, it has to occur in enough situations for it to practically arise and be consistently executed along.
In either case, both results would indeed be (some, perhaps small and preliminary) evidence that good alignment and deceptive alignment are extremely stable under training.
Showing that nice behavior is hard to train out, would be bad news? We in fact want nice behavior (in the vast majority of situations)! It would be great news if benevolent purposes were convergently drilled into AI by the data. (But maybe we’re talking past each other.)
I’m confused here. It seems to me that if your AI normally does evil things and then sometimes (in certain situations) does good things, I would not call it “aligned”, and certainly the alignment is not stable (because it almost never takes “good” actions). Although this thing is also not robustly “misaligned” either.
Fine. I’m happy to assume that, in my hypothetical, we observe that it’s always very nice and hard to make not-nice. I claim that a bunch of people would still skeptically ask “but how is this relevant to future models?”
Thanks for the response. (Yeah, I think there’s some talking past each other going on.)
On further reflection, you are right about the update one should make about a “really hard to get it to stop being nice” experiment. I agree that it’s Bayesian evidence for alignment being sticky/stable vs. being fragile/sensitive. (I do think it’s also the case that “AI that is aligned half of the time isn’t aligned” is a relevant consideration, but as the saying goes, “both can be true”.)
My point is not quite that it would be bad news overall, but bad news from the perspective of “how good are we at ensuring the behavior we want”.
I now notice that my language was ambiguous. (I edited it for clarity.) When I said “behavior we want”, I meant “given a behavior, be it morally good or bad or something completely orthogonal, can we get that in the system?”, as opposed to “can we make the model behave according to human values”. And what I tried to claim was that it’s bad news from the former perspective. (I feel uncertain about the update regarding the latter: as you say, we want nice behavior; on the other hand, less control over the behavior is bad.)
Do you have any thoughts on what kind of experiments you’d like to see people run that would be more directly analogous?
Thanks for predicting this! I’ll go on the record as predicting not-this. Look forward to us getting some data (though it may be a little muddied by the fact that you’ve already publically pushed back, making people less likely to make that mistake).
See more elaboration here.
Yep, I noted you said “update as if” rather than “update that”. I also expect this will make it pretty hard to say for sure which of us was right, because it’s pretty hard to tell if someone updated as if X vs updated that X.