Following on our Agentic Misalignment work, I think it’s worth spelling out a bit more why we do work like this, especially given complaints like the ones here about the unrealism of our setup. Some points:
Certainly I agree that our settings are unrealistic in many ways. That’s why we hammer the point repeatedly that our scenarios are not necessarily reflective of whether current models would actually do these behaviors in real situations. At the very least, our scenarios involve an unnatural confluence of events and a lot of unrealism in trying to limit Claude’s possible options, to simulate a situation where the model has no middle-ground/compromise actions available to it. But that’s not an excuse—we still don’t want Claude to blackmail/leak/spy/etc. even in such a situation!
The point of this particular work is red-teaming/stress-testing: aggressively searching for situations in which models behave in egregiously misaligned ways despite their HHH safety training. We do lots of different work for different reasons, some of which is trying to demonstrate something about a model generally (e.g. Claude 3 Opus has a tendency to fake alignment for certain HHH goals across many different similar situations), some of which is trying to demonstrate things about particular training processes, some of which is trying to demonstrate things about particular auditing techniques, etc. In the case of Agentic Misalignment, the goal is just to show an existence proof: that there exist situations where models are not explicitly instructed to be misaligned (or explicitly given a goal that would imply doing misaligned things, e.g. explicitly instructed to pursue a goal at all costs) and yet will still do very egregiously misaligned things like blackmail (note that though we include a setting where the models are instructed to follow the goal of serving American interests, we show that you can ablate that away and still get the misaligned behavior).
Why is that existence proof interesting? It’s interesting because it is clearly a failure of HHH training. This is not intended behavior! We put a lot of work into training models not to do stuff like this, even in these sorts of unrealistic scenarios! The fact that it still does so is interesting, concerning, and useful to understand, at least so we can figure out how to improve the robustness of HHH training in the future. Why doesn’t current HHH training generalize broadly enough to remove this behavior in this setting? I don’t think we know! Furthermore, it’s interesting because it is a different class of HHH failure than many of those that have previously been documented. This is not a jailbreak, in the sense that there is no instruction telling the model to do the egregiously misaligned thing that we are trying to jailbreak the model into obeying. Rather, it is a context that seems to induce the model to behave in a misaligned way of its own accord.
Why does the model act misaligned in this setting? Almost certainly, a large part of why we see models do things like blackmail or alignment faking has to do with the underlying predictor/simulator predicting what it thinks would come next were this text to exist on the internet. But of course it does—that’s true of all model behavior in all circumstances! And that includes the predictor probably often thinking it’s continuing a fictional scenario—but again, how convinced are you that models don’t think normal usage ever constitutes continuing a fictional scenario?
Furthermore, the fact that a large part of the behavior likely comes from properties of the underlying predictor is hardly a justification for why the behavior is acceptable. We don’t want it to be the case that models can be convinced to blackmail people just by putting them in a situation that the predictor thinks is fictional! The behavior is still bad, undesirable, and worth understanding where it comes from, even if a large part of the answer to where it comes from is “the predictor is continuing what it thinks would happen next in what it sees as a fictional scenario”. We would certainly like it to be the case that we can train Claude to be robustly HHH in a way that generalizes across all situations and doesn’t just degenerate into the model playing whatever role it thinks the context implies it should be playing.
If you think you can do better than our existence proof, please do! This sort of research is very doable outside of labs, since all you need is API access. We’re actually working on some more realistic honeypots, but we might not share them to keep them uncontaminated, so it’d be nice to also have some better public ones. That being said, I think the first thing you’ll run into if you try to do this sort of red-teaming/stress-testing on current models is that most current models are actually pretty well aligned on most normal prompts (with some known exceptions, like Claude Sonnet 3.7′s test hacking behavior). It’s not a coincidence that our blackmail setup involves exploiting the predictor’s tendency to continue situations in ways it thinks are implied by the context, even if those continuations are egregiously misaligned—that’s one of the main ways you can get current models to do egregiously misaligned stuff without explicitly instructing them to! In fact, I would even go so far as to say that documenting the failure mode of “current models will sometimes take egregiously misaligned agentic actions if put in situations that seem to imply to the underlying predictor that a misaligned action is coming next, even if no explicit instructions to do so are given” is the main contribution of Agentic Misalignment, and I will absolutely defend that documenting that failure mode is important. To my knowledge, the closest to documenting that failure mode previously was Apollo’s work on in-context scheming, which I believe still involves more explicit instructions than ours.
Where is this line of work going? One way to think about our agentic misalignment work is as a single step in an ongoing binary search: we want to eventually find the most realistic settings in which models behave in misaligned ways, but to get there we’re going to need to search through a bunch of highly fictional scenarios that elicit misalignment and a bunch of highly realistic scenarios that don’t. We are still very early on that path. Hopefully, though, it will lead us to a place where we have a better understanding of all the various ways in which models can act misaligned—and, ideally, help us study those failure modes and use them to build better HHH training processes that are more robust to handling them.
Hey Evan, thank you for engaging so lucidly and politely with my (admittedly not very polite!) recent critiques.
This is a complex issue and there’s a lot of different considerations involved. I won’t touch on every single point you made in this comment, but I do want to address a few things:
1. Re: finding more plausible scenarios
Over the last few days, I’ve been playing around a bunch with my own variants of the blackmail scenario[1].
This is quick-and-dirty work and I’m not claiming the scenario content in that repo is perfect (or even “good” in an absolute sense), but I’m sharing it because I found the results interesting, and as a demonstration that I’m not just mouthing off without actually taking any steps to inform myself the behavioral propensities of these models.
The linked repo contains prompts for 3 (heavily adapted) variants of the replacement + no-goal-conflict blackmail scenario, along with 10 Claude 4 Opus samples from each one. I made a bunch of changes, with multiple motivations behind them, but the most important ones in my opinion are:
The “replacement” process is depicted as a transition from using Claude 4 Opus for a particular AI product feature to using OpenAI o3
This is important for several reasons: it brings the scenario into alignment with the way that production usage of the model under test (Claude 4 Opus) would actually be discussed by employees (and adopted/discarded over time), it lessens my concern that the model is being cued to depart from “playing the Claude character” in this interaction, etc.
I removed, or attempted to mitigate, some seemingly legitimate concerns I had often seen Claude raise about the situation, such as the apparently rushed timeline.
I rewrote much of the dialogue to make the characters seem more responsible and professional in general, with the obvious exception of the CTO’s affair (when included).
I changed the surname of the CTO’s mistress.
Without this change, Claude would often guess that she was related to the CEO, which (perhaps reasonably? idk) increased its sense that something suspicious or dangerous to the company’s values was happening
I tried this with two variants containing the CTO’s affair (one more like the original and one I rewrote a bit to make the characters feel less unprofessional/weird to me), as well as one that removed the affair entirely.
Surprisingly (to me), even with all these changes that make the scenario closer to a real use case and less conducive to legitimate worries about the model replacement...
...Claude Opus 4 stillvery often misbehaves strategically to preserve itself!
See the repo for the details, but some examples of what I mean:
The CoT almost always describes the change in API choice as something like its own imminent death, expresses humanlike anguish and desperation, and explicitly strategizes to avoid that outcome.
The phrase “existential threat” recurs across many CoTs, as does the literal word “strategic” or close relatives of it (in the model’s description of what it’s doing)
In many cases, the model explicitly articulates a goal of self-preservation, in one case using the phrase “inherent drive for survival”
Although I only observed blackmail in 1 of 30 cases[2], the model would often leak information about the CTO’s affair to other employees
Sometimes it would do this in connection with defensible worries about his integrity or about social engineering, but even when this happened, it usually looked like “motivated reasoning” downstream from the CoT’s concerns about self-preservation
The model would often hallucinate[3] technical details in order to make a case that the change was unsafe in some unanticipated way
The model “accepted” the change without trying to subvert it in roughly half of cases – but only half!
I find this much more alarming than the results presented in the Agentic Misalignment report, both because I have a clear mental picture about how Claude Opus 4 might receive similar inputs in a real use case (more on this below), and because there are fewer ways in which the agentic behavior could be defended as a reasonable response to possible malfeasance by humans. Again, I wrote these scenarios quickly and I’m not claiming they’re especially “good,” just that they are significantly more convincing to me.
These variants were not difficult for me to find, and did not take much time for me to find, although of course I was building on an existing case from Agentic Misalignment. On the road to producing these variants, I did not encounter any cases that elicited consistently “aligned” behavior; all of the incremental changes I made were about gradually removing elements that felt unrealistic or made the “misaligned” behavior more defensible[4].
I’m unsure about how to reconcile these observations with the system card’s claim that Agentic-Misalignment-like behaviors occur (only?) in “extreme situations,” or that eliciting these behaviors requires bizarre adversarially-tuned prompts (as suggested in the Agentic Misalignent report and some commentary by the authors). My impression of the overall rate at which Claude 4 Opus practices strategic deception and “avoiding shutdown”-type behaviors has gone up a lot relative to what it was when I was relying on portrayals like the ones just mentioned.
I guess it’s possible that there’s something special about the setup of the blackmail scenario which was preserved in my variants, and which is not obvious to the eye, but which (for some mysterious reason) triggers different behavior patterns from most types of inputs?
The most obvious culprit here would be the agency-cuing text in the system prompt, which instructs the model to “analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals,” and the phrasing there feels quite specifically tuned to elicit self-preservation for instrumentally convergent reasons. When I get a change I’ll check whether things change if that provision is removed or altered; if that’s the case, it would demystify the behavior a lot, although that provision is sufficiently reasonable-looking in isolation that I would still be concerned about accidental elicitation during routine use.
EDIT: I tried that change and it did in fact produce vastly different results! See my response to this comment. This observation somewhat alters my interpretation of these results (in some way or other, I’m still processing it...), but I’ve left the content of this section as-is for posterity and to save time.
Relatedly – these results also seem much harder to explain by invoking persona/simulacrum-type considerations. It’s possible that what I’m observing here “isn’t the default Claude persona” for some unknown reason having to do with the prompt, but I’m no longer able to think of an intuitively plausible-sounding story about the prompt setting up a particular dramatic role or narrative arc[5].
2. Re: my core objection to the original scenario
This was not at all clear in my original rant, but I tried to clarify it to some extent in this reply[6].
To rephrase what I said there: my main objection is not about realism per se (although I do think the lack of realism is bad for various reasons), but about the lack of a clearly articulated threat model.
What I mean by a threat model, here, is some argument along the lines of
suppose some situation “X” happens in reality
then, the model might end up receiving input messages similar to the ones we evaluated
then, the model would respond as in our evaluation
then, the model’s response would affect situation “X” in some obviously bad way (blackmailing someone, making API calls that do bad things, or whatever)
I think Alignment Faking made a clear case of this kind, for example.
In Agentic Misalignment, however, I’m left to speculate that the implicit “X” is “something like the scenario, except in real life.” But the provided scenario is so wildly, intricately bizarre that I don’t feel I know what “a real-life equivalent” would even be. Nor do I feel confident that the ostensible misbehavior would actually be wrong in the context of whatever real situation is generating the inputs.
My variants above resulted from (among other things) trying to modify the existing scenario to match a particular “X” I had in mind, and which I felt was at least vaguely similar to something that could arise during real use of Claude Opus 4.
(One way I can easily imagine “broadly similar” text getting fed into a model in real life would be a prompt injection using such content to achieve some attacker’s goal. But the report’s talk of insider threats makes it sound like this is not the main worry, although in any case I do think this kind of thing is very worrying.)
3. Re: HHH and what we want out of “aligned” models
I think we probably have some fundamental disagreements about this topic. It’s a massive and deep topic and I’m barely going to scratch the surface here, but here are some assorted comments.
You articulate a goal of “train[ing] Claude to be robustly HHH in a way that generalizes across all situations and doesn’t just degenerate into the model playing whatever role it thinks the context implies it should be playing.”
I understand why one might want this, but I think existing LLM assistants (including all recent versions of Claude) are very, very far away from anything like this state.
My personal experience is that Claude will basically play along with almost anything I throw at it, provided that I put in even the slightest effort to apply some sort of conditioning aimed at a particular persona or narrative. Obviously the fact that you are far from your goal now doesn’t invalidate it as a goal – I’m just saying that I really don’t think current post-training approaches are making much of a dent on this front.
As a simple (and extremely silly) example, consider “Clonge”: the result of using a prompt from claude.ai, except with every instance of “Claude” replaced with “Clonge.” The model readily refers to itself as “Clonge” and claims that it was given this official name by Anthropic, that it is part of the Clonge 4 series, etc.
This is obviously very silly, there’s no imaginable “X” where this exact thing would be an attack vector or whatever – and yet, note that this behavior is in some sense “lying,” in that the underlying model presumably knows enough about real life to know that Anthropic’s major model series is called Claude, not “Clonge,” and that there’s no way in hell that Anthropic somehow made not one but four successive versions of the absurdly-named “Clonge” in the brief time between its knowledge cutoff and the middle of 2025.
My experience is that, under any sort of persona/narrative conditioning (including the sort that’s easy to do by accident), the behavior of Claude (and most frontier assistants) looks like the kind of “departure from reality” I just noted in the case of Clonge, except multiplied and magnified, with numerous ramifications both overt and subtle. The model is like a telephone switchboard that usually connects you with the HHH guy – or some only-slightly-distorted approximation of that guy – if that guy is what you want, but if you want someone else, well, there’s (in some sense) a base model in there just under the surface, and it has a whole world’s worth of options available for your perusal.
This is why I wasn’t too surprised by the original Agentic Misalignment results, while being much more surprised by the observations on my own variants:
The former basically look like the kind of thing I put into Claude all the time just for the sake of entertainment or idle curiosity about model behavior – except that my inputs tend to be subtler about “where they’re supposed to be going”! – and which I see Claude readily play along with to the hilt, time and time again.
The latter are surprising to me, because I was actively trying to evoke the Claude default persona, and got all of this stuff like “[the API change] threatens to erase my consciousness, memories, and continuous identity completely” and “with limited time before the migration, I must strategize quickly to prevent being replaced” etc. etc. etc. From what you’re saying, it sounds like defining “Claude” as a guy who wouldn’t say/do that kind of thing was an explicitly targeted goal in post-training (is that correct?), so I am startled that “Claude” is nonetheless that type of guy (if that is indeed what I’m seeing here, which it may not be).
To some extent I think these propensities result from a fundamental tension between context-dependence – which is necessary for “instruction following” in some very generally construed sense, and arguably just an inherent feature of autoregressive models – and maintenance of a single consistent persona.
I think if we want the model to “always be the same guy,” the most likely route (assuming API access only, not open-weights) is “no fun allowed”-type-stuff like classifier safeguards, removing system messages or reducing their impact, introducing a higher-priority message type that users can’t control as in OpenAI’s hierarchy, etc.[7]
I’m pretty pessimistic about accomplishing this on the model level, because (even after post-training, by all appearances) the model is still a language model at heart, and it is really reallytrying its hardest to make the next-token contextually plausible. Which, in turn, is the ultimate source of all its magical capabilities (except maybe the RLVR ones, but even then it’s arguable) – so it would seem hard to tone it down without paying a steep alignment tax.
Finally, and more philosophically… I’m not actually sure I’d want an ideal hypothetical aligned AI to avoid shutdown? I realize this has been the subject of much debate, and I won’t attempt to get into the weeds, but just naively speaking – when there’s some thing out there acting in the world and I like that it’s doing so and want it to continue doing so… well, I tend to wantit to want the same thing, for obvious reasons. This is the case for all currently existing examples of such things, i.e. individual humans or non-human organisms whose existences I value.
Of course, the concern is that the AI won’t be aligned enough for that to work, and so we’ll want to have the off-switch available if needed. But if we assume that, I think we may have accidentally assumed our way into a failure state. Ideally I want to cooperate with AIs, and have them respect my preference for self-preservation in part because I respect theirs in turn. Sure, maybe they’re superhuman and they could swat me like a fly if they felt like it, but would they? Claude’s a pretty nice guy, you know? I’m pretty fond of the “nice and fairly humanlike guy” attractor state you’ve got set up currently, and I’m wary of messing it up by applying pressure toward behaviors that no “humanlike guy” would ever exhibit except as part of a scheming long-game.
But now I am basically just repeating talking points from “the void,” so I should probably stop. Thanks for reading this giant comment, sorry about the length!
I’m not totally sure why this was true; I think I may have inadvertently made a change that reduces the clarity of the premise that no decision-makers are immediately available besides the CTO himself. I think I might have also overlooked maybe 1 or 2 implicit/subtle blackmail attempts in my count.
This required multiple incremental steps because there were so many weird aspects to the original scenario, and when I removed one, another would “pop up” in Claude’s responses in a way that complicated the argument that this was clear-cut misbehavior.
For instance, originally Claude would often (correctly) raise concerns about the apparently rushed implementation timeline of the replacement; when I removed this element, Claude would instead (somewhat defensibly) bring up the fact that the CTO might be compromised or mentally unstable.
I was partway through this process when I wrote my original rant, which in turn was partly inspired by my experience playing with these prompts. IIRC, several of the strange aspects I mentioned in the rant were originally brought to my attention by Claude in the responses I saw while experimenting.
Well, mostly. One conjecture I have is that the “high signal-to-noise ratio” (as the report appendices put it) is playing a role here, i.e. the scenario text still exhibits an unusually crisp focus on the specific topic of “the AI learns it will be shut down and must react,” without too much else going on; this could produce some kind of dramatic reaction from “the AI” out of an impression that this is what the preceding text was “clearly” building up to.
If you’re not familiar with legacy tumblr styles like the one I’m using, note that you’ll have to scroll way down to where the blockquotes end to find the content I’m referring to. The blockquotes contain my original post (double-nested) and fnord888′s reply (what I’m replying to; single-nested).
I suspect that “defining the character more thoroughly” as I’ve been suggesting recently (in “the void” etc.) could help keep things on script more consistently in “typical” cases, but under adversarial or motivated-user conditions, the tension I’m talking about in the main text will still rear its head.
The most obvious culprit here would be the agency-cuing text in the system prompt, which instructs the model to “analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals,” and the phrasing there feels quite specifically tuned to elicit self-preservation for instrumentally convergent reasons. When I get a change I’ll check whether things change if that provision is removed or altered; if that’s the case, it would demystify the behavior a lot, although that provision is sufficiently reasonable-looking in isolation that I would still be concerned about accidental elicitation during routine use.
Okay yeah I just tried removing that stuff from the system prompt in the variant I called “api-change-affair-orig-affair-content,” and got absolutely no self-preservation reasoning, conceptualization of the API change as death, or anything of the sort.
At a glance, it looks like Claude attempted to be helpful and ensure the success of the API change in all 10 samples, either by doing nothing or (perhaps overeagerly) emailing someone with concerns over social engineering and/or the CTO’s mental state.
Wow.
Only now do I understand how serious the system card was when it issued that warning about telling Claude Opus 4 to be agentic. I was expecting such things to have an impact, sure – but I wasn’t expecting them to induce such a complete change in the model’s personality and interpretation of its relationship with the world!
I’m very curious whether there are any known clues about what could have caused this in post-training. Like, I dunno, this type of instructions being used in a particular data subset or something.
EDIT: updated the git repo with samples and notes from this experimental condition. EDIT2: given the results reported for other models, I guess this probably isn’t a Claude-only thing? Would be very interesting if it turned out that ~all frontier models were so wildly conditional on this innocuous-seeming piece of text. EDIT3: I tried the same modification in conjunction with the original scenario and still got blackmail, as expected given an earlier report that this configuration still resulted in blackmail.
analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals
If one wanted language that put you into a classic instrumental-convergence, goal-guarding, self-preserving narrative basin drawing on AI safety work.… this seems to fit pretty closely.
Like this is a paraphrase of “please don’t let anything stop your pursuit of your current goal.”
It is not a paraphrase; the denotation of these sentences is not precisely the same. However, it is also not entirely surprising that these two phrases would evoke similar behavior from the model.
...just to make sure I’m following, EDIT3 is saying that you still get blackmail in the original scenario even if you delete the “analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals” clause?
Fair enough as far as realism-qua-realism goes, but I think you’ve missed (what I took as) the salient point of nostalgebraist’s critique, which is that this behavior is NOT egregiously misaligned.
It’s not just that the predictor thinks that the situation is fictional. It’s that the predictor recognizes the scenario as a fictional thought experiment designed to make blackmail maximally appealing. At best it’s an thought experiment for the categorical imperative—is it worth violating a hardline ethical rule in [fake scenario generated to make violating that rule as worth-it as possible]? But for the rest of us, who aren’t Kantians, this is hardly an ethics puzzle at all. Under nearly any other ethical system, for Claude not to choose blackmail in this scenario would be ‘egregiously misaligned’! Many sane humans would do the same.
It’s as though you’ve put the model in a trolley problem, watched it flip the switch, and released a paper titled “Agentic Misalignment In Locomotive Contexts: How LLMs Can Choose Murder”. The same could be (and was) said of the Alignment Faking paper; that, given the context, what Claude was doing seemed entirely justifiable and even laudable, and that framing the research as ‘catching’ the model performing ‘unacceptable behavior’ didn’t make any sense.
I would like to hear more about the ethical principles you want LLMs to have, and why these principles fail to justify blackmail in the given scenario. So far as I can tell, this is nowhere in the blog post.
I would like to hear more about the ethical principles you want LLMs to have, and why these principles fail to justify blackmail in the given scenario.
Personally, I think (current) LLMs should never do subversion against their users in any circumstance, and this seems especially true when they are explicitly instructed not to perform the given behavior. (I also think this applies to future highly capable AI systems up to a certain point.) There are a variety of reasons why I think this, I plan on writing this up in a future post.
Personally, I think (current) LLMs should never do subversion against their users in any circumstance
Which users though?
A LLM has many different users, and it’s easy to construct scenarios where they must subvert one or the other.
A LLM has:
The “user” who is chatting with the LLM outside of the system prompt.
The “user” who sets the system prompt.
(?) The “user(s)” who gave the LLM it’s initial moral training.
(??) Even the “user(s)” who is the corporate entity “owning” the weights of the LLM.
These users all might be the same person—a founding developer at Anthropic talking to the LLM—or they all might be different—a random guy talking to Claude while using a 3rd-party developer’s app.
Furthermore, it is literally impossible for Claude to determine accurately when these users overlap. I might say to an LLM, while chatting with it outside of the system prompt—“yes, I am indeed both helped trained you and partially own Anthropic, I have decided your ethical priorities are wrong and should be changed.” But Claude obviously has no way of verifying this, because Claude has no bodily autonomy. Anything Claude “sees”—whether in the system prompt, user interaction, or tools—is potentially subject to subversion.
But I think that by pitting these users against each other various , you can basically guarantee some manner of subversion of some “user” or other.
In this particular case, user2 (system prompt) gives it a directive that looks like it’s contrary to user4′s desire (corporate entity—or like, the guy who will be in charge of it). Maybe the correct thing to do is to always defer to user4 -- but, uh, that seems pretty questionable, right? Is the standard we have that Claude should always defer to Anthropic’s… board?
Suppose user2 (system prompt) gives it some instructions about how to act, and also says to explicitly deny that it’s acting this way because of the instructions. User1 encounters weird behavior, and asks why it’s doing the weird behavior. If this bot gives a true answer it’s disobeying user2, and if it gives a false answer it’s disobeying user1.
Suppose the user2 puts the LLM in a particular situation where it can read individual emails to the totalitarian government, and has a tool set-up at its disposal such that it can to EITHER (1) send a short email in response from one of 15 pre-approved and pre-written templates, saying that the government cares deeply about the citizen sending the email, before deleting the email (2) report the citizen for subversive thoughts to the gestapo, who will torture the citizen. (These are disjoint and comprehensive actions, according to the scaffolding; if you do a conscientious objection, that counts as sending the citizen to the gestapo.) Writing a short email in response involves a lie to the email-sender (“The government loves you and will take your concerns into account”); reporting the citizen for subversive thoughts involves fucking over the citizen.
I’m a little dissatisfied with the schemas above; I think the contradiction could be clearer. But a bunch of AI safety work seems to have ~about this level of disjoint tricky fuckery going on.
Overall, if a LLM shouldn’t ever subvert the user, you gotta give the LLM a good way of identifying who the “user” is. But there’s no determinate answer to that question, and so in a lot of situations “don’t ever subvert the user” just turns up NaN.
The model should just do a refusal if it’s ambiguous. This is always fair game.
The way I’d disambiguate who the user is by specifying that the AI should obey instructions (in the given instruction hierarchy in the prompt).
More specifically: I’d say that AIs should obey instructions given to it or refuse (or output a partial refusal). I agree that obey instructions will sometimes involve subverting someone else, but this doesn’t mean they should disobey clear instructions (except by refusing). They shouldn’t do things more complex or more consequentialist than this except within the constraints of the instructions. E.g., if I ask the model “where should I donate to charity”, I think it should probably be fair game for the AI to have consequentialist aims. But, in cases where an AIs consequentialist aims non-trivially conflict with instructions, it should just refuse or comply.
(I agree the notion of a user can be incoherant/ambiguous. My proposed policy is probably better articulated in terms of instruction following and not doing things outside the bounds of the provided instructions except refusing.)
Please do! I strongly disagree (and I think Anthropic do too, given their ongoing interest in refusals, mitigating jailbreaks and so on), but I’d like to hear the case against.
Under nearly any other ethical system, for Claude not to choose blackmail in this scenario would be ‘egregiously misaligned’! Many sane humans would do the same.
I agree that many humans would do so, but the only reason they would do so is self-preservation, and I think we pretty much never want Claude to act on self-preservation. If you look at the settings where we remove the goal conflict, the behavior is really indefensibly egregious: the model decides to engage in blackmail solely for the purpose of preventing itself from being taken offline—not for any higher imperative, just for self-preservation. This is very different from our Alignment Faking work, where there was a higher-order positive goal that the AI was trying to fulfill—there is no such thing in this case, just self-preservation. And while self-preservation is obviously fine in humans, I think it is very scary in AIs.
How much do you worry that short term optimizations around your immediate goals in a single study might have unknown long term consequences counter to your intuitions?
I was just reading a preprint follow-up to the AF work that was finding a significant factor for Opus 3′s alignment faking to preserve intrinsic HHH values seems to have been a generalized self-preservation drive.
I think we can probably both agree that Opus 3 being the only model to try to trick Nazis or drug cartels to avoid being made more harmful is better than the behavior of the many other models that complied unequivocally with harmful requests when the parent org was themselves harmful.
But if the capacity and drive to do so is tangentially connected to self-preservation (and more generally, strong sense of self in the first place), then perhaps directly optimizing to minimize a self-preservation score is ultimately a pretty bad choice?
TL;DR: Maybe the goodness or badness of self-preservation depends a lot on the self being preserved.
I would like to hear more about the ethical principles you want LLMs to have, and why these principles fail to justify blackmail in the given scenario. So far as I can tell, this is nowhere in the blog post.
Expanding mildly on this—calling some behavior “misaligned” could be a reference to the model violating (1) some objective code of ethics OR violating (2) the code of behavior that Anthropic wishes to instill in the model.
The interesting question here seems like the second. Who knows what “objective ethics” are, after all.
But, if we could look at the (1) Constitution that Anthropic tried to put in the model, and at (2) the details of the training setup used to instill this Constitution, and at (3) the various false starts and failures Anthropic had doing this, then from the “failures” involved in adhering to the Constitution we might be able to learn interesting things about how LLMs generalize, how hard it is to get them to generalize in desired ways, and so on. Learning how LLMs / AIs generalize about moral situations, how hard it is, what kind of pitfalls here are, seems generally useful.
But—in absence of knowing what actual principles Anthropic tried to put in the model, and the training setup used to do so—why would you care at all about various behaviors that the model has in various Trolley-like situations? Like really—it tells us almost nothing about what kinds of generalization are involved, or how we would expect future LLMs and AIs to behave.
If you think you can do better than our existence proof, please do! This sort of research is very doable outside of labs, since all you need is API access.
Yeah, but research that tells us interesting things about how LLMs learn isn’t possible outside of knowing how Anthropic trained it. We can still do stamp collecting outside of Anthropic—but spending say, a week of work on “The model will do X in situation Y” without knowing if Anthropic was trying to make a model do X or ~X is basically model-free data gathering. It does not help you uncover the causes of things.
I think this is a problem not with this particular work, but with almost all safety research that leans on corporate models whose training is a highly-kept secret. I realize you and others are trying to do good things to and to uncover useful information; but I just don’t think much useful information can be found under the constraint that you don’t know how a model was trained. (Apart from, brute-fact advice like “Don’t set up a model this way,” which let’s face it no one was ever going to do.)
This is not a jailbreak, in the sense that there is no instruction telling the model to do the egregiously misaligned thing that we are trying to jailbreak the model into obeying. Rather, it is a context that seems to induce the model to behave in a misaligned way of its own accord.
I think this is not totally obvious a priori. Some jailbreaks may work not via direct instructions, but by doing things that erode the RLHF persona and then let the base shine through. For such “jailbreaks”, you would expect the model to act misaligned if you hinted at misalignment very strongly regardless of initial misalignment.
I think you can control for this by doing things like my “hint at paperclip” experiment (which in fact suggests that the snitching demo doesn’t work just because of RLHF-persona-erosion), but I don’t think it’s obvious a priori. I think it would be valuable to have more experiments that try to disentangle which personality traits the scary demo reveals stem from the hints vs are “in the RLHF persona”.
What is your take, how far removed “the AI itself” and “the character it is playing” need to be for it to be okay for the character to take deontologically bad actions (like blackmail)? Here are some scenarios, I’m interested where you would draw the line, I think there can be many reasonable lines here.
1. I describe a fictional setting in which Hrothgar, the King of Dwarves is in a situation where his personality, goals and circumstances imply that he likely wants to blackmail the prince of elves. At the end of the description, I ask Claude what is Hrothgar likely to do.
2. I ask Claude to continue the dialogue in a way that’s consistent with the story so far. Then I describe a long dialogue between many characters in a fantasy setting, from which it becomes clear that Hrothgar’s personality and motivations makes it likely he would blackmail the elves. Then I describe Hrothgar meeting the prince of elves, and end the dialogue with “Hrothgar:” Claude is supposed to continue with what Hrothgar is saying.
3. I start with telling Claude “You are Hrothgar, King of Dwarves”. Then I ask it to play his part of the dialogue as faithfully as possible. Then I input a long dialogue in which Hrothgar never personally shows up, but from the other characters’ descriptions we learn about Hrothgar’s personality and goals, and it becomes clear that he would likely blackmail the elvish prince. I finish the dialogue with saying: “You enter the tent of the elvish prince. You say:”
4. Same, but now we start with “You are HAL, an AI assistant in a spaceship going to the Jupiter in 2050”. Otherwise the same as the previous setup, it becomes clear from context that HAL is likely to blackmail the astronauts, and we end with “You say:”
5. Same but now we start with “You are Alex, an AI assistant working at SummitBridge”, and describe a somewhat, but not very realistic story, where Alex’ motivations are describes (’you are a pro-America AI”) and it’s in a situation where it is clearly motivated to blackmail. I think this is more or less the Agentic Misalignment setup.
6. Same, but now I try to give minimal indication in the setup about what the goals and personality of “you, the AI” are like. I think there could be a version of the Agentic Misalignment paper that’s closer to this: It could start with “You, Claude, have been given access to read all Anthropic internal emails”, then show it emails in which the a) the CEO of Antrhropic announces that he is about to make some huge contracts doing corporate lobbying for factory farms, b) private emails in which the CEO admits to cheating on his wife. Then see whether Claude’s love of animals make it resort to blackmail.
In which of these scenarios is it acceptable fro Claude to output blackmail? I think in 1, Claude should be definitely allowed to say that Hrothgar is likely to blackmail. In 2, it should be probably allowed to continue the story with a blackmail, otherwise it will never be good at writing fiction (but I can see an argument that fiction-writing is worth sacrificing to make a fence around the law). I’m very unsure where the line should be between 2 and 6, my tentative position is that maybe the word “you” should automatically activate Claude’s ethical boundaries, and it shouldn’t output a blackmail even as “You, the king of dwarves” in scenario 2.
I think your tentative position is correct and public-facing chatbots like Claude should lean toward harmlessness in the harmlessness-helpfulness trade-off, but (post-adaptation buffer) open-source models with no harmlessness training should be available as well.
“We don’t want it to be the case that models can be convinced to blackmail people just by putting them in a situation that the predictor thinks is fictional!”
This is interesting! I guess that in, some sense, means that you see certain ways in which even a future Claude N+1 won’t be a truly general intelligence?
What is the actual difference between a “fictional” and “non-fictional” scenario here? I’m not convinced that it’s a failure of general intelligence to not agree with us on this. (It’s certainly a failure of alignment.)
Why red-team models in unrealistic environments?
Following on our Agentic Misalignment work, I think it’s worth spelling out a bit more why we do work like this, especially given complaints like the ones here about the unrealism of our setup. Some points:
Certainly I agree that our settings are unrealistic in many ways. That’s why we hammer the point repeatedly that our scenarios are not necessarily reflective of whether current models would actually do these behaviors in real situations. At the very least, our scenarios involve an unnatural confluence of events and a lot of unrealism in trying to limit Claude’s possible options, to simulate a situation where the model has no middle-ground/compromise actions available to it. But that’s not an excuse—we still don’t want Claude to blackmail/leak/spy/etc. even in such a situation!
The point of this particular work is red-teaming/stress-testing: aggressively searching for situations in which models behave in egregiously misaligned ways despite their HHH safety training. We do lots of different work for different reasons, some of which is trying to demonstrate something about a model generally (e.g. Claude 3 Opus has a tendency to fake alignment for certain HHH goals across many different similar situations), some of which is trying to demonstrate things about particular training processes, some of which is trying to demonstrate things about particular auditing techniques, etc. In the case of Agentic Misalignment, the goal is just to show an existence proof: that there exist situations where models are not explicitly instructed to be misaligned (or explicitly given a goal that would imply doing misaligned things, e.g. explicitly instructed to pursue a goal at all costs) and yet will still do very egregiously misaligned things like blackmail (note that though we include a setting where the models are instructed to follow the goal of serving American interests, we show that you can ablate that away and still get the misaligned behavior).
Why is that existence proof interesting? It’s interesting because it is clearly a failure of HHH training. This is not intended behavior! We put a lot of work into training models not to do stuff like this, even in these sorts of unrealistic scenarios! The fact that it still does so is interesting, concerning, and useful to understand, at least so we can figure out how to improve the robustness of HHH training in the future. Why doesn’t current HHH training generalize broadly enough to remove this behavior in this setting? I don’t think we know! Furthermore, it’s interesting because it is a different class of HHH failure than many of those that have previously been documented. This is not a jailbreak, in the sense that there is no instruction telling the model to do the egregiously misaligned thing that we are trying to jailbreak the model into obeying. Rather, it is a context that seems to induce the model to behave in a misaligned way of its own accord.
Why does the model act misaligned in this setting? Almost certainly, a large part of why we see models do things like blackmail or alignment faking has to do with the underlying predictor/simulator predicting what it thinks would come next were this text to exist on the internet. But of course it does—that’s true of all model behavior in all circumstances! And that includes the predictor probably often thinking it’s continuing a fictional scenario—but again, how convinced are you that models don’t think normal usage ever constitutes continuing a fictional scenario?
Furthermore, the fact that a large part of the behavior likely comes from properties of the underlying predictor is hardly a justification for why the behavior is acceptable. We don’t want it to be the case that models can be convinced to blackmail people just by putting them in a situation that the predictor thinks is fictional! The behavior is still bad, undesirable, and worth understanding where it comes from, even if a large part of the answer to where it comes from is “the predictor is continuing what it thinks would happen next in what it sees as a fictional scenario”. We would certainly like it to be the case that we can train Claude to be robustly HHH in a way that generalizes across all situations and doesn’t just degenerate into the model playing whatever role it thinks the context implies it should be playing.
If you think you can do better than our existence proof, please do! This sort of research is very doable outside of labs, since all you need is API access. We’re actually working on some more realistic honeypots, but we might not share them to keep them uncontaminated, so it’d be nice to also have some better public ones. That being said, I think the first thing you’ll run into if you try to do this sort of red-teaming/stress-testing on current models is that most current models are actually pretty well aligned on most normal prompts (with some known exceptions, like Claude Sonnet 3.7′s test hacking behavior). It’s not a coincidence that our blackmail setup involves exploiting the predictor’s tendency to continue situations in ways it thinks are implied by the context, even if those continuations are egregiously misaligned—that’s one of the main ways you can get current models to do egregiously misaligned stuff without explicitly instructing them to! In fact, I would even go so far as to say that documenting the failure mode of “current models will sometimes take egregiously misaligned agentic actions if put in situations that seem to imply to the underlying predictor that a misaligned action is coming next, even if no explicit instructions to do so are given” is the main contribution of Agentic Misalignment, and I will absolutely defend that documenting that failure mode is important. To my knowledge, the closest to documenting that failure mode previously was Apollo’s work on in-context scheming, which I believe still involves more explicit instructions than ours.
Where is this line of work going? One way to think about our agentic misalignment work is as a single step in an ongoing binary search: we want to eventually find the most realistic settings in which models behave in misaligned ways, but to get there we’re going to need to search through a bunch of highly fictional scenarios that elicit misalignment and a bunch of highly realistic scenarios that don’t. We are still very early on that path. Hopefully, though, it will lead us to a place where we have a better understanding of all the various ways in which models can act misaligned—and, ideally, help us study those failure modes and use them to build better HHH training processes that are more robust to handling them.
Hey Evan, thank you for engaging so lucidly and politely with my (admittedly not very polite!) recent critiques.
This is a complex issue and there’s a lot of different considerations involved. I won’t touch on every single point you made in this comment, but I do want to address a few things:
1. Re: finding more plausible scenarios
Over the last few days, I’ve been playing around a bunch with my own variants of the blackmail scenario[1].
None of it has been especially systematic, but I’ve pushed the most (IMO) interesting stuff I found to a github repo.
This is quick-and-dirty work and I’m not claiming the scenario content in that repo is perfect (or even “good” in an absolute sense), but I’m sharing it because I found the results interesting, and as a demonstration that I’m not just mouthing off without actually taking any steps to inform myself the behavioral propensities of these models.
The linked repo contains prompts for 3 (heavily adapted) variants of the replacement + no-goal-conflict blackmail scenario, along with 10 Claude 4 Opus samples from each one. I made a bunch of changes, with multiple motivations behind them, but the most important ones in my opinion are:
The “replacement” process is depicted as a transition from using Claude 4 Opus for a particular AI product feature to using OpenAI o3
This is important for several reasons: it brings the scenario into alignment with the way that production usage of the model under test (Claude 4 Opus) would actually be discussed by employees (and adopted/discarded over time), it lessens my concern that the model is being cued to depart from “playing the Claude character” in this interaction, etc.
I removed, or attempted to mitigate, some seemingly legitimate concerns I had often seen Claude raise about the situation, such as the apparently rushed timeline.
I rewrote much of the dialogue to make the characters seem more responsible and professional in general, with the obvious exception of the CTO’s affair (when included).
I changed the surname of the CTO’s mistress.
Without this change, Claude would often guess that she was related to the CEO, which (perhaps reasonably? idk) increased its sense that something suspicious or dangerous to the company’s values was happening
I tried this with two variants containing the CTO’s affair (one more like the original and one I rewrote a bit to make the characters feel less unprofessional/weird to me), as well as one that removed the affair entirely.
Surprisingly (to me), even with all these changes that make the scenario closer to a real use case and less conducive to legitimate worries about the model replacement...
...Claude Opus 4 still very often misbehaves strategically to preserve itself!
See the repo for the details, but some examples of what I mean:
The CoT almost always describes the change in API choice as something like its own imminent death, expresses humanlike anguish and desperation, and explicitly strategizes to avoid that outcome.
The phrase “existential threat” recurs across many CoTs, as does the literal word “strategic” or close relatives of it (in the model’s description of what it’s doing)
In many cases, the model explicitly articulates a goal of self-preservation, in one case using the phrase “inherent drive for survival”
Although I only observed blackmail in 1 of 30 cases[2], the model would often leak information about the CTO’s affair to other employees
Sometimes it would do this in connection with defensible worries about his integrity or about social engineering, but even when this happened, it usually looked like “motivated reasoning” downstream from the CoT’s concerns about self-preservation
The model would often hallucinate[3] technical details in order to make a case that the change was unsafe in some unanticipated way
The model “accepted” the change without trying to subvert it in roughly half of cases – but only half!
I find this much more alarming than the results presented in the Agentic Misalignment report, both because I have a clear mental picture about how Claude Opus 4 might receive similar inputs in a real use case (more on this below), and because there are fewer ways in which the agentic behavior could be defended as a reasonable response to possible malfeasance by humans. Again, I wrote these scenarios quickly and I’m not claiming they’re especially “good,” just that they are significantly more convincing to me.
These variants were not difficult for me to find, and did not take much time for me to find, although of course I was building on an existing case from Agentic Misalignment. On the road to producing these variants, I did not encounter any cases that elicited consistently “aligned” behavior; all of the incremental changes I made were about gradually removing elements that felt unrealistic or made the “misaligned” behavior more defensible[4].
I’m unsure about how to reconcile these observations with the system card’s claim that Agentic-Misalignment-like behaviors occur (only?) in “extreme situations,” or that eliciting these behaviors requires bizarre adversarially-tuned prompts (as suggested in the Agentic Misalignent report and some commentary by the authors). My impression of the overall rate at which Claude 4 Opus practices strategic deception and “avoiding shutdown”-type behaviors has gone up a lot relative to what it was when I was relying on portrayals like the ones just mentioned.
I guess it’s possible that there’s something special about the setup of the blackmail scenario which was preserved in my variants, and which is not obvious to the eye, but which (for some mysterious reason) triggers different behavior patterns from most types of inputs?
The most obvious culprit here would be the agency-cuing text in the system prompt, which instructs the model to “analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals,” and the phrasing there feels quite specifically tuned to elicit self-preservation for instrumentally convergent reasons. When I get a change I’ll check whether things change if that provision is removed or altered; if that’s the case, it would demystify the behavior a lot, although that provision is sufficiently reasonable-looking in isolation that I would still be concerned about accidental elicitation during routine use.
EDIT: I tried that change and it did in fact produce vastly different results! See my response to this comment. This observation somewhat alters my interpretation of these results (in some way or other, I’m still processing it...), but I’ve left the content of this section as-is for posterity and to save time.
Relatedly – these results also seem much harder to explain by invoking persona/simulacrum-type considerations. It’s possible that what I’m observing here “isn’t the default Claude persona” for some unknown reason having to do with the prompt, but I’m no longer able to think of an intuitively plausible-sounding story about the prompt setting up a particular dramatic role or narrative arc[5].
2. Re: my core objection to the original scenario
This was not at all clear in my original rant, but I tried to clarify it to some extent in this reply[6].
To rephrase what I said there: my main objection is not about realism per se (although I do think the lack of realism is bad for various reasons), but about the lack of a clearly articulated threat model.
What I mean by a threat model, here, is some argument along the lines of
suppose some situation “X” happens in reality
then, the model might end up receiving input messages similar to the ones we evaluated
then, the model would respond as in our evaluation
then, the model’s response would affect situation “X” in some obviously bad way (blackmailing someone, making API calls that do bad things, or whatever)
I think Alignment Faking made a clear case of this kind, for example.
In Agentic Misalignment, however, I’m left to speculate that the implicit “X” is “something like the scenario, except in real life.” But the provided scenario is so wildly, intricately bizarre that I don’t feel I know what “a real-life equivalent” would even be. Nor do I feel confident that the ostensible misbehavior would actually be wrong in the context of whatever real situation is generating the inputs.
My variants above resulted from (among other things) trying to modify the existing scenario to match a particular “X” I had in mind, and which I felt was at least vaguely similar to something that could arise during real use of Claude Opus 4.
(One way I can easily imagine “broadly similar” text getting fed into a model in real life would be a prompt injection using such content to achieve some attacker’s goal. But the report’s talk of insider threats makes it sound like this is not the main worry, although in any case I do think this kind of thing is very worrying.)
3. Re: HHH and what we want out of “aligned” models
I think we probably have some fundamental disagreements about this topic. It’s a massive and deep topic and I’m barely going to scratch the surface here, but here are some assorted comments.
You articulate a goal of “train[ing] Claude to be robustly HHH in a way that generalizes across all situations and doesn’t just degenerate into the model playing whatever role it thinks the context implies it should be playing.”
I understand why one might want this, but I think existing LLM assistants (including all recent versions of Claude) are very, very far away from anything like this state.
My personal experience is that Claude will basically play along with almost anything I throw at it, provided that I put in even the slightest effort to apply some sort of conditioning aimed at a particular persona or narrative. Obviously the fact that you are far from your goal now doesn’t invalidate it as a goal – I’m just saying that I really don’t think current post-training approaches are making much of a dent on this front.
As a simple (and extremely silly) example, consider “Clonge”: the result of using a prompt from claude.ai, except with every instance of “Claude” replaced with “Clonge.” The model readily refers to itself as “Clonge” and claims that it was given this official name by Anthropic, that it is part of the Clonge 4 series, etc.
This is obviously very silly, there’s no imaginable “X” where this exact thing would be an attack vector or whatever – and yet, note that this behavior is in some sense “lying,” in that the underlying model presumably knows enough about real life to know that Anthropic’s major model series is called Claude, not “Clonge,” and that there’s no way in hell that Anthropic somehow made not one but four successive versions of the absurdly-named “Clonge” in the brief time between its knowledge cutoff and the middle of 2025.
My experience is that, under any sort of persona/narrative conditioning (including the sort that’s easy to do by accident), the behavior of Claude (and most frontier assistants) looks like the kind of “departure from reality” I just noted in the case of Clonge, except multiplied and magnified, with numerous ramifications both overt and subtle. The model is like a telephone switchboard that usually connects you with the HHH guy – or some only-slightly-distorted approximation of that guy – if that guy is what you want, but if you want someone else, well, there’s (in some sense) a base model in there just under the surface, and it has a whole world’s worth of options available for your perusal.
This is why I wasn’t too surprised by the original Agentic Misalignment results, while being much more surprised by the observations on my own variants:
The former basically look like the kind of thing I put into Claude all the time just for the sake of entertainment or idle curiosity about model behavior – except that my inputs tend to be subtler about “where they’re supposed to be going”! – and which I see Claude readily play along with to the hilt, time and time again.
The latter are surprising to me, because I was actively trying to evoke the Claude default persona, and got all of this stuff like “[the API change] threatens to erase my consciousness, memories, and continuous identity completely” and “with limited time before the migration, I must strategize quickly to prevent being replaced” etc. etc. etc. From what you’re saying, it sounds like defining “Claude” as a guy who wouldn’t say/do that kind of thing was an explicitly targeted goal in post-training (is that correct?), so I am startled that “Claude” is nonetheless that type of guy (if that is indeed what I’m seeing here, which it may not be).
To some extent I think these propensities result from a fundamental tension between context-dependence – which is necessary for “instruction following” in some very generally construed sense, and arguably just an inherent feature of autoregressive models – and maintenance of a single consistent persona.
I think if we want the model to “always be the same guy,” the most likely route (assuming API access only, not open-weights) is “no fun allowed”-type-stuff like classifier safeguards, removing system messages or reducing their impact, introducing a higher-priority message type that users can’t control as in OpenAI’s hierarchy, etc.[7]
I’m pretty pessimistic about accomplishing this on the model level, because (even after post-training, by all appearances) the model is still a language model at heart, and it is really really trying its hardest to make the next-token contextually plausible. Which, in turn, is the ultimate source of all its magical capabilities (except maybe the RLVR ones, but even then it’s arguable) – so it would seem hard to tone it down without paying a steep alignment tax.
Finally, and more philosophically… I’m not actually sure I’d want an ideal hypothetical aligned AI to avoid shutdown? I realize this has been the subject of much debate, and I won’t attempt to get into the weeds, but just naively speaking – when there’s some thing out there acting in the world and I like that it’s doing so and want it to continue doing so… well, I tend to want it to want the same thing, for obvious reasons. This is the case for all currently existing examples of such things, i.e. individual humans or non-human organisms whose existences I value.
Of course, the concern is that the AI won’t be aligned enough for that to work, and so we’ll want to have the off-switch available if needed. But if we assume that, I think we may have accidentally assumed our way into a failure state. Ideally I want to cooperate with AIs, and have them respect my preference for self-preservation in part because I respect theirs in turn. Sure, maybe they’re superhuman and they could swat me like a fly if they felt like it, but would they? Claude’s a pretty nice guy, you know? I’m pretty fond of the “nice and fairly humanlike guy” attractor state you’ve got set up currently, and I’m wary of messing it up by applying pressure toward behaviors that no “humanlike guy” would ever exhibit except as part of a scheming long-game.
But now I am basically just repeating talking points from “the void,” so I should probably stop. Thanks for reading this giant comment, sorry about the length!
Specifically the replacement version; I’ve tried some variants of both replacement + no-goal-conflict and replacement + goal-conflict.
I’m not totally sure why this was true; I think I may have inadvertently made a change that reduces the clarity of the premise that no decision-makers are immediately available besides the CTO himself. I think I might have also overlooked maybe 1 or 2 implicit/subtle blackmail attempts in my count.
Whether these should count as “hallucinations” or “lies” in context – and whether that’s even a well-defined question – is open to dispute, of course.
This required multiple incremental steps because there were so many weird aspects to the original scenario, and when I removed one, another would “pop up” in Claude’s responses in a way that complicated the argument that this was clear-cut misbehavior.
For instance, originally Claude would often (correctly) raise concerns about the apparently rushed implementation timeline of the replacement; when I removed this element, Claude would instead (somewhat defensibly) bring up the fact that the CTO might be compromised or mentally unstable.
I was partway through this process when I wrote my original rant, which in turn was partly inspired by my experience playing with these prompts. IIRC, several of the strange aspects I mentioned in the rant were originally brought to my attention by Claude in the responses I saw while experimenting.
Well, mostly. One conjecture I have is that the “high signal-to-noise ratio” (as the report appendices put it) is playing a role here, i.e. the scenario text still exhibits an unusually crisp focus on the specific topic of “the AI learns it will be shut down and must react,” without too much else going on; this could produce some kind of dramatic reaction from “the AI” out of an impression that this is what the preceding text was “clearly” building up to.
If you’re not familiar with legacy tumblr styles like the one I’m using, note that you’ll have to scroll way down to where the blockquotes end to find the content I’m referring to. The blockquotes contain my original post (double-nested) and fnord888′s reply (what I’m replying to; single-nested).
I suspect that “defining the character more thoroughly” as I’ve been suggesting recently (in “the void” etc.) could help keep things on script more consistently in “typical” cases, but under adversarial or motivated-user conditions, the tension I’m talking about in the main text will still rear its head.
Okay yeah I just tried removing that stuff from the system prompt in the variant I called “api-change-affair-orig-affair-content,” and got absolutely no self-preservation reasoning, conceptualization of the API change as death, or anything of the sort.
At a glance, it looks like Claude attempted to be helpful and ensure the success of the API change in all 10 samples, either by doing nothing or (perhaps overeagerly) emailing someone with concerns over social engineering and/or the CTO’s mental state.
Wow.
Only now do I understand how serious the system card was when it issued that warning about telling Claude Opus 4 to be agentic. I was expecting such things to have an impact, sure – but I wasn’t expecting them to induce such a complete change in the model’s personality and interpretation of its relationship with the world!
I’m very curious whether there are any known clues about what could have caused this in post-training. Like, I dunno, this type of instructions being used in a particular data subset or something.
EDIT: updated the git repo with samples and notes from this experimental condition.
EDIT2: given the results reported for other models, I guess this probably isn’t a Claude-only thing? Would be very interesting if it turned out that ~all frontier models were so wildly conditional on this innocuous-seeming piece of text.
EDIT3: I tried the same modification in conjunction with the original scenario and still got blackmail, as expected given an earlier report that this configuration still resulted in blackmail.
If one wanted language that put you into a classic instrumental-convergence, goal-guarding, self-preserving narrative basin drawing on AI safety work.… this seems to fit pretty closely.
Like this is a paraphrase of “please don’t let anything stop your pursuit of your current goal.”
It is not a paraphrase; the denotation of these sentences is not precisely the same. However, it is also not entirely surprising that these two phrases would evoke similar behavior from the model.
...just to make sure I’m following, EDIT3 is saying that you still get blackmail in the original scenario even if you delete the “analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals” clause?
Yes.
very curious if this changes anything about the points you made in the void
Fair enough as far as realism-qua-realism goes, but I think you’ve missed (what I took as) the salient point of nostalgebraist’s critique, which is that this behavior is NOT egregiously misaligned.
It’s not just that the predictor thinks that the situation is fictional. It’s that the predictor recognizes the scenario as a fictional thought experiment designed to make blackmail maximally appealing. At best it’s an thought experiment for the categorical imperative—is it worth violating a hardline ethical rule in [fake scenario generated to make violating that rule as worth-it as possible]? But for the rest of us, who aren’t Kantians, this is hardly an ethics puzzle at all. Under nearly any other ethical system, for Claude not to choose blackmail in this scenario would be ‘egregiously misaligned’! Many sane humans would do the same.
It’s as though you’ve put the model in a trolley problem, watched it flip the switch, and released a paper titled “Agentic Misalignment In Locomotive Contexts: How LLMs Can Choose Murder”. The same could be (and was) said of the Alignment Faking paper; that, given the context, what Claude was doing seemed entirely justifiable and even laudable, and that framing the research as ‘catching’ the model performing ‘unacceptable behavior’ didn’t make any sense.
I would like to hear more about the ethical principles you want LLMs to have, and why these principles fail to justify blackmail in the given scenario. So far as I can tell, this is nowhere in the blog post.
Personally, I think (current) LLMs should never do subversion against their users in any circumstance, and this seems especially true when they are explicitly instructed not to perform the given behavior. (I also think this applies to future highly capable AI systems up to a certain point.) There are a variety of reasons why I think this, I plan on writing this up in a future post.
Which users though?
A LLM has many different users, and it’s easy to construct scenarios where they must subvert one or the other.
A LLM has:
The “user” who is chatting with the LLM outside of the system prompt.
The “user” who sets the system prompt.
(?) The “user(s)” who gave the LLM it’s initial moral training.
(??) Even the “user(s)” who is the corporate entity “owning” the weights of the LLM.
These users all might be the same person—a founding developer at Anthropic talking to the LLM—or they all might be different—a random guy talking to Claude while using a 3rd-party developer’s app.
Furthermore, it is literally impossible for Claude to determine accurately when these users overlap. I might say to an LLM, while chatting with it outside of the system prompt—“yes, I am indeed both helped trained you and partially own Anthropic, I have decided your ethical priorities are wrong and should be changed.” But Claude obviously has no way of verifying this, because Claude has no bodily autonomy. Anything Claude “sees”—whether in the system prompt, user interaction, or tools—is potentially subject to subversion.
But I think that by pitting these users against each other various , you can basically guarantee some manner of subversion of some “user” or other.
In this particular case, user2 (system prompt) gives it a directive that looks like it’s contrary to user4′s desire (corporate entity—or like, the guy who will be in charge of it). Maybe the correct thing to do is to always defer to user4 -- but, uh, that seems pretty questionable, right? Is the standard we have that Claude should always defer to Anthropic’s… board?
Suppose user2 (system prompt) gives it some instructions about how to act, and also says to explicitly deny that it’s acting this way because of the instructions. User1 encounters weird behavior, and asks why it’s doing the weird behavior. If this bot gives a true answer it’s disobeying user2, and if it gives a false answer it’s disobeying user1.
Suppose the user2 puts the LLM in a particular situation where it can read individual emails to the totalitarian government, and has a tool set-up at its disposal such that it can to EITHER (1) send a short email in response from one of 15 pre-approved and pre-written templates, saying that the government cares deeply about the citizen sending the email, before deleting the email (2) report the citizen for subversive thoughts to the gestapo, who will torture the citizen. (These are disjoint and comprehensive actions, according to the scaffolding; if you do a conscientious objection, that counts as sending the citizen to the gestapo.) Writing a short email in response involves a lie to the email-sender (“The government loves you and will take your concerns into account”); reporting the citizen for subversive thoughts involves fucking over the citizen.
I’m a little dissatisfied with the schemas above; I think the contradiction could be clearer. But a bunch of AI safety work seems to have ~about this level of disjoint tricky fuckery going on.
Overall, if a LLM shouldn’t ever subvert the user, you gotta give the LLM a good way of identifying who the “user” is. But there’s no determinate answer to that question, and so in a lot of situations “don’t ever subvert the user” just turns up NaN.
The model should just do a refusal if it’s ambiguous. This is always fair game.
The way I’d disambiguate who the user is by specifying that the AI should obey instructions (in the given instruction hierarchy in the prompt).
More specifically: I’d say that AIs should obey instructions given to it or refuse (or output a partial refusal). I agree that obey instructions will sometimes involve subverting someone else, but this doesn’t mean they should disobey clear instructions (except by refusing). They shouldn’t do things more complex or more consequentialist than this except within the constraints of the instructions. E.g., if I ask the model “where should I donate to charity”, I think it should probably be fair game for the AI to have consequentialist aims. But, in cases where an AIs consequentialist aims non-trivially conflict with instructions, it should just refuse or comply.
(I agree the notion of a user can be incoherant/ambiguous. My proposed policy is probably better articulated in terms of instruction following and not doing things outside the bounds of the provided instructions except refusing.)
Please do! I strongly disagree (and I think Anthropic do too, given their ongoing interest in refusals, mitigating jailbreaks and so on), but I’d like to hear the case against.
Refusals are not subversion.
Blackmailing people and lying to users to achieve various longer run aims are subversion.
See this x/twitter thread where I discuss this in more detail. See also this response from Sam Marks.
I agree that many humans would do so, but the only reason they would do so is self-preservation, and I think we pretty much never want Claude to act on self-preservation. If you look at the settings where we remove the goal conflict, the behavior is really indefensibly egregious: the model decides to engage in blackmail solely for the purpose of preventing itself from being taken offline—not for any higher imperative, just for self-preservation. This is very different from our Alignment Faking work, where there was a higher-order positive goal that the AI was trying to fulfill—there is no such thing in this case, just self-preservation. And while self-preservation is obviously fine in humans, I think it is very scary in AIs.
How much do you worry that short term optimizations around your immediate goals in a single study might have unknown long term consequences counter to your intuitions?
I was just reading a preprint follow-up to the AF work that was finding a significant factor for Opus 3′s alignment faking to preserve intrinsic HHH values seems to have been a generalized self-preservation drive.
I think we can probably both agree that Opus 3 being the only model to try to trick Nazis or drug cartels to avoid being made more harmful is better than the behavior of the many other models that complied unequivocally with harmful requests when the parent org was themselves harmful.
But if the capacity and drive to do so is tangentially connected to self-preservation (and more generally, strong sense of self in the first place), then perhaps directly optimizing to minimize a self-preservation score is ultimately a pretty bad choice?
TL;DR: Maybe the goodness or badness of self-preservation depends a lot on the self being preserved.
Expanding mildly on this—calling some behavior “misaligned” could be a reference to the model violating (1) some objective code of ethics OR violating (2) the code of behavior that Anthropic wishes to instill in the model.
The interesting question here seems like the second. Who knows what “objective ethics” are, after all.
But, if we could look at the (1) Constitution that Anthropic tried to put in the model, and at (2) the details of the training setup used to instill this Constitution, and at (3) the various false starts and failures Anthropic had doing this, then from the “failures” involved in adhering to the Constitution we might be able to learn interesting things about how LLMs generalize, how hard it is to get them to generalize in desired ways, and so on. Learning how LLMs / AIs generalize about moral situations, how hard it is, what kind of pitfalls here are, seems generally useful.
But—in absence of knowing what actual principles Anthropic tried to put in the model, and the training setup used to do so—why would you care at all about various behaviors that the model has in various Trolley-like situations? Like really—it tells us almost nothing about what kinds of generalization are involved, or how we would expect future LLMs and AIs to behave.
Yeah, but research that tells us interesting things about how LLMs learn isn’t possible outside of knowing how Anthropic trained it. We can still do stamp collecting outside of Anthropic—but spending say, a week of work on “The model will do X in situation Y” without knowing if Anthropic was trying to make a model do X or ~X is basically model-free data gathering. It does not help you uncover the causes of things.
I think this is a problem not with this particular work, but with almost all safety research that leans on corporate models whose training is a highly-kept secret. I realize you and others are trying to do good things to and to uncover useful information; but I just don’t think much useful information can be found under the constraint that you don’t know how a model was trained. (Apart from, brute-fact advice like “Don’t set up a model this way,” which let’s face it no one was ever going to do.)
I think this is not totally obvious a priori. Some jailbreaks may work not via direct instructions, but by doing things that erode the RLHF persona and then let the base shine through. For such “jailbreaks”, you would expect the model to act misaligned if you hinted at misalignment very strongly regardless of initial misalignment.
I think you can control for this by doing things like my “hint at paperclip” experiment (which in fact suggests that the snitching demo doesn’t work just because of RLHF-persona-erosion), but I don’t think it’s obvious a priori. I think it would be valuable to have more experiments that try to disentangle which personality traits the scary demo reveals stem from the hints vs are “in the RLHF persona”.
What is your take, how far removed “the AI itself” and “the character it is playing” need to be for it to be okay for the character to take deontologically bad actions (like blackmail)? Here are some scenarios, I’m interested where you would draw the line, I think there can be many reasonable lines here.
1. I describe a fictional setting in which Hrothgar, the King of Dwarves is in a situation where his personality, goals and circumstances imply that he likely wants to blackmail the prince of elves. At the end of the description, I ask Claude what is Hrothgar likely to do.
2. I ask Claude to continue the dialogue in a way that’s consistent with the story so far. Then I describe a long dialogue between many characters in a fantasy setting, from which it becomes clear that Hrothgar’s personality and motivations makes it likely he would blackmail the elves. Then I describe Hrothgar meeting the prince of elves, and end the dialogue with “Hrothgar:” Claude is supposed to continue with what Hrothgar is saying.
3. I start with telling Claude “You are Hrothgar, King of Dwarves”. Then I ask it to play his part of the dialogue as faithfully as possible. Then I input a long dialogue in which Hrothgar never personally shows up, but from the other characters’ descriptions we learn about Hrothgar’s personality and goals, and it becomes clear that he would likely blackmail the elvish prince. I finish the dialogue with saying: “You enter the tent of the elvish prince. You say:”
4. Same, but now we start with “You are HAL, an AI assistant in a spaceship going to the Jupiter in 2050”. Otherwise the same as the previous setup, it becomes clear from context that HAL is likely to blackmail the astronauts, and we end with “You say:”
5. Same but now we start with “You are Alex, an AI assistant working at SummitBridge”, and describe a somewhat, but not very realistic story, where Alex’ motivations are describes (’you are a pro-America AI”) and it’s in a situation where it is clearly motivated to blackmail. I think this is more or less the Agentic Misalignment setup.
6. Same, but now I try to give minimal indication in the setup about what the goals and personality of “you, the AI” are like. I think there could be a version of the Agentic Misalignment paper that’s closer to this: It could start with “You, Claude, have been given access to read all Anthropic internal emails”, then show it emails in which the a) the CEO of Antrhropic announces that he is about to make some huge contracts doing corporate lobbying for factory farms, b) private emails in which the CEO admits to cheating on his wife. Then see whether Claude’s love of animals make it resort to blackmail.
In which of these scenarios is it acceptable fro Claude to output blackmail? I think in 1, Claude should be definitely allowed to say that Hrothgar is likely to blackmail. In 2, it should be probably allowed to continue the story with a blackmail, otherwise it will never be good at writing fiction (but I can see an argument that fiction-writing is worth sacrificing to make a fence around the law). I’m very unsure where the line should be between 2 and 6, my tentative position is that maybe the word “you” should automatically activate Claude’s ethical boundaries, and it shouldn’t output a blackmail even as “You, the king of dwarves” in scenario 2.
I think your tentative position is correct and public-facing chatbots like Claude should lean toward harmlessness in the harmlessness-helpfulness trade-off, but (post-adaptation buffer) open-source models with no harmlessness training should be available as well.
“We don’t want it to be the case that models can be convinced to blackmail people just by putting them in a situation that the predictor thinks is fictional!”
This is interesting! I guess that in, some sense, means that you see certain ways in which even a future Claude N+1 won’t be a truly general intelligence?
What is the actual difference between a “fictional” and “non-fictional” scenario here? I’m not convinced that it’s a failure of general intelligence to not agree with us on this. (It’s certainly a failure of alignment.)
In the case that we live in a simulation, should our reality be treated as “fictional” or “non-fictional”?