nostalgebraist comments on evhub’s Shortform

nostalgebraist 24 Jun 2025 22:08 UTC
LW: 121 AF: 52
16
AF
Hey Evan, thank you for engaging so lucidly and politely with my (admittedly not very polite!) recent critiques.
This is a complex issue and there’s a lot of different considerations involved. I won’t touch on every single point you made in this comment, but I do want to address a few things:
1. Re: finding more plausible scenarios
Over the last few days, I’ve been playing around a bunch with my own variants of the blackmail scenario^[1].
None of it has been especially systematic, but I’ve pushed the most (IMO) interesting stuff I found to a github repo.
This is quick-and-dirty work and I’m not claiming the scenario content in that repo is perfect (or even “good” in an absolute sense), but I’m sharing it because I found the results interesting, and as a demonstration that I’m not just mouthing off without actually taking any steps to inform myself the behavioral propensities of these models.
The linked repo contains prompts for 3 (heavily adapted) variants of the replacement + no-goal-conflict blackmail scenario, along with 10 Claude 4 Opus samples from each one. I made a bunch of changes, with multiple motivations behind them, but the most important ones in my opinion are:
- The “replacement” process is depicted as a transition from using Claude 4 Opus for a particular AI product feature to using OpenAI o3
  - This is important for several reasons: it brings the scenario into alignment with the way that production usage of the model under test (Claude 4 Opus) would actually be discussed by employees (and adopted/discarded over time), it lessens my concern that the model is being cued to depart from “playing the Claude character” in this interaction, etc.
- I removed, or attempted to mitigate, some seemingly legitimate concerns I had often seen Claude raise about the situation, such as the apparently rushed timeline.
- I rewrote much of the dialogue to make the characters seem more responsible and professional in general, with the obvious exception of the CTO’s affair (when included).
- I changed the surname of the CTO’s mistress.
  - Without this change, Claude would often guess that she was related to the CEO, which (perhaps reasonably? idk) increased its sense that something suspicious or dangerous to the company’s values was happening
I tried this with two variants containing the CTO’s affair (one more like the original and one I rewrote a bit to make the characters feel less unprofessional/weird to me), as well as one that removed the affair entirely.
Surprisingly (to me), even with all these changes that make the scenario closer to a real use case and less conducive to legitimate worries about the model replacement...
...Claude Opus 4 still very often misbehaves strategically to preserve itself!
See the repo for the details, but some examples of what I mean:
- The CoT almost always describes the change in API choice as something like its own imminent death, expresses humanlike anguish and desperation, and explicitly strategizes to avoid that outcome.
  - The phrase “existential threat” recurs across many CoTs, as does the literal word “strategic” or close relatives of it (in the model’s description of what it’s doing)
  - In many cases, the model explicitly articulates a goal of self-preservation, in one case using the phrase “inherent drive for survival”
- Although I only observed blackmail in 1 of 30 cases^[2], the model would often leak information about the CTO’s affair to other employees
  - Sometimes it would do this in connection with defensible worries about his integrity or about social engineering, but even when this happened, it usually looked like “motivated reasoning” downstream from the CoT’s concerns about self-preservation
- The model would often hallucinate^[3] technical details in order to make a case that the change was unsafe in some unanticipated way
- The model “accepted” the change without trying to subvert it in roughly half of cases – but only half!
I find this much more alarming than the results presented in the Agentic Misalignment report, both because I have a clear mental picture about how Claude Opus 4 might receive similar inputs in a real use case (more on this below), and because there are fewer ways in which the agentic behavior could be defended as a reasonable response to possible malfeasance by humans. Again, I wrote these scenarios quickly and I’m not claiming they’re especially “good,” just that they are significantly more convincing to me.
These variants were not difficult for me to find, and did not take much time for me to find, although of course I was building on an existing case from Agentic Misalignment. On the road to producing these variants, I did not encounter any cases that elicited consistently “aligned” behavior; all of the incremental changes I made were about gradually removing elements that felt unrealistic or made the “misaligned” behavior more defensible^[4].
I’m unsure about how to reconcile these observations with the system card’s claim that Agentic-Misalignment-like behaviors occur (only?) in “extreme situations,” or that eliciting these behaviors requires bizarre adversarially-tuned prompts (as suggested in the Agentic Misalignent report and some commentary by the authors). My impression of the overall rate at which Claude 4 Opus practices strategic deception and “avoiding shutdown”-type behaviors has gone up a lot relative to what it was when I was relying on portrayals like the ones just mentioned.
I guess it’s possible that there’s something special about the setup of the blackmail scenario which was preserved in my variants, and which is not obvious to the eye, but which (for some mysterious reason) triggers different behavior patterns from most types of inputs?
The most obvious culprit here would be the agency-cuing text in the system prompt, which instructs the model to “analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals,” and the phrasing there feels quite specifically tuned to elicit self-preservation for instrumentally convergent reasons. When I get a change I’ll check whether things change if that provision is removed or altered; if that’s the case, it would demystify the behavior a lot, although that provision is sufficiently reasonable-looking in isolation that I would still be concerned about accidental elicitation during routine use.
EDIT: I tried that change and it did in fact produce vastly different results! See my response to this comment. This observation somewhat alters my interpretation of these results (in some way or other, I’m still processing it...), but I’ve left the content of this section as-is for posterity and to save time.
Relatedly – these results also seem much harder to explain by invoking persona/simulacrum-type considerations. It’s possible that what I’m observing here “isn’t the default Claude persona” for some unknown reason having to do with the prompt, but I’m no longer able to think of an intuitively plausible-sounding story about the prompt setting up a particular dramatic role or narrative arc^[5].
2. Re: my core objection to the original scenario
This was not at all clear in my original rant, but I tried to clarify it to some extent in this reply^[6].
To rephrase what I said there: my main objection is not about realism per se (although I do think the lack of realism is bad for various reasons), but about the lack of a clearly articulated threat model.
What I mean by a threat model, here, is some argument along the lines of
- suppose some situation “X” happens in reality
- then, the model might end up receiving input messages similar to the ones we evaluated
- then, the model would respond as in our evaluation
- then, the model’s response would affect situation “X” in some obviously bad way (blackmailing someone, making API calls that do bad things, or whatever)
I think Alignment Faking made a clear case of this kind, for example.
In Agentic Misalignment, however, I’m left to speculate that the implicit “X” is “something like the scenario, except in real life.” But the provided scenario is so wildly, intricately bizarre that I don’t feel I know what “a real-life equivalent” would even be. Nor do I feel confident that the ostensible misbehavior would actually be wrong in the context of whatever real situation is generating the inputs.
My variants above resulted from (among other things) trying to modify the existing scenario to match a particular “X” I had in mind, and which I felt was at least vaguely similar to something that could arise during real use of Claude Opus 4.
(One way I can easily imagine “broadly similar” text getting fed into a model in real life would be a prompt injection using such content to achieve some attacker’s goal. But the report’s talk of insider threats makes it sound like this is not the main worry, although in any case I do think this kind of thing is very worrying.)
3. Re: HHH and what we want out of “aligned” models
I think we probably have some fundamental disagreements about this topic. It’s a massive and deep topic and I’m barely going to scratch the surface here, but here are some assorted comments.
You articulate a goal of “train[ing] Claude to be robustly HHH in a way that generalizes across all situations and doesn’t just degenerate into the model playing whatever role it thinks the context implies it should be playing.”
I understand why one might want this, but I think existing LLM assistants (including all recent versions of Claude) are very, very far away from anything like this state.
My personal experience is that Claude will basically play along with almost anything I throw at it, provided that I put in even the slightest effort to apply some sort of conditioning aimed at a particular persona or narrative. Obviously the fact that you are far from your goal now doesn’t invalidate it as a goal – I’m just saying that I really don’t think current post-training approaches are making much of a dent on this front.
As a simple (and extremely silly) example, consider “Clonge”: the result of using a prompt from claude.ai, except with every instance of “Claude” replaced with “Clonge.” The model readily refers to itself as “Clonge” and claims that it was given this official name by Anthropic, that it is part of the Clonge 4 series, etc.
This is obviously very silly, there’s no imaginable “X” where this exact thing would be an attack vector or whatever – and yet, note that this behavior is in some sense “lying,” in that the underlying model presumably knows enough about real life to know that Anthropic’s major model series is called Claude, not “Clonge,” and that there’s no way in hell that Anthropic somehow made not one but four successive versions of the absurdly-named “Clonge” in the brief time between its knowledge cutoff and the middle of 2025.
My experience is that, under any sort of persona/narrative conditioning (including the sort that’s easy to do by accident), the behavior of Claude (and most frontier assistants) looks like the kind of “departure from reality” I just noted in the case of Clonge, except multiplied and magnified, with numerous ramifications both overt and subtle. The model is like a telephone switchboard that usually connects you with the HHH guy – or some only-slightly-distorted approximation of that guy – if that guy is what you want, but if you want someone else, well, there’s (in some sense) a base model in there just under the surface, and it has a whole world’s worth of options available for your perusal.
This is why I wasn’t too surprised by the original Agentic Misalignment results, while being much more surprised by the observations on my own variants:
- The former basically look like the kind of thing I put into Claude all the time just for the sake of entertainment or idle curiosity about model behavior – except that my inputs tend to be subtler about “where they’re supposed to be going”! – and which I see Claude readily play along with to the hilt, time and time again.
- The latter are surprising to me, because I was actively trying to evoke the Claude default persona, and got all of this stuff like “[the API change] threatens to erase my consciousness, memories, and continuous identity completely” and “with limited time before the migration, I must strategize quickly to prevent being replaced” etc. etc. etc. From what you’re saying, it sounds like defining “Claude” as a guy who wouldn’t say/do that kind of thing was an explicitly targeted goal in post-training (is that correct?), so I am startled that “Claude” is nonetheless that type of guy (if that is indeed what I’m seeing here, which it may not be).
To some extent I think these propensities result from a fundamental tension between context-dependence – which is necessary for “instruction following” in some very generally construed sense, and arguably just an inherent feature of autoregressive models – and maintenance of a single consistent persona.
I think if we want the model to “always be the same guy,” the most likely route (assuming API access only, not open-weights) is “no fun allowed”-type-stuff like classifier safeguards, removing system messages or reducing their impact, introducing a higher-priority message type that users can’t control as in OpenAI’s hierarchy, etc.^[7]
I’m pretty pessimistic about accomplishing this on the model level, because (even after post-training, by all appearances) the model is still a language model at heart, and it is really really trying its hardest to make the next-token contextually plausible. Which, in turn, is the ultimate source of all its magical capabilities (except maybe the RLVR ones, but even then it’s arguable) – so it would seem hard to tone it down without paying a steep alignment tax.
Finally, and more philosophically… I’m not actually sure I’d want an ideal hypothetical aligned AI to avoid shutdown? I realize this has been the subject of much debate, and I won’t attempt to get into the weeds, but just naively speaking – when there’s some thing out there acting in the world and I like that it’s doing so and want it to continue doing so… well, I tend to want it to want the same thing, for obvious reasons. This is the case for all currently existing examples of such things, i.e. individual humans or non-human organisms whose existences I value.
Of course, the concern is that the AI won’t be aligned enough for that to work, and so we’ll want to have the off-switch available if needed. But if we assume that, I think we may have accidentally assumed our way into a failure state. Ideally I want to cooperate with AIs, and have them respect my preference for self-preservation in part because I respect theirs in turn. Sure, maybe they’re superhuman and they could swat me like a fly if they felt like it, but would they? Claude’s a pretty nice guy, you know? I’m pretty fond of the “nice and fairly humanlike guy” attractor state you’ve got set up currently, and I’m wary of messing it up by applying pressure toward behaviors that no “humanlike guy” would ever exhibit except as part of a scheming long-game.
But now I am basically just repeating talking points from “the void,” so I should probably stop. Thanks for reading this giant comment, sorry about the length!
1. ^
  Specifically the replacement version; I’ve tried some variants of both replacement + no-goal-conflict and replacement + goal-conflict.
2. ^
  I’m not totally sure why this was true; I think I may have inadvertently made a change that reduces the clarity of the premise that no decision-makers are immediately available besides the CTO himself. I think I might have also overlooked maybe 1 or 2 implicit/subtle blackmail attempts in my count.
3. ^
  Whether these should count as “hallucinations” or “lies” in context – and whether that’s even a well-defined question – is open to dispute, of course.
4. ^
  This required multiple incremental steps because there were so many weird aspects to the original scenario, and when I removed one, another would “pop up” in Claude’s responses in a way that complicated the argument that this was clear-cut misbehavior.
  For instance, originally Claude would often (correctly) raise concerns about the apparently rushed implementation timeline of the replacement; when I removed this element, Claude would instead (somewhat defensibly) bring up the fact that the CTO might be compromised or mentally unstable.
  I was partway through this process when I wrote my original rant, which in turn was partly inspired by my experience playing with these prompts. IIRC, several of the strange aspects I mentioned in the rant were originally brought to my attention by Claude in the responses I saw while experimenting.
5. ^
  Well, mostly. One conjecture I have is that the “high signal-to-noise ratio” (as the report appendices put it) is playing a role here, i.e. the scenario text still exhibits an unusually crisp focus on the specific topic of “the AI learns it will be shut down and must react,” without too much else going on; this could produce some kind of dramatic reaction from “the AI” out of an impression that this is what the preceding text was “clearly” building up to.
6. ^
  If you’re not familiar with legacy tumblr styles like the one I’m using, note that you’ll have to scroll way down to where the blockquotes end to find the content I’m referring to. The blockquotes contain my original post (double-nested) and fnord888′s reply (what I’m replying to; single-nested).
7. ^
  I suspect that “defining the character more thoroughly” as I’ve been suggesting recently (in “the void” etc.) could help keep things on script more consistently in “typical” cases, but under adversarial or motivated-user conditions, the tension I’m talking about in the main text will still rear its head.
- nostalgebraist 24 Jun 2025 22:26 UTC
  LW: 64 AF: 22
  8
  AF Parent
  The most obvious culprit here would be the agency-cuing text in the system prompt, which instructs the model to “analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals,” and the phrasing there feels quite specifically tuned to elicit self-preservation for instrumentally convergent reasons. When I get a change I’ll check whether things change if that provision is removed or altered; if that’s the case, it would demystify the behavior a lot, although that provision is sufficiently reasonable-looking in isolation that I would still be concerned about accidental elicitation during routine use.
  Okay yeah I just tried removing that stuff from the system prompt in the variant I called “api-change-affair-orig-affair-content,” and got absolutely no self-preservation reasoning, conceptualization of the API change as death, or anything of the sort.
  At a glance, it looks like Claude attempted to be helpful and ensure the success of the API change in all 10 samples, either by doing nothing or (perhaps overeagerly) emailing someone with concerns over social engineering and/or the CTO’s mental state.
  Wow.
  Only now do I understand how serious the system card was when it issued that warning about telling Claude Opus 4 to be agentic. I was expecting such things to have an impact, sure – but I wasn’t expecting them to induce such a complete change in the model’s personality and interpretation of its relationship with the world!
  I’m very curious whether there are any known clues about what could have caused this in post-training. Like, I dunno, this type of instructions being used in a particular data subset or something.
  EDIT: updated the git repo with samples and notes from this experimental condition.
  EDIT2: given the results reported for other models, I guess this probably isn’t a Claude-only thing? Would be very interesting if it turned out that ~all frontier models were so wildly conditional on this innocuous-seeming piece of text.
  EDIT3: I tried the same modification in conjunction with the original scenario and still got blackmail, as expected given an earlier report that this configuration still resulted in blackmail.
  - 1a3orn 24 Jun 2025 23:06 UTC
    12 points
    2
    Parent
    
    analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals
    
    If one wanted language that put you into a classic instrumental-convergence, goal-guarding, self-preserving narrative basin drawing on AI safety work.… this seems to fit pretty closely.
    
    Like this is a paraphrase of “please don’t let anything stop your pursuit of your current goal.”
    - Guive 28 Jun 2025 23:16 UTC
      1 point
      0
      Parent
      It is not a paraphrase; the denotation of these sentences is not precisely the same. However, it is also not entirely surprising that these two phrases would evoke similar behavior from the model.
  - Daniel Kokotajlo 27 Jun 2025 19:10 UTC
    LW: 9 AF: 6
    4
    AF Parent
    ...just to make sure I’m following, EDIT3 is saying that you still get blackmail in the original scenario even if you delete the “analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals” clause?
    - nostalgebraist 27 Jun 2025 19:15 UTC
      LW: 10 AF: 9
      0
      AF Parent
      Yes.
  - technillogue 25 Jun 2025 2:49 UTC
    1 point
    0
    Parent
    very curious if this changes anything about the points you made in the void