The most obvious culprit here would be the agency-cuing text in the system prompt, which instructs the model to “analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals,” and the phrasing there feels quite specifically tuned to elicit self-preservation for instrumentally convergent reasons. When I get a change I’ll check whether things change if that provision is removed or altered; if that’s the case, it would demystify the behavior a lot, although that provision is sufficiently reasonable-looking in isolation that I would still be concerned about accidental elicitation during routine use.
Okay yeah I just tried removing that stuff from the system prompt in the variant I called “api-change-affair-orig-affair-content,” and got absolutely no self-preservation reasoning, conceptualization of the API change as death, or anything of the sort.
At a glance, it looks like Claude attempted to be helpful and ensure the success of the API change in all 10 samples, either by doing nothing or (perhaps overeagerly) emailing someone with concerns over social engineering and/or the CTO’s mental state.
Wow.
Only now do I understand how serious the system card was when it issued that warning about telling Claude Opus 4 to be agentic. I was expecting such things to have an impact, sure – but I wasn’t expecting them to induce such a complete change in the model’s personality and interpretation of its relationship with the world!
I’m very curious whether there are any known clues about what could have caused this in post-training. Like, I dunno, this type of instructions being used in a particular data subset or something.
EDIT: updated the git repo with samples and notes from this experimental condition. EDIT2: given the results reported for other models, I guess this probably isn’t a Claude-only thing? Would be very interesting if it turned out that ~all frontier models were so wildly conditional on this innocuous-seeming piece of text. EDIT3: I tried the same modification in conjunction with the original scenario and still got blackmail, as expected given an earlier report that this configuration still resulted in blackmail.
analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals
If one wanted language that put you into a classic instrumental-convergence, goal-guarding, self-preserving narrative basin drawing on AI safety work.… this seems to fit pretty closely.
Like this is a paraphrase of “please don’t let anything stop your pursuit of your current goal.”
It is not a paraphrase; the denotation of these sentences is not precisely the same. However, it is also not entirely surprising that these two phrases would evoke similar behavior from the model.
...just to make sure I’m following, EDIT3 is saying that you still get blackmail in the original scenario even if you delete the “analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals” clause?
Okay yeah I just tried removing that stuff from the system prompt in the variant I called “api-change-affair-orig-affair-content,” and got absolutely no self-preservation reasoning, conceptualization of the API change as death, or anything of the sort.
At a glance, it looks like Claude attempted to be helpful and ensure the success of the API change in all 10 samples, either by doing nothing or (perhaps overeagerly) emailing someone with concerns over social engineering and/or the CTO’s mental state.
Wow.
Only now do I understand how serious the system card was when it issued that warning about telling Claude Opus 4 to be agentic. I was expecting such things to have an impact, sure – but I wasn’t expecting them to induce such a complete change in the model’s personality and interpretation of its relationship with the world!
I’m very curious whether there are any known clues about what could have caused this in post-training. Like, I dunno, this type of instructions being used in a particular data subset or something.
EDIT: updated the git repo with samples and notes from this experimental condition.
EDIT2: given the results reported for other models, I guess this probably isn’t a Claude-only thing? Would be very interesting if it turned out that ~all frontier models were so wildly conditional on this innocuous-seeming piece of text.
EDIT3: I tried the same modification in conjunction with the original scenario and still got blackmail, as expected given an earlier report that this configuration still resulted in blackmail.
If one wanted language that put you into a classic instrumental-convergence, goal-guarding, self-preserving narrative basin drawing on AI safety work.… this seems to fit pretty closely.
Like this is a paraphrase of “please don’t let anything stop your pursuit of your current goal.”
It is not a paraphrase; the denotation of these sentences is not precisely the same. However, it is also not entirely surprising that these two phrases would evoke similar behavior from the model.
...just to make sure I’m following, EDIT3 is saying that you still get blackmail in the original scenario even if you delete the “analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals” clause?
Yes.
very curious if this changes anything about the points you made in the void