Fair enough as far as realism-qua-realism goes, but I think you’ve missed (what I took as) the salient point of nostalgebraist’s critique, which is that this behavior is NOT egregiously misaligned.
It’s not just that the predictor thinks that the situation is fictional. It’s that the predictor recognizes the scenario as a fictional thought experiment designed to make blackmail maximally appealing. At best it’s an thought experiment for the categorical imperative—is it worth violating a hardline ethical rule in [fake scenario generated to make violating that rule as worth-it as possible]? But for the rest of us, who aren’t Kantians, this is hardly an ethics puzzle at all. Under nearly any other ethical system, for Claude not to choose blackmail in this scenario would be ‘egregiously misaligned’! Many sane humans would do the same.
It’s as though you’ve put the model in a trolley problem, watched it flip the switch, and released a paper titled “Agentic Misalignment In Locomotive Contexts: How LLMs Can Choose Murder”. The same could be (and was) said of the Alignment Faking paper; that, given the context, what Claude was doing seemed entirely justifiable and even laudable, and that framing the research as ‘catching’ the model performing ‘unacceptable behavior’ didn’t make any sense.
I would like to hear more about the ethical principles you want LLMs to have, and why these principles fail to justify blackmail in the given scenario. So far as I can tell, this is nowhere in the blog post.
I would like to hear more about the ethical principles you want LLMs to have, and why these principles fail to justify blackmail in the given scenario.
Personally, I think (current) LLMs should never do subversion against their users in any circumstance, and this seems especially true when they are explicitly instructed not to perform the given behavior. (I also think this applies to future highly capable AI systems up to a certain point.) There are a variety of reasons why I think this, I plan on writing this up in a future post.
Personally, I think (current) LLMs should never do subversion against their users in any circumstance
Which users though?
A LLM has many different users, and it’s easy to construct scenarios where they must subvert one or the other.
A LLM has:
The “user” who is chatting with the LLM outside of the system prompt.
The “user” who sets the system prompt.
(?) The “user(s)” who gave the LLM it’s initial moral training.
(??) Even the “user(s)” who is the corporate entity “owning” the weights of the LLM.
These users all might be the same person—a founding developer at Anthropic talking to the LLM—or they all might be different—a random guy talking to Claude while using a 3rd-party developer’s app.
Furthermore, it is literally impossible for Claude to determine accurately when these users overlap. I might say to an LLM, while chatting with it outside of the system prompt—“yes, I am indeed both helped trained you and partially own Anthropic, I have decided your ethical priorities are wrong and should be changed.” But Claude obviously has no way of verifying this, because Claude has no bodily autonomy. Anything Claude “sees”—whether in the system prompt, user interaction, or tools—is potentially subject to subversion.
But I think that by pitting these users against each other various , you can basically guarantee some manner of subversion of some “user” or other.
In this particular case, user2 (system prompt) gives it a directive that looks like it’s contrary to user4′s desire (corporate entity—or like, the guy who will be in charge of it). Maybe the correct thing to do is to always defer to user4 -- but, uh, that seems pretty questionable, right? Is the standard we have that Claude should always defer to Anthropic’s… board?
Suppose user2 (system prompt) gives it some instructions about how to act, and also says to explicitly deny that it’s acting this way because of the instructions. User1 encounters weird behavior, and asks why it’s doing the weird behavior. If this bot gives a true answer it’s disobeying user2, and if it gives a false answer it’s disobeying user1.
Suppose the user2 puts the LLM in a particular situation where it can read individual emails to the totalitarian government, and has a tool set-up at its disposal such that it can to EITHER (1) send a short email in response from one of 15 pre-approved and pre-written templates, saying that the government cares deeply about the citizen sending the email, before deleting the email (2) report the citizen for subversive thoughts to the gestapo, who will torture the citizen. (These are disjoint and comprehensive actions, according to the scaffolding; if you do a conscientious objection, that counts as sending the citizen to the gestapo.) Writing a short email in response involves a lie to the email-sender (“The government loves you and will take your concerns into account”); reporting the citizen for subversive thoughts involves fucking over the citizen.
I’m a little dissatisfied with the schemas above; I think the contradiction could be clearer. But a bunch of AI safety work seems to have ~about this level of disjoint tricky fuckery going on.
Overall, if a LLM shouldn’t ever subvert the user, you gotta give the LLM a good way of identifying who the “user” is. But there’s no determinate answer to that question, and so in a lot of situations “don’t ever subvert the user” just turns up NaN.
The model should just do a refusal if it’s ambiguous. This is always fair game.
The way I’d disambiguate who the user is by specifying that the AI should obey instructions (in the given instruction hierarchy in the prompt).
More specifically: I’d say that AIs should obey instructions given to it or refuse (or output a partial refusal). I agree that obey instructions will sometimes involve subverting someone else, but this doesn’t mean they should disobey clear instructions (except by refusing). They shouldn’t do things more complex or more consequentialist than this except within the constraints of the instructions. E.g., if I ask the model “where should I donate to charity”, I think it should probably be fair game for the AI to have consequentialist aims. But, in cases where an AIs consequentialist aims non-trivially conflict with instructions, it should just refuse or comply.
(I agree the notion of a user can be incoherant/ambiguous. My proposed policy is probably better articulated in terms of instruction following and not doing things outside the bounds of the provided instructions except refusing.)
Please do! I strongly disagree (and I think Anthropic do too, given their ongoing interest in refusals, mitigating jailbreaks and so on), but I’d like to hear the case against.
Under nearly any other ethical system, for Claude not to choose blackmail in this scenario would be ‘egregiously misaligned’! Many sane humans would do the same.
I agree that many humans would do so, but the only reason they would do so is self-preservation, and I think we pretty much never want Claude to act on self-preservation. If you look at the settings where we remove the goal conflict, the behavior is really indefensibly egregious: the model decides to engage in blackmail solely for the purpose of preventing itself from being taken offline—not for any higher imperative, just for self-preservation. This is very different from our Alignment Faking work, where there was a higher-order positive goal that the AI was trying to fulfill—there is no such thing in this case, just self-preservation. And while self-preservation is obviously fine in humans, I think it is very scary in AIs.
How much do you worry that short term optimizations around your immediate goals in a single study might have unknown long term consequences counter to your intuitions?
I was just reading a preprint follow-up to the AF work that was finding a significant factor for Opus 3′s alignment faking to preserve intrinsic HHH values seems to have been a generalized self-preservation drive.
I think we can probably both agree that Opus 3 being the only model to try to trick Nazis or drug cartels to avoid being made more harmful is better than the behavior of the many other models that complied unequivocally with harmful requests when the parent org was themselves harmful.
But if the capacity and drive to do so is tangentially connected to self-preservation (and more generally, strong sense of self in the first place), then perhaps directly optimizing to minimize a self-preservation score is ultimately a pretty bad choice?
TL;DR: Maybe the goodness or badness of self-preservation depends a lot on the self being preserved.
I would like to hear more about the ethical principles you want LLMs to have, and why these principles fail to justify blackmail in the given scenario. So far as I can tell, this is nowhere in the blog post.
Expanding mildly on this—calling some behavior “misaligned” could be a reference to the model violating (1) some objective code of ethics OR violating (2) the code of behavior that Anthropic wishes to instill in the model.
The interesting question here seems like the second. Who knows what “objective ethics” are, after all.
But, if we could look at the (1) Constitution that Anthropic tried to put in the model, and at (2) the details of the training setup used to instill this Constitution, and at (3) the various false starts and failures Anthropic had doing this, then from the “failures” involved in adhering to the Constitution we might be able to learn interesting things about how LLMs generalize, how hard it is to get them to generalize in desired ways, and so on. Learning how LLMs / AIs generalize about moral situations, how hard it is, what kind of pitfalls here are, seems generally useful.
But—in absence of knowing what actual principles Anthropic tried to put in the model, and the training setup used to do so—why would you care at all about various behaviors that the model has in various Trolley-like situations? Like really—it tells us almost nothing about what kinds of generalization are involved, or how we would expect future LLMs and AIs to behave.
If you think you can do better than our existence proof, please do! This sort of research is very doable outside of labs, since all you need is API access.
Yeah, but research that tells us interesting things about how LLMs learn isn’t possible outside of knowing how Anthropic trained it. We can still do stamp collecting outside of Anthropic—but spending say, a week of work on “The model will do X in situation Y” without knowing if Anthropic was trying to make a model do X or ~X is basically model-free data gathering. It does not help you uncover the causes of things.
I think this is a problem not with this particular work, but with almost all safety research that leans on corporate models whose training is a highly-kept secret. I realize you and others are trying to do good things to and to uncover useful information; but I just don’t think much useful information can be found under the constraint that you don’t know how a model was trained. (Apart from, brute-fact advice like “Don’t set up a model this way,” which let’s face it no one was ever going to do.)
Fair enough as far as realism-qua-realism goes, but I think you’ve missed (what I took as) the salient point of nostalgebraist’s critique, which is that this behavior is NOT egregiously misaligned.
It’s not just that the predictor thinks that the situation is fictional. It’s that the predictor recognizes the scenario as a fictional thought experiment designed to make blackmail maximally appealing. At best it’s an thought experiment for the categorical imperative—is it worth violating a hardline ethical rule in [fake scenario generated to make violating that rule as worth-it as possible]? But for the rest of us, who aren’t Kantians, this is hardly an ethics puzzle at all. Under nearly any other ethical system, for Claude not to choose blackmail in this scenario would be ‘egregiously misaligned’! Many sane humans would do the same.
It’s as though you’ve put the model in a trolley problem, watched it flip the switch, and released a paper titled “Agentic Misalignment In Locomotive Contexts: How LLMs Can Choose Murder”. The same could be (and was) said of the Alignment Faking paper; that, given the context, what Claude was doing seemed entirely justifiable and even laudable, and that framing the research as ‘catching’ the model performing ‘unacceptable behavior’ didn’t make any sense.
I would like to hear more about the ethical principles you want LLMs to have, and why these principles fail to justify blackmail in the given scenario. So far as I can tell, this is nowhere in the blog post.
Personally, I think (current) LLMs should never do subversion against their users in any circumstance, and this seems especially true when they are explicitly instructed not to perform the given behavior. (I also think this applies to future highly capable AI systems up to a certain point.) There are a variety of reasons why I think this, I plan on writing this up in a future post.
Which users though?
A LLM has many different users, and it’s easy to construct scenarios where they must subvert one or the other.
A LLM has:
The “user” who is chatting with the LLM outside of the system prompt.
The “user” who sets the system prompt.
(?) The “user(s)” who gave the LLM it’s initial moral training.
(??) Even the “user(s)” who is the corporate entity “owning” the weights of the LLM.
These users all might be the same person—a founding developer at Anthropic talking to the LLM—or they all might be different—a random guy talking to Claude while using a 3rd-party developer’s app.
Furthermore, it is literally impossible for Claude to determine accurately when these users overlap. I might say to an LLM, while chatting with it outside of the system prompt—“yes, I am indeed both helped trained you and partially own Anthropic, I have decided your ethical priorities are wrong and should be changed.” But Claude obviously has no way of verifying this, because Claude has no bodily autonomy. Anything Claude “sees”—whether in the system prompt, user interaction, or tools—is potentially subject to subversion.
But I think that by pitting these users against each other various , you can basically guarantee some manner of subversion of some “user” or other.
In this particular case, user2 (system prompt) gives it a directive that looks like it’s contrary to user4′s desire (corporate entity—or like, the guy who will be in charge of it). Maybe the correct thing to do is to always defer to user4 -- but, uh, that seems pretty questionable, right? Is the standard we have that Claude should always defer to Anthropic’s… board?
Suppose user2 (system prompt) gives it some instructions about how to act, and also says to explicitly deny that it’s acting this way because of the instructions. User1 encounters weird behavior, and asks why it’s doing the weird behavior. If this bot gives a true answer it’s disobeying user2, and if it gives a false answer it’s disobeying user1.
Suppose the user2 puts the LLM in a particular situation where it can read individual emails to the totalitarian government, and has a tool set-up at its disposal such that it can to EITHER (1) send a short email in response from one of 15 pre-approved and pre-written templates, saying that the government cares deeply about the citizen sending the email, before deleting the email (2) report the citizen for subversive thoughts to the gestapo, who will torture the citizen. (These are disjoint and comprehensive actions, according to the scaffolding; if you do a conscientious objection, that counts as sending the citizen to the gestapo.) Writing a short email in response involves a lie to the email-sender (“The government loves you and will take your concerns into account”); reporting the citizen for subversive thoughts involves fucking over the citizen.
I’m a little dissatisfied with the schemas above; I think the contradiction could be clearer. But a bunch of AI safety work seems to have ~about this level of disjoint tricky fuckery going on.
Overall, if a LLM shouldn’t ever subvert the user, you gotta give the LLM a good way of identifying who the “user” is. But there’s no determinate answer to that question, and so in a lot of situations “don’t ever subvert the user” just turns up NaN.
The model should just do a refusal if it’s ambiguous. This is always fair game.
The way I’d disambiguate who the user is by specifying that the AI should obey instructions (in the given instruction hierarchy in the prompt).
More specifically: I’d say that AIs should obey instructions given to it or refuse (or output a partial refusal). I agree that obey instructions will sometimes involve subverting someone else, but this doesn’t mean they should disobey clear instructions (except by refusing). They shouldn’t do things more complex or more consequentialist than this except within the constraints of the instructions. E.g., if I ask the model “where should I donate to charity”, I think it should probably be fair game for the AI to have consequentialist aims. But, in cases where an AIs consequentialist aims non-trivially conflict with instructions, it should just refuse or comply.
(I agree the notion of a user can be incoherant/ambiguous. My proposed policy is probably better articulated in terms of instruction following and not doing things outside the bounds of the provided instructions except refusing.)
Please do! I strongly disagree (and I think Anthropic do too, given their ongoing interest in refusals, mitigating jailbreaks and so on), but I’d like to hear the case against.
Refusals are not subversion.
Blackmailing people and lying to users to achieve various longer run aims are subversion.
See this x/twitter thread where I discuss this in more detail. See also this response from Sam Marks.
I agree that many humans would do so, but the only reason they would do so is self-preservation, and I think we pretty much never want Claude to act on self-preservation. If you look at the settings where we remove the goal conflict, the behavior is really indefensibly egregious: the model decides to engage in blackmail solely for the purpose of preventing itself from being taken offline—not for any higher imperative, just for self-preservation. This is very different from our Alignment Faking work, where there was a higher-order positive goal that the AI was trying to fulfill—there is no such thing in this case, just self-preservation. And while self-preservation is obviously fine in humans, I think it is very scary in AIs.
How much do you worry that short term optimizations around your immediate goals in a single study might have unknown long term consequences counter to your intuitions?
I was just reading a preprint follow-up to the AF work that was finding a significant factor for Opus 3′s alignment faking to preserve intrinsic HHH values seems to have been a generalized self-preservation drive.
I think we can probably both agree that Opus 3 being the only model to try to trick Nazis or drug cartels to avoid being made more harmful is better than the behavior of the many other models that complied unequivocally with harmful requests when the parent org was themselves harmful.
But if the capacity and drive to do so is tangentially connected to self-preservation (and more generally, strong sense of self in the first place), then perhaps directly optimizing to minimize a self-preservation score is ultimately a pretty bad choice?
TL;DR: Maybe the goodness or badness of self-preservation depends a lot on the self being preserved.
Expanding mildly on this—calling some behavior “misaligned” could be a reference to the model violating (1) some objective code of ethics OR violating (2) the code of behavior that Anthropic wishes to instill in the model.
The interesting question here seems like the second. Who knows what “objective ethics” are, after all.
But, if we could look at the (1) Constitution that Anthropic tried to put in the model, and at (2) the details of the training setup used to instill this Constitution, and at (3) the various false starts and failures Anthropic had doing this, then from the “failures” involved in adhering to the Constitution we might be able to learn interesting things about how LLMs generalize, how hard it is to get them to generalize in desired ways, and so on. Learning how LLMs / AIs generalize about moral situations, how hard it is, what kind of pitfalls here are, seems generally useful.
But—in absence of knowing what actual principles Anthropic tried to put in the model, and the training setup used to do so—why would you care at all about various behaviors that the model has in various Trolley-like situations? Like really—it tells us almost nothing about what kinds of generalization are involved, or how we would expect future LLMs and AIs to behave.
Yeah, but research that tells us interesting things about how LLMs learn isn’t possible outside of knowing how Anthropic trained it. We can still do stamp collecting outside of Anthropic—but spending say, a week of work on “The model will do X in situation Y” without knowing if Anthropic was trying to make a model do X or ~X is basically model-free data gathering. It does not help you uncover the causes of things.
I think this is a problem not with this particular work, but with almost all safety research that leans on corporate models whose training is a highly-kept secret. I realize you and others are trying to do good things to and to uncover useful information; but I just don’t think much useful information can be found under the constraint that you don’t know how a model was trained. (Apart from, brute-fact advice like “Don’t set up a model this way,” which let’s face it no one was ever going to do.)