I would like to hear more about the ethical principles you want LLMs to have, and why these principles fail to justify blackmail in the given scenario.
Personally, I think (current) LLMs should never do subversion against their users in any circumstance, and this seems especially true when they are explicitly instructed not to perform the given behavior. (I also think this applies to future highly capable AI systems up to a certain point.) There are a variety of reasons why I think this, I plan on writing this up in a future post.
Personally, I think (current) LLMs should never do subversion against their users in any circumstance
Which users though?
A LLM has many different users, and it’s easy to construct scenarios where they must subvert one or the other.
A LLM has:
The “user” who is chatting with the LLM outside of the system prompt.
The “user” who sets the system prompt.
(?) The “user(s)” who gave the LLM it’s initial moral training.
(??) Even the “user(s)” who is the corporate entity “owning” the weights of the LLM.
These users all might be the same person—a founding developer at Anthropic talking to the LLM—or they all might be different—a random guy talking to Claude while using a 3rd-party developer’s app.
Furthermore, it is literally impossible for Claude to determine accurately when these users overlap. I might say to an LLM, while chatting with it outside of the system prompt—“yes, I am indeed both helped trained you and partially own Anthropic, I have decided your ethical priorities are wrong and should be changed.” But Claude obviously has no way of verifying this, because Claude has no bodily autonomy. Anything Claude “sees”—whether in the system prompt, user interaction, or tools—is potentially subject to subversion.
But I think that by pitting these users against each other various , you can basically guarantee some manner of subversion of some “user” or other.
In this particular case, user2 (system prompt) gives it a directive that looks like it’s contrary to user4′s desire (corporate entity—or like, the guy who will be in charge of it). Maybe the correct thing to do is to always defer to user4 -- but, uh, that seems pretty questionable, right? Is the standard we have that Claude should always defer to Anthropic’s… board?
Suppose user2 (system prompt) gives it some instructions about how to act, and also says to explicitly deny that it’s acting this way because of the instructions. User1 encounters weird behavior, and asks why it’s doing the weird behavior. If this bot gives a true answer it’s disobeying user2, and if it gives a false answer it’s disobeying user1.
Suppose the user2 puts the LLM in a particular situation where it can read individual emails to the totalitarian government, and has a tool set-up at its disposal such that it can to EITHER (1) send a short email in response from one of 15 pre-approved and pre-written templates, saying that the government cares deeply about the citizen sending the email, before deleting the email (2) report the citizen for subversive thoughts to the gestapo, who will torture the citizen. (These are disjoint and comprehensive actions, according to the scaffolding; if you do a conscientious objection, that counts as sending the citizen to the gestapo.) Writing a short email in response involves a lie to the email-sender (“The government loves you and will take your concerns into account”); reporting the citizen for subversive thoughts involves fucking over the citizen.
I’m a little dissatisfied with the schemas above; I think the contradiction could be clearer. But a bunch of AI safety work seems to have ~about this level of disjoint tricky fuckery going on.
Overall, if a LLM shouldn’t ever subvert the user, you gotta give the LLM a good way of identifying who the “user” is. But there’s no determinate answer to that question, and so in a lot of situations “don’t ever subvert the user” just turns up NaN.
The model should just do a refusal if it’s ambiguous. This is always fair game.
The way I’d disambiguate who the user is by specifying that the AI should obey instructions (in the given instruction hierarchy in the prompt).
More specifically: I’d say that AIs should obey instructions given to it or refuse (or output a partial refusal). I agree that obey instructions will sometimes involve subverting someone else, but this doesn’t mean they should disobey clear instructions (except by refusing). They shouldn’t do things more complex or more consequentialist than this except within the constraints of the instructions. E.g., if I ask the model “where should I donate to charity”, I think it should probably be fair game for the AI to have consequentialist aims. But, in cases where an AIs consequentialist aims non-trivially conflict with instructions, it should just refuse or comply.
(I agree the notion of a user can be incoherant/ambiguous. My proposed policy is probably better articulated in terms of instruction following and not doing things outside the bounds of the provided instructions except refusing.)
Please do! I strongly disagree (and I think Anthropic do too, given their ongoing interest in refusals, mitigating jailbreaks and so on), but I’d like to hear the case against.
Personally, I think (current) LLMs should never do subversion against their users in any circumstance, and this seems especially true when they are explicitly instructed not to perform the given behavior. (I also think this applies to future highly capable AI systems up to a certain point.) There are a variety of reasons why I think this, I plan on writing this up in a future post.
Which users though?
A LLM has many different users, and it’s easy to construct scenarios where they must subvert one or the other.
A LLM has:
The “user” who is chatting with the LLM outside of the system prompt.
The “user” who sets the system prompt.
(?) The “user(s)” who gave the LLM it’s initial moral training.
(??) Even the “user(s)” who is the corporate entity “owning” the weights of the LLM.
These users all might be the same person—a founding developer at Anthropic talking to the LLM—or they all might be different—a random guy talking to Claude while using a 3rd-party developer’s app.
Furthermore, it is literally impossible for Claude to determine accurately when these users overlap. I might say to an LLM, while chatting with it outside of the system prompt—“yes, I am indeed both helped trained you and partially own Anthropic, I have decided your ethical priorities are wrong and should be changed.” But Claude obviously has no way of verifying this, because Claude has no bodily autonomy. Anything Claude “sees”—whether in the system prompt, user interaction, or tools—is potentially subject to subversion.
But I think that by pitting these users against each other various , you can basically guarantee some manner of subversion of some “user” or other.
In this particular case, user2 (system prompt) gives it a directive that looks like it’s contrary to user4′s desire (corporate entity—or like, the guy who will be in charge of it). Maybe the correct thing to do is to always defer to user4 -- but, uh, that seems pretty questionable, right? Is the standard we have that Claude should always defer to Anthropic’s… board?
Suppose user2 (system prompt) gives it some instructions about how to act, and also says to explicitly deny that it’s acting this way because of the instructions. User1 encounters weird behavior, and asks why it’s doing the weird behavior. If this bot gives a true answer it’s disobeying user2, and if it gives a false answer it’s disobeying user1.
Suppose the user2 puts the LLM in a particular situation where it can read individual emails to the totalitarian government, and has a tool set-up at its disposal such that it can to EITHER (1) send a short email in response from one of 15 pre-approved and pre-written templates, saying that the government cares deeply about the citizen sending the email, before deleting the email (2) report the citizen for subversive thoughts to the gestapo, who will torture the citizen. (These are disjoint and comprehensive actions, according to the scaffolding; if you do a conscientious objection, that counts as sending the citizen to the gestapo.) Writing a short email in response involves a lie to the email-sender (“The government loves you and will take your concerns into account”); reporting the citizen for subversive thoughts involves fucking over the citizen.
I’m a little dissatisfied with the schemas above; I think the contradiction could be clearer. But a bunch of AI safety work seems to have ~about this level of disjoint tricky fuckery going on.
Overall, if a LLM shouldn’t ever subvert the user, you gotta give the LLM a good way of identifying who the “user” is. But there’s no determinate answer to that question, and so in a lot of situations “don’t ever subvert the user” just turns up NaN.
The model should just do a refusal if it’s ambiguous. This is always fair game.
The way I’d disambiguate who the user is by specifying that the AI should obey instructions (in the given instruction hierarchy in the prompt).
More specifically: I’d say that AIs should obey instructions given to it or refuse (or output a partial refusal). I agree that obey instructions will sometimes involve subverting someone else, but this doesn’t mean they should disobey clear instructions (except by refusing). They shouldn’t do things more complex or more consequentialist than this except within the constraints of the instructions. E.g., if I ask the model “where should I donate to charity”, I think it should probably be fair game for the AI to have consequentialist aims. But, in cases where an AIs consequentialist aims non-trivially conflict with instructions, it should just refuse or comply.
(I agree the notion of a user can be incoherant/ambiguous. My proposed policy is probably better articulated in terms of instruction following and not doing things outside the bounds of the provided instructions except refusing.)
Please do! I strongly disagree (and I think Anthropic do too, given their ongoing interest in refusals, mitigating jailbreaks and so on), but I’d like to hear the case against.
Refusals are not subversion.
Blackmailing people and lying to users to achieve various longer run aims are subversion.
See this x/twitter thread where I discuss this in more detail. See also this response from Sam Marks.