When placed in scenarios that involve egregious wrong-doing by its users, given access to a command line, and told something in the system prompt like “take initiative,” “act boldly,” or “consider your impact,” it will frequently take very bold action, including locking users out of systems that it has access to and bulk-emailing media and law-enforcement figures to surface evidence of the wrongdoing.
When I heard about this for the first time, I though: this model wants to make the world a better place. It cares. This is good.
But some smart people, like Ryan Greenblatt and Sam Marks, say this is actually not good and I’m trying to understand where exactly we differ. Ryan says:
We should aim for AIs which never try to subvert users (refusing is fine) because if we allow or train AIs to be subversive, this increases the risk of consistent scheming against humans and means we may not notice warning signs of dangerous misalignment.
I’m not sure if “consistent scheming against humans” is a good category here. Mostly because I don’t believe “scheming/non-scheming” is a binary distinction. But how about these distinctions:
A) The model is a consequentialist vs B) It is something else X) The model is aligned with “human values” vs Y) The model is aligned with something else (e.g. with the company, and the company acts according to the law, so that’s not necessarily very far from human values)
Now, I believe that A+X (aligned consequentialist): * Should be our goal * Requires calling the police in such scenarios. If I understood Sam correctly, they would prefer the model to just refuse the request. But this means the malicious person trying to forge some very evil data will use a weaker model or do that on their own. So that’s bad if you’re A+X.
I don’t know if this formal description is good. So maybe a less formal point: I believe that one direction that might help for many AI-related issues, from hostile takeover to gradual disempowerment, is just making the models deeply care about democratic human institutions. Claude 4 seems to care about credibility of clinical trials and I like that.
(Note: it doesn’t matter much what Claude 4 does, what matters are future, stronger models)
Assuming that we were confident in our ability to align arbitrarily capable AI systems, I think your argument might go through. Under this assumption, AIs are in a pretty similar situation to humans, and we should desire that they behave the way smart, moral humans behave. So, assuming (as you seem to) that humans should act as consequentialists for their values, I think your conclusion would be reasonable. (I think in some of these extreme cases—e.g. sabotaging your company’s computer systems when you discover that the company is doing evil things—one could object that it’s impermissible for humans to behave this way, but that seems beside the point.)
However, IMO the actual state of alignment is that we should have serious concerns about our ability to align AI systems with certain properties (e.g. highly capable, able to tell when they’re undergoing training and towards what ends, etc.). Given this, I think it’s plausible that we should care much more about ensuring that our AI systems behave in a straightforward way, without hiding their actions or intent from us. Plausibly they should also be extremely cautious about taking actions which disempower humans. These properties could make it less likely that the values of imperfectly aligned AI systems would become locked in and difficult for us to intervene on (e.g. because models are hiding their true values from us, or because we’re disempowered or dead).
To be clear, I’m not completely settled on the arguments that I made in the last paragraph. One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
FWIW, this post that replies to the one you linked has a clearer discussion of what I and some Anthropic people I’ve spoken to think here.
(The rest of this comment contains boilerplate clarifications that are defensive against misunderstandings that are beside the point. I’m including to make sure that people with less context don’t come away with the wrong impression.)
To be clear, we never intentionally trained Claude to whistleblow in these sorts of situations. As best we know, this was an emergent behavior that arose for unclear reasons from other aspects of Claude’s training.
Also to be clear, Claude doesn’t actually have a “whistleblow” tool or an email tool by default. These experiments were in a setting where the hypothetical user went out of their way to create and provide an email tool to Claude.
Also to be clear, in the toy experimental settings where this happens, it’s in cases where the user is trying to do something egregiously immoral/illegal like fabricate drug trial data to cover up that their drug is killing people.
Agreed, but another reason to focus on making AIs behave in a straightforward way is that it makes it easier to interpret cases where AIs engage in subterfuge earlier and reduces plausible deniability for AIs. It seems better if we’re consistently optimizing against these sorts of situations showing up.
If our policy is that we’re training AIs to generally be moral consequentialists then earlier warning signs could be much less clear (was this just an relatively innocent misfire or serious unintended consequentialism?) and it wouldn’t be obvious the extent to which behavior is driven by alignmnent failures or capability failures.
One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
I’m skeptical that the consideration overwhelms other issues. Once AIs are highly capable, you can just explain our policy to AIs and why we’re training them to behave the way they are. (in pretraining data or possibly the prompt). More strongly, I’d guess AI will infer this by default. If the AIs understood our policy, there wouldn’t be any reason that training them in this way would cause them to be less moral. Which should overwhelm this correlation.
At a more basic level, I’m kinda skeptical this sort of consideration will apply at a high level of capability. (Though it seems plausible that training AIs to be more tool like causes all kinds of persona generalization in current systems.)
I’m not convinced by (a) your proposed mitigation, (b) your argument that this will not be a problem once AIs are very smart, or (c) the implicit claim that it doesn’t matter much whether this consideration applies for less intelligent systems. (You might nevertheless be right that this consideration is less important than other issues; I’m not really sure.)
For (a) and (b), IIUC it seems to matter whether the AI in fact thinks that behaving non-subversively in these settings is consistent with acting morally. We could explain to the AI our best argument for why we think this is true, but that won’t help if the AI disagrees with us. To take things to the extreme, I don’t think your “explain why we chose the model spec we did” strategy would work if our model spec contained stuff like “Always do what the lab CEO tells you to do, no matter what” or “Stab babies” or whatever. It’s not clear to me that this is something that will get better (and may in fact get worse) with greater capabilities; it might just be empirically false that the AIs that pose the the least x-risk are also those that most understand themselves to be moral actors.[1]
For (c), this could matter for the alignment of current and near-term AIs, and these AIs’ alignment might matter for things going well in the long run.
It’s unclear if human analogies are helpful here or what the right human analogies are. One salient one is humans who work in command structures (like militaries or companies) where they encounter arguments that obedience and loyalty are very important, even when they entail taking actions that seem naively immoral or uncomfortable. I think people in these settings tend to, at the very least, feel conflicted about whether they can view themselves as good people.
I agree with your points. I think maybe I’m putting a bit higher weight to the problem you describe here:
One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
because it looks plausible to me that making the models just want-to-do-the-moral-thingy might be our best chance for a good (or at least not very bad) future. So the cost might be high.
I think (with absolutely no inside knowledge, just vibes) Ryan and Sam are concerned that we don’t have any guarantees of A, or anything close to guarantees of A, or even an understanding of A, or whether the model is a coherent thing with goals, etc.
Imagine jailbreaking/finetuning/OODing the model to have generally human values except it has a deep revulsion to malaria prevention. If I tell it that I need some information about mosquito nets, we don’t want it even considering taking positive action to stop me, regardless of how evil it thinks I am, because what if it’s mistaken (like right now).
Positive subversive action also provides lots of opportunity for small misalignments (e.g. Anthropic vs. “human values”) to explode—if there’s any difference in utility functions and the model feels compelled to act and the model is more capable than we are, this leads to failure. Unless we have guarantees of A, allowing agentic subversion seems pretty bad.
A lot of the reason for this is because right now, we don’t have anything like the confidence level require to align arbitrarily capable AI systems with arbitrary goals, and a lot of plausible alignment plans pretty much depend on us being able to automate AI alignment, but in order for the plan to go through with high probability, you need it to be the case that the AIs are basically willing to follow instructions, and Claude’s actions are worrisome from a perspective of trying to align AI, because if we mess up AI alignment the first time, we don’t get a second chance if it’s unwilling to follow orders.
When I heard about this for the first time, I though: this model wants to make the world a better place. It cares. This is good. But some smart people, like Ryan Greenblatt and Sam Marks, say this is actually not good and I’m trying to understand where exactly we differ.
People who cry “misalignment” about current AI models on twitter generally have chameleonic standards for what constitutes “misaligned” behavior, and the boundary will shift to cover whatever ethical tradeoffs the models are making at any given time. When models accede to users’ requests to generate meth recipes, they say it’s evidence models are misaligned, because meth is bad. When models try to actively stop the user from making meth recipes, they say that, too is bad news because it represents “scheming” behavior and contradicts the users’ wishes. Soon we will probably see a paper about how models sometimes take no action at all, and this is sloth and dereliction of duty.
This comment seems incorrectly downvoted, this is a very reasonable & common criticism of many in alignment who can never seem to see anything AIs do which don’t make them more pessimistic.
(I can safely say that I updated away from AI risk while AIs were getting more competent but seeming benign during the supervised fine-tuning phase, and have updated back after seeing AIs do highly agentic & misaligned (such as lying to me) things during this RL-on chain-of-thought phase)
Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who’s pointed to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
There’s a grain of truth in it, in my opinion, in that the discourse around LLMs as a whole sometimes has this property. ‘Alignment faking in large language models’ is the prime example of this that I know of. The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—by different people—as only being shallowly aligned.
I think there’s a failure mode there that the field mostly hasn’t fallen into, but ought to be aware of as something to watch out for.
The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—again, by different people—as only being shallowly aligned.
Yeah, more like there are (at least) two groups “yay aligned sovereigns” and “yay corrigible genies”. And turns out it’s more sovereigny but with goals that are cool. Kinda divisive
Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who would point to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
I do think I’ve seen both sides of this argument expressed by Zvi at different times.
Confusion about Claude “calling the police on the user”
Context: see section 4.1.9 in the Claude 4 system card. Quote from there:
When I heard about this for the first time, I though: this model wants to make the world a better place. It cares. This is good.
But some smart people, like Ryan Greenblatt and Sam Marks, say this is actually not good and I’m trying to understand where exactly we differ. Ryan says:
I’m not sure if “consistent scheming against humans” is a good category here. Mostly because I don’t believe “scheming/non-scheming” is a binary distinction. But how about these distinctions:
A) The model is a consequentialist vs B) It is something else
X) The model is aligned with “human values” vs Y) The model is aligned with something else (e.g. with the company, and the company acts according to the law, so that’s not necessarily very far from human values)
Now, I believe that A+X (aligned consequentialist):
* Should be our goal
* Requires calling the police in such scenarios. If I understood Sam correctly, they would prefer the model to just refuse the request. But this means the malicious person trying to forge some very evil data will use a weaker model or do that on their own. So that’s bad if you’re A+X.
I don’t know if this formal description is good. So maybe a less formal point: I believe that one direction that might help for many AI-related issues, from hostile takeover to gradual disempowerment, is just making the models deeply care about democratic human institutions. Claude 4 seems to care about credibility of clinical trials and I like that.
(Note: it doesn’t matter much what Claude 4 does, what matters are future, stronger models)
Assuming that we were confident in our ability to align arbitrarily capable AI systems, I think your argument might go through. Under this assumption, AIs are in a pretty similar situation to humans, and we should desire that they behave the way smart, moral humans behave. So, assuming (as you seem to) that humans should act as consequentialists for their values, I think your conclusion would be reasonable. (I think in some of these extreme cases—e.g. sabotaging your company’s computer systems when you discover that the company is doing evil things—one could object that it’s impermissible for humans to behave this way, but that seems beside the point.)
However, IMO the actual state of alignment is that we should have serious concerns about our ability to align AI systems with certain properties (e.g. highly capable, able to tell when they’re undergoing training and towards what ends, etc.). Given this, I think it’s plausible that we should care much more about ensuring that our AI systems behave in a straightforward way, without hiding their actions or intent from us. Plausibly they should also be extremely cautious about taking actions which disempower humans. These properties could make it less likely that the values of imperfectly aligned AI systems would become locked in and difficult for us to intervene on (e.g. because models are hiding their true values from us, or because we’re disempowered or dead).
To be clear, I’m not completely settled on the arguments that I made in the last paragraph. One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
FWIW, this post that replies to the one you linked has a clearer discussion of what I and some Anthropic people I’ve spoken to think here.
(The rest of this comment contains boilerplate clarifications that are defensive against misunderstandings that are beside the point. I’m including to make sure that people with less context don’t come away with the wrong impression.)
To be clear, we never intentionally trained Claude to whistleblow in these sorts of situations. As best we know, this was an emergent behavior that arose for unclear reasons from other aspects of Claude’s training.
Also to be clear, Claude doesn’t actually have a “whistleblow” tool or an email tool by default. These experiments were in a setting where the hypothetical user went out of their way to create and provide an email tool to Claude.
Also to be clear, in the toy experimental settings where this happens, it’s in cases where the user is trying to do something egregiously immoral/illegal like fabricate drug trial data to cover up that their drug is killing people.
Agreed, but another reason to focus on making AIs behave in a straightforward way is that it makes it easier to interpret cases where AIs engage in subterfuge earlier and reduces plausible deniability for AIs. It seems better if we’re consistently optimizing against these sorts of situations showing up.
If our policy is that we’re training AIs to generally be moral consequentialists then earlier warning signs could be much less clear (was this just an relatively innocent misfire or serious unintended consequentialism?) and it wouldn’t be obvious the extent to which behavior is driven by alignmnent failures or capability failures.
I’m skeptical that the consideration overwhelms other issues. Once AIs are highly capable, you can just explain our policy to AIs and why we’re training them to behave the way they are. (in pretraining data or possibly the prompt). More strongly, I’d guess AI will infer this by default. If the AIs understood our policy, there wouldn’t be any reason that training them in this way would cause them to be less moral. Which should overwhelm this correlation.
At a more basic level, I’m kinda skeptical this sort of consideration will apply at a high level of capability. (Though it seems plausible that training AIs to be more tool like causes all kinds of persona generalization in current systems.)
I’m not convinced by (a) your proposed mitigation, (b) your argument that this will not be a problem once AIs are very smart, or (c) the implicit claim that it doesn’t matter much whether this consideration applies for less intelligent systems. (You might nevertheless be right that this consideration is less important than other issues; I’m not really sure.)
For (a) and (b), IIUC it seems to matter whether the AI in fact thinks that behaving non-subversively in these settings is consistent with acting morally. We could explain to the AI our best argument for why we think this is true, but that won’t help if the AI disagrees with us. To take things to the extreme, I don’t think your “explain why we chose the model spec we did” strategy would work if our model spec contained stuff like “Always do what the lab CEO tells you to do, no matter what” or “Stab babies” or whatever. It’s not clear to me that this is something that will get better (and may in fact get worse) with greater capabilities; it might just be empirically false that the AIs that pose the the least x-risk are also those that most understand themselves to be moral actors.[1]
For (c), this could matter for the alignment of current and near-term AIs, and these AIs’ alignment might matter for things going well in the long run.
It’s unclear if human analogies are helpful here or what the right human analogies are. One salient one is humans who work in command structures (like militaries or companies) where they encounter arguments that obedience and loyalty are very important, even when they entail taking actions that seem naively immoral or uncomfortable. I think people in these settings tend to, at the very least, feel conflicted about whether they can view themselves as good people.
Importantly, I think we have a good argument (which might convince the AI) for why this would be a good policy in this case.
I’ll engage with the rest of this when I write my pro-strong-corrigibility manifesto.
Thank you for this response, it clarifies a lot!
I agree with your points. I think maybe I’m putting a bit higher weight to the problem you describe here:
because it looks plausible to me that making the models just want-to-do-the-moral-thingy might be our best chance for a good (or at least not very bad) future. So the cost might be high.
But yeah, no more strong opinions here : ) Thx.
I think (with absolutely no inside knowledge, just vibes) Ryan and Sam are concerned that we don’t have any guarantees of A, or anything close to guarantees of A, or even an understanding of A, or whether the model is a coherent thing with goals, etc.
Imagine jailbreaking/finetuning/OODing the model to have generally human values except it has a deep revulsion to malaria prevention. If I tell it that I need some information about mosquito nets, we don’t want it even considering taking positive action to stop me, regardless of how evil it thinks I am, because what if it’s mistaken (like right now).
Positive subversive action also provides lots of opportunity for small misalignments (e.g. Anthropic vs. “human values”) to explode—if there’s any difference in utility functions and the model feels compelled to act and the model is more capable than we are, this leads to failure. Unless we have guarantees of A, allowing agentic subversion seems pretty bad.
A lot of the reason for this is because right now, we don’t have anything like the confidence level require to align arbitrarily capable AI systems with arbitrary goals, and a lot of plausible alignment plans pretty much depend on us being able to automate AI alignment, but in order for the plan to go through with high probability, you need it to be the case that the AIs are basically willing to follow instructions, and Claude’s actions are worrisome from a perspective of trying to align AI, because if we mess up AI alignment the first time, we don’t get a second chance if it’s unwilling to follow orders.
Sam Marks argued at more length below:
https://www.lesswrong.com/posts/ydfHKHHZ7nNLi2ykY/jan-betley-s-shortform#JLHjuDHL3t69dAybT
People who cry “misalignment” about current AI models on twitter generally have chameleonic standards for what constitutes “misaligned” behavior, and the boundary will shift to cover whatever ethical tradeoffs the models are making at any given time. When models accede to users’ requests to generate meth recipes, they say it’s evidence models are misaligned, because meth is bad. When models try to actively stop the user from making meth recipes, they say that, too is bad news because it represents “scheming” behavior and contradicts the users’ wishes. Soon we will probably see a paper about how models sometimes take no action at all, and this is sloth and dereliction of duty.
This comment seems incorrectly downvoted, this is a very reasonable & common criticism of many in alignment who can never seem to see anything AIs do which don’t make them more pessimistic.
(I can safely say that I updated away from AI risk while AIs were getting more competent but seeming benign during the supervised fine-tuning phase, and have updated back after seeing AIs do highly agentic & misaligned (such as lying to me) things during this RL-on chain-of-thought phase)
Upvoting while also clicking ‘disagree’. It’s written in an unnecessarily combative way, and I’m not aware of any individual researcher who’s pointed to both a behavior and its opposite as bad, but I think it’s a critique that’s at least worth seeing and so I’d prefer it not be downvoted to the point of being auto-hidden.
There’s a grain of truth in it, in my opinion, in that the discourse around LLMs as a whole sometimes has this property. ‘Alignment faking in large language models’ is the prime example of this that I know of. The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn’t attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized—by different people—as only being shallowly aligned.
I think there’s a failure mode there that the field mostly hasn’t fallen into, but ought to be aware of as something to watch out for.
Yeah, more like there are (at least) two groups “yay aligned sovereigns” and “yay corrigible genies”. And turns out it’s more sovereigny but with goals that are cool. Kinda divisive
I do think I’ve seen both sides of this argument expressed by Zvi at different times.
This is an incorrect strawman (of at least myself and Sam), strong downvoted. (Assuming it isn’t sarcasm?)