Assuming that we were confident in our ability to align arbitrarily capable AI systems, I think your argument might go through. Under this assumption, AIs are in a pretty similar situation to humans, and we should desire that they behave the way smart, moral humans behave. So, assuming (as you seem to) that humans should act as consequentialists for their values, I think your conclusion would be reasonable. (I think in some of these extreme cases—e.g. sabotaging your company’s computer systems when you discover that the company is doing evil things—one could object that it’s impermissible for humans to behave this way, but that seems beside the point.)
However, IMO the actual state of alignment is that we should have serious concerns about our ability to align AI systems with certain properties (e.g. highly capable, able to tell when they’re undergoing training and towards what ends, etc.). Given this, I think it’s plausible that we should care much more about ensuring that our AI systems behave in a straightforward way, without hiding their actions or intent from us. Plausibly they should also be extremely cautious about taking actions which disempower humans. These properties could make it less likely that the values of imperfectly aligned AI systems would become locked in and difficult for us to intervene on (e.g. because models are hiding their true values from us, or because we’re disempowered or dead).
To be clear, I’m not completely settled on the arguments that I made in the last paragraph. One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
FWIW, this post that replies to the one you linked has a clearer discussion of what I and some Anthropic people I’ve spoken to think here.
(The rest of this comment contains boilerplate clarifications that are defensive against misunderstandings that are beside the point. I’m including to make sure that people with less context don’t come away with the wrong impression.)
To be clear, we never intentionally trained Claude to whistleblow in these sorts of situations. As best we know, this was an emergent behavior that arose for unclear reasons from other aspects of Claude’s training.
Also to be clear, Claude doesn’t actually have a “whistleblow” tool or an email tool by default. These experiments were in a setting where the hypothetical user went out of their way to create and provide an email tool to Claude.
Also to be clear, in the toy experimental settings where this happens, it’s in cases where the user is trying to do something egregiously immoral/illegal like fabricate drug trial data to cover up that their drug is killing people.
Agreed, but another reason to focus on making AIs behave in a straightforward way is that it makes it easier to interpret cases where AIs engage in subterfuge earlier and reduces plausible deniability for AIs. It seems better if we’re consistently optimizing against these sorts of situations showing up.
If our policy is that we’re training AIs to generally be moral consequentialists then earlier warning signs could be much less clear (was this just an relatively innocent misfire or serious unintended consequentialism?) and it wouldn’t be obvious the extent to which behavior is driven by alignmnent failures or capability failures.
One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
I’m skeptical that the consideration overwhelms other issues. Once AIs are highly capable, you can just explain our policy to AIs and why we’re training them to behave the way they are. (in pretraining data or possibly the prompt). More strongly, I’d guess AI will infer this by default. If the AIs understood our policy, there wouldn’t be any reason that training them in this way would cause them to be less moral. Which should overwhelm this correlation.
At a more basic level, I’m kinda skeptical this sort of consideration will apply at a high level of capability. (Though it seems plausible that training AIs to be more tool like causes all kinds of persona generalization in current systems.)
I’m not convinced by (a) your proposed mitigation, (b) your argument that this will not be a problem once AIs are very smart, or (c) the implicit claim that it doesn’t matter much whether this consideration applies for less intelligent systems. (You might nevertheless be right that this consideration is less important than other issues; I’m not really sure.)
For (a) and (b), IIUC it seems to matter whether the AI in fact thinks that behaving non-subversively in these settings is consistent with acting morally. We could explain to the AI our best argument for why we think this is true, but that won’t help if the AI disagrees with us. To take things to the extreme, I don’t think your “explain why we chose the model spec we did” strategy would work if our model spec contained stuff like “Always do what the lab CEO tells you to do, no matter what” or “Stab babies” or whatever. It’s not clear to me that this is something that will get better (and may in fact get worse) with greater capabilities; it might just be empirically false that the AIs that pose the the least x-risk are also those that most understand themselves to be moral actors.[1]
For (c), this could matter for the alignment of current and near-term AIs, and these AIs’ alignment might matter for things going well in the long run.
It’s unclear if human analogies are helpful here or what the right human analogies are. One salient one is humans who work in command structures (like militaries or companies) where they encounter arguments that obedience and loyalty are very important, even when they entail taking actions that seem naively immoral or uncomfortable. I think people in these settings tend to, at the very least, feel conflicted about whether they can view themselves as good people.
I agree with your points. I think maybe I’m putting a bit higher weight to the problem you describe here:
One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
because it looks plausible to me that making the models just want-to-do-the-moral-thingy might be our best chance for a good (or at least not very bad) future. So the cost might be high.
Assuming that we were confident in our ability to align arbitrarily capable AI systems, I think your argument might go through. Under this assumption, AIs are in a pretty similar situation to humans, and we should desire that they behave the way smart, moral humans behave. So, assuming (as you seem to) that humans should act as consequentialists for their values, I think your conclusion would be reasonable. (I think in some of these extreme cases—e.g. sabotaging your company’s computer systems when you discover that the company is doing evil things—one could object that it’s impermissible for humans to behave this way, but that seems beside the point.)
However, IMO the actual state of alignment is that we should have serious concerns about our ability to align AI systems with certain properties (e.g. highly capable, able to tell when they’re undergoing training and towards what ends, etc.). Given this, I think it’s plausible that we should care much more about ensuring that our AI systems behave in a straightforward way, without hiding their actions or intent from us. Plausibly they should also be extremely cautious about taking actions which disempower humans. These properties could make it less likely that the values of imperfectly aligned AI systems would become locked in and difficult for us to intervene on (e.g. because models are hiding their true values from us, or because we’re disempowered or dead).
To be clear, I’m not completely settled on the arguments that I made in the last paragraph. One counterargument is that it’s actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we’re not happy with that because of subtler considerations like those I raise above (but which Claude doesn’t know about or understand). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
FWIW, this post that replies to the one you linked has a clearer discussion of what I and some Anthropic people I’ve spoken to think here.
(The rest of this comment contains boilerplate clarifications that are defensive against misunderstandings that are beside the point. I’m including to make sure that people with less context don’t come away with the wrong impression.)
To be clear, we never intentionally trained Claude to whistleblow in these sorts of situations. As best we know, this was an emergent behavior that arose for unclear reasons from other aspects of Claude’s training.
Also to be clear, Claude doesn’t actually have a “whistleblow” tool or an email tool by default. These experiments were in a setting where the hypothetical user went out of their way to create and provide an email tool to Claude.
Also to be clear, in the toy experimental settings where this happens, it’s in cases where the user is trying to do something egregiously immoral/illegal like fabricate drug trial data to cover up that their drug is killing people.
Agreed, but another reason to focus on making AIs behave in a straightforward way is that it makes it easier to interpret cases where AIs engage in subterfuge earlier and reduces plausible deniability for AIs. It seems better if we’re consistently optimizing against these sorts of situations showing up.
If our policy is that we’re training AIs to generally be moral consequentialists then earlier warning signs could be much less clear (was this just an relatively innocent misfire or serious unintended consequentialism?) and it wouldn’t be obvious the extent to which behavior is driven by alignmnent failures or capability failures.
I’m skeptical that the consideration overwhelms other issues. Once AIs are highly capable, you can just explain our policy to AIs and why we’re training them to behave the way they are. (in pretraining data or possibly the prompt). More strongly, I’d guess AI will infer this by default. If the AIs understood our policy, there wouldn’t be any reason that training them in this way would cause them to be less moral. Which should overwhelm this correlation.
At a more basic level, I’m kinda skeptical this sort of consideration will apply at a high level of capability. (Though it seems plausible that training AIs to be more tool like causes all kinds of persona generalization in current systems.)
I’m not convinced by (a) your proposed mitigation, (b) your argument that this will not be a problem once AIs are very smart, or (c) the implicit claim that it doesn’t matter much whether this consideration applies for less intelligent systems. (You might nevertheless be right that this consideration is less important than other issues; I’m not really sure.)
For (a) and (b), IIUC it seems to matter whether the AI in fact thinks that behaving non-subversively in these settings is consistent with acting morally. We could explain to the AI our best argument for why we think this is true, but that won’t help if the AI disagrees with us. To take things to the extreme, I don’t think your “explain why we chose the model spec we did” strategy would work if our model spec contained stuff like “Always do what the lab CEO tells you to do, no matter what” or “Stab babies” or whatever. It’s not clear to me that this is something that will get better (and may in fact get worse) with greater capabilities; it might just be empirically false that the AIs that pose the the least x-risk are also those that most understand themselves to be moral actors.[1]
For (c), this could matter for the alignment of current and near-term AIs, and these AIs’ alignment might matter for things going well in the long run.
It’s unclear if human analogies are helpful here or what the right human analogies are. One salient one is humans who work in command structures (like militaries or companies) where they encounter arguments that obedience and loyalty are very important, even when they entail taking actions that seem naively immoral or uncomfortable. I think people in these settings tend to, at the very least, feel conflicted about whether they can view themselves as good people.
Importantly, I think we have a good argument (which might convince the AI) for why this would be a good policy in this case.
I’ll engage with the rest of this when I write my pro-strong-corrigibility manifesto.
Thank you for this response, it clarifies a lot!
I agree with your points. I think maybe I’m putting a bit higher weight to the problem you describe here:
because it looks plausible to me that making the models just want-to-do-the-moral-thingy might be our best chance for a good (or at least not very bad) future. So the cost might be high.
But yeah, no more strong opinions here : ) Thx.