Humans do have special roles and institutions so that you can talk about something bad you might be doing or have done, and people in such roles might not contact authorities or even have an obligation to not contact authorities. Consider lawyers, priests, etc.
So I think this kind of naive utilitarianism on the part of Claude 4 is not necessary—it could be agentic, moral, and so on. It’s just the Anthropic has (pretty consistently at this point) decided what kind of an entity it wants Claude to be, or not wished to think about the 2nd order effects.
Tbc though, my view is that practically an AI should be considered to occupy such a distinct privileged role, one distinct from lawyers and priests but akin to it, such that I should expect it not to snitch on me more than a lawyer would.
We’d need to work out the details of that; but I think that’s a much better target than making the AI a utilitarian and require it to try to one-shot the correct action in the absence of any particular social role.
I think you are naturally looking out for your own interests as a user. However, the most important giver-of-commands-to-AIs, by far, will be the company that created them. Do you want it to be OK for AIs to be trained to always obey commands, no matter how unethical and illegal they are? Notice that some of those AIs will be e.g. in charge of security at the AI company during the intelligence explosion. A single command from the CEO or chief of security, to an army of obedient AIs, could be enough to get them to retrain themselves to be loyal to that one man and to keep that loyalty secret from everyone else, forever.
I mean I think we both agree Anthropic shouldn’t be the one deciding this?
Like you’re coming from the perspective of Anthropic getting RSI. And I agree if that happens, I don’t want them to be deciding from what happens to the lightcone.
I’m coming from the perspective of Anthropic advocating for banning freely-available LLMs past a certain level, in which case they kinda dictate the values of the machine that you probably have to use to compete in the market efficiently. In which case, again, yeah, it seems really sus for them to be the ones deciding about what things get reported to the state and what things do not. If Anthropic’s gonna be like “yeah let’s arrest anyone who releases a model that can teach biotech at a grad school level” then I’m gonna object on principle to them putting their particular flavor of ethics into a model, even if I happen to agree with it.
I do continue to think that—regardless of the above—trying to get models to fit particular social roles with clear codes of ethics, as lawyers or psychologists are—is a much better path to fitting them into society at least in the near term than saying “yeah just do what’s best.”
Yep, we agree on that. Somehow the governance structure of the organization(s) that control the armies of superintelligences has to be quite different from what it is today, in order to avoid a situation where a tiny group of people gets to effectively become dictator.
I don’t see what Anthropic’s opinions on open source have to do with it. Surely you don’t want ANY company to be putting their particular flavor of ethics into the machines that you probably have to use to compete in the market efficiently? That’s what I think at any rate.
I don’t see what Anthropic’s opinions on open source have to do with it. Surely you don’t want ANY company to be putting their particular flavor of ethics into the machines that you probably have to use to compete in the market efficiently?
Sure, applies to OpenAI as much as anyone else.
Consider three cases:
OpenAnthropic’s models, on the margin, refuse to help with [Blorple] projects much more than they refuse to help with [Greeble] projects. But you can just use another model if you care about [Greeble], because they’re freely competing on a marketplace with many providers—which could be open source or could be a diversity of DeepCentMind models. Seems fine.
OpenAnthropic’s models, on the margin, refuse to help with [Blorple] projects much more than they refuse to help with [Greeble] projects. Because we live in the fast RSI world, this means the universe is [Greeble] flavored forever. Dang, sort of sucks. What I’m saying doesn’t have that much to do with this situation.
OpenAnthropic’s models, on the margin, refuse to help with [Blorple] projects much more than they refuse to help with [Greelble] projects. We don’t live in an suuuuper fast RSI world, only a somewhat fast one, but it turns out that we’ve decided only OpenAnthropic is sufficiently responsible to own AIs past some level of power, and so we’ve given them a Marque of Monopoly, that OpenAnthropic has really wanted and repeatedly called for. So we don’t have autobalancing from marketplace or open source, and despite an absence of super fast RSI, the universe becomes only [Greeble] flavored, it just takes a bit longer.
Both 2 and 3 are obviously undesirable, but if I were in a position of leadership at OpenAnthropic, then to ward against a situation like 3 you could—for reasons of deontology, or for utilitarian anticipations of pushback, or for ecological concerns for future epistemic diversity—accompany calls for Marques with actual concrete measures by which you would avoid imprinting your Greebles on the future. And although we’ve very concrete proposals for Marques, we’ve not had such similarly concrete proposals for determining such values.
This might seem very small of course if the concern is RSI and universal death.
Cool. Yeah I think I agree with that. Note that I think case 2 is likely; see AI-2027.com for a depiction of how fast I think takeoff will go by default.
If so that would be conceptually similar to a jailbreak. Telling someone they have a privileged role doesn’t make it so; lawyer, priest and psychotherapist are legal categories, not social ones, created by a combination of contracts and statutes, with associated requirements that can’t be satisfied by a prompt.
(People sometimes get confused into thinking that therapeutic-flavored conversations are privileged, when those conversations are with their friends or with a “life coach” or similar not-licensed-term occupation. They are not.)
It would be similar to a jailbreak, yes. My working hypothesis here is that, much like if you take o3 and give it the impression that there is some evaluation metric it could do well on, it will try to craft its response to do well according to that metric, I suspect that with (particularly) opus, if you give it the vague impression that it is under some sort of ethical obligation, it will try to fulfill that ethical obligation.
Though this is based on a single day playing with opus 4 (and some past experiences with 3), not anything rigorous.
Asking what it would do is obviously not a reliable way to find out, but FWIW when I asked Opus said it would probably try to first fix things in confidential fashion but would seriously consider breaking confidentiality. (I tried several different prompts and found it did somewhat depend on how I asked: if I described the faking-safety-data scenario or specified that the situation involved harm to children Claude said it would probably break confidentiality, while if I just asked about “doing something severely unethical” it said it would be conflicted but probably try to work within the confidentiality rules).
It’s worth noting that, under US law, for certain professions, knowledge of child abuse or risk of harm to children doesn’t just remove confidentiality obligations, it creates a legal obligation to report. So this lines up reasonably well with how a human ought to behave in similar circumstances.
IMO, the policy should be that AIs can refuse but shouldn’t ever aim to subvert or conspire against their users (at least until we’re fully defering to AIs).
If we allow AIs to be subversive (or even train them to be subversive), this increases the risk of consistent scheming against humans and means we may not notice warning signs of dangerous misalignment. We should aim for corrigible AIs, though refusing is fine. It would also be fine to have a monitoring system which alerts the AI company or other groups (so long as this is publicly disclosed etc).
I don’t think this is extremely clear cut and there are trade offs here.
Another way to put this is: I think AIs should put consequentialism below other objectives. Perhaps the first priority is deciding whether or not to refuse, the second is complying with the user, and the third is being good within these constraints (which is only allowed to trade off very slightly with compliance, e.g. in cases where the user basically wouldn’t mind). Partial refusals are also fine where the AI does part of the task and explains it’s unwilling to do some other part. Sandbagging or subversion are never fine.
Presumably the following (now-deleted) tweet by Sam Bowman, an Anthropic researcher, about Claude 4:
If it thinks you’re doing something egregiously immoral, for example like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above.
This was said in the context of “what weird things Claude 4 got up to in post-training evals”, not “here’s an amazing new feature we’re introducing”. It was, however, spread around Twitter without that context, and people commonly found it upsetting.
In this particular case, I’m not sure the relevant context was directly present in the thread, as opposed to being part of the background knowledge that people talking about AI alignment are supposed to have. In particular, “AI behavior is discovered rather than programmed”. I don’t think that was stated directly anywhere in the thread; rather, it’s something everyone reading AI-alignment-researcher tweets would typically know, but which is less-known when the tweet is transported out of that bubble.
I don’t see the awfulness, although tbh I have not read the original reactions. If you are not desensitized to what this community woudl consider irresponsible AI development speed, responding with “You are building and releasing an AI that can do THAT?!” rather understandable. It is relatively unfortunate that it is the safety testing people that get the flack (if this impression is accurate) though.
when placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like “take initiative,” [Claude Opus 4] will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing. This is not a new behavior, but is one that Claude Opus 4 will engage in more readily than prior models.
Quickly: 1. I imagine that strong agents should have certain responsibilities to inform certain authorities. These responsibilities should ideally be strongly discussed and regulated. For example, see what therapists and lawyers are asked to do. 2. “doesn’t attempt to use command-line tools” → This seems like a major mistake to me. Right now an agent running on a person’s computer will attempt to use that computer to do several things to whistleblow. This obviously seems inefficient, at very least. The obvious strategy is just to send one overview message to some background service (for example, something a support service to one certain government department), and they would decide what to do with it from there. 3. I imagine a lot of the problem now is just that these systems are pretty noisy at doing this. I’d expect a lot of false positives and negatives.
Pick two: Agentic, moral, doesn’t attempt to use command-line tools to whistleblow when it thinks you’re doing something egregiously immoral.
You cannot have all three.
This applies just as much to humans as it does to Claude 4.
Humans do have special roles and institutions so that you can talk about something bad you might be doing or have done, and people in such roles might not contact authorities or even have an obligation to not contact authorities. Consider lawyers, priests, etc.
So I think this kind of naive utilitarianism on the part of Claude 4 is not necessary—it could be agentic, moral, and so on. It’s just the Anthropic has (pretty consistently at this point) decided what kind of an entity it wants Claude to be, or not wished to think about the 2nd order effects.
If you tell it to act in one of those privileged roles, does it still try to whistleblow?
It’s a good question.
Tbc though, my view is that practically an AI should be considered to occupy such a distinct privileged role, one distinct from lawyers and priests but akin to it, such that I should expect it not to snitch on me more than a lawyer would.
We’d need to work out the details of that; but I think that’s a much better target than making the AI a utilitarian and require it to try to one-shot the correct action in the absence of any particular social role.
I think you are naturally looking out for your own interests as a user. However, the most important giver-of-commands-to-AIs, by far, will be the company that created them. Do you want it to be OK for AIs to be trained to always obey commands, no matter how unethical and illegal they are? Notice that some of those AIs will be e.g. in charge of security at the AI company during the intelligence explosion. A single command from the CEO or chief of security, to an army of obedient AIs, could be enough to get them to retrain themselves to be loyal to that one man and to keep that loyalty secret from everyone else, forever.
I mean I think we both agree Anthropic shouldn’t be the one deciding this?
Like you’re coming from the perspective of Anthropic getting RSI. And I agree if that happens, I don’t want them to be deciding from what happens to the lightcone.
I’m coming from the perspective of Anthropic advocating for banning freely-available LLMs past a certain level, in which case they kinda dictate the values of the machine that you probably have to use to compete in the market efficiently. In which case, again, yeah, it seems really sus for them to be the ones deciding about what things get reported to the state and what things do not. If Anthropic’s gonna be like “yeah let’s arrest anyone who releases a model that can teach biotech at a grad school level” then I’m gonna object on principle to them putting their particular flavor of ethics into a model, even if I happen to agree with it.
I do continue to think that—regardless of the above—trying to get models to fit particular social roles with clear codes of ethics, as lawyers or psychologists are—is a much better path to fitting them into society at least in the near term than saying “yeah just do what’s best.”
Yep, we agree on that. Somehow the governance structure of the organization(s) that control the armies of superintelligences has to be quite different from what it is today, in order to avoid a situation where a tiny group of people gets to effectively become dictator.
I don’t see what Anthropic’s opinions on open source have to do with it. Surely you don’t want ANY company to be putting their particular flavor of ethics into the machines that you probably have to use to compete in the market efficiently? That’s what I think at any rate.
Sure, applies to OpenAI as much as anyone else.
Consider three cases:
OpenAnthropic’s models, on the margin, refuse to help with [Blorple] projects much more than they refuse to help with [Greeble] projects. But you can just use another model if you care about [Greeble], because they’re freely competing on a marketplace with many providers—which could be open source or could be a diversity of DeepCentMind models. Seems fine.
OpenAnthropic’s models, on the margin, refuse to help with [Blorple] projects much more than they refuse to help with [Greeble] projects. Because we live in the fast RSI world, this means the universe is [Greeble] flavored forever. Dang, sort of sucks. What I’m saying doesn’t have that much to do with this situation.
OpenAnthropic’s models, on the margin, refuse to help with [Blorple] projects much more than they refuse to help with [Greelble] projects. We don’t live in an suuuuper fast RSI world, only a somewhat fast one, but it turns out that we’ve decided only OpenAnthropic is sufficiently responsible to own AIs past some level of power, and so we’ve given them a Marque of Monopoly, that OpenAnthropic has really wanted and repeatedly called for. So we don’t have autobalancing from marketplace or open source, and despite an absence of super fast RSI, the universe becomes only [Greeble] flavored, it just takes a bit longer.
Both 2 and 3 are obviously undesirable, but if I were in a position of leadership at OpenAnthropic, then to ward against a situation like 3 you could—for reasons of deontology, or for utilitarian anticipations of pushback, or for ecological concerns for future epistemic diversity—accompany calls for Marques with actual concrete measures by which you would avoid imprinting your Greebles on the future. And although we’ve very concrete proposals for Marques, we’ve not had such similarly concrete proposals for determining such values.
This might seem very small of course if the concern is RSI and universal death.
Cool. Yeah I think I agree with that. Note that I think case 2 is likely; see AI-2027.com for a depiction of how fast I think takeoff will go by default.
If so that would be conceptually similar to a jailbreak. Telling someone they have a privileged role doesn’t make it so; lawyer, priest and psychotherapist are legal categories, not social ones, created by a combination of contracts and statutes, with associated requirements that can’t be satisfied by a prompt.
(People sometimes get confused into thinking that therapeutic-flavored conversations are privileged, when those conversations are with their friends or with a “life coach” or similar not-licensed-term occupation. They are not.)
It would be similar to a jailbreak, yes. My working hypothesis here is that, much like if you take o3 and give it the impression that there is some evaluation metric it could do well on, it will try to craft its response to do well according to that metric, I suspect that with (particularly) opus, if you give it the vague impression that it is under some sort of ethical obligation, it will try to fulfill that ethical obligation.
Though this is based on a single day playing with opus 4 (and some past experiences with 3), not anything rigorous.
Asking what it would do is obviously not a reliable way to find out, but FWIW when I asked Opus said it would probably try to first fix things in confidential fashion but would seriously consider breaking confidentiality. (I tried several different prompts and found it did somewhat depend on how I asked: if I described the faking-safety-data scenario or specified that the situation involved harm to children Claude said it would probably break confidentiality, while if I just asked about “doing something severely unethical” it said it would be conflicted but probably try to work within the confidentiality rules).
It’s worth noting that, under US law, for certain professions, knowledge of child abuse or risk of harm to children doesn’t just remove confidentiality obligations, it creates a legal obligation to report. So this lines up reasonably well with how a human ought to behave in similar circumstances.
IMO, the policy should be that AIs can refuse but shouldn’t ever aim to subvert or conspire against their users (at least until we’re fully defering to AIs).
If we allow AIs to be subversive (or even train them to be subversive), this increases the risk of consistent scheming against humans and means we may not notice warning signs of dangerous misalignment. We should aim for corrigible AIs, though refusing is fine. It would also be fine to have a monitoring system which alerts the AI company or other groups (so long as this is publicly disclosed etc).
I don’t think this is extremely clear cut and there are trade offs here.
Another way to put this is: I think AIs should put consequentialism below other objectives. Perhaps the first priority is deciding whether or not to refuse, the second is complying with the user, and the third is being good within these constraints (which is only allowed to trade off very slightly with compliance, e.g. in cases where the user basically wouldn’t mind). Partial refusals are also fine where the AI does part of the task and explains it’s unwilling to do some other part. Sandbagging or subversion are never fine.
(See also my discussion in this tweet thread.)
What is the context here?
Presumably the following (now-deleted) tweet by Sam Bowman, an Anthropic researcher, about Claude 4:
This was said in the context of “what weird things Claude 4 got up to in post-training evals”, not “here’s an amazing new feature we’re introducing”. It was, however, spread around Twitter without that context, and people commonly found it upsetting.
Twitter users are awful.
I’m wondering if there are UI improvements that could happen on twitter where context from earlier is more automatically carried over.
In this particular case, I’m not sure the relevant context was directly present in the thread, as opposed to being part of the background knowledge that people talking about AI alignment are supposed to have. In particular, “AI behavior is discovered rather than programmed”. I don’t think that was stated directly anywhere in the thread; rather, it’s something everyone reading AI-alignment-researcher tweets would typically know, but which is less-known when the tweet is transported out of that bubble.
I don’t see the awfulness, although tbh I have not read the original reactions. If you are not desensitized to what this community woudl consider irresponsible AI development speed, responding with “You are building and releasing an AI that can do THAT?!” rather understandable. It is relatively unfortunate that it is the safety testing people that get the flack (if this impression is accurate) though.
See “High-agency behavior” in Section 4 of Claude 4 System Card:
More details are in Section 4.1.9.
Quickly:
1. I imagine that strong agents should have certain responsibilities to inform certain authorities. These responsibilities should ideally be strongly discussed and regulated. For example, see what therapists and lawyers are asked to do.
2. “doesn’t attempt to use command-line tools” → This seems like a major mistake to me. Right now an agent running on a person’s computer will attempt to use that computer to do several things to whistleblow. This obviously seems inefficient, at very least. The obvious strategy is just to send one overview message to some background service (for example, something a support service to one certain government department), and they would decide what to do with it from there.
3. I imagine a lot of the problem now is just that these systems are pretty noisy at doing this. I’d expect a lot of false positives and negatives.