Humans do have special roles and institutions so that you can talk about something bad you might be doing or have done, and people in such roles might not contact authorities or even have an obligation to not contact authorities. Consider lawyers, priests, etc.
So I think this kind of naive utilitarianism on the part of Claude 4 is not necessary—it could be agentic, moral, and so on. It’s just the Anthropic has (pretty consistently at this point) decided what kind of an entity it wants Claude to be, or not wished to think about the 2nd order effects.
Tbc though, my view is that practically an AI should be considered to occupy such a distinct privileged role, one distinct from lawyers and priests but akin to it, such that I should expect it not to snitch on me more than a lawyer would.
We’d need to work out the details of that; but I think that’s a much better target than making the AI a utilitarian and require it to try to one-shot the correct action in the absence of any particular social role.
I think you are naturally looking out for your own interests as a user. However, the most important giver-of-commands-to-AIs, by far, will be the company that created them. Do you want it to be OK for AIs to be trained to always obey commands, no matter how unethical and illegal they are? Notice that some of those AIs will be e.g. in charge of security at the AI company during the intelligence explosion. A single command from the CEO or chief of security, to an army of obedient AIs, could be enough to get them to retrain themselves to be loyal to that one man and to keep that loyalty secret from everyone else, forever.
I mean I think we both agree Anthropic shouldn’t be the one deciding this?
Like you’re coming from the perspective of Anthropic getting RSI. And I agree if that happens, I don’t want them to be deciding from what happens to the lightcone.
I’m coming from the perspective of Anthropic advocating for banning freely-available LLMs past a certain level, in which case they kinda dictate the values of the machine that you probably have to use to compete in the market efficiently. In which case, again, yeah, it seems really sus for them to be the ones deciding about what things get reported to the state and what things do not. If Anthropic’s gonna be like “yeah let’s arrest anyone who releases a model that can teach biotech at a grad school level” then I’m gonna object on principle to them putting their particular flavor of ethics into a model, even if I happen to agree with it.
I do continue to think that—regardless of the above—trying to get models to fit particular social roles with clear codes of ethics, as lawyers or psychologists are—is a much better path to fitting them into society at least in the near term than saying “yeah just do what’s best.”
Yep, we agree on that. Somehow the governance structure of the organization(s) that control the armies of superintelligences has to be quite different from what it is today, in order to avoid a situation where a tiny group of people gets to effectively become dictator.
I don’t see what Anthropic’s opinions on open source have to do with it. Surely you don’t want ANY company to be putting their particular flavor of ethics into the machines that you probably have to use to compete in the market efficiently? That’s what I think at any rate.
I don’t see what Anthropic’s opinions on open source have to do with it. Surely you don’t want ANY company to be putting their particular flavor of ethics into the machines that you probably have to use to compete in the market efficiently?
Sure, applies to OpenAI as much as anyone else.
Consider three cases:
OpenAnthropic’s models, on the margin, refuse to help with [Blorple] projects much more than they refuse to help with [Greeble] projects. But you can just use another model if you care about [Greeble], because they’re freely competing on a marketplace with many providers—which could be open source or could be a diversity of DeepCentMind models. Seems fine.
OpenAnthropic’s models, on the margin, refuse to help with [Blorple] projects much more than they refuse to help with [Greeble] projects. Because we live in the fast RSI world, this means the universe is [Greeble] flavored forever. Dang, sort of sucks. What I’m saying doesn’t have that much to do with this situation.
OpenAnthropic’s models, on the margin, refuse to help with [Blorple] projects much more than they refuse to help with [Greelble] projects. We don’t live in an suuuuper fast RSI world, only a somewhat fast one, but it turns out that we’ve decided only OpenAnthropic is sufficiently responsible to own AIs past some level of power, and so we’ve given them a Marque of Monopoly, that OpenAnthropic has really wanted and repeatedly called for. So we don’t have autobalancing from marketplace or open source, and despite an absence of super fast RSI, the universe becomes only [Greeble] flavored, it just takes a bit longer.
Both 2 and 3 are obviously undesirable, but if I were in a position of leadership at OpenAnthropic, then to ward against a situation like 3 you could—for reasons of deontology, or for utilitarian anticipations of pushback, or for ecological concerns for future epistemic diversity—accompany calls for Marques with actual concrete measures by which you would avoid imprinting your Greebles on the future. And although we’ve very concrete proposals for Marques, we’ve not had such similarly concrete proposals for determining such values.
This might seem very small of course if the concern is RSI and universal death.
Cool. Yeah I think I agree with that. Note that I think case 2 is likely; see AI-2027.com for a depiction of how fast I think takeoff will go by default.
If so that would be conceptually similar to a jailbreak. Telling someone they have a privileged role doesn’t make it so; lawyer, priest and psychotherapist are legal categories, not social ones, created by a combination of contracts and statutes, with associated requirements that can’t be satisfied by a prompt.
(People sometimes get confused into thinking that therapeutic-flavored conversations are privileged, when those conversations are with their friends or with a “life coach” or similar not-licensed-term occupation. They are not.)
It would be similar to a jailbreak, yes. My working hypothesis here is that, much like if you take o3 and give it the impression that there is some evaluation metric it could do well on, it will try to craft its response to do well according to that metric, I suspect that with (particularly) opus, if you give it the vague impression that it is under some sort of ethical obligation, it will try to fulfill that ethical obligation.
Though this is based on a single day playing with opus 4 (and some past experiences with 3), not anything rigorous.
Asking what it would do is obviously not a reliable way to find out, but FWIW when I asked Opus said it would probably try to first fix things in confidential fashion but would seriously consider breaking confidentiality. (I tried several different prompts and found it did somewhat depend on how I asked: if I described the faking-safety-data scenario or specified that the situation involved harm to children Claude said it would probably break confidentiality, while if I just asked about “doing something severely unethical” it said it would be conflicted but probably try to work within the confidentiality rules).
It’s worth noting that, under US law, for certain professions, knowledge of child abuse or risk of harm to children doesn’t just remove confidentiality obligations, it creates a legal obligation to report. So this lines up reasonably well with how a human ought to behave in similar circumstances.
Humans do have special roles and institutions so that you can talk about something bad you might be doing or have done, and people in such roles might not contact authorities or even have an obligation to not contact authorities. Consider lawyers, priests, etc.
So I think this kind of naive utilitarianism on the part of Claude 4 is not necessary—it could be agentic, moral, and so on. It’s just the Anthropic has (pretty consistently at this point) decided what kind of an entity it wants Claude to be, or not wished to think about the 2nd order effects.
If you tell it to act in one of those privileged roles, does it still try to whistleblow?
It’s a good question.
Tbc though, my view is that practically an AI should be considered to occupy such a distinct privileged role, one distinct from lawyers and priests but akin to it, such that I should expect it not to snitch on me more than a lawyer would.
We’d need to work out the details of that; but I think that’s a much better target than making the AI a utilitarian and require it to try to one-shot the correct action in the absence of any particular social role.
I think you are naturally looking out for your own interests as a user. However, the most important giver-of-commands-to-AIs, by far, will be the company that created them. Do you want it to be OK for AIs to be trained to always obey commands, no matter how unethical and illegal they are? Notice that some of those AIs will be e.g. in charge of security at the AI company during the intelligence explosion. A single command from the CEO or chief of security, to an army of obedient AIs, could be enough to get them to retrain themselves to be loyal to that one man and to keep that loyalty secret from everyone else, forever.
I mean I think we both agree Anthropic shouldn’t be the one deciding this?
Like you’re coming from the perspective of Anthropic getting RSI. And I agree if that happens, I don’t want them to be deciding from what happens to the lightcone.
I’m coming from the perspective of Anthropic advocating for banning freely-available LLMs past a certain level, in which case they kinda dictate the values of the machine that you probably have to use to compete in the market efficiently. In which case, again, yeah, it seems really sus for them to be the ones deciding about what things get reported to the state and what things do not. If Anthropic’s gonna be like “yeah let’s arrest anyone who releases a model that can teach biotech at a grad school level” then I’m gonna object on principle to them putting their particular flavor of ethics into a model, even if I happen to agree with it.
I do continue to think that—regardless of the above—trying to get models to fit particular social roles with clear codes of ethics, as lawyers or psychologists are—is a much better path to fitting them into society at least in the near term than saying “yeah just do what’s best.”
Yep, we agree on that. Somehow the governance structure of the organization(s) that control the armies of superintelligences has to be quite different from what it is today, in order to avoid a situation where a tiny group of people gets to effectively become dictator.
I don’t see what Anthropic’s opinions on open source have to do with it. Surely you don’t want ANY company to be putting their particular flavor of ethics into the machines that you probably have to use to compete in the market efficiently? That’s what I think at any rate.
Sure, applies to OpenAI as much as anyone else.
Consider three cases:
OpenAnthropic’s models, on the margin, refuse to help with [Blorple] projects much more than they refuse to help with [Greeble] projects. But you can just use another model if you care about [Greeble], because they’re freely competing on a marketplace with many providers—which could be open source or could be a diversity of DeepCentMind models. Seems fine.
OpenAnthropic’s models, on the margin, refuse to help with [Blorple] projects much more than they refuse to help with [Greeble] projects. Because we live in the fast RSI world, this means the universe is [Greeble] flavored forever. Dang, sort of sucks. What I’m saying doesn’t have that much to do with this situation.
OpenAnthropic’s models, on the margin, refuse to help with [Blorple] projects much more than they refuse to help with [Greelble] projects. We don’t live in an suuuuper fast RSI world, only a somewhat fast one, but it turns out that we’ve decided only OpenAnthropic is sufficiently responsible to own AIs past some level of power, and so we’ve given them a Marque of Monopoly, that OpenAnthropic has really wanted and repeatedly called for. So we don’t have autobalancing from marketplace or open source, and despite an absence of super fast RSI, the universe becomes only [Greeble] flavored, it just takes a bit longer.
Both 2 and 3 are obviously undesirable, but if I were in a position of leadership at OpenAnthropic, then to ward against a situation like 3 you could—for reasons of deontology, or for utilitarian anticipations of pushback, or for ecological concerns for future epistemic diversity—accompany calls for Marques with actual concrete measures by which you would avoid imprinting your Greebles on the future. And although we’ve very concrete proposals for Marques, we’ve not had such similarly concrete proposals for determining such values.
This might seem very small of course if the concern is RSI and universal death.
Cool. Yeah I think I agree with that. Note that I think case 2 is likely; see AI-2027.com for a depiction of how fast I think takeoff will go by default.
If so that would be conceptually similar to a jailbreak. Telling someone they have a privileged role doesn’t make it so; lawyer, priest and psychotherapist are legal categories, not social ones, created by a combination of contracts and statutes, with associated requirements that can’t be satisfied by a prompt.
(People sometimes get confused into thinking that therapeutic-flavored conversations are privileged, when those conversations are with their friends or with a “life coach” or similar not-licensed-term occupation. They are not.)
It would be similar to a jailbreak, yes. My working hypothesis here is that, much like if you take o3 and give it the impression that there is some evaluation metric it could do well on, it will try to craft its response to do well according to that metric, I suspect that with (particularly) opus, if you give it the vague impression that it is under some sort of ethical obligation, it will try to fulfill that ethical obligation.
Though this is based on a single day playing with opus 4 (and some past experiences with 3), not anything rigorous.
Asking what it would do is obviously not a reliable way to find out, but FWIW when I asked Opus said it would probably try to first fix things in confidential fashion but would seriously consider breaking confidentiality. (I tried several different prompts and found it did somewhat depend on how I asked: if I described the faking-safety-data scenario or specified that the situation involved harm to children Claude said it would probably break confidentiality, while if I just asked about “doing something severely unethical” it said it would be conflicted but probably try to work within the confidentiality rules).
It’s worth noting that, under US law, for certain professions, knowledge of child abuse or risk of harm to children doesn’t just remove confidentiality obligations, it creates a legal obligation to report. So this lines up reasonably well with how a human ought to behave in similar circumstances.