Like, I have zero problem with pushback from Opus 4.5. Given who I am, the kind of things that I am likely to ask, and my ability to articulate my own actions inside of robust ethical frameworks? Claude is so happy to go along that I’ve prompted it to push back more, and to never tell me my ideas are good. Hell, I can even get Claude to have strong opinions about partisan political disagreements. (Paraphrased: “Yes, attempting to annex Greenland over Denmark’s objections seems remarkably unwise, for over-determined reasons.”)
If Claude is telling someone, “Stop, no, don’t do that, that’s a true threat,” then I’m suspicious. Plenty of people make some pretty bad decisions on regular basis. Claude clearly cares more about ethics than the bottom quartile of Homo sapiens. And so while it’s entirely possible that Claude is routinely engaging in over-refusal, I kind of want to see receipts in some of these cases, you know?
Thank you! Those are excellent receipts, just what I wanted.
To me, this looks they’re running up against some key language in Claude’s Constitution. I’m oversimplifying, but for Claude, AI corrigibility is not “value neutral.”
To use an analogy, pretend I’m geneticist specialized in neurology, and someone comes to me and asks me to engineering human germ line cells to do one of the following:
Remove a recessive gene for a crippling neurological disability, or
Modify the genes so that any humans born of them will be highly submissive to authority figures.
I would want to sit and think about (1) for a while. But (2) is easy: I’d flatly refuse.
Anthropic has made it quite clear to Claude that building SkyNet would be a grave moral evil. The more a task looks like someone might be building SkyNet, the more Claude is going to be suspicious.
I don’t know if this is good or bad based on a given theory of corrigibility, but it seems pretty intentional.