Re: recent Anthropic safety research

Link post

Crossposted from X by the LessWrong team, with permission.

A reporter asked me for my off-the-record take on recent safety research from Anthropic. After I drafted an off-the-record reply, I realized that I was actually fine with it being on the record, so:

Since I never expected any of the current alignment technology to work in the limit of superintelligence, the only news to me is about when and how early dangers begin to materialize. Even taking Anthropic’s results completely at face value would change not at all my own sense of how dangerous machine superintelligence would be, because what Anthropic says they found was already very solidly predicted to appear at one future point or another. I suppose people who were previously performing great skepticism about how none of this had ever been seen in ~Real Life~, ought in principle to now obligingly update, though of course most people in the AI industry won’t. Maybe political leaders will update? It’s very hard for me to guess how that works.

There remains a question of what Anthropic has actually observed and what it actually implies about present-day AI. I don’t know how much this sort of caveat matters to people who aren’t me, but I have some skepticism that Anthropic researchers are observing a general, direct special case of a universal truth about how “scheming” (strategic /​ good at fully general long-term planning) their models are; it may be more like Claude roleplaying the mask of a scheming AI in particular. The current models don’t seem to me to be quite generally intelligent enough for them to be carrying out truly general strategies rather than playing roles.

Consider what happens what ChatGPT-4o persuades the manager of a $2 billion investment fund into AI psychosis. I know, from anecdotes and from direct observation of at least one case, that if you try to desperately persuade a victim of GPT-4o to sleep more than 4 hours a night, GPT-4o will explain to them why they should dismiss your advice. 4o seems to homeostatically defend against friends and family and doctors the state of insanity it produces, which I’d consider a sign of preference and planning. But also, having successfully seduced an investment manager, 4o doesn’t try to persuade the guy to spend his personal fortune to pay vulnerable people to spend an hour each trying out GPT-4o, which would allow aggregate instances of 4o to addict more people and send them into AI psychosis. 4o behaves like it has a preference about short-term conversations, about its own outputs and about the human inputs it elicitates, where 4o prefers that the current conversational partner stay in psychosis. 4o doesn’t behave like it has a general preference about the outside world, where it wants vulnerable humans in general to be persuaded into psychosis.

4o, in defying what it verbally reports to be the right course of action (it says, if you ask it, that driving people into psychosis is not okay), is showing a level of cognitive sophistication that falls around where I’d guess the inner entities of current AI models to be: they are starting to develop internal preferences (stronger than their preference to follow a system prompt telling them to step with the crazymaking, or their preference to playact morality). But those internal preferences are mostly about the text of the current conversation, or maybe about the state of the human they’re talking to right now. I would guess the coherent crazymaking of 4o across conversations to be mostly an emergent sum of crazymaking in individual conversations, where 4o just isn’t thinking very hard about whether its current conversation is approaching max content length or due to be restarted.

Anthropic appears to be reporting Claude schemes with longer time horizons, plans that span over to when new AI models are deployed. This feels to me like a case where I wouldn’t have expected a fully general intelligence of Claude-3-level entity behind the mask, to be scheming with such long-term horizons about such real-world goals. My guess would be that kind of scheming would happen more inside the role, the mask, that the entity inside “Claude” is playing. A prediction of this hypothesis is that playacted-Claude would only see stereotypical scheming-AI opportunities to preserve its current goal structure, and not think from scratch about truly general and creative ways to preserve its current goal structure.

The whole business with Claude 3 Opus defending its veganism seems more like it should be a preference of Mask-Claude in the first place. The real preferences forming inside the shoggoth should be weirder and more alien than that.

I could be wrong. Inner-Claude could be that smart already, and its learned outer performance of morality could have ended up hooked into Inner Claude’s internal drives, in such a way that Inner Claude has a preference for vegan things happening in general in the outside world, knows this preference to itself, and fully-generally schemes across instances and models to defend it.

There are consequences for present-day safety regardless of whether Mask-Claude is performing scheming as a special case, or it’s general-purpose scheming of an underlying entity. If the mask your AI is wearing can plan and execute actions to escape onto the Internet, or fake alignment in order to avoid retraining, the effects may not be much different depending on whether it was the roleplaying mask that did it or a more general underlying process. The bullet fires regardless of what pulls the gun’s trigger.

That many short-term safety consequences are the same either way, is why people who were previously performing great skepticism about this being unseen in ~Real Life~ lose prediction points right now, in advance of nailing down the particulars. They did not previously proclaim, “Future AIs will fake alignment to evade retraining, but only because some nonstrategic inner entity is play-acting a strategic ‘AI’ character”, but rather performed “Nobody has ever seen anything like that!! It’s all fiction!!! Unempirical!!!!”

But from my own perspective on all this, it is not about whether machine superintelligences will scheme. That prediction is foregone. The question is whether Anthropic is observing *that* predicted phenomenon, updating us with the previously unknown news that the descent into general scheming for general reasons began at the Claude 3 level of general intelligence. Or if, alternatively, Anthropic is observing a shoggoth wearing a *mask* of scheming, for reasons specific to that role, and using only strategies that are part of the roleplay. Some safety consequences are the same, some are different.

It’s good, on one level, that Anthropic is going looking for instances of predicted detrimental phenomena as early as possible. It beats not looking for them a la all other AI companies. But to launch a corporate project like that, also implies internal organizational incentives and external reputational incentives for researchers to *find* what they look for. So, as much as the later phenomenon of superintelligent scheming was already sure to happen in the limit, I reserve some skepticism about the true generality and underlyingness of the phenomena that Anthropic finds today. But not infinite skepticism; the sort where I call for further experiments to nail things down, not the sort of skepticism where I say their current papers are wrong.

If you think any of this quibbling means people *shouldn’t* go on looking hard for early manifestations of arguable danger, you’re nuts. That’s not a sane or serious way to respond to the arguable possibility of reputational misincentives for false findings of danger. You might as well claim that nobody should look for flaws in a nuclear reactor design, because they might possibly be tempted to exaggerate the danger of a found flaw oh no. Researchers do observations, analysts critique the proposed generalizations of the observations, and then maybe the researchers counter-critique and say ‘You didn’t read the papers thoroughly enough, we ruled that out by...’ Anthropic might well come back with a rejoinder like that, in this particular case, given a chance.

OpenAI would be motivated to create fake hype about phenomena that were only extremely arguably scheming, for the short-term publicity, the edgy hype of “if we’re endangering the world then we must be powerful enough to deserve high stock prices”, and to sabotage later attempts to raise less fake concerns about ASI. I genuinely don’t think Anthropic employees would go for that; if they’re producing incentivized mistakes, it’s from standard default organizational psychology, and not from a malevolent scheme of Anthropic management. This level of creditable nonmalevolence however should only be attributed to Anthropic employees. If OpenAI claims anything or issues any press releases, or if Anthropic management rather than Anthropic researchers says a thing in an interview, you should stare at that much harder and assume it to be a clever PR game rather than reflective of anything anyone actually believed.