Rob Bensinger comments on No77e’s Shortform

Rob Bensinger 11 Apr 2026 1:18 UTC
50 points
29
I wrote a reply to Scott on Twitter, before seeing the discussion here; I think it’s a lot clearer than my original (IMO sloppy) tweet.
I’ve copied the reply below; see also my reply to Buck.
_____________________________________________________
To clarify the claim I’m making: I’m not trying to throw EA under a bus. This thread spun off from a discussion where I said I thought EA’s net impact on AI x-risk was probably positive, but I was highly uncertain.

Somebody asked what the bad components of EA’s impact were, and I went off on Anthropic, and on EA’s (and especially OpenPhil’s) entanglement with the company and their support for Anthropic’s operations. (To the extent that a lot of x-risk-adjacent EA seems to function, in practice, as a talent pipeline for Anthropic.)

I also said that I think OpenPhil’s bet on OpenAI was a disaster. And I said that there’s a culture of caginess, soft-pedaling, and trying-to-sound-reassuringly-mundane that I think has damaged AI risk discourse a fair amount, and that various people in and around OpenPhil have contributed to.

I’m restating this partly to be clear about what my exact claims are. E.g., I’m not claiming that items 1+2+3 are things OpenPhil and Anthropic leadership would happily endorse as stated. I deliberately phrased them in ways that highlight what I see as the flaws in these views and memes, in the hope that this could help wake up some people in and around OpenPhil+Anthropic to the road they’re walking.

This may have been the wrong conversational tack, but my vague sense is that there have been a lot of milder conversations about these topics over the years, and they don’t seem to have produced a serious reckoning, retrospective, or course change of the kind I would have expected.

I hoped it was obvious from the phrasing that 1-3 were attempting to embed the obvious critiques into the view summary, rather than attempting to phrase things in a way that would make the proponent go “Hell yeah, I love that view, what a great view it is!” If this confused anyone, I apologize for that.

I wasn’t centrally thinking of Holden’s public communication in the OP, though I think if he were consistently solid at this, Aysja Johnson wouldn’t have needed to write this in response to Holden’s defense of Anthropic ditching its core safety commitments.
“Dario said there’s a 25% chance ‘things go really, really badly’”
I feel like this is a case in point. Like, sure, counting up from 0 (“the average corporation building the average product doesn’t try to warn the public about their product, except in ways mandated by law!”), Anthropic’s doing great. Or if the baseline is “is Anthropic doing better than pathological liar Sam Altman?”, then sure, Anthropic is doing better than OpenAI on candor.

If we’re instead anchoring to “trying to build a product that massively endangers everyone in the world is an incredibly evil sort of thing to do by default, and to even begin to justify it you need to be doing a truly excellent job of raising the loudest possible alarm bells alongside dozens of other things”, then I don’t think Anthropic is coming close to clearing that bar.

“Things go really, really badly”? Nobody outside the x-risk ecosystem has any idea what that means. And this is not the kind of claim Anthropic or Dario has ever tried to spotlight. You won’t find a big urgent-looking banner on the front page of Anthropic loudly warning the public, in plain terms, about this technology, and asking them to write their congressman about it. You won’t even find it tucked away in a press release somewhere. Dario gave a number when explicitly asked, in an on-stage interview.

If we’re setting the bar at 0, then maybe we want to call this an amazing act of courage, when he could have ducked the question entirely. But why on earth would we set the bar at 0? Is the social embarrassment of talking about AI risk in 2025 so great that we should be amazed when Dario doesn’t totally dodge the topic, while running one of the main companies building the tech?
“Meanwhile, you seem to be treating all these people as basically equivalent to Gary Marcus.”
I think Dario has been more reasonable on this issue than Gary Marcus. I also don’t think “clearing Gary Marcus” is the criterion we should be using to judge the CEO of Anthropic.
“I think this ‘debate’ isn’t about OpenPhil or Anthropic failing to say they’re extremely worried”
Specifically, this debate (from my perspective) isn’t about whether Anthropic or others have ever said anything scary-sounding, if an x-risk person goes digging for cherry-picked quotes to signal-boost. The question is whether the average statement from Anthropic, weighted by how visible Anthropic tries to make that statement, is adequate for informing the uninformed about the insane situation we’re in.

Is the average statement from Dario or Anthropic communicating, “Holy shit, the technology we and our competitors are building has a high chance of killing us all or otherwise devastating the world, on a timescale of years, not decades. This is terrifying, and we urgently call on policymakers and researchers to help find a solution right now”? Or is it communicating, “Mythos is our most aligned model yet! ☺️ Powerful AI could have benefits, but it could have costs too. AI is a big deal, and it could have impacts and pose challenges! We are taking these very seriously! Also, unlike our competitors, Claude will always be ad-free! We’re a normal company talking about the importance of safety and responsibility in this transformative period. ☺️”

(Case in point: https://x.com/HumanHarlan/status/2031981447377273273)

If Anthropic’s messaging were awful, but Dario’s personal communications were reliably great, then I’d at least give partial credit. But Dario’s messaging is often even worse than that. Dario has been the AI CEO agitating the earliest and loudest for racing against China. He’s the one who’s been loudest about there being no point in trying to coordinate with China on this issue. “The Adolescence of Technology” opens with a tirade full of strawmen of what seems to be Yudkowsky/Soares’ position (https://x.com/robbensinger/status/2016607060591595924), and per Ryan Greenblatt, the essay sends a super misleading message about whether Anthropic “has things covered” on the technical alignment side (https://x.com/RyanPGreenblatt/status/2016553987861000238):
“Dario strongly implies that Anthropic ‘has this covered’ and wouldn’t be imposing a massively unreasonable amount of risk if Anthropic proceeded as the leading AI company with a small buffer to spend on building powerful AI more carefully. I do not think Anthropic has this covered[....] I think it’s unhealthy and bad for AI companies to give off a ‘we have this covered and will do a good job’ vibe if they actually believe that even if they were in the lead, risk would be very high. At the very least, I expect many employees at Anthropic working on alignment, safety, and security don’t believe Anthropic has the situation covered.”
I also strongly agree with Ryan re:
- “I think it’s important to emphasize the severity of outcomes and I think people skimming the essay may not realize exactly what Dario thinks is at stake. A substantial possibility of the majority of humans being killed should be jarring.”
- “I wish Dario more clearly distinguished between what he thinks a reasonable government should do given his understanding of the situation and what he thinks should happen given limited political will. I’d guess Dario thinks that very strong government action would be justified without further evidence of risk (but perhaps with evidence of capabilities) if there was high political will for action (reducing backlash risks).”
(And I claim that Anthropic leadership has been doing this for years; “The Adolescence of Technology” is not a one-off.)

On podcast interviews, Dario sometimes lets slip an unusually candid and striking statement about how insane and dangerous the situation is, without couching it in caveats about how Everything Is Uncertain and More Evidence Is Needed and It’s Premature For Governments To Do Much About This. Sometimes, he even says it in a way that non-insiders are likely to understand. But when he talks to lawmakers, he says things like:
“However, the abstract and distant nature of long-term risks makes them hard to approach from a policy perspective: our view is that it may be best to approach them indirectly by addressing more imminent risks that serve as practice for them.”
Never mind the merits of “the policy world should totally ignore superintelligence”. Even if you agree with that (IMO extreme and false) claim, there is no justifying calling these risks “long-term”, “abstract”, and “distant” when you have timelines a fraction as aggressive as Dario’s!!

See also Jack Clark’s communication on this issue, and my criticism at the time (https://x.com/robbensinger/status/1834325868032012296). This was in 2024. I don’t think it’s great for Dario to be systematically making the same incredibly misleading elisions two years after this pretty major issue was pointed out to his co-founder.
“It’s about OpenPhil in particular being pretty careful how they phrase things for public consumption. And I think any attempt to attack them for this should start with an acknowledgement that MIRI is directly responsible for all of our current problems”
I’m not criticizing Anthropic or Open Phil for being “careful how they phrase things”. I’m criticizing them for being careful in exactly the wrong direction. Any communication they send out that sends a “we have things covered, this is business-as-usual, no need to worry” signal is potentially not just factually misleading, but destructive of society’s ability to orient to what’s happening and course-correct. Anthropic is the “Machines of Loving Grace” company; it’s exactly the company that has put way more effort, early and often, into communicating how powerful and cool this technology is, while being consistently nervous and hedged about alerting others to the hazards.

This is exactly the opposite of what “being careful how you phrase things” should look like. Anthropic should have internal processes for catching any tweet that risks implicitly sending a “this is business-as-normal” or “we have everything handled” message, to either filter those out or flag them for evaluation. Sending that kind of message is much more dangerous than any ordinary reputational risk a company faces.

Re ‘MIRI is saying strategy is bad, but if MIRI had been strategic then they might not have started the deep learning revolution’: I think that this just didn’t happen. Per the https://x.com/allTheYud/status/2042362484976468053 thread, I think this is just a myth that propagates because it’s funny. (And because Sam Altman is good at spreading narratives that help him out.)

I don’t think MIRI accelerated timelines on net, and if it did, I don’t think the effect was large. I’d also say that if this happened, it was in spite of one of MIRI’s top obsessions for the last 20+ years being “be ultra cautious around messaging that could shorten AI timelines”.

(Like, as someone who’s been at MIRI for 13 years, this is literally one of the top annoying things constraining everything I’ve written and all the major projects I’ve seen my colleagues work on. Not because we think we’re geniuses sitting on a trove of capabilities insights, but just because we take the responsibility of not-accidentally-contributing-to-the-race extraordinarily seriously.)

But whatever, sure. If you want to accuse MIRI of hypocrisy and say that we’re just as culpable as the AI labs, go for it. You can think MIRI is terrible in every way and also think that the Anthropic cluster is not handling AI risk in a remotely responsible way.

Set aside the years of Anthropic poisoning the commons with its public messaging, poisoning efforts at international coordination by being the top lab preemptively shitting on the possibility of US-China coordination, and poisoning the US government’s ability to orient to what’s happening by selling half-truths and absurd frames to Senate committees.

Even without looking at their broad public communications, and without critiquing what passes for a superintelligence alignment or deployment plan in Anthropic’s public communications, Anthropic has behaved absurdly irresponsibly, lying to the public about their RSP being a binding commitment, lying to their investors re ‘we’re not going to accelerate capabilities progress’, and specifically targeting the most dangerous and difficult-to-control AI capabilities (recursive self-improvement) in a way that may burn years off of the remaining timeline.
“What they haven’t said is ‘the situation is totally hopeless and every strategy except pausing has literally no chance of working’, but that isn’t a comms problem, that’s because they genuinely believe something different from you.”
Just to be clear: nowhere in this thread, or anywhere else, have I asked Anthropic to say something like that. Everything I’ve said above is compatible with thinking that Anthropic has a chance at solving superintelligence alignment. “I think I have a chance at solving superintelligence alignment!” is not an excuse for Anthropic or Dario’s behavior.
“Your claim that ‘governments are incredibly trigger-happy about banning things...there’s a long history of governments successfully coordinating to ban things dramatically less dangerous than superintelligent AI’ is too glib”
I agree it’s too glib as an argument for “international coordination to ban superintelligence is easy”. It isn’t easy. In the context of a conversation where most people are seriously underweighting the possibility, “governments have been known to ban scary or weird tech” and “governments have been known to enact policies that cost them money” are useful correctives, but they should be correctives pointing toward “this seems hard but maybe doable”, not “this seems easy”.
“But my impression is that the rest of the field is executing this portfolio plan admirably, but MIRI and a few other PauseAI people are trying to sabotage every other strategy in the portfolio in the hope of forcing people into theirs.”
How are we doing that, exactly?

Like, this is one of the most foregrounded claims in Dario’s essay. He repeats a bunch of easily-checked falsehoods about the MIRI argument, at the very start of the essay, while warning that this view’s skepticism about alignment tractability is a “self-fulfilling belief”. He then proceeds to shit on the possibility of the US coordinating with China to avoid building superintelligence, which seems like a much more classic example of “belief that could easily be self-fulfilling”.

What is the mechanism whereby Dario criticizing MIRI is “cooperating” (is it that he didn’t mention us by name, preventing people from fact-checking any of his claims?), and MIRI staff criticizing Dario is “defecting”? What, specifically, is the wrench I’m throwing in Anthropic’s plans by tweeting about this? Is a key researcher on Chris Olah’s team going to get depressed and stop doing interpretability research unless I contribute to the “Anthropic is the Good Guys and OpenAI is the Bad Guys” narrative? Is Anthropic at risk of losing its lead in the race if MIRI people are open about their view that all the labs are behaving atrociously? Should I have dropped in a claim that everyone who disagrees with me is “quasi-religious”, the same way Dario’s cooperative essay begins?

If you think I’m factually mistaken, as you said at the start of your reply, then that makes sense. But surely that would be an equally valid criticism whether I were saying pro-Anthropic stuff or anti-Anthropic stuff. Why this separate “MIRI is defecting” idea?
“I worry that any support or oxygen you guys get will be spent knifing other safety advocates, while Sam Altman happily builds AGI regardless.”
Yeah. And when MIRI voiced early skepticism of OpenAI in private conversation, we were told that it was crucial to support Sam and Elon’s effort because Demis was untrustworthy. Counting up from zero, OpenAI could be framed as amazing progress: a nonprofit! Run by people vocally alarmed about x-risk! And they’re struggling for cash in the near term (in spite of verbal promises of funding from Musk), which gives us an opportunity to buy seats on the board!

Anthropic may or may not be slightly better than OpenAI. OpenAI may or may not be slightly better than DeepMind. I don’t think the lesson of history is that OpenPhil-cluster people are good at telling the difference between “this is marginally better than what the other guys are doing” and “this is good enough to actually succeed”.

But nothing I’ve said above depends on that claim. You can disagree with me about how likely Anthropic is to save the world, and still think there’s an egregious candor gap between the average Anthropic public statement and the scariest paragraphs buried in “The Adolescence of Technology”, and a further egregious candor gap between “The Adolescence of Technology” and e.g. Ryan Greenblatt’s post or https://x.com/MaskedTorah/status/2040270860846768203.

I don’t think the “circle-the-wagon” approach has served EA well throughout its history, and I don’t think people self-censoring to that degree is good for governments’ or labs’ ability to orient to reality.
What links here?
- AI #164: Pre Opus by Zvi (17 Apr 2026 13:10 UTC; 47 points)
- Rohin Shah's comment on No77e’s Shortform by No77e (11 Apr 2026 9:55 UTC; 27 points)
- Scott Alexander 11 Apr 2026 16:24 UTC
  7 points
  0
  Parent
  Some helpful points, thanks. I responded in more depth on Twitter, but I don’t want to duplicate every conversation there here, so I’m just signposting that people should check the thread there for most of my opinions.