Communications @ MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer’s. (Though we agree about an awful lot.)
Rob Bensinger
Thanks for writing this, Buck. I’m not going to try to reply to your whole post, because I think some of it is stuff I should chew on for longer and see whether I agree with it. But going through some of your points:
I definitely apologize for making it sound like I was making a harsher criticism of (the relevant parts of) EA than I intended. My tweet was originally written as a quick follow-up comment to someone who asked why I thought EA’s impact on AI x-risk was only ~55% likely to be positive. I turned it into a top-level tweet because I didn’t want to hide it deep in an existing discussion, but this was an error given I didn’t add extra context.
I also apologize for anything I said that made it sound like I was universally criticizing past or present Open Phil / cG staff (or centrally basing my views on first-hand conversations, for that matter). I already believed that tons of past and present rank-and-file OP/cG staff have very reasonable views, and I happily further update in that direction based on your and Oliver’s statements to that effect (e.g., Ollie’s “I have since updated that more people who are a level below Alexander, Dustin and Dario have more reasonable beliefs”).
I agree that my characterization of “Dario and a cluster of Open-Phil-ish people” was phrased in a needlessly confusing and sloppy way. I wanted to talk about a mix of ‘present-day views that seem to be endorsed by Dario and some other key figures’ and ‘general tendencies and memes that seem pretty widespread and that seem suspiciously related to choices EA leadership made many years ago’, but blurring these together is really unnecessarily confusing. Also, it didn’t help that I was sarcastically embedding my criticisms into my summaries of the views.Insofar as my broad criticism of EA cultural trends/memes is correct (which I think is substantial), I still feel a fair bit of uncertainty about how to divvy up responsibility between more Open-Phil-ish people, more Oxford-ish people, MIRI / the rats, etc. And of course, some of the problem may stem from broader social-or-demographic factors that no EA leaders tried to engineer, and that even go counter to how leadership has tried to optimize. (I too remember the early speeches themed around “Keep EA Weird”, the early EA-leader conversations fretting about overly naive EA consequentialism, etc.)
When Freeman Dyson originally said “Dyson sphere” I believe he had a Dyson swarm in mind, so it strikes me as oddly unfair to Freeman Dyson to treat Dyson “spheres” and “swarms” as disjoint. But “swarms” might be better language, just to avoid the misconception that a “Dyson sphere” is supposed to be a single solid structure.
Quoting from a follow-up conversation I had with Buck after this exchange:
__________________________________________________________
Buck: So following up on your Will post: It sounds like you genuinely didn’t understand that Will is worried about AI takeover risk and thinks we should try to avert it, including by regulation. Is that right?
I’m just so confused here. I thought your description of his views was a ridiculous straw man, and at first I thought you were just being some combination of dishonest and rhetorically sloppy, but now my guess is you’re genuinely confused about what he thinks?
(Happy to call briefly if that would be easier. I’m interested in talking about this a bit because I was shocked by your post and want to prevent similar things happening in the future if it’s easy to do so.)
Rob: I was mostly just going off Will’s mini-review; I saw that he briefly mentioned “governance agendas” but otherwise everything he said seemed to me to fit ‘has some worries that AI could go poorly, but isn’t too worried, and sees the current status quo as basically good—alignment is going great, the front-running labs are sensible, capabilities and alignment will by default advance in a way that lets us ratchet the two up safely without needing to do anything special or novel’
so I assumed if he was worried, it was mainly about things that might disrupt that status quo
Buck: what about his line “I think the risk of misaligned AI takeover is enormously important.”
alignment is going great, the front-running labs are sensible
This is not my understanding of what Will thinks.
[added by Buck later: And also I don’t think it’s an accurate reading of the text.]
Rob: 🙏
that’s helpful to know!
Buck: I am not confident I know exactly what Will thinks here. But my understanding is that his position is something like: The situation is pretty scary (hence him saying “I think the risk of misaligned AI takeover is enormously important.”). There is maybe 5% overall chance of AI takeover, which is a bad and overly large number. The AI companies are reckless and incompetent with respect to these risks, compared to what you’d hope given the stakes. Rushing through super intelligence would be extremely dangerous for AI takeover and other reasons.
[added/edited by Buck later: I interpret the review as saying:
He thinks the probability of AI takeover and of human extinction due to AI takeover is substantially lower than you do.
This is not because he thinks “AI companies/humanity are very thoughtful about mitigating risk from misaligned superintelligence, and they are clearly on track to develop techniques that will give developers justified confidence that AIs powerful enough that their misalignment poses risk of AI takeover are aligned”. It’s because he is more optimistic about what will happen if AI companies and humanity are not very thoughtful and competent.
He thinks that the arguments given in the book have important weaknesses.
He disagrees with the strategic implications of the worldview described in the book.
For context, I am less optimistic than he is, but I directionally agree with him on both points.]
In general, MIRI people often misunderstand someone saying, “I think X will probably be fine because of consideration Y” to mean “I think that plan Y guarantees that X will be fine”. And often, Y is not a plan at all, it’s just some purported feature of the world.
Another case is people saying “I think that argument A for why X will go badly fails to engage with counterargument Y”, which MIRI people round off to “X is guaranteed to go fine because of my plan Y”
Rob: my current guess is that my error is downstream of (a) not having enough context from talking to Will or seeing enough other AI Will-writing, and (b) Will playing down some of his worries in the review
I think I was overconfident in my main guess, but I don’t think it would have been easy for me to have Reality as my main guess instead
Buck: When I asked the AIs, they thought that your summary of Will’s review was inaccurate and unfair, based just on his review.
It might be helpful to try checking this way in the future.
I’m still interested in how you interpreted his line “I think the risk of misaligned AI takeover is enormously important.”
Rob: I think that line didn’t stick out to me at all / it seemed open to different interpretations, and mainly trying to tell the reader ‘mentally associate me with some team other than the Full Takeover Skeptics (eg I’m not LeCun), to give extra force to my claim that the book’s not good’.
like, I still associate Will to some degree with the past version of himself who was mostly unconcerned about near-term catastrophes and thought EA’s mission should be to slowly nudge long-term social trends. “enormously important” from my perspective might have been a polite way of saying ‘it’s 1 / 10,000 likely to happen, but that’s still one of the most serious risks we face as a society’
it sounds like Will’s views have changed a lot, but insofar as I was anchored to ‘this is someone who is known to have oddly optimistic views and everything-will-be-pretty-OK views about the world’ it was harder for me to see what it sounds like you saw in the mini-review
(I say this mainly as autobiography since you seemed interested in debugging how this happened; not as ‘therefore I was justified/right’)
Buck: Ok that makes sense
Man, how bizarre
Claude had basically the same impression of your summary as I did
Which makes me feel like this isn’t just me having more context as a result of knowing Will and talking to him about this stuff.
Rob: I mean, I still expect most people who read Will’s review to directionally update the way I did—I don’t expect them to infer things like
“The situation is pretty scary.”
“The AI companies are reckless and incompetent wrt these risks.”
“Rushing through super intelligence would be extremely dangerous for AI takeover and other reasons.”
or ‘a lot of MIRI-ish proposals like compute governance are a great idea’ (if he thinks that)
or ‘if the political tractability looked 10-20x better then it would likely be worth seriously looking into a global shutdown immediately’ (if he thinks something like that??)
I think it was reasonable for me to be confused about what he thinks on those fronts and to press him on it, since I expect his review to directionally make people waaaaaaay more misinformed and confused about the state of the world
and I think some of his statements don’t make sense / have big unresolved tensions, and a lot of his arguments were bad and misinformed. (not that him strawmanning MIRI a dozen different ways excuses me misrepresenting his view; but I still find it funny how disinterested people apparently are in the ‘strawmanning MIRI’ side of things? maybe they see no need to back me up on the places where my post was correct, because they assume the Light of Truth will shine through and persuade people in those cases, so the only important intervention is to correct errors in the post?)
but I should have drawn out those tensions by posing a bunch of dilemmas and saying stuff like ‘seems like if you believe W, then bad consequence X; and if you believe Y, then bad consequence Z. which horn of the dilemma do you choose, so I know what to argue against?‘, rather than setting up a best-guess interpretation of what Will was saying (even one with a bunch of ‘this is my best guess’ caveats)
I think Will was being unvirtuously cagey or spin-y about his views, and this doesn’t absolve me of responsibility for trying to read the tea leaves and figure out what he actually thinks about ‘should government ever slow down or halt the race to ASI?’, but it would have been a very easy misinterpretation for him to prevent (if his views are as you suggest)
it sounds like he mostly agrees about the parts of MIRI’s view that we care the most about, eg ‘would a slowdown/halt be good in principle’, ‘is the situation crazy’, ‘are the labs wildly irresponsible’, ‘might we actually want a slowdown/halt at some point’, ‘should govs wake up to this and get very involved’, ‘is a serious part of the risk rogue AI and not just misuse’, ‘should we do extensive compute monitoring’, etc.
it’s not 100% of what we’re pushing but it’s overwhelmingly more important to us than whether the risk is more like 20-50% or more like ‘oh no’
I think most readers wouldn’t come away from Will’s review thinking we agree on any of those points, much less all of them
Buck:
I expect his review to directionally make people waaaaaaay more misinformed and confused about the state of the world
I disagree
and I think some of his statements don’t make sense / have big unresolved tensions, and a lot of his arguments were bad and misinformed.
I think some of his arguments are dubious, but I don’t overall agree with you.
I think Will was being unvirtuously cagey or spin-y about his views, and this doesn’t absolve me of responsibility for trying to read the tea leaves and figure out what he actually thinks about ‘should government ever slow down or halt the race to ASI?’, but it would have been a very easy misinterpretation for him to prevent (if his views are as you suggest)
I disagree for what it’s worth.
it sounds like he mostly agrees about the parts of MIRI’s view that we care the most about, eg ‘would a slowdown/halt be good in principle’, ‘is the situation crazy’, ‘are the labs wildly irresponsible’, ‘might we actually want a slowdown/halt at some point’, ‘should govs wake up to this and get very involved’, ‘is a serious part of the risk rogue AI and not just misuse’, ‘should we do extensive compute monitoring’, etc.
it’s not 100% of what we’re pushing but it’s overwhelmingly more important to us than whether the risk is more like 20-50% or more like ‘oh no’
I think that the book made the choice to center a claim that people like Will and me disagree with: specifically, “With the current trends in AI progress building super intelligence is overwhelmingly likely to lead to misaligned AIs that kill everyone.”
It’s true that much weaker claims (e.g. all the stuff you have in quotes in your message here) are the main decision-relevant points. But the book chooses to not emphasize them and instead emphasize a much stronger claim that in my opinion and Will’s opinion it fails to justify.
I think it’s reasonable for Will to substantially respond to the claim that you emphasize, rather than different claims that you could have chosen to emphasize.
I think a general issue here is that MIRI people seem to me to be responding at a higher simulacrum level than the one at which criticisms of the book are operating. Here you did that partly because you interpreted Will as himself operating at a higher simulacrum level than the plain reading of the text.
I think it’s a difficult situation when someone makes criticisms that, on the surface level, look like straightforward object level criticisms, but that you suspect are motivated by a desire to signal disagreement. I think it is good to default to responding just on the object level most of the time, but I agree there are costs to that strategy.
And if you want to talk about the higher simulacra levels, I think it’s often best to do so very carefully and in a centralized place, rather than in a response to a particular person.
I also agree with Habryka’s comment that Will chose a poor phrasing of his position on regulation.
Rob: If we agree about most of the decision-relevant claims (and we agree about which claims are decision-relevant), then I think it’s 100% reasonable for you and Will to critique less-decision-relevant claims that Eliezer and Nate foregrounded; and I also think it would be smart to emphasize those decision-relevant claims a lot more, so that the world is likely to make better decisions. (And so people’s models are better in general; I think the claims I mentioned are very important for understanding the world too, not just action-relevant.)
I especially think this is a good idea for reviews sent to a hundred thousand people on Twitter. I want a fair bit more of this on LessWrong too, but I can see a stronger claim having different norms on LW, and LW is also a place where a lot of misunderstandings are less likely because a lot more people here have context.
Re simulacra levels: I agree that those are good heuristics. For what it’s worth, I still have a much easier time mentally generating a review like Will’s when I imagine the author as someone who disagrees with that long list of claims; I have a harder time understanding how none of those points of agreement came up in the ensuing paragraphs if Will tacitly agreed with me about most of the things I care about.
Possibly it’s just a personality or culture difference; if I wrote “This is a shame, because I think the risk of misaligned AI takeover is enormously important” (especially in the larger context of the post it occurred in) I might not mean something all that strong (a lot of things in life can be called “enormously important” from one perspective or another); but maybe that’s the Oxford-philosopher way of saying something closer to “This situation is insane, we’re playing Russian roulette with the world, this is an almost unprecedented emergency.”
(Flagging that this is all still speculative because Will hasn’t personally confirmed what his views are someplace I can see it. I’ve been mostly deferring to you, Oliver, etc. about what kinds of positions Will is likely to endorse, but my actual view is a bit more uncertain than it may sound above.)
(I also would have felt dramatically more positive about Will’s review if he’d kept everything else unchanged but just added the sentence “I definitely think it will be extremely valuable to have the option to slow down AI development in the future.” anywhere in his review. XP If he agrees with that sentence, anyway!)
I definitely think it will be extremely valuable to have the option to slow down AI development in the future.
What are the mechanisms you find promising for causing this to occur? If we all agree on “it will be extremely valuable to have the option to slow down AI development in the future”, then I feel silly for arguing about other things; it seems like the first priority should be to talk about ways to achieve that shared goal, whatever else we disagree about.
(Unless there’s a fast/easy way to resolve those disagreements, of course.)
banning anyone from having more than 8 GPUs
I assume you know this, but I’ll say out loud that this is a straw-man, since I expect this to be a common misunderstanding. The book suggests “[more than] eight of the most advanced GPUs from 2024” as a possible threshold where international monitoring efforts come online and the world starts caring that you aren’t using those GPUs to push the world closer to superintelligence, if it’s possible to do so.
“More than 8 GPUs” is also potentially confusing because people are likely to anchor to consumer hardware. From the book’s online appendices:
The most advanced AI chips are also quite specialized, so tracking and monitoring them would have few spillover effects. NVIDIA’s H100 chip, one of the most common AI chips as of mid-2025, costs around $30,000 per chip and is designed to be run in a datacenter due to its cooling and power requirements. These chips are optimized for doing the numerical operations involved in training and running AIs, and they’re typically tens to thousands of times more performant at AI workloads than standard computers (consumer CPUs).
I wasn’t exclusively looking at that line; I was also assuming that if Will liked some of the book’s core policy proposals but disliked others, then he probably wouldn’t have expressed such a strong a blanket rejection. And I was looking at Will’s proposal here:
[IABIED skips over] what I see as the crucial period, where we move from the human-ish range to strong superintelligence[1]. This is crucial because it’s both the period where we can harness potentially vast quantities of AI labour to help us with the alignment of the next generation of models, and because it’s the point at which we’ll get a much better insight into what the first superintelligent systems will be like. The right picture to have is not “can humans align strong superintelligence”, it’s “can humans align or control AGI-”, then “can {humans and AGI-} align or control AGI” then “can {humans and AGI- and AGI} align AGI+” and so on.
This certainly sounds like a proposal that we advance AI as fast as possible, so that we can reach the point where productive alignment research is possible sooner.
The next paragraph then talks about “a gradual ramp-up to superintelligence”, which makes it sound like Will at least wants us to race to the level of superintelligence as quickly as possible, i.e., he wants the chain of humans-and-AIs-aligning-stronger-AIs to go at least that far:
Elsewhere, EY argues that the discontinuity question doesn’t matter, because preventing AI takeover is still a ‘first try or die’ dynamic, so having a gradual ramp-up to superintelligence is of little or no value. I think that’s misguided.
… Unless he thinks this “gradual ramp-up” should be achieved via switching over at some point from the natural continuous trendlines he expects from industry, to top-down government-mandated ratcheting up of a capability limit? But I’d be surprised if that’s what he had in mind, given the rest of his comment.
Wanting the world to race to build superintelligence as soon as possible also seems like it would be a not-that-surprising implication of his labs-have-alignment-in-the-bag claims.
And although it’s not totally clear to me how seriously he’s taking this hypothetical (versus whether he mainly intends it as a proof of concept), he does propose that we could build a superintelligent paperclip maximizer and plausibly be totally fine (because it’s risk averse and promise-keeping), and his response to “Maybe we won’t be able to make deals with AIs?” is:
I agree that’s a worry; but then the right response is to make sure that we can.
Not “in that case maybe we shouldn’t build a misaligned superintelligence”, but “well then we’d sure better solve the honesty problem!”.
All of this together makes me extremely confused if his real view is basically just “I agree with most of MIRI’s policy proposals but I think we shouldn’t rush to enact a halt or slowdown tomorrow”.
If his view is closer to that, then that’s great news from my perspective, and I apologize for the misunderstanding. I was expecting Will to just straightforwardly accept the premises I listed, and for the discussion to proceed from there.
I’ll add a link to your comment at the top of the post so folks can see your response, and if Will clarifies his view I’ll link to that as well.
Twitter says that Will’s tweet has had over a hundred thousand views, so if he’s a lot more pro-compute-governance, pro-slowdown, and/or pro-halt than he sounded in that message, I hope he says loud stuff in the near future to clarify his views to folks!
yeah, I left off this part but Nate also said
[people having trouble separating them] does maybe enhance my sense that the whole community is desperately lacking in nate!courage, if so many people have such trouble distinguishing between “try naming your real worry” and “try being brazen/rude”. (tho ofc part of the phenomenon is me being bad at anticipating reader confusions; the illusion of transparency continues to be a doozy.)
Nate messaged me a thing in chat and I found it helpful and asked if I could copy it over:
fwiw a thing that people seem to me to be consistently missing is the distinction between what i was trying to talk about, namely the advice “have you tried saying what you actually think is the important problem, plainly, even once? ideally without broadcasting signals of how it’s a socially shameful belief to hold?”, and the alternative advice that i was not advocating, namely “have you considered speaking to people in a way that might be described as ‘brazen’ or ‘rude’ depending on who’s doing the describing?”.
for instance, in personal conversation, i’m pretty happy to directly contradict others’ views—and that has nothing to do with this ‘courage’ thing i’m trying to describe. nate!courage is completely compatible with saying “you don’t have to agree with me, mr. senator, but my best understanding of the evidence is [thing i believe]. if ever you’re interested in discussing the reasons in detail, i’d be happy to. and until then, we can work together in areas where our interests overlap.” there are plenty of ways to name your real worry while being especially respectful and polite! nate!courage and politeness are nearly orthogonal axes, on my view.
FWIW, as someone who’s been working pretty closely with Nate for the past ten years (and as someone whose preferred conversational dynamic is pretty warm-and-squishy), I actively enjoy working with the guy and feel positive about our interactions.
(Considering how little cherry-picking they did.)
From my perspective, FWIW, the endorsements we got would have been surprising even if they had been maximally cherry-picked. You usually just can’t find cherries like those.
(That was indeed my first thought when Bernanke said he liked the book; no dice, though.)
Yep. And equally, the blurbs would be a lot less effective if the title were more timid and less stark.
Hearing that a wide range of respected figures endorse a book called If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All is a potential “holy shit” moment. If the same figures were endorsing a book with a vaguely inoffensive title like Smarter Than Us or The AI Crucible, it would spark a lot less interest (and concern).
Yeah, I think people usually ignore blurbs, but sometimes blurbs are helpful. I think strong blurbs are unusually likely to be helpful when your book has a title like If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All.
Aside from the usual suspects (people like Tegmark), we mostly sent the book to people following the heuristic “would an endorsement from this person be helpful?”, much more so than “do we know that this person would like the book?”. If you’d asked me individually about Church, Schneier, Bernanke, Shanahan, or Spaulding in advance, I’d have put most of my probability on “this person won’t be persuaded by the book (if they read it at all) and will come away strongly disagreeing and not wanting to endorse”. They seemed worth sharing the book with anyway, and then they ended up liking it (at least enough to blurb it) and some very excited MIRI slack messages ensued.
(I’d have expected Eddy to agree with the book, though I wouldn’t have expected him to give a blurb; and I didn’t know Wolfsthal well enough to have an opinion.)
Nate has a blog post coming out in the next few days that will say a bit more about “How filtered is this evidence?” (along with other topics), but my short answer is that we haven’t sent the book to that many people, we’ve mostly sent it to people whose AI opinions we didn’t know much about (and who we’d guess on priors would be skeptical to some degree), and we haven’t gotten many negative reactions at all. (Though we’ve gotten people who just didn’t answer our inquiries, and some of those might have read the book and disliked it enough to not reply.)
Now, how much is that evidence about the correctness of the book? Extremely little!
It might not be much evidence for LWers, who are already steeped in arguments and evidence about AI risk. It should be a lot of evidence for people newer to this topic who start with a skeptical prior. Most books making extreme-sounding (conditional) claims about the future don’t have endorsements from Nobel-winning economists, former White House officials, retired generals, computer security experts, etc. on the back cover.
We’re still working out some details on the preorder events; we’ll have an announcement with more info on LessWrong, the MIRI Newsletter, and our Twitter in the next few weeks.
You don’t have to do anything special to get invited to preorder-only events. :) In the case of Nate’s LessOnline Q&A, it was a relatively small in-person event for LessOnline attendees who had preordered the book; the main events we have planned for the future will be larger and online, so more people can participate without needing to be in the Bay Area.
(Though we’re considering hosting one or more in-person events at some point in the future; if so, those would be advertised more widely as well.)
“Inventor” is correct!
Hopefully a German pre-order from a local bookstore will make a difference.
Yep, this counts! :)
I wrote a reply to Scott on Twitter, before seeing the discussion here; I think it’s a lot clearer than my original (IMO sloppy) tweet.
I’ve copied the reply below; see also my reply to Buck.
_____________________________________________________
To clarify the claim I’m making: I’m not trying to throw EA under a bus. This thread spun off from a discussion where I said I thought EA’s net impact on AI x-risk was probably positive, but I was highly uncertain.
Somebody asked what the bad components of EA’s impact were, and I went off on Anthropic, and on EA’s (and especially OpenPhil’s) entanglement with the company and their support for Anthropic’s operations. (To the extent that a lot of x-risk-adjacent EA seems to function, in practice, as a talent pipeline for Anthropic.)
I also said that I think OpenPhil’s bet on OpenAI was a disaster. And I said that there’s a culture of caginess, soft-pedaling, and trying-to-sound-reassuringly-mundane that I think has damaged AI risk discourse a fair amount, and that various people in and around OpenPhil have contributed to.
I’m restating this partly to be clear about what my exact claims are. E.g., I’m not claiming that items 1+2+3 are things OpenPhil and Anthropic leadership would happily endorse as stated. I deliberately phrased them in ways that highlight what I see as the flaws in these views and memes, in the hope that this could help wake up some people in and around OpenPhil+Anthropic to the road they’re walking.
This may have been the wrong conversational tack, but my vague sense is that there have been a lot of milder conversations about these topics over the years, and they don’t seem to have produced a serious reckoning, retrospective, or course change of the kind I would have expected.
I hoped it was obvious from the phrasing that 1-3 were attempting to embed the obvious critiques into the view summary, rather than attempting to phrase things in a way that would make the proponent go “Hell yeah, I love that view, what a great view it is!” If this confused anyone, I apologize for that.
I wasn’t centrally thinking of Holden’s public communication in the OP, though I think if he were consistently solid at this, Aysja Johnson wouldn’t have needed to write this in response to Holden’s defense of Anthropic ditching its core safety commitments.
I feel like this is a case in point. Like, sure, counting up from 0 (“the average corporation building the average product doesn’t try to warn the public about their product, except in ways mandated by law!”), Anthropic’s doing great. Or if the baseline is “is Anthropic doing better than pathological liar Sam Altman?”, then sure, Anthropic is doing better than OpenAI on candor.
If we’re instead anchoring to “trying to build a product that massively endangers everyone in the world is an incredibly evil sort of thing to do by default, and to even begin to justify it you need to be doing a truly excellent job of raising the loudest possible alarm bells alongside dozens of other things”, then I don’t think Anthropic is coming close to clearing that bar.
“Things go really, really badly”? Nobody outside the x-risk ecosystem has any idea what that means. And this is not the kind of claim Anthropic or Dario has ever tried to spotlight. You won’t find a big urgent-looking banner on the front page of Anthropic loudly warning the public, in plain terms, about this technology, and asking them to write their congressman about it. You won’t even find it tucked away in a press release somewhere. Dario gave a number when explicitly asked, in an on-stage interview.
If we’re setting the bar at 0, then maybe we want to call this an amazing act of courage, when he could have ducked the question entirely. But why on earth would we set the bar at 0? Is the social embarrassment of talking about AI risk in 2025 so great that we should be amazed when Dario doesn’t totally dodge the topic, while running one of the main companies building the tech?
I think Dario has been more reasonable on this issue than Gary Marcus. I also don’t think “clearing Gary Marcus” is the criterion we should be using to judge the CEO of Anthropic.
Specifically, this debate (from my perspective) isn’t about whether Anthropic or others have ever said anything scary-sounding, if an x-risk person goes digging for cherry-picked quotes to signal-boost. The question is whether the average statement from Anthropic, weighted by how visible Anthropic tries to make that statement, is adequate for informing the uninformed about the insane situation we’re in.
Is the average statement from Dario or Anthropic communicating, “Holy shit, the technology we and our competitors are building has a high chance of killing us all or otherwise devastating the world, on a timescale of years, not decades. This is terrifying, and we urgently call on policymakers and researchers to help find a solution right now”? Or is it communicating, “Mythos is our most aligned model yet! ☺️ Powerful AI could have benefits, but it could have costs too. AI is a big deal, and it could have impacts and pose challenges! We are taking these very seriously! Also, unlike our competitors, Claude will always be ad-free! We’re a normal company talking about the importance of safety and responsibility in this transformative period. ☺️”
(Case in point: https://x.com/HumanHarlan/status/2031981447377273273)
If Anthropic’s messaging were awful, but Dario’s personal communications were reliably great, then I’d at least give partial credit. But Dario’s messaging is often even worse than that. Dario has been the AI CEO agitating the earliest and loudest for racing against China. He’s the one who’s been loudest about there being no point in trying to coordinate with China on this issue. “The Adolescence of Technology” opens with a tirade full of strawmen of what seems to be Yudkowsky/Soares’ position (https://x.com/robbensinger/status/2016607060591595924), and per Ryan Greenblatt, the essay sends a super misleading message about whether Anthropic “has things covered” on the technical alignment side (https://x.com/RyanPGreenblatt/status/2016553987861000238):
I also strongly agree with Ryan re:
“I think it’s important to emphasize the severity of outcomes and I think people skimming the essay may not realize exactly what Dario thinks is at stake. A substantial possibility of the majority of humans being killed should be jarring.”
“I wish Dario more clearly distinguished between what he thinks a reasonable government should do given his understanding of the situation and what he thinks should happen given limited political will. I’d guess Dario thinks that very strong government action would be justified without further evidence of risk (but perhaps with evidence of capabilities) if there was high political will for action (reducing backlash risks).”
(And I claim that Anthropic leadership has been doing this for years; “The Adolescence of Technology” is not a one-off.)
On podcast interviews, Dario sometimes lets slip an unusually candid and striking statement about how insane and dangerous the situation is, without couching it in caveats about how Everything Is Uncertain and More Evidence Is Needed and It’s Premature For Governments To Do Much About This. Sometimes, he even says it in a way that non-insiders are likely to understand. But when he talks to lawmakers, he says things like:
Never mind the merits of “the policy world should totally ignore superintelligence”. Even if you agree with that (IMO extreme and false) claim, there is no justifying calling these risks “long-term”, “abstract”, and “distant” when you have timelines a fraction as aggressive as Dario’s!!
See also Jack Clark’s communication on this issue, and my criticism at the time (https://x.com/robbensinger/status/1834325868032012296). This was in 2024. I don’t think it’s great for Dario to be systematically making the same incredibly misleading elisions two years after this pretty major issue was pointed out to his co-founder.
I’m not criticizing Anthropic or Open Phil for being “careful how they phrase things”. I’m criticizing them for being careful in exactly the wrong direction. Any communication they send out that sends a “we have things covered, this is business-as-usual, no need to worry” signal is potentially not just factually misleading, but destructive of society’s ability to orient to what’s happening and course-correct. Anthropic is the “Machines of Loving Grace” company; it’s exactly the company that has put way more effort, early and often, into communicating how powerful and cool this technology is, while being consistently nervous and hedged about alerting others to the hazards.
This is exactly the opposite of what “being careful how you phrase things” should look like. Anthropic should have internal processes for catching any tweet that risks implicitly sending a “this is business-as-normal” or “we have everything handled” message, to either filter those out or flag them for evaluation. Sending that kind of message is much more dangerous than any ordinary reputational risk a company faces.
Re ‘MIRI is saying strategy is bad, but if MIRI had been strategic then they might not have started the deep learning revolution’: I think that this just didn’t happen. Per the https://x.com/allTheYud/status/2042362484976468053 thread, I think this is just a myth that propagates because it’s funny. (And because Sam Altman is good at spreading narratives that help him out.)
I don’t think MIRI accelerated timelines on net, and if it did, I don’t think the effect was large. I’d also say that if this happened, it was in spite of one of MIRI’s top obsessions for the last 20+ years being “be ultra cautious around messaging that could shorten AI timelines”.
(Like, as someone who’s been at MIRI for 13 years, this is literally one of the top annoying things constraining everything I’ve written and all the major projects I’ve seen my colleagues work on. Not because we think we’re geniuses sitting on a trove of capabilities insights, but just because we take the responsibility of not-accidentally-contributing-to-the-race extraordinarily seriously.)
But whatever, sure. If you want to accuse MIRI of hypocrisy and say that we’re just as culpable as the AI labs, go for it. You can think MIRI is terrible in every way and also think that the Anthropic cluster is not handling AI risk in a remotely responsible way.
Set aside the years of Anthropic poisoning the commons with its public messaging, poisoning efforts at international coordination by being the top lab preemptively shitting on the possibility of US-China coordination, and poisoning the US government’s ability to orient to what’s happening by selling half-truths and absurd frames to Senate committees.
Even without looking at their broad public communications, and without critiquing what passes for a superintelligence alignment or deployment plan in Anthropic’s public communications, Anthropic has behaved absurdly irresponsibly, lying to the public about their RSP being a binding commitment, lying to their investors re ‘we’re not going to accelerate capabilities progress’, and specifically targeting the most dangerous and difficult-to-control AI capabilities (recursive self-improvement) in a way that may burn years off of the remaining timeline.
Just to be clear: nowhere in this thread, or anywhere else, have I asked Anthropic to say something like that. Everything I’ve said above is compatible with thinking that Anthropic has a chance at solving superintelligence alignment. “I think I have a chance at solving superintelligence alignment!” is not an excuse for Anthropic or Dario’s behavior.
I agree it’s too glib as an argument for “international coordination to ban superintelligence is easy”. It isn’t easy. In the context of a conversation where most people are seriously underweighting the possibility, “governments have been known to ban scary or weird tech” and “governments have been known to enact policies that cost them money” are useful correctives, but they should be correctives pointing toward “this seems hard but maybe doable”, not “this seems easy”.
How are we doing that, exactly?
Like, this is one of the most foregrounded claims in Dario’s essay. He repeats a bunch of easily-checked falsehoods about the MIRI argument, at the very start of the essay, while warning that this view’s skepticism about alignment tractability is a “self-fulfilling belief”. He then proceeds to shit on the possibility of the US coordinating with China to avoid building superintelligence, which seems like a much more classic example of “belief that could easily be self-fulfilling”.
What is the mechanism whereby Dario criticizing MIRI is “cooperating” (is it that he didn’t mention us by name, preventing people from fact-checking any of his claims?), and MIRI staff criticizing Dario is “defecting”? What, specifically, is the wrench I’m throwing in Anthropic’s plans by tweeting about this? Is a key researcher on Chris Olah’s team going to get depressed and stop doing interpretability research unless I contribute to the “Anthropic is the Good Guys and OpenAI is the Bad Guys” narrative? Is Anthropic at risk of losing its lead in the race if MIRI people are open about their view that all the labs are behaving atrociously? Should I have dropped in a claim that everyone who disagrees with me is “quasi-religious”, the same way Dario’s cooperative essay begins?
If you think I’m factually mistaken, as you said at the start of your reply, then that makes sense. But surely that would be an equally valid criticism whether I were saying pro-Anthropic stuff or anti-Anthropic stuff. Why this separate “MIRI is defecting” idea?
Yeah. And when MIRI voiced early skepticism of OpenAI in private conversation, we were told that it was crucial to support Sam and Elon’s effort because Demis was untrustworthy. Counting up from zero, OpenAI could be framed as amazing progress: a nonprofit! Run by people vocally alarmed about x-risk! And they’re struggling for cash in the near term (in spite of verbal promises of funding from Musk), which gives us an opportunity to buy seats on the board!
Anthropic may or may not be slightly better than OpenAI. OpenAI may or may not be slightly better than DeepMind. I don’t think the lesson of history is that OpenPhil-cluster people are good at telling the difference between “this is marginally better than what the other guys are doing” and “this is good enough to actually succeed”.
But nothing I’ve said above depends on that claim. You can disagree with me about how likely Anthropic is to save the world, and still think there’s an egregious candor gap between the average Anthropic public statement and the scariest paragraphs buried in “The Adolescence of Technology”, and a further egregious candor gap between “The Adolescence of Technology” and e.g. Ryan Greenblatt’s post or https://x.com/MaskedTorah/status/2040270860846768203.
I don’t think the “circle-the-wagon” approach has served EA well throughout its history, and I don’t think people self-censoring to that degree is good for governments’ or labs’ ability to orient to reality.