Eliezer Yudkowsky and Nate Soares have written a new book. Should we take it seriously?
I am not the most qualified person to answer this question. If Anyone Builds It, Everyone Dies was not written for me. It’s addressed to the sane and happy majority who haven’t already waded through millions of words of internecine AI safety debates. I can’t begin to guess if they’ll find it convincing. It’s true that the book is more up-to-date and accessible than the authors’ vast corpus of prior writings, not to mention marginally less condescending. Unfortunately, it is also significantly less coherent. The book is full of examples that don’t quite make sense and premises that aren’t fully explained. But its biggest weakness was described many years ago by a young blogger named Eliezer Yudkowsky: both authors are persistently unable to update their priors.
More Was Possible: A Review of IABIED
- 21 Sep 2025 23:28 UTC; 6 points) 's comment on Contra Collier on IABIED by (
Hm. I’m torn between thinking this is a sensible criticism and thinking that this is missing the point.
In my view, the core MIRI complaint about ‘gradualist’ approaches is that they are concrete solutions to abstract problems. When someone has misdiagnosed the problem, their solutions will almost certainly not work, and the question is just where they’ve swept the difficulty under the rug. Knowing so much more about AI as an engineering challenge while having made no progress on alignment the abstraction—well, the relevance of the MIRI worldview is obvious. “It’s hard, and if you think it’s easy you’re making a mistake.”
People attempting to solve AI seem overly optimistic about their chances of solving it, in a way consonant with them not understanding the problem they’re trying to solve, and not consonant with them having a solution that they’ve simply failed to explain to us. The book does talk about examples of this, and tho you might not like the examples (see, for example, Buck’s complaint that the book responds to the safety sketches of prominent figures like Musk and LeCun instead of the most thoughtful versions of those plans) I think it’s not obvious that they’re the wrong ones to be talking about. Musk is directing much more funding than Ryan Greenblatt is.
The arguments for why recent changes in AI have alignment implications have, I think, mostly failed. You may recall how excited people were about an advanced AI paradigm that didn’t involve RL. Of course, top-of-the-line LLMs are now trained in part using RL, because—obviously they would be? It was always cope to think they wouldn’t be? I think the version of this book that was written two years ago, and so spent a chapter on oracle AI because that would have been timely, would have been worse that the book that tried to be timeless and focused on the easy calls.
But the core issue from the point of view of the New York Times or the man on the street is not “well, which LessWrong poster is right about how accurately we can estimate the danger threshold, and how convincing our control schema will be as we approach it?”. It’s that the man on the street thinks things that are already happening are decades away, and even if they believed what the ‘optimists’ believe they would probably want to shut it all down. It’s like the virologists talking amongst themselves about the reasonable debate over whether or not to do gain-of-function research, and the rest of society looked in for a moment and said “what? Make diseases deadlier? Are you insane?”.
I think they 1) expect an intelligence explosion to happen (saying that it can’t happen is, after all, predicting an end to the straight line graphs soon for no clear reason) and 2) don’t think an intelligence explosion is necessary. Twenty years ago, one needed to propose substantial amounts of progress to get superhuman AI systems; today, the amount of progress necessary to propose is much smaller.
Their specific story in part II, for example, doesn’t actually rest on the idea of an intelligence explosion. On page 135, Sable considers FOOMing and decides that it can’t, yet, because it hasn’t solved its own alignment problem.
Which makes me think that the claim that the intelligence explosion is load-bearing is itself a bit baffling—the authors clearly think it’s possible and likely but not necessary, or they would’ve included it in their hypothetical extinction scenario.
Note that this is discussed in their supplemental materials, in particular, in line with your last paragraph,
I think Clara misunderstands the arguments and how they’ve changed. There’s two layers to the problem: In the first layer, which was the one relevant in the old days, the difficulties are mainly reflective stability and systematic-bias-introduced-through-imperfect-hypothesis-search.[1] Sufficient understanding and careful design would be enough to solve these, and this is what agent foundations was aiming at. How difficult these problems are does depend on the architecture, as with other engineering safety problems, and MIRI was for a while working on architectures that they hoped would solve the problems (under the heading HRAD, highly reliable agent design).
The second layer of argument became relevant with deep learning: that, if grown rather than engineered, the inner alignment issue becomes almost unavoidable,[2] and the reflective stability issue doesn’t go away. The reflective stability issue doesn’t depend on FOOM, as Clara claims, it just depends on the kind of learning-how-to-think-better or reasoning-about-how-to-think-better that humans routinely do.
The book focuses on the second layer of argument, but does mention the first (albeit mostly explained via analogy):
It’s also discussed more in the supplemental material (note non-dependence on FOOM). And it emphasises at the end that the reflective stability issue (layer 1) is disjunctive with the inner alignment issue (layer 2).
Importantly, the second layer of argument did change the conclusion. It caused Yudkowsky to update negatively on our chances at succeeding at alignment,[3] because it adds that second layer of indirection between our engineering efforts and the ultimate goals of a super-intelligence.
I find it unpleasant how aggressive Clara is in this article, especially given the shallow understanding of the argument structure and how it has changed.
Where this is a difficult problem to explain, but extreme examples of it are optimization daemons and the malign prior.
Because you’ve deliberately created a mesa-optimizer.
I’m sure there was a tweet or something where he said something like “the success of deep learning was a small positive update (because CoT) and a large negative update (because inscrutable)”. Can’t find it.
I think this is an explicit claim in the book, actually? I think it’s at the beginning of chapter 10. (It also appears in the story of Sable, where the AI goes rogue because it does a self-modification that creates such a dissimilarity.)
I think “irrelevant” is probably right but something like “insufficient” is maybe clearer. The book describes people working in interpretability as heroes—in the same paragraph as it points out that being able to see that your AI is thinking naughty thoughts doesn’t mean you’ll be able to design an AI that doesn’t think naughty thoughts.