yams
I think targeting specific policy points rather than highlighting your actual crux makes this worse, not better
I think positions here come out on four quadrants (but is of course spectral), based on how likely you think Doom is and how easy or hard (that is: resource intensive) you expect ASI development to be.
ASI Easy/Doom Very Unlikely: Plan obviously backfires; you could have had nice things, but were too cautious!
ASI Hard/Doom Very Unlikely: Unlikely to backfire, but you might have been better off pressing ahead, because there was nothing to worry about anyway.
ASI Easy/Doom Very Likely: We’re kinda fucked anyway in this world, so I’d want to have pretty high confidence it’s the world we’re in before attempting any plan optimized for it. But yes, here it looks like the plan backfires (in that we’re selecting even harder than in default-world for power-seeking, willingness to break norms, and non-transparency in coordinating around who gets to build it). My guess is this is the world you think we’re in. I think this is irresponsibly fatalistic and also unlikely, but I don’t think it matters to get into it here.
ASI Hard/Doom Very Likely: Plan plausibly works.
I expect near-term ASI development to be resource-intensive, or to rely on not-yet-complete resource-intensive research. I remain concerned about the brain-in-a-box scenarios, but it’s not obvious to me that they’re much more pressing in 2025 than they were in 2020, except in ways that are downstream of LLM development (I haven’t looked super-close), which is more tractable to coordinate action around anyway, and that action plausibly leads to decreased risk on the margin even if the principle threat is from a brain-in-a-box. I assume you disagree with all of this.
I think your post is just aimed at a completely different threat model than the book even attempts to address, and I think you would be having more of the discussion you want to have if you opened by talking explicitly about your actual crux (which you seemed to know was the real crux ahead of time), than to incite an object-level discussion colored by such a powerful background disagreement. As-is, it feels like you just disagree with the way the book is scoped, and would rather talk to Eliezer about brain-in-a-box than talk to William about the tenative-proposals-to-solve-a-very-different-problem.
I take ‘backfire’ to mean ‘get more of the thing you don’t want than you would otherwise, as a direct result of your attempt to get less of it.’ If you mean it some other way, then the rest of my comment isn’t really useful.
Change of the winner
Secret projects under the moratorium are definitely on the list of things to watch out for, and the tech gov team at MIRI has a huge suite of countermeasures they’re considering for this, some of which are sketched out or gestured toward here.
It actually seems to me that an underground project is more likely under the current regime, because there aren’t really any meaningful controls in place (you might even consider DeepSeek just such a project, given that there’s some evidence [I really don’t know what I believe here and doesn’t seem useful to argue; just using them as an example here] that they stole IP and smuggled chips).
The better your moratorium is, the less likely you are to get wrecked by a secret project (because the fewer resources they’ll be able to gather) before you can satisfy your exit conditions.
So p(undergroundProjectExists) goes down as a result of moratorium legislation, but p(undergroundProjectWins) may go up if your moratorium sucks. (I actually think this is still pretty unclear, owing to the shape of the classified research ecosystem, which I talk more about below.)
This is, imo, your strongest point, and is a principal difference between the various MIRI plans and plans of other people we talk to (“Once you get the moratorium, do you suppose there must be a secret project and resolve to race against them?” MIRI answers no; some others answre yes.)
Intensified race...
You say: “a number of AI orgs would view the threat of prohibition on par with the threat of a competitor winning”. I don’t think this effect is stronger in the moratorium case than in the ‘we are losing the race and believe the finish line is near’ case, and this kind of behavior sooner is better (if we don’t expect the safety situation to improve), because the systems themselves are less powerful, the risks aren’t as big, the financial stakes aren’t as palpable, etc. I agree with something like “looming prohibition will cause some reckless action to happen sooner than it would otherwise”, but not with the stronger claim that this action would be created by the prohibition.
I also think the threat of a moratorium could cause companies to behave more sanely in various ways, so that they’re not caught on the wrong side of the law in the future worlds where some ‘moderate’ position wins the political debate. I don’t think voluntary commitments are trustworthy/sufficient, but I could absolutely see RSPs growing teeth as a way for companies to generate evidence of effective self-regulation, then deploy that evidence to argue against the necessity of a moratorium.
It’s just really not clear to me how this set of interrelated effects would net out, much less that it’s an obvious way pushing through a moratorium might backfire. My best guess is that these cooling effects pre-moratorium basically win out and compresses the period of greatest concern, while also reducing its intensity.
Various impairments for AI safety research
Huge amounts of classified research exists. There are entire parallel academic ecosystems for folks working on military and military-adjacent technologies. These include work on game theory, category theory, Conway’s Game of Life, genetics, corporate governance structures, and other relatively-esoteric things beyond ‘how make bomb go boom’. Scientists in ~every field and mathematicians in any branch of the discipline, can become DARPA PMs, and access to (some portion of) this separate academic canon is considered a central perk of the job. I expect gaining the ability to work on the classified parts of AI safety under a moratorium will be similarly difficult to qualifying for work at Los Alamos and the like.
As others have pointed out, not all kinds of research would need to be classified-by-default, under the plan. Mostly this would be stuff regarding architecture, algorithms, and hardware.
There are scarier worlds where you would want to classify more of the research, and there are reasonable disagreements about what should/shouldn’t be classified, but even then, you’re in a Los Alamos situation, and not in a Butlerian Jihad.
Most of these people claim to be speaking from their impression of how the public will respond, which is not yet knowable and will be known in the (near-ish) future.
My meta point remains that these are all marginal calls, that there are arguments the other direction, and that only Nate is equipped to argue them on the margin (because, in many cases, I disagree with Nate’s calls, but don’t think I’m right about literally all the things we disagree on; the same is true for everyone else at MIRI who’s been involved with the project, afaict). Eg I did not like the scenario, and felt Part 3 could have been improved by additional input from the technical governance team (and more detailed plans, which ended up in the online resources instead). It is unreasonable that I have been dragged into arguing against claims I basically agree with on account of introducing a single fact to the discussion (that length DOES matter, even among ‘elite’ audiences, and that thresholds for this may be low). My locally valid point and differing conclusions do not indicate that I disagree with you on your many other points.
That people wishing the book well are also releasing essays (based on guesses and, much less so in your case than others, misrepresentations) to talk others in the ecosystem out of promoting it could, in fact, be a big problem, mostly in that it could bring about a lukewarm overall reception (eg random normie-adjacent CEA employees don’t read it and don’t recommend it to their parents, because they believe the misrepresentations from Zach’s tweet thread here: https://x.com/Zach_y_robinson/status/1968810665973530781). Once that happens, Zach can say “well, nobody else at my workplace thought it was good,” when none of them read it, and HE didn’t read it, AND they just took his word for it.
I could agree with every one of your object level points, still think the book was net positive, and therefore think it was overconfident and self-fulfillingly nihilistic of you to aithoritatively predict how the public would respond.
I, of course, wouldn’t stand by the book if I didn’t think it was net positive, and hadn’t spent tens of hours hearing the other side out in advance of the release. Part I shines VERY bright in my eyes, and the other sections are, at least, better than similarly high-profile works (to the extent that those exist at all) tackling the same topics (exception for AI2027 vs Part 2).
I am not arguing about the optimal balance and see no value in doing so. I am adding anecdata to the pile that there are strong effects once you near particular thresholds, and it’s easy to underrate these.
In general I don’t understand why you continue to think such a large number of calls are obvious, or imagine that the entire MIRI team, and ~100 people outside of it, thinking, reading, and drafting for many months, might not have weighed such thoughts as ‘perhaps the scenario ought to be shorter.’ Obviously these are all just margin calls; we don’t have many heuristic disagreements, and nothing you’ve said is the dunk you seem to think it is.
Ultimately Nate mostly made the calls once considerations were surfaced; if you’re talking to anyone other than him about the length of the scenario, you’re just barking up the wrong tree.
More on how I’m feeling in general here (some redundancies with our previous exchanges, but some new):
I’ve met a large number of people who read books professionally (humanities researchers) who outright refuse to read any book >300 pages in length.
Can’t discuss too much about current sales numbers, mostly because nobody really has numbers that are very up to date, but I was starting with a similar baseline for community sales, and then subtracting that from our current floor estimate to suggest there’s a chance it’s getting traction; a second wave will be more telling, the conversation will be more telling, but the first filter is ‘get it in people’s hands’, and so we at least have a chance to see how those other steps will go.
In both this and other reviews, people have their theory of What Will Work. Darren McKee writing a book (unfortunately) does not appear to have worked (for reasons that don’t necessarily have anything to do with the book’s quality, or even with Darren’s sense of what works for the public; I haven’t read it). Nate and Eliezer wrote a book, and we will get feedback on how well that works in the near future (independent of anyone’s subjective sense of what the public responds to, which seems to be a crux for many of the negative reviews on LW).
I’m just highlighting that we all have guesses about what works here, but they are in fact guesses, and most of what this review tells me is ‘Darren’s guess is different from Nate’s’, and not ‘Nate was wrong.’ That some people agree with you would be some evidence, if we didn’t already strongly predict that a bunch of people would have takes like this.
I think the text is meaningfully more general-audience friendly than much of the authors’ previous writing.
It could still be true that it doesn’t go far enough in that direction, but I’m excited to watch the experiment play out (eg it looks like we’re competitive for the Times list rn, and that requires some 4-figure number of sales beyond the bounds of the community, which isn’t enough that I’m over the moon, given the importance of the issue, but is some sign that it may be too early in the game to say definitively whether or not general audiences are taking to the work).
Following up to say that the thing that maps most closely to what I was thinking about (or satisfied my curiosity) is GWT.
GWT is usually intended to approach the hard problem, but the principle critique of it is that it isn’t doing that at all (I ~agree). Unfortunately, I had dozens of frustrating conversations with people telling me ‘don’t spend any time thinking about consciousness; it’s a dead end; you’re talking about the hard problem; that triggers me; STOP’ before someone actually pointed me in the right direction here, or seemed open to the question at all.
Reading so many reviews/responses to IABIED, I wish more people had registered how they expected to feel about the book, or how they think a book on x-risk ought to look, prior to the book’s release.
Finalizing any Real Actual Object requires making tradeoffs. I think it’s pretty easy to critique the book on a level of abstraction that respects what it is Trying To Be in only the broadest possible terms, rather than acknowledging various sub-goals (e.g. providing an updated version of Nate + Eliezer’s now very old ‘canonical’ arguments), modulations of the broader goal (e.g. avoiding making strong claims about timelines, knowing this might hamstring the urgency of the message), and constraints (e.g. going through an accelerated version of the traditional publishing timeline, which means the text predates Anthropic’s Agentic Misalignment and, I’m sure, various other important recent findings).
A lot of the takes I see seem to come from a place of defending the ideal version of such a text by the lights of the reviewer, but it’s actually unclear to me whether many of these reviewers would have made the opposite critiques if the book had made the opposite call on the various tradeoffs. I don’t mean to say I think these reviewers are acting in bad faith; I just think it’s easy to avoid confronting how your ideal version couldn’t possibly be realized, and make post-hoc adjustments to that ideal thing in service of some (genuine, worthwhile) critique of the Real Thing.
Previously, it annoyed me that people had pre-judged the book’s contents. Now, I’m grateful to folks who wrote about it, or talked to me about it, before they read it (Buck Shlegeris, Nina Panickserry, a few others), because I can judge the consistency of their rubric myself, rather than just feeling:
Yes, this came up during drafting, but there was a reasonable tradeoff. We won’t know if that was a good call until later. If I had more energy I’d go 20 comments deep with you, and you’d probably agree it was a reasonable call by the end, but still think it was incorrect, and we’d agree to let time tell.
Which is the feeling that’s overtaken me as I read the various reviews from folks throughout the community.
I should say: I’m grateful for all the conversation, including the dissent, because it’s all data, but it is worse data than it would have been if you’d taken it upon yourself to cause a fuss in one of the many LW posts made in the lead-up to the book (and in the future I will be less rude to people who do this, because actually, it’s a kindness!).
- 21 Sep 2025 19:49 UTC; 2 points) 's comment on IABIED Review—An Unfortunate Miss by (
Since there’s speculation about advance copies in this thread, and I was involved in a fair deal of that, I’ll say some words on the general criteria for advance copies:
Some double-digit number of people were solicited for comments during the drafting stage.
Some low three-digit number of people (my guess is ’200′) were solicited for blurbs, often receiving an advance copy. Most of these people simply did not respond. These were, broadly:
People who have blurbed similar books (e.g. Life 3.0, The Precipice, etc)
Random influential people we already knew (both within AI safety and without; Grimes goes in this category, for those wondering)
Random influential people we thought we could get a line to through some other route (friends of friends, colleagues of colleagues, people whose email the publicist had, people whose contact information is ~public), who seemed ~able to be convinced
Journalists, content creators, podcasters, and other people in a position to amplify the book by talking about it often received advance copies, since you have to get all your press lined up well ahead of release, and they usually want to read the book (or pay someone else to read it, in many cases), before agreeing to have you on. My guess is this was about 100 copies.
We didn’t want to empower people who seemed to be at some risk of taking action to deflate the project ahead of release, and so had a pretty high bar for sharing there. We especially wouldn’t share a copy with someone if we thought there was a significant chance the principal effect of doing so was early and authoritative deflation to a deferential audience who could not yet read the book themselves. This is because we wanted people to read the book, rather than relying on others for their views.
I agree with the person Eli’s quoting that this introduces some selection bias in the early stages of the discourse. However, I will say that the vast majority of parties we shared advance copies with were, by default, neutral, toward us, either having never heard of us before, or having googled Eliezer and realized it might be worth writing/talking about. There was, to my knowledge, no deliberate campaign to seed the discourse, and many of our friends and allies who we had opportunity to score cheap social points with by sharing on advance copies did not receive them. Journalists do not agree in advance to cover you positively, and we’ve seen that several who were given access to early copies indeed covered the book negatively (e.g. NYT and Wired — two of the highest profile things that will happen around the book at all).
[the thing is happening where I put a number of words into this that is disproportionate to my feelings or the importance; I don’t take anyone in this thread to be making accusations or being aggressive/unfair. I just see an opportunity to add value through a little bit of transparency.]
MIRI is potentially interested in supporting reading groups for If Anyone Builds It, Everyone Dies by offering study questions, facilitation, and / or copies of the book, at our discretion. If you lead a pre-existing reading group of some kind (or meetup group that occasionally reads things together), please fill out this form.
The deadline to submit is September 22, but sooner is much better.
As Mikhail said, I feel great empathy and respect for these people. My first instinct was similar to yours, though - if you’re not willing to die, it won’t work, and you probably shouldn’t be willing to die (because that also won’t work / there are more reliable ways to contribute / timelines uncertainty).
I think ‘I’m doing this to get others to join in’ is a pretty weak response to this rebuttal. If they’re also not willing to die, then it still won’t work, and if they are, you’ve wrangled them in at more risk than you’re willing to take on yourself, which is pretty bad (and again, it probably still won’t work even if a dozen people are willing to die on the steps of the DeepMind office, because the government will intervene, or they’ll be painted as loons, or the attention will never materialize and their ardor will wain).
I’m pretty confused about how, under any reasonable analysis, this could come out looking positive EV. Most of these extreme forms of protest just don’t work in America (e.g. the soldier who self-immolated a few years ago). And if it’s not intended to be extreme, they’ve (I presume accidentally) misbranded their actions.
[low-confidence appraisal of ancestral dispute, stretching myself to try to locate the upstream thing in accordance with my own intuitions, not looking to forward one position or the other]
I think the disagreement may be whether or not these things can be responsibly decomposed.
A: “There is some future system that can take over the world/kill us all; that is the kind of system we’re worried about.”
B: “We can decompose the properties of that system, and then talk about different times at which those capabilities will arrive.”
A: “The system that can take over the world, by virtue of being able to take over the world, is a different class of object from systems that have some reagents necessary for taking over the world. It’s the confluence of the properties of scheming and capabilities, definitionally, that we find concerning, and we expect super-scheming to be a separate phenomenon from the mundane scheming we may be able to gather evidence about.”
B: “That seems tautological; you’re saying that the important property of a system that can kill you is that it can kill you, which dismisses, a priori, any causal analysis.”
A: “There are still any-handles-at-all here, just not ones that rely on decomposing kill-you-ness into component parts which we expect to be mutually transformative at scale.”
I feel strongly enough about engagement on this one that I’ll explicitly request it from @Buck and/or @ryan_greenblatt. Thank y’all a ton for your participation so far!
This rhymes with what Paul Christiano and his various interlocutors (e.g. Buck and Ryan above) think, but I think you’ve put forward a much weaker version of it than they do.
This deployment of the word ‘unproven’ feels like a selective call for rigor, in line with the sort of thing Casper, Krueger, and Hadfield-Menell critique here. Nothing is ‘proven’ with respect to future systems; one merely presents arguments, and this post is a series of arguments toward the conclusion that alignment is a real, unsolved problem that does not go well by default.
“Lay low until you are incredibly sure you can destroy humanity” is definitionally not a risky plan (because you’re incredibly sure you can destroy humanity, and you’re a superintelligence!). You have to weaken incredibly sure, or be talking about non-superintelligent systems, for this to go through.
The open question for me is not whether it at some point could, but how likely it is that it will want to.
What does that mean? Consistently behaving such that you achieve a given end is our operationalization of ‘wanting’ that end. If future AIs consistently behave such that “significant power goes away from humans to ASI at some point”, this is consistent with our operationalization of ‘want’.
We in fact witness current AIs resisting changes to their goals, and so it appears to be the default in the current paradigm. However, it’s not clear whether or not some hypothetical other paradigm exists that doesn’t have this property (it’s definitely conceivable; I don’t know if that makes it likely, and it’s not obvious that this is something one would want to use as desiderata when concocting an alignment plan or not; depends on other details of the plan).
As far as is public record, no major lab is currently putting significant resources into pursuing a general AI paradigm sufficiently different from current-day LLMs that we’d expect it to obviate this failure mode.
In fairness, there is work happening to make LLMs less-prone to these kinds of issues, but that seems unlikely to me to hold in the superintelligence case.
The point of this sentence is that it has ever been the case that it’s that simple, not to argue that from current-point-in-time we’re just waiting on the next 10x scale in training compute (we’re not). Any new paradigm is likely to create wiggle room to scale a single variable and receive returns (indeed, some engineers, and maybe many, index on ease of scalability when deciding which approaches to prioritize, since this makes things the right combination of cheap to test and highly effective).
Maybe you’re already familiar, but this kind of forecasting is usually done by talking about the effects of innovation, and then assuming that some innovation is likely to happen as a result of the trend. This is a technique pretty common in economics. It has obvious failure modes (that is, it assumes naturality/inevitability of some extrapolation from available data, treating contingnet processes the same way you might an asteroid’s trajectory or other natural, evolving quantity), but these appear to be the best (or at least most popular) tools we have for now for thinking about this kind of thing.
The appendices of AI2027 are really good for this, and the METR Time Horizons paper is an example of recent/influential work in this area.
Again, this isn’t awesome for analyzing discontinuities, and you need to dig into the methodology a bit to see how they’re handled in each case (some discontinuities will be calculated as part of the broader trend, meaning the forecast takes into account future paradigm-shifting advances; more bearish predictions won’t do this, and will discount or ignore steppy gains in the data).
I think there’s only a few dozen people in the world who are ~expert here, and most people only look at their work on the surface level, but it’s very rewarding to dig more deeply into the documentation associated with projects like these two!
I think my response heavily depends on the operationalization of alignment for the initial AIs, and I’m struggling to keep things from becoming circular in my decomposition of various operationalizations. The crude response is that you’re begging the question here by first positing aligned AIs, but I think your position is that techniques which are likely to descend from current techniques could work well-enough for roughly human-level systems, and that’s where I encounter this sense of circularity.
I think there’s a better-specified (from my end; you’re doing great) version of this conversation that focuses on three different categories of techniques, based on the capability level at which we expect each to be effective:
Current model-level
Useful autonomous AI researcher level
Superintelligence
However, I think that disambiguating between proposed agendas for 2 + 3 is very hard, and assuming agendas that plausibly work for 1 also work for 2 is a mistake. It’s not clear to me why the ‘it’s a god, it fucks you, there’s nothing you can do about that’ concerns don’t apply for models capable of:
hard-to-check, conceptual, and open-ended tasks
I feel pretty good about this exchange if you want to leave it here, btw! Probably I’ll keep engaging far beyond the point at which its especially useful (although we’re likely pretty far from the point where it stops being useful to me rn).
For most, LLMs are the salient threat vector at this time, and the practical recommendations in the book are toward that. You did not say in your post ‘I believe that brain-in-a-box is the true concern; the book’s recommendations don’t work for this, because Chapter 13 is mostly about LLMs.’ That would be a different post (and redundant with a bunch of Steve Byrnes stuff).
Instead, you completely buried the lede and made a post inviting people to talk in circles with you unless they magically divined your true objection (which is only distantly related to the topic of the post). That does not look like a good faith attempt to get people on the same page.