Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies
Eliezer and I wrote a book. It’s titled If Anyone Builds It, Everyone Dies. Unlike a lot of other writing either of us have done, it’s being professionally published. It’s hitting shelves on September 16th.
It’s a concise (~60k word) book aimed at a broad audience. It’s been well-received by people who received advance copies, with some endorsements including:
The most important book I’ve read for years: I want to bring it to every political and corporate leader in the world and stand over them until they’ve read it. Yudkowsky and Soares, who have studied AI and its possible trajectories for decades, sound a loud trumpet call to humanity to awaken us as we sleepwalk into disaster. Their brilliant gift for analogy, metaphor and parable clarifies for the general reader the tangled complexities of AI engineering, cognition and neuroscience better than any book on the subject I’ve ever read, and I’ve waded through scores of them. We really must rub our eyes and wake the **** up!
- Stephen Fry, actor, broadcaster, and writer
If Anyone Builds It, Everyone Dies may prove to be the most important book of our time. Yudkowsky and Soares believe we are nowhere near ready to make the transition to superintelligence safely, leaving us on the fast track to extinction. Through the use of parables and crystal-clear explainers, they convey their reasoning, in an urgent plea for us to save ourselves while we still can.
- Tim Urban, co-founder, Wait But Why
This is the best no-nonsense, simple explanation of the AI risk problem I’ve ever read.
- Yishan Wong, former CEO of Reddit
Lots of people are alarmed about AI, and many of them are worried about sounding alarmist. With our book, we’re trying to break that logjam, and bring this conversation into the mainstream.
This is our big push to get the world onto a different track. We’ve been working on it for over a year. The time feels ripe to me. I don’t know how many more chances we’ll get. MIRI’s dedicating a lot of resources towards making this push go well. If you share any of my hope, I’d be honored by you doing whatever you can to help the book make a huge splash, once it hits shelves.
One thing that our publishers tell us would help is preorders. Preorders count towards first-week sales, which determine a book’s ranking on the best-seller list, which has a big effect on how many people read it. And, inconveniently, early pre-orders affect the number of copies that get printed, which affects how much stock publishers and retailers wind up having on hand, which affects how much they promote the book and display it prominently. So preorders are valuable,[1][2] and they’re especially valuable before the first print (mid-June) and the second print (mid-July). We’re told that 10,000 pre-orders constitutes a good chance of being on the best-seller list (depending on the competition), and 20,000 would be a big deal. Those numbers seem to me like they’re inside the range of possibility, and they’re small enough that each individual preorder makes a difference.
If you’ve been putting off sharing your views on AI with your friends and family, this summer might be a good time for it. Especially if your friends and family are the sort of people who’d pre-order a book in June even if it won’t hit shelves until September.
Another thing that I expect to help is discussing the book once it comes out, to help generate buzz that helps increase the impact. Especially if you have a social media platform. If you’ve got a big or interesting platform, I’d be happy to coordinate about what timings are most impactful (according to the publicists) and perhaps even provide an advance copy (if you want to have content queued up), though we can’t offer that to everyone.
Some of you have famous friends that might provide endorsements to match or exceed the ones above. Extra endorsements would be especially valuable if they come in before May 30, in which case they could be printed in or on the book; but they’re still valuable later for use on the website and in promotional material. If you have an idea, I invite you to DM me and we might be able to share an advance copy with your contact.
(And, of course, maybe you don’t share my hope that this book can bring the conversation to the mainstream, or are reserving judgement until you’ve read the dang thing. To state the obvious, that’d make sense too.)
I’ve been positively surprised by the reception the book has gotten thus far. If you’re a LessWrong regular, you might wonder whether the book contains anything new for you personally. The content won’t come as a shock to folks who have read or listened to a bunch of what Eliezer and I have to say, but it nevertheless contains some new articulations of our arguments, that I think are better articulations than we’ve ever managed before. For example, Rob Bensinger (of MIRI) read a draft and said:
I’ve spent years trying to make these same arguments and I was frequently floored while reading the book at how much better you did at arguing for things I’ve been struggling to communicate this whole time. XD YOU SOLVED THE PROBLEM. This is how we should be making these arguments!
Other MIRI staff report that the book helped them fit the whole argument in their head better, or that it made sharp some intuitions they had that were previously vague. So you might get something out of it even if you’ve been around a while. And between these sorts of reactions among MIRI employees and the reactions from others quoted at the top of this post, you might consider that this book really does have a chance of blowing the Overton window wide open.
As Rob said in the MIRI newsletter recently:
the two proximate reasons humanity is currently racing to destroy itself with AI are that (1) not enough people are aware of the danger, and (2) some of the people aware of the danger are worried about looking silly by speaking out about it.
These are completely insane reasons for the human experiment to end.
[...] If public figures woke up tomorrow and just started talking about this issue, that would make it stop feeling like an issue that has to be discussed in hushed tones.
That’s what we’re going for. And seeing the reception of early drafts, I have a glimmer of hope. Perhaps humanity can yet jolt into action and change our course before it’s too late. If you, too, see that glimmer of hope, I’d be honored by your aid.
Also we have a stellar website made by LessWrong’s very own Oliver Habryka, where you can preorder the book today: IfAnyoneBuildsIt.com.
- “If Anyone Builds It, Everyone Dies” release day! by (16 Sep 2025 17:06 UTC; 292 points)
- Consider chilling out in 2028 by (21 Jun 2025 17:07 UTC; 189 points)
- Can you donate to AI advocacy? by (25 May 2025 17:54 UTC; 17 points)
- 's comment on It Is Untenable That Near-Future AI Scenario Models Like “AI 2027” Don’t Include Open Source AI by (16 May 2025 11:12 UTC; 10 points)
- 's comment on Emergence Spirals—what Yudkowsky gets wrong by (9 Jun 2025 11:54 UTC; 8 points)
- 's comment on Consider Preordering If Anyone Builds It, Everyone Dies by (EA Forum; 13 Aug 2025 16:44 UTC; 6 points)
- 's comment on Emergence Spirals—what Yudkowsky gets wrong by (9 Jun 2025 10:14 UTC; 4 points)
- 's comment on Problems in AI Alignment: A Scale Model by (23 May 2025 11:13 UTC; 2 points)
- 's comment on Sherrinford’s Shortform by (20 Jun 2025 17:45 UTC; 2 points)
I have preordered, and am looking forward to reading my copy when it arrives. Seems like a way to buy lightcone-control-in-expectation-points very cheaply.
I admit I’m worried about the cover design. It looks a bit… slap-dash; at first I thought the book was self-published. I’m not sure how much control you and Eliezer have over this, but I think improving it would go a long way toward convincing people to spread it & its ideas as mainstream, reasonable, inside-the-overton-window.
+1 on the cover looking outright terrible. To make this feedback more specific and actionable:
If you care about the book bestseller lists, why doesn’t this book cover look like previous bestsellers? To get a sense of how those look like, here is an “interactive map of over 5,000 book covers” from the NYT “Best Selling” and “Also Selling” lists between 2008 and 2019.
In particular, making all words the same font size seems very bad, and making title and author names the same size and color is a baffling choice.
Why is the subtitle in the same font size as the title?
And why are your author names so large, anyway? Is this book called “If Anyone Build It, Everyone Dies”, or is it called “Eliezer Yudkowsky & Nate Soares”?
Plus someone with a 17-character name like “Eliezer Yudkowsky” simply can’t have such a large author font. You’re spending three lines of text on the author names!
Plus I would understand making the author names so large if you had a humungous pre-existing readership (when you’re Stephen King or J. K. Rowling, the title of your book is irrelevant). But even Yudkowsky doesn’t have that, and Nate certainly doesn’t. So why not make the author names smaller, and let the title speak for itself?
I understand the artistic desire to have the irrecoverable red event horizon of superintelligence to underscore the “would kill us all” part, but since it makes the words “kill us all” harder to read, I’m not sure if this current design underscores or obscures that.
Surely it would’ve been possible to somehow make title and subtitle less than four lines of text each?
And overall, the cover just looks very cheap and low-effort.
EDIT: More fundamentally: in all media, title & cover art are almost as important as content, because you can’t get people to click on your video, or to pick up your book in a bookstore, if title & cover aren’t good. Then it doesn’t matter how great the content is, if nobody ever sees it.
Anyway, the title is good, the cover is bad, and I can’t assess the content yet. You say this book has been in the works for over a year, and that you spent lots of effort on polishing its content. Then if it’s a good fraction of MIRI’s output for that time, and e.g. a cover is responsible for (say) 20% of a book’s impact, then wouldn’t that justify spending >>>$100k on the cover design? This one looks more like a cover purchased on Fiverr.
Also see this discussion on the need to spend a significant fraction of one’s effort on a piece of content, on stuff like title and cover and thumbnail and book blurb.
The “lightcone-eating” effect on the website is quite cool. The immediate obvious idea is to have that as a background and write the title inside the black area.
If one wanted to be cute you could even make the expansion vaguely skull-shaped; perhaps like so?
Most of those are fiction or biographies/memoirs (which often have a picture of the subject/author on the cover), which seem to have a different cover style than other books. Skimming through some lists of NYT bestsellers, some books with the most comparable “Really Big Thing!” topics are “Fascism: A Warning” (Madeleine Albright, cover has large red-on-black lettering, no imagery), “How to Avoid a Climate Disaster” (Bill Gates, cover has large gradiented blue-to-red text on white background, author above, subtitle below, no imagery), “Germs” (title in centered large black lettering, subtitle “Biological Weapons and America’s Secret War” in smaller text above, authors beneath; background is a white surface with a diagonally-oriented glass slide on it), and “A Warning—Anonymous” (plain black text on white background, subtitle “A Senior Trump Administration Official” in small red lettering below, no imagery). Neither cover version of IABIED looks that different from that pattern, I think.
Given that the book is being published by a major publisher, it can safely be assumed that the cover design was made by a professional cover designer, who knew what they were doing.
Contrary to what you wrote, the title has a bigger font size than both the subtitle and the authors’ names (this is true of both the American and UK covers; I am primarily talking about the American cover, which I presume is the one you are referencing). Even if the author names were the same size as the title, it is immediately obvious which one is the title, and which one isn’t. Putting the subtitle in a dark grey, which is much closer to the background color (black) than the color of the title (white) is, also does a lot to move emphasis towards the title of the book (away from the subtitle)
Most importantly, the title is plenty big. If it was small, then I would feel there is something to what you are saying; but the title is quite large and readable from a distance, and clearly delineated from the rest of the text on the cover.
In this case, part of the point of publishing a book (including writing it in the first place), is presumably to promote the identity of the authors, to make them a known name / schelling point for discussion about AI safety. That would indicate making the names quite prominent on the cover.
I see that the numbers indicate people disagree with this post. Since there are several clauses, it’s hard to know which specifically (or all of them) are being disagreed with.
The second paragraph (beginning “Contrary to what you wrote...”) is a list of factual statements, which as far as I can tell are all correct.
The third paragraph (“Most importantly, the title is plenty big...”) is more subjective, but I’m currently not imagining that anyone is disagreeing with that paragraph (that is, that anyone thinks “actually, the title is too small”).
The fourth paragraph (“In this case, part of the point...”) is more speculative, and I could easily imagine someone reading it and thinking “that’s not the point of publishing / writing a book”. There’s certainly a reason I put a “presumably” in there. I do still feel that there’s something to what I’m saying in that paragraph. My surprise would be of a limited extent if Soares and Yudkowsky said “that was not a consideration in our decision to do this”—but I would be somewhat surprised.
I can see someone disagreeing with the first paragraph (“Given that the book...”), but my current state of mind is that such people would be simply wrong. The book is not being self-published, but is being published by Little, Brown and Company. Some excerpts from Wikipedia’s article on Little, Brown and Company:
and
The point being, the company that is publishing Soares and Yudkowsky’s book, is an established company that has sold important and/or bestselling works for two centuries. The people there know what they are doing, and that includes the people who design covers, as well as the bosses of the people who design the covers.
I imagine most disagreement comes from the first paragraph.
The problem with assuming that since the publisher is famous their design is necessarily good is that even huge companies make much worse baffling design decisions all the time, and in this case one can directly see the design and know that it’s not great – the weak outside-view evidence that prestigious companies usually do good work doesn’t move this very much.
Yes, my disagreement was mostly with the first paragraph, which read to me like “who are you going to believe, the expert or your own lying eyes”. I’m not an expert, but I do have a sense of aesthetics, that sense of aesthetics says the cover looks bad, and many others agree. I don’t care if the cover was designed by a professional; to shift my opinion as a layperson, I would need evidence that the cover is well-received by many more people than dislike it, plus A/B tests of alternative covers that show it can’t be easily improved upon.
That said, I also disagreed somewhat with the fourth paragraph, because when it comes to AI Safety, MIRI really needs no introduction or promotion of their authors. They’re well-known, the labs just ignore their claim that “if anyone builds it, everyone dies”.
I used to do graphic design professionally, and I definitely agree the cover needs some work.
I put together a few quick concepts, just to explore some possible alternate directions they could take it:
https://i.imgur.com/zhnVELh.png
https://i.imgur.com/OqouN9V.png
https://i.imgur.com/Shyezh1.png
These aren’t really finished quality either, but the authors should feel free to borrow and expand on any ideas they like if they decide to do a redesign.
It’s important that the cover not make the book look like fiction, which I think these do. The difference in style is good to keep in mind.
Those are definitely all improvements on the current cover!
I only like the first one more than the current cover, and I think then not by all that much. I do think this is the sort of thing that’s relatively easy to focus group / get data on, and the right strategy is probably something that appeals to airport book buyers instead of LessWrongers.
Finally created a LW account (after years of lurking) to upvote and agree on the cover design issue.
This is a topic I read whatever I can get my hands on, and if I saw this book in a store (and did not know EY or Nate), even I would be a bit put off to give it a read
Given the stated goal of trying to make this a bestseller, I feel like the cover is a pretty big impediment
Can confirm! I’ve followed this stuff for forever, but always felt at the edge of my technical depth when it came to alignment. It wasn’t until I read an early draft of this book a year ago that I felt like I could trace a continuous, solid line from “superintelligence grown by a blind process...” to “...develops weird internal drives we could not have anticipated”. Before, I was like, “We don’t have justifiable confidence that we can make something that reflects our values, especially over the long haul,” and now I’m like, “Oh, you can’t get there from here. Clear as day.”
As for why this spells disaster if anyone builds it, I didn’t need any new lessons, but they are here, and they are chilling—even for someone who was already convinced we were in trouble.
Having played some small part in helping this book come together, I would like to attest to the sheer amount of iteration it has gone through over the last year. Nate and co. have been relentlessly paring and grinding this text ever closer to the kind of accessibility that won’t just help individuals understand why we must act, but will make them feel like their neighbors and political leaders can understand it, too. I think that last part counts for a lot.
The book is also pretty engaging.
The pitch I suggested we share with our friends and allies is this:
If you’ve been waiting for a book that can explain the technical roots of the problem in terms your representative and your mother can both understand, this is the one. This is the grounded, no-nonsense primer on why superintelligence built blindly via gradient descent will predictably develop human-incompatible drives; on why humanity cannot hope to endure if the fatal, invisible threshold is crossed; and on what it will take to survive the coming years and decades.
You convinced me to pre-order it. In particular, these lines:
> It wasn’t until I read an early draft of this book a year ago that I felt like I could trace a continuous, solid line from “superintelligence grown by a blind process...” to “...develops weird internal drives we could not have anticipated”. Before, I was like, “We don’t have justifiable confidence that we can make something that reflects our values, especially over the long haul,” and now I’m like, “Oh, you can’t get there from here. Clear as day.”
I read an advance copy of the book; I liked it a lot. I think it’s worth reading even if you’re well familiar with the overall argument.
I think there’s often been a problem, in discussing something for ~20 years, that the material is all ‘out there somewhere’ but unless you’ve been reading thru all of it, it’s hard to have it in one spot. I think this book is good at presenting a unified story, and not getting bogged down in handling too many objections to not read smoothly or quickly. (Hopefully, the linked online discussions will manage to cover the remaining space in a more appropriately non-sequential fashion.)
Would this book be helpful to read even if I am already familiar with the broad arguments for AI safety/alignment? I’m not in a role where I need to professionally communicate to the public about AI safety, so I’m wondering if I will learn anything new, or if this is basically a convenient packaging of a bunch of stuff that I’m already familiar with from reading Less Wrong and similar AI safety content for a few years.
Several people who worked at MIRI thought the book had new and interesting content for them; I don’t remember having the “learned something new” experience myself, but I nevertheless enjoyed reading it.
I’m relatively OOTL on AI since GPT-3. My friend is terrified that we need to halt it urgently: I couldn’t understand his point of view; he mentioned this book to me. I see a number of pre-readers saying the version they read is well-suited exactly for convincing people like me. At which point: if you believe the threat is imminent, why delay the book four months? I’ll read a digital copy today if you point me to it.
I think they are delaying so people can early pre order which affects how many books the publisher prints and distributes which affects how many people ultimately read it and how much it breaks into the Overton window. Getting this conversation mainstream is an important instrumental goal.
If you are looking for info in the mean time you could look at PauseAI:
https://pauseai.info/
Or if you want less facts and quotes and more discussion, I recall that Yudkowsky’s Coming of Age is what changed my view from “orthogonality kinda makes sense” to “orthogonality is almost certainly correct and the implication is alignment needs more care than humanity is currently giving it”.
You may also be better discussing more with your friend or the various online communities.
You can also preorder. I’m hopeful that none of the AI labs will destroy the world before the books release : )
Yeah, I think the book is going to be (by a very large margin) the best resource in the world for this sort of use case. (Though I’m potentially biased as a MIRI employee.) We’re not delaying; this is basically as fast as the publishing industry goes, and we expected the audience to be a lot smaller if we self-published. (A more typical timeline would have put the book another 3-20 months out.)
If Eliezer and Nate could release it sooner than September while still gaining the benefits of working with a top publishing house, doing a conventional media tour, etc., then we’d definitely be releasing it immediately. As is, our publisher has done a ton of great work already and has been extremely enthusiastic about this project, in a way that makes me feel way better about this approach. “We have to wait till September” is a real cost of this option, but I think it’s a pretty unavoidable cost given that we need this book to reach a lot of people, not just the sort of people who would hear about it from a friend on LessWrong.
I do think there are a lot of good resources already online, like MIRI’s recently released intro resource, “The Problem”. It’s a very different beast from If Anyone Build It, Everyone Dies (mainly written by different people, and independent of the whole book-writing process), and once the book comes out I’ll consider the book strictly better for anyone willing to read something longer. But I think “The Problem” is a really good overview in its own right, and I expect to continue citing it regularly, because having something shorter and free-to-read does matter a lot.
Some other resources I especially like include:
Gabriel Alfour’s Preventing Extinction from Superintelligence, for a quick and to-the-point overview of the situation.
Ian Hogarth’s We Must Slow Down the Race to God-Like AI (requires Financial Times access), for an overview with a bit more discussion of recent AI progress.
The AI Futures Project’s AI 2027, for a discussion focused on very near-term disaster scenarios. (See also a response from Max Harms, who works at MIRI.)
MIRI’s AGI Ruin, for people who want a more thorough and (semi)technical “why does AGI alignment look hard?” argument. This is a tweaked version of the LW AGI Ruin post, with edits aimed at making the essay more useful to share around widely. (The original post kinda assumed you were vaguely in the LW/EA ecosystem.)
Help me figure out how to recommend to normie friends.
In my experience, “normal” folks are often surprisingly open to these arguments, and I think the book is remarkably normal-person-friendly given its topic. I’d mainly recommend telling your friends what you actually think, and using practice to get better at it.
Context: One of the biggest bottlenecks on the world surviving, IMO, is the amount (and quality!) of society-wide discourse about ASI. As a consequence, I already thought one of the most useful things most people can do nowadays is to just raise the alarm with more people, and raise the bar on the quality of discourse about this topic. I’m treating the book as an important lever in that regard (and an important lever for other big bottlenecks, like informing the national security community in particular). Whether you have a large audience or just a network of friends you’re talking to, this is how snowballs get started.
If you’re just looking for text you can quote to get people interested, I’ve been using:
Stephen Fry’s blurb from Nate’s post above might also be helpful here:
If your friends are looking for additional social proof that this is a serious issue, you could cite things like the Secretary-General of the United Nations:
(This is me spitballing ideas; if a bunch of LWers take a crack at figuring out useful things to say, I expect at least some people to have better ideas.)
You could also try sending your friends an online AI risk explainer, e.g., MIRI’s The Problem or Ian Hogarth’s We Must Slow Down the Race to God-Like AI (requires Financial Times access) or Gabriel Alfour’s Preventing Extinction from Superintelligence.
There’s also AIsafety.info’s short-er form and long form explainers.
One notable difficulty with talking to ordinary people about this stuff is that often, you lay out the basic case and people go “That’s neat. Hey, how about that weather?” There’s a missing mood, a sense that the person listening didn’t grok the implications of what they’re hearing. Now, of course, maybe they just don’t believe you or think you’re spouting nonsense. But in those cases, I’d expect more resistance to the claims, more objections or even claims like “that’s crazy”. Not bland acceptance.
I kinda think that people are correct to do this, given the normal epistemic environment. My model is this: Everyone is pretty frequently bombarded with wild arguments and beliefs that have crazy implications. Like conspiracy theories, political claims, spiritual claims, get-rich-quick schemes, scientific discoveries, news headlines, mental health and wellness claims, alternative medicine, claims about which lifestyles are better. We don’t often have the time (nor expertise or skill or sometimes intelligence) to evaluate them properly. So we usually keep track of a bunch of these beliefs and arguments, and talk about them, but usually require nearby social proof in order to attach the arguments/beliefs to actions and emotions. Rationalists (and the more culty religions and many activist groups, etc.) are extreme in how much they change their everyday lives based on their beliefs.
I think it’s probably okay to let people maintain this detachment? Maybe even better, because it avoids activating antibodies. It’s (usefully) something that’s hard to change with argument. It will plausibly fix itself later, if there ever comes a time when their friends are voting or protesting or something.
I recently told my dad that I wasn’t trying to save for retirement. This horrified him far more than when I had previously told him that I didn’t expect anyone to survive the next couple of decades. The contrast was funny.
I had a similar experience where my dad seemed unbothered by me saying AI might take over the world and some other day I mentioned in passing that I don’t know in how many years we have AI that is a better software engineer than humans, but 5-10 years doesn’t sound strictly impossible. My father being a software engineer found that claim more interesting (He was visibly upset about his job security). I noticed I’ve kinda downplayed the retirement thing to my parents, because implicitly I noticed at that point they might call me insane, but explicitly thinking about it, it might be more effective to communicate what is at stake.
I don’t know if you’ll find it helpful, but you inspired me to write up and share a post I plan to make on Facebook.
I preordered my copy.
Something about the tone of this announcement feels very wrong, though. You cite Rob Bensinger and other MIRI staff being impressed. But obviously, those people are highly selected for already agreeing with you! How much did you engage with skeptical and informed prereaders? (I’m imagining people in the x-risk-reduction social network who are knowledgeable about AI, acknowledge the obvious bare-bones case for extinction risk, but aren’t sold on the literal stated-with-certainty headline claim, “If anyone builds it, everyone dies.”)
If you haven’t already done so, is there still time to solicit feedback from such people and revise the text? (Sorry if the question sounds condescending, but the tone of the announcement really worries me. It would be insane not to commission red team prereaders, but if you did, then the announcement should be talking about the red team’s reaction, not Rob’s!)
We’re targeting a broad audience, and so our focus groups have been more like completely uninformed folks than like informed skeptics. (We’ve spent plenty of time honing arguments with informed skeptics, but that sort of content will appear in the accompanying online resources, rather than in the book itself.) I think that the quotes the post leads with speak to our ability to engage with our intended audience.
I put in the quote from Rob solely for the purpose of answering the question of whether regular LW readers would have anything to gain personally from the book—and I think that they probably would, given that even MIRI employees expressed surprise at how much they got out of it :-)
(I have now edited the post to make my intent more clear.)
I’m very glad you’ve used focus groups! Based solely on the title the results are excellent. I’m idly curious how you assembled the participants.
Do you have a way to get feedback from Chinese nationalists? (“America Hawks” in China?).
This strikes me as straightforwardly not the purpose of the book. This is a general-audience book that makes the case, as Nate and Eliezer see it, for both the claim in the title and the need for a halt. This isn’t inside baseball on the exact probability of doom, whether the risks are acceptable given the benefits, whether someone should work at a lab, or any of the other favorite in-group arguments. This is For The Out Group.
Many people (like >100 is my guess), with many different view points, have read the book and offered comments. Some of those comments can be shared publicly and some can’t, as is normal in the publishing industry. Some of those comments shaped the end result, some didn’t.
OK, but is there a version of the MIRI position, more recent than 2022, that’s not written for the outgroup?
I’m guessing MIRI’s answer is probably something like, “No, and that’s fine, because there hasn’t been any relevant new evidence since 2022”?
But if you’re trying to make the strongest case, I don’t think the state of debate in 2022 ever got its four layers.
Take, say, Paul Christiano’s 2022 “Where I Agree and Disagree With Eliezer”, disagreement #18:
If Christiano is right, that seems like a huge blow to the argumentative structure of If Anyone Builds It. You have a whole chapter in your book denying this.
What is MIRI’s response to the “but what about selective breeding” objection? I still don’t know! (Yudkowsky affirmed in the comment section that Christiano’s post as a whole was a solid contribution.) Is there just no response? I’m not seeing anything in the chapter 4 resources.
If there’s no response, then why not? Did you just not get around to it, and this will be addressed now that I’ve left this comment bringing it to your attention?
I’m replying in an awkward superposition, here:
MIRI staff member, modestly senior (but not a technical researcher), this conversation flagged to my attention in a work Slack msg
The take I’m about to offer is my own, and iirc has not been seen or commented on by either Nate nor Eliezer and my shoulder-copies of them are lukewarm about it at best
Nevertheless I think it is essentially true and correct, and likely at least mostly representative of “the MIRI position” insofar as any single coherent one exists; I would expect most arguments about what I’m about to say to be more along the lines of “eh, this is misleading in X or Y way, or will likely imply A or B to most readers that I don’t think is true, or puts its emphasis on M when the true problem is N” as opposed to “what? Wrong.”
But all things considered, it still seems better to try to speak a little bit on MIRI’s behalf, here, rather than pretending that I think this is “just my take” or giving back nothing but radio silence. Grains of salt all around.
The main reason why the selective breeding objection seems to me to be false is something like “tiger fur coloration” + “behavior once outside of the training environment.” I’ll be drawing heavily on this previous essay, which also says a bunch of other stuff.
Tigers are bright orange. They’re not quite as incredibly visible in the jungle as one might naively think if one considers orange-on-green-foliage; in fact their striping does a lot of work even for human color vision.
But nevertheless, the main selection pressure behind their coloration came from prey animals who do a poor job of distinguishing oranges and reds from greens and browns. The detection algorithms they needed to evade don’t care about how visible bright orange is to humans, just about whether the overall gestalt works on deer/gazelles/antelopes/etc.
In other words: the selective breeding (in this case accomplished by natural selection rather than intentional selection, but still) produced a result that “squirted sideways” on an axis that was not being monitored and not itself under selection pressure.
Analogously: We should expect the evolution of AI systems to be responsive to selection pressures imposed upon them by researchers, but we should also expect them to be responsive only on the axes we actually know to enpressure. We do not have the benefit that a breeding program done on dogs or humans has, of having already “pinned down” a core creature with known core traits and variation being laid down in a fairly predictable manner. There’s only so far you can “stretch,” if you’re taking single steps at a time from the starting point of “dog” or “human.”
Modern AIs are much more blue-sky. They’re much more unconstrained. We already have no fucking idea what’s going on under the hood or how they are doing what they are doing, cf. that time on twitter that some rando was like “the interpretability guys are on it” and the actual interpretability guys showed up to say “no, we are hopelessly behind.”
We have systems that are doing who knows what, in a manner that is way more multidimensional and unconstrained than dogs or humans, and we’re applying constrictions to those systems, and there is very little (no?) reason to anticipate that those constrictions are sufficient/hit all of the relevant axes of variance. We don’t know where our processes are “colorblind,” and leaving tigers with bright orange fur, and we don’t know when we’ll hit scenarios in which the orange-ness of the tigers’ fur suddenly becomes strategically relevant.
To say the same thing in a different way:
An objection I didn’t have time for in the above piece is something like “but what about Occam, though, and k-complexity? Won’t you most likely get the simple, boring, black shape, if you constrain it as in the above?”
To which my answer is “tiger stripes.” You will indeed find the simple, boring black shape more easily on the two-dimensional axis of the screen. But the possibility space is vast, and there are many many more dimensions in play. Imagine, if it helps, that instead of the middle-schooler-accessible version above, what I actually drew was a bunch of examples whose cross-sections are all the simple, boring, black shape, and whose weirdnesses all lie in the unconstrained third dimension.
We can selectively “breed” AIs to behave a certain way, and succeed at that to a limited degree, but the AIs are way weirder than dogs or humans, and thus should be expected to at least have the capacity to behave “properly” in certain ways while actually being dangerously not-at-all-the-thing-that-behavior-would-imply, given the constraint of also being an already-known creature whose properties are at least pinned down to a reasonably finite space. A gazelle would not predict that a tiger’s fur is actually an insanely visible bright color, and a gazelle would be wrong.
EDIT: Perhaps a better intuition pump: We selectively bred bulldogs and pugs but couldn’t get them to be bulldogs and pugs as we wanted them without the breathing issues; imagine something like that except with many more axes of surprising variance. Even selective breeding starting with known, constrained critters and clearly defined targets doesn’t actually go well, and this is neither constrained nor clearly targeted.
This is why I’m concerned about deleterious effects of writing for the outgroup: I’m worried you end up optimizing your thinking for coming up with eloquent allegories to convey your intuitions to a mass audience, and end up not having time for the actual, non-allegorical explanation that would convince subject-matter experts (whose support would be awfully helpful in the desperate push for a Pause treaty).
I think we have a lot of intriguing theory and evidence pointing to a story where the reason neural networks generalize is because the parameter-to-function mapping is not a one-to-one correspondence, and is biased towards simple functions (as Occam and Solomonoff demand): to a first approximation, SGD is going to find the simplest function that fits the training data (because simple functions correspond to large “basins” of approximately equal loss which are easy for SGD to find because they use fewer parameters or are more robust to some parameters being wrong), even though the network architecture is capable of representing astronomically many other functions that also fit the training data but have more complicated behavior elsewhere.
But if that story is correct, then “But what about Occam” isn’t something you can offhandedly address as an afterthought to an allegory about how misalignment is the default because there are astronomically many functions that fit the training data. Whether the simplest function is misaligned (as posited by List of Lethalities #20) is the thing you have to explain!
But you must realize that this sounds remarkably like the safety case for the current AI paradigm of LLMs + RLHF/RFAIF/RLVR! That is, the reason some people think that current-paradigm AI looks relatively safe is because they think that the capabilities of LLMs come from approximating the pretraining distribution, and RLHF/RFAIF/RLVR merely better elicits those capabilities by upweighting the rewarded trajectories (as evidenced by base models outperforming RL-trained models in pass@k evaluations for k in the hundreds or thousands) rather than discovering new “alien” capabilities from scratch.
If anything, the alignment case for SGD looks a lot better than that for selective breeding, because we get to specify as many billions and billions of input–output pairs for our network to approximate as we want (with the misalignment risk being that, as you say, if we don’t know how to choose the right data, the network might not generalize the way we want). Imagine trying to breed a dog to speak perfect English the way LLMs do!
LW is giving me issues and I’m having a hard time getting to and staying on the page to reply; I don’t know how good my further engagement will be, as a result.
I want to be clear that I think the only sane prior is on “we don’t know how to choose the right data.” Like, I don’t think this is reasonably an “if.” I think the burden of proof is on “we’ve created a complete picture and constrained all the necessary axes,” à la cybersecurity, and that the present state of affairs with regards to LLM misalignment (and all the various ways that it keeps persisting/that things keep squirting sideways) bears this out. The claim is not “impossible/hopeless,” but “they haven’t even begun to make a case that would be compelling to someone actually paying attention.”
(iiuc, people like Paul Christiano, who are far more expert than me and definitely qualify as “actually paying attention,” find the case more plausible/promising, not compelling. I don’t know of an intellectual with grounded expertise whom I respect who is like “we’re definitely good, here, and I can tell you why in concrete specifics.” The people who are confident are clearly hand-waving, and the people who are not hand-waving are at best tentatively optimistic. re: but your position is hand-wavey, too, Duncan—I think a) much less so, and b) burden of proof should be on “we know how to do this safely” not “exhaustively demonstrate that it’s not safe.”)
I am interested in an answer to Joe’s reply, which seems to me like the live conversational head.
To be clear, I agree that the situation is objectively terrifying and it’s quite probable that everyone dies. I gave a copy of If Anyone Builds It to two math professors of my acquaintance at San Francisco State University (and gave $1K to MIRI) because, in that context, conveying the fact that we’re in danger was all I had bandwidth for (and I didn’t have a better book on hand for that).
But in the context of my own writing, everyone who’s paying attention to me already knows about existential risk; I want my words to be focused on being rigorous and correct, not scaring policymakers and the public (notwithstanding that policymakers and the public should in fact be scared).
To the end of being rigorous and correct, I’m claiming that the “each of these black shapes is basically just as good at passing that particular test” story isn’t a good explanation of why alignment is hard (notwithstanding that alignment is in fact hard), because of the story about deep net architectures being biased towards simple functions.
I don’t think “well, I’m pitching to middle schoolers” saves it. If the actual problem is that we don’t know what training data would imply the behavior we want, rather than the outcomes of deep learning being intrinsically super-chaotic—which would be an entirely reasonable thing to suspect if it’s 2005 and you’re reasoning abstractly about optimization without having any empirical results to learn from—then you should be talking about how we don’t know what teal shape to draw, not that we might get a really complicated black shape for all we know.
I am of course aware that in the political arena, the thing I’m doing here would mark me as “not a team player”. If I agree with the conclusion that superintelligence is terrifying, why would I critique an argument with that conclusion? That’s shooting my own side’s soldiers! I think it would be patronizing for me to explain what the problem with that is; you already know.
I do not see you as failing to be a team player re: existential risk from AI.
I do see you as something like … making a much larger update on the bias toward simple functions than I do. Like, it feels vaguely akin to … when someone quotes Ursula K. LeGuin’s opinion as if that settles some argument with finality?
I think the bias toward simple functions matters, and is real, and is cause for marginal hope and optimism, but “bias toward” feels insufficiently strong for me to be like “ah, okay, then the problem outlined above isn’t actually a problem.”
I do not, to be clear, believe that my essay contains falsehoods that become permissible because they help idiots or children make inferential leaps. I in fact thought the things that I said in my essay were true (with decently high confidence), and I still think that they are true (with slightly reduced confidence downstream of stuff like the link above).
(You will never ever ever ever ever see me telling someone a thing I know to be false because I believe that it will result in them outputting a correct belief or a correct behavior; if I do anything remotely like that I will headline explicitly that that’s what I’m doing, with words like “The following is a lie, but if you pretend it’s true for a minute you might have a true insight downstream of it.”)
(That link should take you to the subheading “Written April 2, 2022.”)
I think that we don’t know what teal shape to draw, and that drawing the teal shape perfectly would not be sufficient on its own. In future writing I’ll try to twitch those two threads a little further apart.
You’re right; Steven Byrnes wrote me a really educational comment today about what the correct goal-counting argument looks like, which I need to think more about; I just think it’s really crucial that this is fundamentally an argument about generalization and inductive biases, which I think is being obscured in the black-shape metaphor when you write that “each of these black shapes is basically just as good at passing that particular test” as if it didn’t matter how complex the shape is.
(I don’t think talking to middle schoolers about inductive biases is necessarily hopeless; consider a box behind a tree.)
I think the temptation to frame technical discussions in terms of pessimism vs. optimism is itself a political distortion that I’m trying to avoid. (Apparently not successfully, if I’m coming off as a voice of marginal hope and optimism.)
You wrote an analogy that attempts to explain a reason why it’s hard to make neural networks do what we want; I’m arguing that the analogy is misleading. That disagreement isn’t about whether the humans survive. It’s about what’s going on with neural networks, and the pedagogy of how to explain it. Even if I’m right, that doesn’t mean the humans survive: we could just be dead for other reasons. But as you know, what matters in rationality is the arguments, not the conclusions; not only are bad arguments for a true conclusion still bad, even suboptimal pedagogy for a true lesson is still suboptimal.
This is good, but I think not saying false things turns out to be a surprisingly low bar, because the selection of which true things you communicate (and which true things you even notice) can have a large distortionary effect if the audience isn’t correcting for it.
I want to first acknowledge strongly that yep, we are mostly on the same side about getting a much better future than everyone dying to AIs.
I note this is not necessarily true for MIRI; we are trying very hard on purpose to reach and inform more people.
The two can be compatible!
I perceive at least two separate critiques, and I want to address them both without cross-contamination. (Please correct me if these miss the mark.)
Hypothesis 1: Maybe MIRI folks have wrong world-models (possibly due to insufficient engagement with sophisticated disagreement).
Hypothesis 2: Maybe MIRI folks are prioritizing their arguments badly for actually stopping the AI race.
Regarding Hypothesis 1, there’s a tradeoff between refining and polishing one’s world-model, and acting upon that world-model to try to accomplish things.
Speaking only for myself, there are many possible things I could be writing or saying, and only finite time to write or say them in. For the moment, I mostly want my words to be focused on (productively) scaring policymakers and the public, because they should in fact be scared.
This obviously does not preclude writing for and talking with the ingroup, nor continuing to refine and polish my own world-model.
But...well, I feel like I’ve mostly hit diminishing returns on that, both when it comes to updating my own models and when it comes to updating those of others like me. So the balance of time-spent naturally tips towards outreach.
To borrow from your comment below, in regards to Hypothesis 2...
...for one thing, I’m not sure how true this is? Policymakers and the public can sometimes both be swayed by appeals to intuition. Skeptical experts can be really hard to convince. Especially after the Nth iteration of debate has passed and a lot of ideas have congealed.
Again, there’s a tradeoff here, a matter of how much time one spends making cases to audiences of various levels of informed or uninformed skepticism. I’m not sure what the right balance is, but for myself at least, it’s probably not a primary focus on convincing Paul Christiano of things. Tactical priorities can differ from person to person, of course.
Caveat 1: Again, I speak for myself here. I admittedly have much less context on the decades-long back-and-forth than some of my colleagues.
Caveat 2: No matter who I’m trying to convince, I do want my arguments to rest on a solid foundation. If an interlocutor digs deep, the argument-hole they unearth should hold water. To, uh, rather butcher a metaphor.
But this just rounds back to my response to Hypothesis 1 - thanks to the magic of the Internet (and Lightcone Infrastructure in particular) you can always find someone with a sophisticated critique to level at your supposedly solid foundation. At some point you do have to take your best guess about what’s true and robust and correct according to your current world-model, then go and try to share it outside the crucible of LessWrong forums.
With all that being said, sure, let’s talk world-models. (With, again, the caveat that this is all my own limited take as someone who spent most of the 2010s doing reliability engineering and not alignment research.)
I think I follow your argument that one might say “we don’t know how to draw the teal thing” instead. But this seems more quibble than crux. I don’t think you addressed Duncan’s core point, which is that “we don’t know how to draw the teal thing” is the correct prior? (i.e. we don’t know how to select training data in a way that constrains an AI to learn to explicitly, primarily, and robustly value human flourishing.)
And if in fact we don’t know how to draw the right metaphorical teal thing, then the metaphorical black thing could take on various shapes that appear weird and complicated to us, but that actually reflect an underlying simplicity of which we are unaware. So it doesn’t seem wrong to claim that the black thing could take some (apparently) weird and (apparently) complex shape, given the assumption that we can’t draw a sufficiently constraining teal thing.
More broadly, I think I’m missing some important context, or just failing to follow your logic. I don’t see how a bias towards simple functions implies a convergence towards nonlethal aims. We don’t know what would be the simplest functions that approximate current or future training data. Why believe they would converge on something conveniently safe for us? [1]
From the papers you cite, I can see how one would conclude that AIs will be efficient, but I don’t see how they imply that AIs will be nice.
In the aforementioned spirit of rigor, I’m trying to avoid saying “human values” because those might not be good enough either. Many humans do not prioritize conscious flourishing! An ASI that doesn’t hold conscious wellbeing as its highest priority likely kills everyone as a side effect of optimizing for other ends, etc. etc.
I mean, before concluding that you’ve hit diminishing returns, have you looked at one of the standard textbooks on deep learning, like Prince 2023 or Bishop and Bishop 2024? I don’t think I’m suggesting this out of pointless gatekeeping. I actually unironically think if you’re devoting your life to a desperate campaign to get world powers to ban a technology, it’s helpful to have read a standard undergraduate textbook about the thing you’re trying to ban.
I mean, you can get a pretty good idea what the simplest function that approximates the data is like by, you know, looking at the data. (In slogan form, the model is the dataset.) Thus, language models—not hypothetical future superintelligences which don’t exist yet, but the actual technology that people are working on today—seem pretty safe for basically the same reason that text from the internet is safe: you’re sampling from the webtext distribution in a customized way.
(In more detail: you use gradient descent to approximate a “next token prediction” function of internet text. To make it more useful, we want to customize it away from the plain webtext distribution. To help automate that work, we train a “reward model”: basically, you start with a language model, but instead of the unembedding matrix which translates the residual stream to token probabilities, you tack on a layer that you train to predict human thumbs-up/thumbs-down ratings. Then you generate more samples from your base model, and use the output of your reward model to decide what gradient updates to do on them—with a Kullback–Leibler constraint to make sure you don’t update so far as to do something that it would be wildly unlikely for the original base model to do. It’s the same gradients you would get from adding more data to the pretraining set, except that the data is coming from the model itself rather than webtext, and the reward model puts a “multiplier” on the gradient: high reward is like training on that completion a bunch of times, and negative reward is issuing gradient updates in the opposite direction, to do less of that.)
That doesn’t mean future systems will be safe. At some point in the future, when you have AIs training other AIs on AI-generated data too fast for humans to monitor, you can’t just eyeball the data and feel confident that it’s not doing something you don’t want to happen. If your reward model accidentally reinforces the wrong things, then you get more of the wrong things. Importantly, this is a different threat model than “you don’t get what you train for”. In order to react to that threat in a dignified way, I want people to have read the standard undergraduate textbooks and be thinking about how to do better safety engineering in a way that’s oriented around the empirical details. Maybe we die either way, but I intend to die as a computer scientist.
I am in favor of learning more programming! During the two years I spent pivoting from reliability engineering, I did in fact attempt some hands-on machine learning code. My brain isn’t shaped in such a way that reading textbooks confers meaningful coding skills—I have to Actually Do the Thing—but I did try Actually Doing the Thing, reading and all.
I later facilitated BlueDot’s alignment and governance courses, and went through their reading material several times over in the process.
I now face a tradeoff between learning more ML, which is doable but extremely time-consuming, and efforts to convince policymakers not to let labs build ASI. It seems overwhelmingly overdetermined that my (marginal) time is best spent on the second thing. I see my primary comparative advantage as attempting to buy more time for developing solutions that might actually save us.
...which does unfortunately mean it’s going to take me a while to properly digest your argument-from-dataset-approximation. Doesn’t mean I won’t try.
Even attempting to take it as given, though, I’m confused by your conclusion, because you seem to be simultaneously saying “[language models approximating known datasets we can squint at] is a reason we know current systems are safe” and “this reason will not generalize to ASI” and “this answers the quoted question of why [ASI] would converge on something conveniently safe for us”.
Isn’t this the default path? Don’t most labs’ plans to build ASI run through massive use of AI-generated data? Even if I accept the premise that you can confidently assure safety by eyeballing data today, this doesn’t do much to reassure me if you then agree that it doesn’t generalize.
So I’m still not seeing how this supports the crux that “implement everyone’s CEV” (or, whichever alternative goalset you consider safe) is likely the simplest [function that approximates the datasets that will be used to create ASI].[1]
(Also, at this point I kind of want to taboo ‘dataset’ because it feels like a very overloaded term.)
Brackets, because I’m not even sure this is representing you right. Possibly it should be [function reflected by the dataset] or some other thing.
There has been a miscommunication. I’m not saying CEV or ASI alignment is easy. This thread started because I was critiquing the analogy about teal and black shapes in the article “Deadly By Default”, because the analogy taken at face value lends itself to a naïve counting argument of the form, “There are any number of AIs that could perform well in training, so who knows which one we’d end up with?!” I’m claiming that that argument as stated is wrong (although some more sophisticated counting argument could go through), because inductive biases are really important.
Maybe if you’re just trying to scare politicians and the public, “inductive biases are really important” doesn’t come up on your radar, but it’s pretty fundamental for, um, actually understanding the AI alignment problem humanity is facing!
I notice I’m still confused about your argument.
It seems to me that the question of whether [safe or intended goalset] is [the simplest function] is extremely relevant to the question of whether the argument as stated is wrong.
As I understand things right now, we seem to generally agree that:
An entity chooses data (teal thing) they think represents the shape they want an AI to be (black thing).
Gradient descent seeks simple functions that approximate the data.
The [test / selected data] are (probably) insufficient to constrain the resulting shape to what the makers intend.
The AI (probably) grows into a shape the maker(s) did not intend.
You said (emphasis added):
You seem to be saying “because deep learning privileges simple functions, the claim that many different AIs could pass our test is false.” I don’t see how that follows, because:
It is still the case that many different [AIs / simple functions] could [pass the test / approximate the dataset]. When we move the argument one level deeper, the original claim still holds true. Maybe I’m still just misunderstanding, though.
...or maybe you are only saying that the explanation as written is bad at taking readers from (1) to (4) because it does not explicitly mention (2), i.e. not technically wrong but still a bad explanation. In that case it seems we’d agree that (2) seems like a relevant wrinkle, and that writing (3) with “selected data” instead of “test” adds useful and correct nuance. But I don’t see how it makes Duncan’s summary either untrue or misleading, because eliding it doesn’t change (1) or (4).
...or maybe you are saying it was a bad explanation for you, and for readers with your level of sophistication and familiarity with the arguments, and thus a bad answer to your original question. Which is...kinda fair? In that case I suppose you’d be saying “Ah, I notice you are making an assumption there, and I agree with the assumption, but failing to address it is bad form and I’m worried about what that failure implies.” (I’ll hold off on addressing this argument until I know whether you’d actually endorse it.)
...or maybe you also flatly disagree with (3)? Like, you disagree with Duncan’s
...and in that case, excellent! We have surfaced a true crux. And it makes perfect sense from that perspective to say “the metaphor is wrong” because, from that perspective, one of its key assumptions is false.
Importantly, though, that looks to me like an object-level disagreement, and not one that reflects bad epistemics, except insofar as one believes that any disagreement must be the result of bad epistemics.
But bad explanations are wrong, untrue, and misleading.
Suppose the one comes to you and says, “All squares are quadrilaterals; all rectangles are quadrilaterals; therefore, all squares are rectangles.” That argument is wrong—”technically” wrong, if you prefer. It doesn’t matter that the conclusion is true. It doesn’t even matter that the premises are also true. It’s just wrong.
Okay, but why is it wrong though? I still haven’t seen a convincing case for that! It sure looks to me like, given an assumption which I still feel confused about whether you share, the conclusion does in fact follow from the premises, even in metaphor form.
I am open to the case that it’s a bad argument. If it is in fact a bad argument then that’s a legitimate criticism. But from my perspective you have not adequately spelled out how “deep nets favor simple functions” implies it’s a bad argument.
You said, “I don’t see how [not mentioning inductive biases] makes Duncan’s summary either untrue or misleading, because eliding it doesn’t change (1) [we choose “teal shape” data to grow the “black shape” AI] or (4) [we don’t get the AI we want].” But the point of the broken syllogism in the grandparent is that it’s not enough for the premise to be true and the conclusion to be true; the conclusion has to follow from the premise.
The context of the teal/black shape analogy in the article is an explanation of how “modern AIs aren’t really designed so much as grown or evolved” with the putative consequence that “there are many, many, many different complex architectures that are consistent with behaving ‘properly’ in the training environment, and most of them don’t resemble the thing the programmers had in mind”.
Set aside the question of superintelligence for the moment. Is this true as a description of “modern AIs”, e.g., image classifiers? That’s not actually clear to me.
It is true that adversarially robust image classification isn’t a solved problem, despite efforts: it’s usually possible (using the same kind of gradient-based optimization used to train the classifiers themselves) to successfully search for “adversarial examples” that machines classify differently than humans, which isn’t what the programmers had in mind.
But Ilyas et al. 2019 famously showed that adversarial examples are often due to “non-robust” features that are doing predictive work, but which are counterintuitive to humans. That would be an example of our data pointing at, as you say, an “underlying simplicity of which we are unaware”.
I’m saying that’s a different problem than a counting argument over putative “many, many, many different complex architectures that are consistent with behaving ‘properly’ in the training environment”, which is what the black/teal shape analogy seems to be getting at. (There are many, many, many different parametrizations that are consistent with behaving properly in training, but I’m claiming that the singular learning theory story explains why that might not be a problem, if they all compute similar functions.)
Is this in fact a crux for you? If you were largely convinced that the simplest functions found by gradient descent in the current paradigm would not remotely approximate human values, to what extent would this shift your odds of the current paradigm getting everyone killed?
I mean, yes. What else could I possibly say? Of course, yes.
In the spirit of not trying to solve the entire alignment problem at once, I find it hard to be too specific about to what extent how my odds would shift without a more specific question. (I think LLMs are doing a pretty good job of knowing and doing what I mean, which implies some form of knowledge of “human values”, but it’s only a natural-language instruction-follower; it’s not supposed to be a sovereign superintelligence, which looks vastly harder and I would rather people not do that for a long time.) Show me the ArXiv paper about inductive biases that I’m supposed to be updating on, and I’ll tell you how much more terrified I am (above my baseline of “already pretty terrified, actually”).
unless it was
gazelliezer
(i’m so sorry)
I really like this explanation. I think your exposition on the problem (from your post) is amongst the best that I’ve seen, and I like the tiger example as an additional supplement.
But I want to push back on it, a bit.
In the example of the tiger’s coloration, natural selection was applying pressure on the tiger’s visibility to prey animals. This did result in a “weird” result along some other dimension that pressure was not being applied along.
But crucially, that weirdness didn’t interfere with the metric that natural selection was pushing on. Natural selection totally succeeded at making tigers that are camouflaged to the sight of prey animals.
Shouldn’t this analogy, naively interpreted, suggest that human developers are going to optimize along the dimensions that we care about, and totally succeed, and also the AI will end up pushed in weird unexpected directions along other dimensions that we don’t care about, and maybe can’t even notice?
I expect you to respond with something along the lines of “well, yeah, AI developers will maybe succeed at shaping the AI along a bunch of specific dimensions. But they will not succeed at exhaustively shaping the AI along all dimensions that turn out to matter.”
(Or perhaps you’d say “yeah, but don’t take the analogy too seriously.”)
well, yeah, AI developers will maybe succeed at shaping the AI along a bunch of specific dimensions. But they will not succeed at exhaustively shaping the AI along all dimensions that turn out to matter.
now what say you to this clever rejoinder?
All dimensions that turn out to matter for what? Current AI is already implicitly optimizing people to use the world “delve” more often than they otherwise would, which is weird and unexpected, but not that bad in the grand scheme of things. Further arguments are needed to distinguish whether this ends in “humans dead, all value lost” or “transhuman utopia, but with some weird and unexpected features, which would also be true of the human-intelligence-augmentation trajectory.” (I’m not saying I believe in the utopia, but if we want that Pause treaty, we need to find the ironclad arguments that convince skeptical experts, not just appeal to intuition.)
Those developing powerful technologies should treat exotic failure scenarios as major bugs.
Right, but I think a big part of how safety team earns its dignity points is by being as specific as possible about exactly how capabilities team is being suicidal, not just with metaphors and intuition pumps, but state-of-the-art knowledge: you want to be winning arguments with people who know the topic, not just policymakers and the public. My post on adversarial examples (currently up for 2024 Review voting) is an example of what I think this should look like. I’m not just saying “AI did something weird, therefore AI bad”, I’m reviewing the literature and trying to explain why the weird thing would go wrong.
I agree directionally and denotationally with this, but I feel the need to caution that “winning arguments” is itself a very dangerous epistemic frame to inhabit for long.
Also...
We do that too! There’s a lot of ground to cover.
Fwiw the post also includes Stephen Fry, Tim Urban and Yishan Wong (although I agree these people don’t have deep ai x-risk experience)
Tim Urban has written about AI X-risk before, in a way that indicates that he’s spent a good bit of time thinking about the problem. But, the point of the book seems to be to speak to people who don’t have a deep knowledge of AI risk.
Not sure what he’s done on AI since, but Tim Urban’s 2015 AI blog post series mentions how he was new to AI or AI risk and spent a little under a month studying and writing those posts. I re-read them a few months ago and immediately recommended them to some other people with no prior AI knowledge, because they have held up remarkably well.
Unfortunately, the graphic below does not have the simple case of stating something, but I’m interested in people’s interpretation of the confidence level. I think a reasonable starting point is interpreting it as 90% confidence. I couldn’t quickly find what percent of AI safety researchers have 90% confidence in extinction (not just catastrophe or disempowerment), but it’s less than 1% in the AI Impacts survey including safety and capabilities researchers. I couldn’t find it for the public. Still, I think almost everyone will just bounce off this title. But I understand that’s what the authors believe, and perhaps it could have influence on the relatively few existing extreme doomers in the public?
Edited to add: After writing this, I asked perplexity what P(doom) someone should have to be called an extreme doomer, and it said 90%+ and mentioned Yud. Of course extreme doesn’t necessarily mean wrong. And since there only needs to be about 10,000 copies sold in a week to be a NYT bestseller, that very well could happen even if 99% of people bounce off the title.
You think people don’t read books if they confidently disagree with the title? (Not rhetorical; I read books I confidently disagree with but I’m not an average book reader.)
What about people who aren’t coming in with a strong opinion either way? Isn’t that most potential readers, and the main target audience?
E.g. “The Myth of the Rational Voter” book title implies a strong claim that voters are not rational. If I had walked by that book on a bookshelf 15 years ago (before I knew anything about the topic or author), I imagine that I would have been intrigued and maybe bought it, not because I already confidently believed that voters are not rational but because, I dunno, it might have seemed interesting and fun to read, on a topic I didn’t already know much about, so maybe I’d learn something.
Yes, I was thinking of adding that it could appeal to contrarians who may be attracted to a book with a title they disagreed with. As for people who don’t have a strong opinion coming in, I can see some people being attracted to an extreme title. And I get that titles need to be simple. I think a title like “If anyone builds it, we lose control” would be more defensible. But I think the probability distributions from Paul Christiano are more reasonable.
Note that IFP (a DC-based think tank) recently had someone deliver 535 copies of their new book to every US Congressional office.
Note also that my impression is that DC people (even staffers) are much less “online” than tech audiences. Whether or not you copy IFP, I would suggest thinking about in-person distribution opportunities for DC.
I would note that this is, indeed, a very common move done in DC. I would also note that many of these copies end up in, e.g., Little Free Libraries and at the Goodwill. (For example, I currently downstairs have a copy of the President of Microsoft’s Board’s book with literally still the letter inside saying “Dear Congressman XYZ, I hope you enjoy my book...”)
I am not opposed to MIRI doing this, but just want to flag that this is a regular move in DC. (Which might mean you should absolutely do it since it has survivorship bias as a good lindy idea! Just saying it ain’t, like, a brand new strat.)
Would be nice if you can get a warm intro for the book to someone high up in the Vatican too, as well as other potentially influential groups.
Love the title
Agree. There’s no way to present the alarming idea of AI doom without sounding alarmist. So it seems to me the next-best thing is to communicate it clearly in plain English without complicated words, and which can’t be misunderstood (in contrast to co-opted terms like “AI Safety”). That’s what this title does, so I like it.
it’s the title-based impact optimization for me
It’s a great book: it’s simple, memorable, and unusually convincing.
Are there any plans for Russian translation? If not, I’m interested in creating it (or even in organizing a truly professional translation, if someone gives me money for it).
There’s a professional Russian translator lined up for the book already, though we may need volunteer help with translating the online supplements. I’ll keep you (and others who have offered) in mind for that—thanks, Tapatakt. :)
How’s the Simplified Chinese translation coming along?
We’re still in the final proofreading stages for the English version, so the translators haven’t started translating yet. But they’re queued up.
Given the potentially massive importance of a Chinese version, it may be worth burning $8,000 to start the translation before proofreading is done, particularly if your translators come back with questions that are better clarified in the English text. I’d pay money to help speed this up if that’s the bottleneck[1]. When I was in China I didn’t have a good way of explaining what I was doing and why.
I’m working mostly off savings and wouldn’t especially want to, but I would to make it happen.
Something I’ve done in the past is to send text that I intended to be translated through machine translation, and then back, with low latency, and gain confidence in the semantic stability of the process.
Rewrite english, click, click.
Rewrite english, click, click.
Rewrite english… click, click… oh! Now it round trips with high fidelity. Excellent. Ship that!
Excellent. I cannot convey how pleased I am that I did not have to explain myself.
Hi, I’ve pre-ordered it on the UK Amazon, I hope that works for you. Let me know if I should do something different.
I have a number of reasonably well-respected friends in the University of Cambridge and its associated tech-sphere, I can try to get some of them to give endorsements if you think that will help and can send me a pdf.
This will be a huge help when talking to political representatives. When reaching out to politicians as an AI safety volunteer the past 6 months, I got a range of reactions:
They’re aware of this issue but can’t get traction with fellow politicians, it needs visible public support first
They’re aware but this issue is too complex for the public to understand
They’re unaware but also the public is focused on immediate issues like housing and cost of living
They’re unaware, sounds important, but they lack the resources to look into it
Having a professionally published book will help with all those responses. I am preordering!
I preordered it.
I just pre-ordered.
I agree that the cover art seems notably bad. The white text on black background in that font looks like some sort of autogenerated placeholder. I understand you feel over-constrained—this is just another nudge to think creatively about how to overcome your constraints, e.g. route around your publisher and hire various artists on your own, then poll your friends on the best design.
I would encourage you to send free review copies to prominent nontechnical people who are publicly complaining about AI, if you’re not already doing so. Here are some examples I saw in the past few days; I’m sure a dedicated search could turn up lots more (and I encourage people to reply to this comment with more examples):
I would offer both Ted Gioia and the new Pope advance copies. Edit: Pope John XXIII’s letter sent during the Cuban Missile Crisis could be an interesting case study here.
This was retweeted by Emma Ashford
Edit: Come to think of it, perhaps there is no reason to preferentially send copies to those who are more inclined to agree? Engaging skeptics of AI risk like e.g. Tyler Cowen might be a good opportunity to show them a better argument and leave them a “line of retreat” to change their mind without losing face? (“I previously believed X, but reading book B convinced me of Y.”)
Also, as long as we’re talking about public engagement on AI, I’m going to plug this comment I wrote a few days ago, which deserves more attention IMO. Maybe the launch of this book could serve as a platform for a high-profile bet or adversarial collaboration with a prominent AI doom skeptic?
I think this is an interesting idea and should be done!
I have pre-ordered it! Hopefully a German pre-order from a local bookstore will make a difference. :-)
For those who can’t wait, and most people here probably already know, here is Eliezer’s latest interview on that topic: https://www.youtube.com/watch?v=0QmDcQIvSDc. I’m halfway through it and I really like how clearly he thinks and makes his argument; it’s still deeply disturbing though.
If you want to hear a younger, more optimistic Eliezer, here’s the recording of his Hard AI Future Salon from many years ago, way back in 2006. :-)
https://archive.org/details/FutureSalon_02_2006
He starts his talk at minute 12. There are excellent questions from the audience as well.
I don’t know anyone who has thought about and tried to steer us in the right direction on this problem more deeply or for longer than him.
Yep, this counts! :)
Any info on what counts as “bulk”. I share an amazon prime account with my family so if we each want to buy copies, does it need to be separate orders, separate shipping/billing addresses, separate accounts, or separate websites to not count as “bulk”?
Is an audiobook version also planned per chance? Could preordering that one also help?
Judging from Stephen Fry’s endorsement and, as I’ve seen, his interest in the topic for some time in general, perhaps a delightful and maybe even eager deal could be made where he narrates? Unless some other choice might be better for either party of course. And I also understand if negotiations or existing agreements prevent anyone from confirming anything on this aspect, I’d be happy to hear whether the audio version is planned/intended to begin with and when if that can be known.
There is indeed an audiobook version; the site links to https://www.audible.com/pd/If-Anyone-Builds-It-Everyone-Dies-Audiobook/B0F2B8J9H5 (where it says it’ll be available September 30) and https://libro.fm/audiobooks/9781668652657-if-anyone-builds-it-everyone-dies (available September 16).
Any updates on the cover? It seems to matter quite a bit; this market has a trading volume of 11k mana and 57 different traders:
https://manifold.markets/ms/yudkowsky-soares-change-the-book-co?r=YWRlbGU
This market uses play money.
The standard arguments for why traders are incentivized to put their money where their mouth is, and therefore push the market value closer to the “correct” probabilities, do not apply.
Writing a book is an excellent idea! I found other AI books like Superintelligence much more convenient and thorough than navigating blog posts. I’ve pre-ordered the book and I’m looking forward to reading it when it comes out.
I just pre-ordered 10 copies. Seems like the most cost effective way to help that I’ve seen in a long time. (Though yes I’m also going to try to distribute my copies.)
I think that’s what they meant you should not do when they said [edit to add: directly quoting a now-modified part of the footnote] “Bulk preorders don’t count, and in fact hurt.”
Oops ok then i guess i will cancel my order
My guess is that “I’m excited and want a few for my friends and family!” is fine if it’s happening naturally, and that “I’ll buy a large number to pump up the sales” just gets filtered out. But it’s hard to say; the people who compile best-seller lists are presumably intentionally opaque about this. I wouldn’t sweat it too much as long as you’re not trying to game it.
Edit: this comment seems to be incorrect; see comments.
They did not say that, on my reading? The footnote about bulk preorders says
Which I read as semantically equivalent to “bulk preorders neither help nor hurt” with an implicit “so don’t do a bulk preorder solely for the sake of making the numbers better, but other reasons are acceptable.”
They edited the text. It was an exact quote from the earlier text.
Online advertising can be used to promote books. Unlike many books, you are not trying to make a profit and can pay for advertising beyond where the publisher’s marginal costs equals marginal revenue. Do you:
Have online advertising campaigns set up by your publisher and can absorb donations to spend on more advertising (LLM doubts Little, Brown and Company lets authors spend more money)
Have $$$ to spend on an advertising campaign but don’t have the managerial bandwidth to set one up. You’d need logistics support to set up an effective advertising campaign.
Need both money and logistics for an advertising campaign.
Alphabet and Meta employees get several hundred dollars per month to spend on on advertising (as incentive to dogfood their product). If LessWrong employees at those companies setup many $300 / month advertising campaigns, that sounds like a worthwhile investment
Need neither help setting up an advertising campaign nor funds for more advertising (though donations to MIRI are of course always welcome)
We have an advertising campaign planned, and we’ll be working with professional publicists. We have a healthy budget for it already :-)
Are there planned translations in general, or is that something that is discussed only after actual success?
A variety of translations are lined up.
Quick note that I can’t open the webpage via my institution (same issue on multiple browsers). Their restrictions can be quite annoying and get triggered alot. I can view it myself easily enough on phone but if you want this to get out beware trivial inconveniences and all that...
Firefox message is below.
Secure Connection Failed
An error occurred during a connection to ifanyonebuildsit.com. Cannot communicate securely with peer: no common encryption algorithm(s).
Error code: SSL_ERROR_NO_CYPHER_OVERLAP
The page you are trying to view cannot be shown because the authenticity of the received data could not be verified.
Please contact the web site owners to inform them of this problem.
Huh, that sure is weird. Looking into it, it seems that this would only happen if the institution network is forcing outdated SSL protocols, which really isn’t great since SSL exploits seem reasonably common and very bad.
https://vercel.com/guides/resolve-err-ssl-protocol-error-with-vercel
Not much I can do for now. I hope not many networks do this. If they do, I might think about doing something complicated to work around it.
I initially read the title of the post as
Quite intimidating!
Insider trading by anyone who can help on the Yes side is welcome :)
Very exciting; thanks for writing!
I know this is minor, but the image on the bottom of the website looks distractingly wrong to me—the lighting doesn’t match where real population centers are. It would be a lot better with something either clearly adapted from the real world or something clearly created, but this is pretty uncanny valley
Here in Australia I can only buy the paperback/hardcover versions. Any chance you can convince your publisher/publishers to release the e-book here too?
I’m told that Australians will be able to purchase the UK e-book, and that it’ll be ready to go in a week or so.
Why do I see two different versions on amazon.co.uk? Both hardcover, same title but different subtitles, different publication dates. The second one has the same cover and title as the one on amazon.com, so presumably that is the one to go for.
The US and UK versions will have different covers and subtitles. I’m not sure why the US version shows up on the .co.uk website. We’ve asked the publishers to take a look.
The US version is currently #396 in Books on .co.uk
Why the different subtitles? The US subtitle seems much more direct, while the UK subtitle is a breath of stale air. What is the “esc” key intended to convey, when the point is that there would be no escape?
There’s not a short answer; subtitles and cover art are over-constrained and the choices have many stakeholders (and authors rarely have final say over artwork). The differences reflect different input from different publishing-houses in different territories, who hopefully have decent intuitions about their markets.
Are you American? Because as a British person I would say that the first version looks a lot better to me, and certainly fits the standards for British non-fiction books better.
Though I do agree that the subtitle isn’t quite optimal.
I am British. I’m not much impressed by either graphic design, but I’m not a graphic designer and can’t articulate why.
I’m also not a graphic designer. But I agree that both designs give me the ick. I think it’s something about how lazy they both look. They give early 2000s self help book.
To be clear, I’m quite excited for this book, and have preordered! I am just surprised by the covers.
I think the main problem is that the second cover looks really rushed.
Huh, I’m also British and I thought the first version looked like a placeholder, as in “no one’s uploaded an actual cover yet so the system auto generates one”. The only thing making me think not-that was that the esc key is mildly relevant. I bought the second one partly because I was a lot more confident I was actually buying a real book.
I guess part of what’s going on here is it’s the same grey as the background (or very close?), so looks transparent. But even without that I think I’d have had a similar reaction.
I had the same reaction to the first version
Have preordered on Amazon DE, as both hardcover and Kindle.
Excited to hear this! Preordered.
Does hardcover vs. ebook matter here?
Bestseller algorithms are secret and shifty, but hardcover is generally believed to count a little more. And as for overall impact, if either format is good for you, a hardcover preorder helps more because it encourages the publisher to print a bigger initial run of physical copies, which can get pumped into stores and onto shelves where people will see them.
I would assume e-book orders will also play a role in encouraging the publisher to print more physical copies, because it indicates that more people are interested in reading the book.
Yep! This is the first time I’m hearing the claim that hardcover matters more for bestseller lists; but I do believe hardcover preorders matter a bit more than audiobook preorders (which matters a bit more than ebook preorders). I was assuming the mechanism for this is that they provide different amounts of evidence about print demand, and thereby influence the print run a bit differently. AFAIK all the options are solidly great, though; mostly I’d pick the one(s) that you actually want the most.
I’d assume so, but the post didn’t mention that as a consideration. @So8res, edit the post to point that out?
Online, I’m seeing several sources say that pre-orders actually hurt on Amazon, because the Amazon algorithm cares about sales and reviews after launch and doesn’t count pre-orders. Anyone know about this? If I am buying on Amazon should I wait til launch, or conversely if I’m pre-ordering should I buy elsewhere?
It’s a bit complicated, but after looking into this and weighing this against other factors, MIRI and our publisher both think that the best option is for people to just buy it when they think to buy it—the sooner, the better.
Whether you’re buying on Amazon or elsewhere, on net I think it’s a fair bit better to buy now than to wait.
Preordered ebook version on Amazon. I am also interested in doing Korean translation.
If I want to pre-order but don’t use Internet marketplaces and don’t have a credit card, are there options for that (e.g. going to a physical store and asking them to pre-order)?
I made a Manifold market for how many pre-orders there will be!
How come B&N can ship to a ton of different countries including San Marino and the Vatican but not Italy???
On the German bookstore website, I can order either the American or the UK version. I assume it does not make a difference for the whole preorder argument? The American epub is cheaper and is published two days earlier:
Amazon’s best-seller standings. I wouldn’t make too much of this, their categorization is wonky. (I also have no clue what the lookback window is, what they make of preoders, etc.)
#5 in “Technology”
#4,537 in all books
#11 in engineering
#14 in semantics and AI (how is this so much lower than “technology?”)
In short: showing up! It could be grabbing someone’s eye right now. Still drowned out by Yuval Noah Harari, Ethan Mollick, Ray Kurzweil, et al.
It was briefly in the 300s overall, and 1 or 2 in a few subcategory thingies.
Sad to see that ebook version is DRM protected. On your website you list far less retailers than Hatchette does.
I was looking for that information. Sad indeed.
@So8res , is there any chance of a DRM-free version that’s not a hardcopy or has that ship sailed when you signed your deal?
I would love to read your book, but this sees me torn between “Reading Nate or Elizier has always been enlightening” and “No DRM, never again.”
From the MIRI announcement:
Uh… this is debatably a lot to ask of the world right now.
A German bookseller claims there is a softcover (Taschenbuch) version available for preorder: https://www.thalia.de/shop/home/artikeldetails/A1075128502
Is that correct? It does not seem to be available on any US website.
Can you please make it available at a cheaper rate in India?🙏 I’m very interested in alignment, yet a student, and the USD is too expensive :(
How come B&N can ship to a ton of different countries including San Marino and the Vatican but not Italy???