David Karsten suggests you read the book, while noting he is biased. He reminds us that, like any other book, most conversations you have about the book will be with people who did not read the book.
David Manheim: “An invaluable primer on one side of the critical and underappreciated current debate between those dedicated to building this new technology, and those who have long warned against it. If you haven’t already bought the book, you really should.”
Steven Byrnes reviews the book, says he agrees with ~90% of it, disagreeing with some of the reasoning steps but emphasizes that the conclusions are overdetermined.
Michael Nielsen gives the book five stars on Goodreads and recommends it, as it offers a large and important set of largely correct arguments, with the caveat that there he sees significant holes in the central argument. His actual top objection is that even if we do manage to get a controlled and compliant ASI, that is still extremely destabilizing at best and fatal at worst. I agree with him this is not a minor quibble, and I worry about that scenario a lot whereas the book’s authors seem to consider it a happy problem they’d love to have. If anything it makes the book’s objections to ‘building it’ even stronger. He notes he does not have a good solution.
Nomads Vagabonds (in effect) endorses the first 2 parts of the book, but strongly opposes the policy asks in part 3 as unreasonable asks, demanding that if you are alarmed you need to make compromises to ‘help you win,’ if you think the apocalypse is coming you only propose things inside the Overton Window, that ‘even a 1% reduction in risk is worth pursuing.’
That only makes sense if and only if solutions inside the Overton Window substantially improve your chances or outlook, and thus the opportunity outweighs the opportunity cost. The reason (as Nomads quotes) Ezra Klein says Democrats should compromise on various issues if facing what they see as a crisis, is this greatly raises the value of ‘win the election with a compromise candidate’ relative to losing. There’s tons of value to protect even if you give up ground on some issues.
Whereas the perspective of Yudkowsky and Soares is that measures ‘within the Overton Window’ matter very little in terms of prospective outcomes. So you’d much rather take a chance on convincing people to do what would actually work. It’s a math problem, including the possibility that busting the Overton Window helps the world achieve other actions you aren’t yourself advocating for.
The central metaphor Nomads uses is protecting against dragons, where the watchman insists upon only very strong measures and rejects cheaper ones entirely. Well, that depends on if the cheaper solutions actually protect you from dragons.
If you believe – as I do! – that lesser changes can make a bigger difference, then you should say so, and try to achieve those lesser changes while also noting you would support the larger ask. If you believe – as the authors do – that the lesser changes can’t make much difference, then you say that instead.
I would also note that yes, there are indeed many central historical examples of asking for things outside the Overton Window being the ultimately successful approach to creating massive social change. This ranges from the very good (such as abolishing slavery) to the very bad (such as imposing Communism or Fascism).
Guarded Positive Reactions
People disagree on a lot on whether the book is well or poorly written, in various ways, at various different points in the book. Kelsey Piper thinks the writing in the first half is weak, the second half is stronger, whereas Timothy Lee (a skeptic of the book’s central arguments) thought the book was surprisingly well-written and the first few chapters were its favorites.
Peter Wildeford goes over the book’s central claims, finding them overconfident and warning about the downsides and costs of shutting AI development down. He’s a lot ‘less doomy’ than the book on many fronts. I’d consider this his central takeaway:
Peter Wildeford: Personally, I’m very optimistic about AGI/ASI and the future we can create with it, but if you’re not at least a little ‘doomer’ about this, you’re not getting it. You need profound optimism to build a future, but also a healthy dose of paranoia to make sure we survive it. I’m worried we haven’t built enough of this paranoia yet, and while Yudkowsky’s and Soares’s book is very depressing, I find it to be a much-needed missing piece.
Buck is a strong fan of the first two sections and liked the book far more than he expected, calling them the best available explanation of the basic AI misalignment risk case for a general audience. He caveats that the book does not address counterarguments Buck thinks are critical, and that he would tell people to skip section 3. Buck’s caveats come down to expecting important changes between the world today and the world where ASI is developed, that potentially change whether everyone would probably die. I don’t see him say what he expects these differences to be? And the book does hold out the potential for us to change course, and indeed find such changes, it simply says this won’t happen soon or by default, which seems probably correct to me.
Nostream Argues For Lower Confidence
Nostream addresses the book’s arguments, agreeing that existential risk is present but making several arguments that the probability is much lower, they estimate 10%-20%, with the claim that if you buy any of the ‘things are different than the book says’ arguments, that would be sufficient to lower your risk estimate. This feels like a form of the conjunction fallacy fallacy, where there is a particular dangerous chain and thus breaking the chain at any point breaks a lot of the danger and returns us to things probably being fine.
The post focuses on the threat model of ‘an AI turns on humanity per se’ treating that as load bearing, which it isn’t, and treats the ‘alignment’ of current models as meaningful in ways I think are clearly wrong, and in general tries to draw too many conclusions from the nature of current LLMs. I consider all the arguments here, in the forms expressed, not new and well-answered by the book’s underlying arguments, although not always in a form as easy to pick up on as one might hope in the book text alone (book is non-technical and length-limited, and people understandably have cached thoughts and anchor on examples).
So overall I wished the post was better and made stronger arguments, including stronger forms of its arguments, but this is the right approach to take of laying out specific arguments and objections, including laying out up front that their own view includes unacceptably high levels of existential risk and also dystopia risk or catastrophic risk short of that. If I believed what Nostream believed I’d be in favor of not building the damn thing for a while, if there was a way to not build it for a while.
Gary Marcus Reviews The Book
Gary Marcus offered his (gated) review, which he kindly pointed out I had missed in Wednesday’s roundup. He calls the book deeply flawed, but with much that is instructive and worth heeding.
Despite the review’s flaws and incorporation of several common misunderstandings, this was quite a good review, because Gary Marcus focuses on what is important, and his reaction to the book is similar to mine to his review. He notices that his disagreements with the book, while important and frustrating, should not be allowed to interfere with the central premise, which is the important thing to consider.
He starts off with this list of key points where he agrees with the authors:
Rogue AI is a possibility that we should not ignore. We don’t know for sure what future AI will do and we cannot rule out the possibility that it will go rogue.
We currently have no solution to the “alignment problem” of making sure that machines behave in human-compatible ways.
Figuring out solutions to the alignment problem is really, really important.
Figuring out a solution to the alignment problem is really, really hard.
Superintelligence might come relatively soon, and it could be dangerous.
Superintelligence could be more consequential than any other trend.
Governments should be more concerned.
The short-term benefits of AI (eg in terms of economics and productivity) may not be worth the long-term risks.
Noteworthy, none of this means that the title of the book — If Anyone Builds It, Everyone Dies — literally means everyone dies. Things are worrying, but not nearly as worrying as the authors would have you believe.
It is, however, important to understand that Yudkowsky and Soares are not kidding about the title
Gary Marcus does a good job here of laying out the part of the argument that should be uncontroversial. I think all seven points are clearly true as stated.
Gary Marcus: Specifically, the central argument of the book is as follows: Premise 1. Superintelligence will inevitably come — and when it does, it will inevitably be smarter than us. Premise 2. Any AI that is smarter than any human will inevitably seek to eliminate all humans. Premise 3. There is nothing that humans can do to stop this threat, aside from not building superintelligence in the first place.
The good news [is that the second and third] premisses are not nearly as firm as the authors would have it.
I’d importantly quibble here, my version of the book’s argument would be, most importantly on the second premise although it’s good to be precise everywhere:
Superintelligence is a real possibility that could happen soon.
If built with anything like current techniques, any sufficiently superintelligent AI will effectively maximize some goal and seek to rearrange the atoms to that end, in ways unlikely to result in the continued existence of humans.
Once such a superintelligent AI is built, if we messed up, it will be too late to stop this from happening. So for now we have to not build such an AI.
He mostly agrees with the first premise and has a conservative but reasonable estimation of how soon it might arrive.
For premise two, the quibble matters because Gary’s argument focuses on whether the AI will have malice, pointing out existing AIs have not shown malice. Whereas the book does not see this as requiring the AI to have malice, nor does it expect malice. It expects merely indifference, or simply a more important priority, the same way humans destroy many things we are indifferent to, or that we actively like but are in the way. For any given specified optimization target, given sufficient optimization power, the chance the best solution involves humans is very low. This has not been an issue with past AIs because they lacked sufficient optimization power, so such solutions were not viable.
This is a very common misinterpretation, both of the authors and the underlying problems. The classic form of the explanation is ‘The AI does not love you, the AI does not hate you, but you are composed of atoms it can use for something else.’
His objection to the third premise is that in a conflict, the ASI’s victory over humans seems possible but not inevitable. He brings a standard form of this objection, including quoting Moltke’s famous ‘no plan survives contact with the enemy’ and modeling this as the AI getting to make some sort of attack, which might not work, after which we can fight back. I’ve covered this style of objection many times, comprising a substantial percentage of words written in weekly updates, and am very confident it is wrong. The book also addresses it at length.
I will simply note that in context, even if on this point you disagree with me and the book, and agree with Gary Marcus, it does not change the conclusion, so long as you agree (as Marcus does) that such an ASI has a large chance of winning. That seems sufficient to say ‘well then we definitely should not build it.’ Similarly, while I disagree with his assumption we would unite and suddenly act ruthlessly once we knew the ASI was against us, I don’t see the need to argue that point.
Marcus does not care for the story in Part 2, or the way the authors use colloquial language to describe the AI in that story.
I’d highlight another common misunderstanding, which is important enough that the response bears repeating:
Gary Marcus: A separate error is statistical. Over and again, the authors describe scenarios that multiply out improbabilities: AIs decide to build biological weapons; they blackmail everyone in their way; everyone accepts that blackmail; the AIs do all this somehow undetected by the authorities; and so on. But when you string together a bunch of improbabilities, you wind up with really long odds.
The authors are very much not doing this. This is not ‘steps in the chain’ where the only danger is that the AI succeeds at every step, whereas if it ever fails we are safe. They bend over backwards, in many places, to describe how the AI uses forking paths, gaming out and attempting different actions, having backup options, planning for many moves not succeeding and so on.
The scenario also does not presuppose that no one realizes or suspects, in broad terms, what is happening. They are careful to say this is one way things might play out, and any given particular story you tell must necessarily involve a lot of distinct things happening.
If you think that at the first sign of trouble all the server farms get turned off? I do not believe you are paying enough attention to the world in 2025. Sorry, no, not even if there wasn’t an active AI effort to prevent this, or an AI predicting what actions would cause what reactions, and choosing the path that is most likely to work, which is one of the many benefits of superintelligence.
Several people have pointed out that the actual absurdity is that the AI in the story has to work hard and use the virus to justify it taking over. In the real world, we will probably put it in charge ourselves, without the AI even having to push us to do so. But in order to be more convincing, the story makes the AI’s life harder here and in many other places than it actually would be.
Gary Marcus closes with discussion of alternative solutions to building AIs that he wished the book explored more, including alternatives to LLMs. I’d say that this was beyond the scope of the book, and that ‘stop everyone from racing ahead with LLMs’ would be a likely necessary first step before we can pursue such avenues in earnest. Thus I won’t go into details here, other than to note that ‘axioms such as ‘avoid causing harm to humans’’ are known to not work, which was indeed consistently the entire point of Asimov’s robot novels and other stories, where he explores some but not all of the reasons why this is the case.
Again, I very much appreciated this review, which focused on what actually matters and clearly is stating what Marcus actually believes. More like this, please.
John Pressman Agrees With Most Claims But Pushes Back On Big Picture
John Pressman agrees with most of the book’s individual statements and choices on how to present arguments, but disagrees with the thesis and many editorial and rhetorical choices. Ultimately his verdict is that the book is ‘okay.’ He states up top as ‘obvious’ many ‘huge if true’ statements about the world that I do not think are correct and definitely are not obvious. There’s also a lot of personal animus going on here, has been for years. I see two central objections to the book from Pressman:
If you use parables then You Are Not Serious People, as in ‘Bluntly: A real urgent threat that demands attention does not begin with Once Upon a Time.’
Which he agrees is a style issue.
There are trade-offs here. It is hard for someone like Pressman to appreciate the challenges in engaging with average people who know nothing about these issues or facing off the various stupid objections those people have (that John mostly agrees are deeply stupid).
He thinks the alignment problems involved are much easier than Yudkowsky and Soares believe, including that we have ‘solved the human values loading problem,’ although he agrees that we are still very much on track to fail bigly.
I think (both here and elsewhere where he goes into more detail) he both greatly overstates his case and also deliberately presents the case in a hostile and esoteric manner. That makes engagement unnecessarily difficult.
I also think there’s important real things here and in related clusters of arguments, and my point estimate of difficulty is lower than the book’s, largely for reasons that are related to what Pressman is attempting to say.
As Pressman points out, even if he’s fully right, it looks pretty grim anyway.
Meta Level Reactions Pushing Back On Pushback
Aella (to be fair a biased source here) offers a kind of meta-review.
Aella: man writes book to warn the public about asteroid heading towards earth. his fellow scientists publish thinkpieces with stuff like ‘well his book wasn’t very good. I’m not arguing about the trajectory but he never addresses my objection that the asteroid is actually 20% slower’
I think I’d feel a lot better if more reviews started with “first off, I am also very worried about the asteroid and am glad someone is trying to get this to the public. I recognize I, a niche enthusiast, am not the target of this book, which is the general public and policy makers. Regardless of how well I think the book accomplishes its mission, it’s important we all band together and take this very seriously. That being said, here’s my thoughts/criticisms/etc.
There is even an advertising campaign whose slogan is ‘we wish we were exaggerating,’ and I verify they do wish this.
It is highly reasonable to describe the claims as overstated or false, indeed in some places I agree with you. But as Kelsey says, focus on: What is true?
Claims the authors are overconfident, including claims this is bad for credibility.
I agree that the authors are indeed overconfident. I hope I’ve made that clear.
However I think these are reasonable mistakes in context, that the epistemic standards here are much higher than those of most critics, and also that people should say what they believe, and that the book is careful not to rely upon this overconfidence in its arguments.
Fears about credibility or respectability, I believe, motivate many attacks on the book. I would urge everyone attacking the book for this reason (beyond noting the specific worry) to stop and take a long, hard look in the mirror.
Complaints that it sucks that ‘extreme’ or eye-catching claims will be more visible. Whereas claims one thinks are more true are less visible and discussed.
Yeah, sorry, world works the way it does.
The eye-catching claims open up the discussion, where you can then make clear you disagree with the full claim but endorse a related other claim.
As Raymond says, this is a good thing to do, if you think the book is wrong about something write and say why and about how you think about it.
He provides extensive responses to the arguments and complaints he has seen, using his own framings, in the extended thread.
Then he fleshes out his full thoughts in a longer LessWrong post, The Title Is Reasonable, that lays out these questions at length. It also contains some good comment discussion, including by Nate Soares.
I agree with him that it is a reasonable strategy to have those who make big asks outside the Overton Window because they believe those asks are necessary, and also those who make more modest asks because they feel those are valuable and achievable. I also agree with him that this is a classic often successful strategy.
Yes, people will try to tar all opposition with the most extreme view and extreme ask. You see it all the time in anything political, and there are plenty of people doing this already. But it is not obvious this works, and in any case it is priced in. One can always find such a target, or simply lie about one or make one up. And indeed, this is the path a16z and other similar bad faith actors have chosen time and again.
James Miller: I’m an academic who has written a lot of journalistic articles. If Anyone Builds It has a writing style designed for the general public, not LessWrong, and that is a good thing and a source of complaints against the book.
David Manheim notes that the book is non-technical by design and does not attempt to bring readers up to speed on the last decade of literature. As David notes, the book reflects on the new information but is not in any position to cover or discuss all of that in detail.
I would turn this back at Clara here, and say she seems to have failed to update her priors on what the authors are saying. I found her response very disappointing.
I don’t agree with Rob Bensinger’s view that this was one of the lowest-quality reviews. It’s more that it was the most disappointing given the usual quality and epistemic standards of the source. The worst reviews were, as one would expect, from sources that are reliably terrible in similar spots, but that is easy to shrug off.
The book very explicitly does not depend on the foom, there only being one AI, or other elements she says it depends on. Indeed in the book’s example story the AI does not foom. It feels like arguing with a phantom.
He also points out that there being multiple similarly powerful AIs by default makes the situation more dangerous rather than less dangerous, including in AI 2027, because it reduces margin for error and ability to invest in safety. If you’ve done a tabletop exercise based on AI 2027, this becomes even clearer.
Her first charge is that a lot of AI safety people disagree with the MIRI perspective on the problem, so the subtext of the book is that the authors must think all those other people are idiots. This seems like a mix of an argument for epistemic modesty, or a claim that Yudkowsky and Soares haven’t considered these other opinions properly, and are disrespecting those who hold them? I would push back strongly on all of that.
She raises the objection that the book has an extreme ask that ‘distracts’ from other asks, a term also used by MacAskill. Yes, the book asks for an extreme thing that I don’t endorse, but that’s what they think is necessary, so they should say so.
She asks why the book doesn’t spend more time explaining why an intelligence explosion is likely to occur. The answer is the book is explicitly arguing a conditional, what happens if it does occur, and acknowledges that it may or may not occur, or occur on any given time frame. She also raises the ‘but progress has been continuous which argues against an intelligent explosion’ argument, except continuous progress does not argue against a future explosion. Extend curves.
She objects that the authors reach the same conclusions about LLMs that they previously reached about other AI systems. Yes, they do. But, she says, these things are different. Yes, they are, but not in ways that change the answer, and the book (and supplementary material, and their other writings) explain why they believe this. Reasonable people can disagree. I don’t think, from reading Clara’s objections along these lines, that she understands the central arguments being made by the authors (over the last 20 years) on these points.
She says if progress will be ‘slow and continuous’ we will have more than one shot on the goal. For sufficiently generous definitions of slow and continuous this might be true, but there has been a lot of confusion about ‘slow’ versus ‘fast.’ Current observed levels of ‘slow’ are still remarkably fast, and effectively mean we only get one shot, although we are getting tons of very salient and clear warning shots and signs before we take that one shot. Of course, we’re ignoring the signs.
She objections that ‘a future full of flourishing people is not the best, most efficient way to fulfill strange alien purposes’ is stated as a priori obvious. But it is very much a priori obvious, I will bite this bullet and die on this hill.
There are additional places one can quibble, another good response is the full post from Max Harms, which includes many additional substantive criticisms. One key additional clarification is that the book is not claiming that what we will fail to learn a lot about AI from studying current AI, only that this ‘a lot’ will be nothing like sufficient. It can be difficult to reconcile ‘we will learn a lot of useful things’ with ‘we will predictably learn nothing like enough of the necessary things.’
Clara responded in the comments, and stands by the claim that the book’s claims require a major discontinuity in capabilities, and that gradualism would imply multiple meaningfully distinct shots on goal.
Jeffrey Ladish: I do think you’re nearby to a criticism that I would make about Eliezer’s views / potential failures to update: Which is the idea that the gap between a village idiot and Einstein is small and we’ll blow through it quite fast. I think this was an understandable view at the time and has turned out to be quite wrong. And an implication of this is that we might be able to use agents that are not yet strongly superhuman to help us with interpretability / alignment research / other useful stuff to help us survive.
Anyway, I appreciate you publishing your thoughts here, but I wanted to comment because I didn’t feel like you passed the authors ITT, and that surprised me.
Peter Wildeford: I agree that the focus on FOOM in this review felt like a large distraction and missed the point of the book.
The fact that we meaningfully do get a meaningful amount of time with AIs one could think of as between village idiots and Einsteins is indeed a major source of hope, although I understand why Soares and Yudkowsky do not see as much hope here, and that would be a good place to poke. The gap argument was still more correct than incorrect, in the sense that we will likely only get a period of years in that window rather than decades, and most people are making various forms of the opposite mistake and not understanding that above-Einstein levels of intelligence are Coming Soon. But years or even months can do a lot for you if you use them well.
Will MacAskill Offers Disappointing Arguments
Will MacAskill offers a negative review, criticizing the arguments and especially parallels to evolution as quite bad, although he praises the book for plainly saying what its authors actually believe, and for laying out ways in which the tech we saw was surprising, and for the quality of the analogies.
I found Will’s quite disappointing but unsurprising. Here are my responses:
The ones about evolution parallels seem robustly answered by the book and also repeatedly elsewhere by many.
Will claims the book is relying on a discontinuity of capability. It isn’t. They are very clear that it isn’t. The example story containing a very mild form of [X] does not mean one’s argument relies on [X], although it seems impossible for there not to be some amount of jumps in capability, and we have indeed seen such jumps.
The ones about ‘types of misalignment’ seems at best deeply confused, I think he’s saying humans will stay in control because AIs will be risk averse, so they’ll be happy to settle for a salary rather than take over and thus make this overwhelmingly pitiful deal with us? Whereas the fact that imperfect alignment is indeed catastrophic misalignment in the context of a superhuman AI is the entire thesis of the book, covered extensively, in ways Will doesn’t engage with here?
Intuition pump that might help: Are humans risk averse?
Criticism of the proposal as unlikely to happen. Well, not with that attitude. If one thinks that this is what it takes, one should say so. If not, not.
Criticism of the proposal as unnecessary or even unhelpful, and ‘distracting’ from other things we could do. Standard arguments here. Not much to say, other than to note that the piecemail ban proposals he suggests don’t technically work.
Criticism of the use of fiction and parables ‘as a distraction.’ Okie dokie.
Claim that the authors should have updated more in light of developments in ML. This is a reasonable argument one can make, but people who (including Clair and also Tyler below) that the book’s arguments are ‘outdated’ are simply incorrect. The arguments and authors do take such information into account, and have updated in some ways, but do not believe the new developments change the central arguments or likely future ultimate path. And they explain why. You are encouraged to disagree with their reasoning if you find it wanting.
Zack Robinson Raises Alarm About Anthropic’s Long Term Benefit Trust
Zack Robinson of Anthropic’s Long Term Benefit Trust has a quite poor Twitter thread attacking the book, following the MacAskill style principle of attacking those whose tactics and messages are not cooperating with the EA-brand-approved messages designed to seek movement growth, respectability and power and donations, which in my culture we call a strategy of instrumental convergence.
The particular arguments in the thread are quite bad and mischaracterize the book. He says the book presents doom as a foregone conclusion, which is not how contingent predictions work. He uses pure modesty and respectability arguments for ‘accepting uncertainty’ in order to ‘leave room for AI’s transformative benefits,’ which is simply a non-sequitur. He warns this ‘blinds us to other series risks,’ the ‘your cause of us all not dying is a distraction from the real risks’ argument, which again simply is not true, there is no conflict here. His statement in this linked Tweet about the policy debate being only a binary false choice is itself clearly false, and he must know this.
The whole thing is in what reads as workshopped corporate speak.
The responses to the thread are very on point, and Rob Benginger in particular pulls this very important point of emphasis:
Like Buck and those who responded to expressions their disappointment, I find this thread unbecoming of someone on the LTBT and as long as he remains there I can no longer consider him a meaningful voice for existential risk worries there, which substantially lowers my estimate of Anthropic’s likely behaviors.
I join several respondents in questioning whether Zack has read the book, a question he did not respond to. One can even ask if he has logically parsed its title.
Similarly, given he is the CEO of the Center for Effective Altruism, I must adjust there, as well, especially as this is part of a pattern of similar strategies.
I think this line is telling in multiple ways:
Zack Robinson: I grew up participating in debate, so I know the importance of confidence. But there are two types: epistemic (based on evidence) and social (based on delivery). Despite being expressed in self-assured language, the evidence for imminent existential risk is far from airtight.
Zack is debating in this thread, as in trying to make his position win and get ahead via his arguments, rather than trying to improve our epistemics.
Contrary to the literal text, Zack is looking at the social implications of appearing confident, and disapproving, rather than primarily challenging the epistemics. Indeed, the only arguments he makes against confidence here are from modesty:
Argument from consequences: Thinking confidently causes you to get dismissed (maybe) or to ignore other risks (no) and whatnot.
Argument from consensus: Others believe the risk is lower. Okie dokie. Noted.
As you would expect, many others not named in this post are trotting out all the usual Obvious Nonsense arguments about why superintelligence would definitely turn out fine for the humans, such as that ASIs would ‘freely choose to love humans.’ Do not be distracted by noise.
I think (both here and elsewhere where he goes into more detail) he both greatly overstates his case and also deliberately presents the case in a hostile and esoteric manner. That makes engagement unnecessarily difficult.
I notice the style of this section of your summary of other people’s reviews is angrier and more openly emotive than the others. I take this to mean I’ve offended or upset you somehow. This is odd to me because I think my review was a lot nicer than most people expected from me (including myself). You don’t seem nearly as frustrated by other people making much dumber and more bad faith arguments, so I’m curious what it is that I’ve done to upset you.
In any case I do not think that I make my case in a “hostile and esoteric manner”. If anything I think I’ve kind of done worse than that by mostly not writing down my case at all because I have very high intellectual standards and don’t feel comfortable articulating my intuitions until the articulation is relatively rigorous.
That having been said I don’t think what I have written so far is “hostile and esoteric”.
There’s me attempting to explain during our podcast which I admit it took me longer than I’d like to get to the point.
I have writing at various levels of quality and endorsement in my Twitter archive which you can find by searching for keywords like “Goodhart” and “Reward model”.
Bluntly, I don’t really understand what you take issue with in my review. From my perspective the structure of my review goes like this:
We are currently in an alignment winter. (This is bad)
Alignment is not solved yet but people widely believe it is. (This is bad)
I was expecting to hate the book but it actually retreats on most of the rhetoric I blame for contributing to the alignment winter. (This is good)
The style of the book is bad, but I won’t dwell on it and in fact spend a paragraph on the issue and then move on.
I actually disagree with the overall thesis, but think it’s virtuous to focus on the points of agreement when someone points out an important issue so I don’t dwell on that either and instead
“Emphatically agree” (literal words) that AI labs are not serious about the alignment problem.
State a short version of what the alignment problem actually is. (Important because it’s usually conflated with or confused with simpler problems that sound a lot easier to solve.)
I signal boost Eliezer’s other and better writing because I think my audience is disproportionately made up of people who might be able to contribute to the alignment problem if they’re not deeply confused about it and I think Eliezer’s earlier work is under-read.
I reiterate that I think the book is kinda bad, since I need a concluding paragraph.
I continue to think this is a basically fair review.
I had to reread part 7 from your review to fully understand what you were trying to say. It’s not easy to parse on a quick read, so I’m guessing Zvi didn’t interpret the context and content correctly, like I didn’t on my first pass. On first skim, I thought it was a technical argument about how you disagreed with the overall thesis, which makes things pretty confusing.
If that’s your reaction to my reaction, then it was a miss in at least some ways, which is on me.
I did not feel angry (more like frustrated?) when I wrote it nor did I intend to express anger, but I did read your review itself as expressing anger and hostility in various forms—you’re doing your best to fight through that and play fair with the ideas as you see them, which is appreciated - and have generally read your statements about Yudkowsky and related issues consistently as being something in the vicinity of angry, also as part of a consistent campaign, and perhaps some of this was reflected in my response. It’s also true that I have a cached memory of you often responding as if things said are more hostile than I felt they were or were intended, although I do not recall examples at this point.
And I hereby report that, despite at points in the past putting in considerable effort trying to parse your statements, and at some point found it too difficult, frustrating and aversive in some combination and mostly stopped attempting to do so when my initial attempt on a given statement bounced (which sometimes it doesn’t).
(Part of what is ‘esoteric’ is perhaps that the perfect-enemy-of-good thing means a lot of load-bearing stuff is probably unsaid by you, and you may not realize that you haven’t said it?)
But also, frankly, when people write much dumber reviews with much dumber things in them, I mostly can’t even bring myself to be mad, because I mean what else can one expect from such sources—there’s only one such review that actually did make me angry, because it was someone where I expected better. It’s something I’ve worked a lot on, and I think made progress on—I don’t actually e.g. get mad at David Sacks anymore as a person, although I still sometimes get mad that I have to once again write about David Sacks.
To the extent I was actually having a reaction to you here it was a sign that I respect you enough to care, that I sense opportunity in some form, and that you’re saying actual things that matter rather than just spouting gibberish or standard nonsense.
Similarly, with the one exception, if those people had complained about my reaction to their reaction in the ways I’d expect them to do so, I would have ignored them.
Versus your summary of your review, I would say I read it more as:
We are currently in an alignment winter. (This is bad). This is asserted as ‘obvious’ and then causes are cited, all in what I read as a hostile manner, and an assertion of ‘facts not in evidence’ that I indeed disagree with, including various forms of derision that read in-context as status attacks and accusations of bad epistemic action, and the claim that the value loading problem has been solved, which is all offered in a fashion that implies you think this is all clearly true if not rather obvious, and this is all loaded up front despite it not being especially relevant to the book, and echoing things you talk about a lot. This sets the whole thing up as an adversarial exercise. You can notice that in my reaction, I treated these details as central, in a way you don’t seem to think are, or at least I think the central thing boils down to this thing?
Alignment is not solved yet but people widely believe it is. (This is bad). It’s weird because you say ‘we solved [X] and people think [X] solves alignment but it doesn’t’ where I don’t think it’s true we solved [X].
I was expecting to hate the book but it actually retreats on most of the rhetoric I blame for contributing to the alignment winter. (This is good) Yes.
The style of the book is bad, but I won’t dwell on it and in fact spend a paragraph on the issue and then move on. ‘Truly appalling’ editorial choices, weird and often condescending, etc. Yes it’s condensed but you come on very strong here (which is fine, you clearly believe it, but I wouldn’t minimize its role). Also your summary skips over the ‘contempt for LLMs’ paragraph.
I actually disagree with the overall thesis, but think it’s virtuous to focus on the points of agreement when someone points out an important issue so I don’t dwell on that either and instead.
“Emphatically agree” (literal words) that AI labs are not serious about the alignment problem.
State a short version of what the alignment problem actually is. (Important because it’s usually conflated with or confused with simpler problems that sound a lot easier to solve.)
I signal boost Eliezer’s other and better writing because I think my audience is disproportionately made up of people who might be able to contribute to the alignment problem if they’re not deeply confused about it and I think Eliezer’s earlier work is under-read.
I reiterate that I think the book is kinda bad, since I need a concluding paragraph.
I read ‘ok’ in this context as better than ‘kinda bad’ fwiw.
As for ‘I should just ask you,’ I notice this instinctively feels aversive as likely opening up a very painful and time consuming and highly frustrating interaction or set of interactions and I notice I have the strong urge not to do it. I forget the details of the interactions with you in particular or close others that caused this instinct, and it could be a mistake. I could be persuaded to try again.
I do know that when I see the interactions of the entire Janus-style crowd on almost anything, I have the same feeling I had with early LW, where I expect to get lectured to and yelled at and essentially downvoted a lot, including in ‘get a load of this idiot’ style ways, if I engage directly in most ways and it puts me off interacting. Essentially it doesn’t feel like a safe space for views outside a certain window. This makes me sad because I have a lot of curiosity there, and it is entirely possible this is deeply stupid and if either side braved mild social awkwardness we’d all get big gains from trade and sharing info. I don’t know.
I realize it is frustrating to report things in my head where I can’t recall many of the sources of the things, but I am guessing that you would want me to do that given that this is the situation.
I dunno, man, this is definitely a ‘write the long letter’ situation and I’m calling it here.
(If you want to engage further, my reading of LW comments even on my own posts is highly unreliable, but I would get a PM or Twitter DM or email etc pretty reliably).
I’m pretty sure it is the same post I wrote my comment in response to.
It is hard for someone like Pressman to appreciate the challenges in engaging with average people
And similar comments seem like a level of snark and condescension that the other descriptions don’t have.
It’s also wrong, I do appreciate the challenges with that and think they are in a sense the fundamental difficulty of writing a book like this. What I was originally going to say in my review , before deciding it was going to get lambasted by 50 parties anyway and I didn’t need to be one of them when there’s praiseworthy elements I can focus on instead, is that the book as written is barely structured like an argument at all. It’s more like a series of disconnected vignettes arranged in roughly chronological order. This is because the authors don’t actually feel they can make an argument. They don’t feel they can just make an argument because normally when they try doing that their audience will interject with some ridiculous cope thing, and the ridiculous cope is different for different people and it’s difficult to know in advance which ridiculous cope thing the audience will want you to respond to. So what the book does is interrupt itself constantly to try and head off some aside or point that the imagined reader might make there, and I guess I can only hope that the chosen structure is actually the product of constant beta readers objecting at various points and then deciding to put the interruptions at the most common points of objection. I hope this but do not really expect it because the book honestly comes across as the product of an incestuous editing process where it wasn’t really shown to any critical perspectives who would point out flaws as basic as “if I skim the beginning of each part of the book I will pick up on it always starting with an example sufficiently whimsical and not-real that the phrase ‘once upon a time’ is warranted”.
Nevertheless my review was primarily written from the perspective of “You are an AI (alignment) researcher and you want to know if there’s anything important in this book for you to spend your limited time reading.” so I didn’t go into that much detail about the style flaws besides noting that they exist and what I feel the biggest flaws are. Part of why I didn’t want to do an extended takedown of the style is that ultimately it’s an empirical question, the book will either be successful or it won’t be and going into a long dissection of what I think is wrong stylistically in a book that isn’t even written with me as the intended audience doesn’t seem like a very self aware thing to do.
They don’t feel they can just make an argument because normally when they try doing that their audience will interject with some ridiculous cope thing, and the ridiculous cope is different for different people and it’s difficult to know in advance which ridiculous cope thing the audience will want you to respond to. So what the book does is interrupt itself constantly to try and head off some aside or point that the imagined reader might make there
I dont see why that’s a bad thing, or fundamentally different to making an argument.: making a case and refuting objections as you go along, is a stronger form of argumentation than just making a case.
Well it’s a bad thing because it makes it harder to follow what they’re trying to say. It’s also a necessary thing for this subject and kind of the core question that the book has to answer is “how do we anticipate people’s objections and get past them?” and the book represents one hypothesis for how to do that. I am doubtful that it is a correct hypothesis, the reception to it doesn’t seem very positive BUT the reviews we’re getting are not from the target audience and as Scott Alexander said in his review Yudkowsky is a genius who has a history of seeing memetic opportunities that other people do not. So I feel comfortable noting I am doubtful and otherwise letting the reaction tell the story.
Well it’s a bad thing because it makes it harder to follow what they’re trying to say
If you don’t dont, you end up with in a position where the audience can read a short book, and know what your claim is, and disagree with it because of one of the unanswered objections.
and the book represents one hypothesis for how to do that. I am doubtful that it is a correct hypothesis
She asks why the book doesn’t spend more time explaining why an intelligence explosion is likely to occur. The answer is the book is explicitly arguing a conditional, what happens if it does occur, and acknowledges that it may or may not occur, or occur on any given time frame.
Is it your claim here that the book is arguing the conditional: “If there’s an intelligence explosion, then everyone dies?” If so, then it seems completely valid to counterargue: “Well, an intelligence explosion is unlikely to occur, so who cares?”
I notice the style of this section of your summary of other people’s reviews is angrier and more openly emotive than the others. I take this to mean I’ve offended or upset you somehow. This is odd to me because I think my review was a lot nicer than most people expected from me (including myself). You don’t seem nearly as frustrated by other people making much dumber and more bad faith arguments, so I’m curious what it is that I’ve done to upset you.
In any case I do not think that I make my case in a “hostile and esoteric manner”. If anything I think I’ve kind of done worse than that by mostly not writing down my case at all because I have very high intellectual standards and don’t feel comfortable articulating my intuitions until the articulation is relatively rigorous.
That having been said I don’t think what I have written so far is “hostile and esoteric”.
There’s me attempting to explain during our podcast which I admit it took me longer than I’d like to get to the point.
https://gist.githubusercontent.com/JD-P/34e597cef5e99f8afa6304b3df5a4386/raw/db55f1b24bd437566392ac58f2160ccaefd5631f/outer-alignment-and-deception-with-jdp-and-zvi.vtt
Then there’s various pieces of writing I have which express part of my thoughts
https://gist.github.com/JD-P/56eaadc7f3a08026418ceb7bf4808aee
https://minihf.com/posts/2024-12-20-weave-agent-dev-log-3/
I have writing at various levels of quality and endorsement in my Twitter archive which you can find by searching for keywords like “Goodhart” and “Reward model”.
https://jdpressman.com/tweets.html
You can also just ask me. Honestly I think people ask me things about alignment somewhere between 10-100x less often than they should.
Bluntly, I don’t really understand what you take issue with in my review. From my perspective the structure of my review goes like this:
We are currently in an alignment winter. (This is bad)
Alignment is not solved yet but people widely believe it is. (This is bad)
I was expecting to hate the book but it actually retreats on most of the rhetoric I blame for contributing to the alignment winter. (This is good)
The style of the book is bad, but I won’t dwell on it and in fact spend a paragraph on the issue and then move on.
I actually disagree with the overall thesis, but think it’s virtuous to focus on the points of agreement when someone points out an important issue so I don’t dwell on that either and instead
“Emphatically agree” (literal words) that AI labs are not serious about the alignment problem.
State a short version of what the alignment problem actually is. (Important because it’s usually conflated with or confused with simpler problems that sound a lot easier to solve.)
I signal boost Eliezer’s other and better writing because I think my audience is disproportionately made up of people who might be able to contribute to the alignment problem if they’re not deeply confused about it and I think Eliezer’s earlier work is under-read.
I reiterate that I think the book is kinda bad, since I need a concluding paragraph.
I continue to think this is a basically fair review.
I had to reread part 7 from your review to fully understand what you were trying to say. It’s not easy to parse on a quick read, so I’m guessing Zvi didn’t interpret the context and content correctly, like I didn’t on my first pass. On first skim, I thought it was a technical argument about how you disagreed with the overall thesis, which makes things pretty confusing.
If that’s your reaction to my reaction, then it was a miss in at least some ways, which is on me.
I did not feel angry (more like frustrated?) when I wrote it nor did I intend to express anger, but I did read your review itself as expressing anger and hostility in various forms—you’re doing your best to fight through that and play fair with the ideas as you see them, which is appreciated - and have generally read your statements about Yudkowsky and related issues consistently as being something in the vicinity of angry, also as part of a consistent campaign, and perhaps some of this was reflected in my response. It’s also true that I have a cached memory of you often responding as if things said are more hostile than I felt they were or were intended, although I do not recall examples at this point.
And I hereby report that, despite at points in the past putting in considerable effort trying to parse your statements, and at some point found it too difficult, frustrating and aversive in some combination and mostly stopped attempting to do so when my initial attempt on a given statement bounced (which sometimes it doesn’t).
(Part of what is ‘esoteric’ is perhaps that the perfect-enemy-of-good thing means a lot of load-bearing stuff is probably unsaid by you, and you may not realize that you haven’t said it?)
But also, frankly, when people write much dumber reviews with much dumber things in them, I mostly can’t even bring myself to be mad, because I mean what else can one expect from such sources—there’s only one such review that actually did make me angry, because it was someone where I expected better. It’s something I’ve worked a lot on, and I think made progress on—I don’t actually e.g. get mad at David Sacks anymore as a person, although I still sometimes get mad that I have to once again write about David Sacks.
To the extent I was actually having a reaction to you here it was a sign that I respect you enough to care, that I sense opportunity in some form, and that you’re saying actual things that matter rather than just spouting gibberish or standard nonsense.
Similarly, with the one exception, if those people had complained about my reaction to their reaction in the ways I’d expect them to do so, I would have ignored them.
Versus your summary of your review, I would say I read it more as:
We are currently in an alignment winter. (This is bad). This is asserted as ‘obvious’ and then causes are cited, all in what I read as a hostile manner, and an assertion of ‘facts not in evidence’ that I indeed disagree with, including various forms of derision that read in-context as status attacks and accusations of bad epistemic action, and the claim that the value loading problem has been solved, which is all offered in a fashion that implies you think this is all clearly true if not rather obvious, and this is all loaded up front despite it not being especially relevant to the book, and echoing things you talk about a lot. This sets the whole thing up as an adversarial exercise. You can notice that in my reaction, I treated these details as central, in a way you don’t seem to think are, or at least I think the central thing boils down to this thing?
Alignment is not solved yet but people widely believe it is. (This is bad). It’s weird because you say ‘we solved [X] and people think [X] solves alignment but it doesn’t’ where I don’t think it’s true we solved [X].
I was expecting to hate the book but it actually retreats on most of the rhetoric I blame for contributing to the alignment winter. (This is good) Yes.
The style of the book is bad, but I won’t dwell on it and in fact spend a paragraph on the issue and then move on. ‘Truly appalling’ editorial choices, weird and often condescending, etc. Yes it’s condensed but you come on very strong here (which is fine, you clearly believe it, but I wouldn’t minimize its role). Also your summary skips over the ‘contempt for LLMs’ paragraph.
I actually disagree with the overall thesis, but think it’s virtuous to focus on the points of agreement when someone points out an important issue so I don’t dwell on that either and instead.
“Emphatically agree” (literal words) that AI labs are not serious about the alignment problem.
State a short version of what the alignment problem actually is. (Important because it’s usually conflated with or confused with simpler problems that sound a lot easier to solve.)
I signal boost Eliezer’s other and better writing because I think my audience is disproportionately made up of people who might be able to contribute to the alignment problem if they’re not deeply confused about it and I think Eliezer’s earlier work is under-read.
I reiterate that I think the book is kinda bad, since I need a concluding paragraph.
I read ‘ok’ in this context as better than ‘kinda bad’ fwiw.
As for ‘I should just ask you,’ I notice this instinctively feels aversive as likely opening up a very painful and time consuming and highly frustrating interaction or set of interactions and I notice I have the strong urge not to do it. I forget the details of the interactions with you in particular or close others that caused this instinct, and it could be a mistake. I could be persuaded to try again.
I do know that when I see the interactions of the entire Janus-style crowd on almost anything, I have the same feeling I had with early LW, where I expect to get lectured to and yelled at and essentially downvoted a lot, including in ‘get a load of this idiot’ style ways, if I engage directly in most ways and it puts me off interacting. Essentially it doesn’t feel like a safe space for views outside a certain window. This makes me sad because I have a lot of curiosity there, and it is entirely possible this is deeply stupid and if either side braved mild social awkwardness we’d all get big gains from trade and sharing info. I don’t know.
I realize it is frustrating to report things in my head where I can’t recall many of the sources of the things, but I am guessing that you would want me to do that given that this is the situation.
I dunno, man, this is definitely a ‘write the long letter’ situation and I’m calling it here.
(If you want to engage further, my reading of LW comments even on my own posts is highly unreliable, but I would get a PM or Twitter DM or email etc pretty reliably).
I’m pretty sure it is the same post I wrote my comment in response to.
And similar comments seem like a level of snark and condescension that the other descriptions don’t have.
It’s also wrong, I do appreciate the challenges with that and think they are in a sense the fundamental difficulty of writing a book like this. What I was originally going to say in my review , before deciding it was going to get lambasted by 50 parties anyway and I didn’t need to be one of them when there’s praiseworthy elements I can focus on instead, is that the book as written is barely structured like an argument at all. It’s more like a series of disconnected vignettes arranged in roughly chronological order. This is because the authors don’t actually feel they can make an argument. They don’t feel they can just make an argument because normally when they try doing that their audience will interject with some ridiculous cope thing, and the ridiculous cope is different for different people and it’s difficult to know in advance which ridiculous cope thing the audience will want you to respond to. So what the book does is interrupt itself constantly to try and head off some aside or point that the imagined reader might make there, and I guess I can only hope that the chosen structure is actually the product of constant beta readers objecting at various points and then deciding to put the interruptions at the most common points of objection. I hope this but do not really expect it because the book honestly comes across as the product of an incestuous editing process where it wasn’t really shown to any critical perspectives who would point out flaws as basic as “if I skim the beginning of each part of the book I will pick up on it always starting with an example sufficiently whimsical and not-real that the phrase ‘once upon a time’ is warranted”.
Nevertheless my review was primarily written from the perspective of “You are an AI (alignment) researcher and you want to know if there’s anything important in this book for you to spend your limited time reading.” so I didn’t go into that much detail about the style flaws besides noting that they exist and what I feel the biggest flaws are. Part of why I didn’t want to do an extended takedown of the style is that ultimately it’s an empirical question, the book will either be successful or it won’t be and going into a long dissection of what I think is wrong stylistically in a book that isn’t even written with me as the intended audience doesn’t seem like a very self aware thing to do.
I dont see why that’s a bad thing, or fundamentally different to making an argument.: making a case and refuting objections as you go along, is a stronger form of argumentation than just making a case.
Well it’s a bad thing because it makes it harder to follow what they’re trying to say. It’s also a necessary thing for this subject and kind of the core question that the book has to answer is “how do we anticipate people’s objections and get past them?” and the book represents one hypothesis for how to do that. I am doubtful that it is a correct hypothesis, the reception to it doesn’t seem very positive BUT the reviews we’re getting are not from the target audience and as Scott Alexander said in his review Yudkowsky is a genius who has a history of seeing memetic opportunities that other people do not. So I feel comfortable noting I am doubtful and otherwise letting the reaction tell the story.
If you don’t dont, you end up with in a position where the audience can read a short book, and know what your claim is, and disagree with it because of one of the unanswered objections.
What’s the right way?
Is it your claim here that the book is arguing the conditional: “If there’s an intelligence explosion, then everyone dies?” If so, then it seems completely valid to counterargue: “Well, an intelligence explosion is unlikely to occur, so who cares?”