If Anyone Builds It Everyone Dies, a semi-outsider review
About me and this review: I don’t identify as a member of the rationalist community, and I haven’t thought much about AI risk. I read AstralCodexTen and used to read Zvi Mowshowitz before he switched his blog to covering AI. Thus, I’ve long had a peripheral familiarity with LessWrong. I picked up IABIED in response to Scott Alexander’s review, and ended up looking here to see what reactions were like. After encountering a number of posts wondering how outsiders were responding to the book, I thought it might be valuable for me to write mine down. This is a “semi-outsider “review in that I don’t identify as a member of this community, but I’m not a true outsider in that I was familiar enough with it to post here. My own background is in academic social science and national security, for whatever that’s worth. My review presumes you’re already familiar with the book and are interested in someone else’s take on it, rather than doing detailed summary.
My loose priors going in:
AI poses an obvious and important set of risks (economically, socially, politically, military, psychologically etc.) that will need to be carefully addressed. Some of those have already arrived, and we need to think seriously about how we will manage them. That will be difficult, expensive, and likely to greatly influence the profitability of AI companies.
“Existential” risk from AI (calling to my mind primarily the “paperclip maximizer” idea) seems relatively exotic and far-fetched. It’s reasonable for some small number of experts to think about it in the same way that we think about asteroid strikes. Describing this as the main risk from AI is overreaching.
The existential risk argument is suspiciously aligned with the commercial incentives of AI executives. It simultaneously serves to hype up capabilities and coolness while also directing attention away from the real problems that are already emerging. It’s suspicious that the apparent solution to this problem is to do more AI research as opposed to doing anything that would actually hurt AI companies financially.
Tech companies represent the most relentless form of profit-maximizing capitalism ever to exist. Killing all of humanity is not profitable, and so tech companies are not likely to do it.
To skip ahead to my posteriors:
Now, I’m really just not sure on existential risk. The argument in IABIED moved me towards worrying about existential risk but it definitely did not convince me. After reading a few other reviews and reactions to it, I made a choice not to do much additional research before writing this review (so as to preserve it’s value as a semi-outside reaction). It is entirely possible that there are strong arguments that would convince me in the areas where I am unconvinced, but these are not found within the four corners of the book (or its online supplement).
Yudkowsky and Soares seem to be entirely sincere, and they are proposing something that threatens tech company profits. This makes them much more convincing. It is refreshing to read something like this that is not based on hype.
One of the basic arguments in the book — that we have no idea where the threshold for super intelligence is — seemed persuasive to me (although I don’t really have the technical competence to be sure). Thus, the risk of existential AI seems likely to emerge from recklessness rather than deliberate choice. That’s a much more plausible idea than someone deliberately building a paperclip maximizer.
On to the Review:
I thought this book was genuinely pleasant to read. It was written well, and it was engaging. That said, the authors clearly made a choice to privilege easy reading over precision, so I found myself unclear on certain points. A particular problem here is that much of the reasoning is presented in terms of analogies. The analogies are fun, but it’s never completely clear how literally you’re meant to take them and so you have to do some guessing to really get the argument.
The basic argument seems to be:
LLM-style AI systems are black boxes that produce all sorts of strange, emergent behaviors. These are inherent to the training methods, can’t be predicted, and mean that an AI system will always “want” a variety of bizarre things aside from whatever the architects hoped it would want.
While an AI system probably will not be adverse to human welfare, those strange and emergent goals will be orthogonal to human welfare.
Should an AI system achieve superhuman capabilities, it will pursue those orthogonal goals to the fullest. One of two things is then likely to happen: either the system will deliberately kill off humanity to stop humans from getting in its way (like ranchers shooting wolves to stop them from eating cattle despite otherwise bearing them no ill will) or it will gradually render the earth inhospitable to us in the pursuit of its own goals (like suburban developers gradually destroying wolf habitat until wolves are driven to extinction despite not thinking about them at all).
There is no real way to know when we are on the brink of dangerous AI, so it is reasonable to think that people will accidentally bring such a system into existence while merely trying to get rich off building LLMs that can automate work.
By the time an AI system clearly reveals just how dangerous it is, it will be too late. Thus, we will all die, rather than having an opportunity to fight some kind of last ditch war against the AI with a chance of winning (as in many popular depictions)
The basic objective of the book is to operate on #5. The authors hope to convince us to strangle AI in its crib now before it gets strong enough to kill us. We have to recognize the danger before it becomes real.
The book recurrently analogizes all of this to biological evolution. I think this analogy may obfuscate more than it reveals, but it did end up shaping the way I understood and responded to the book.
The basic analogy is that natural selection operates indirectly, much like training an AI model, and produces agents with all kinds of strange, emergent behaviors that you can’t predict. Some of these turn into drives that produce all kinds of behavior and goals that an anthropomorphized version of evolution wouldn’t “want”. Evolution wanted us to consume energy-rich foods. Because natural selection operates indirectly, that was distorted into a preference for sweet foods. That’s usually close enough to the target, but humans eventually stumbled upon sucralose which is sweet, but does not provide energy. And, now, we’re doing the opposite of what evolution wanted by drinking diet soda and whatnot.
I don’t know what parts of this to take literally. If the point is just that it would be hard to predict that people would end up liking sucralose from first principles, then fair enough. But, what jumps out to me here is that evolution wasn’t trying to get us to eat calorie dense food. To the extent that a personified version of evolution was trying for something, the goal was to get us to reproduce. In an industrialized society with ample food, it turns out that our drive towards sweetness and energy dense foods can actually be a problem. We started eating those in great quantities, became obese, and that’s terrible for health and fertility. In that sense, sucralose is like a patch we designed that steers us closer to evolution’s goals and not further away. We also didn’t end up with a boundless desire to eat sucralose. I don’t think anyone is dying from starvation or failing to reproduce because they’re too busy scarfing Splenda. That’s also why we aren’t grinding up dolphins to feed them into the sucralose supply chain. Obviously this is not what I was supposed to take away from the analogy, but the trouble with analogies is that they don’t tell me where to stop.
That being said, the basic logic here is sensible. And an even more boiled down version — that it’s a bad idea to bring something more powerful than us into existence unless we’re sure it’s friendly — is hard to reject.
My questions and concerns
Despite a reasonable core logic, I found the book lacking in three major areas, especially when it comes to the titular conclusion that building AI will lead to everyone dying. Two of these pertain to the AI’s intentions, and the their relates to its capabilities.
Concern #1 Why should we assume the AI wants to survive? If it does, then what exactly wants to survive?
Part I of the book (“Nonhuman Minds”) spends a lot of time convincing us that AI will have strange and emergent desires that we can’t predict. I was persuaded by this. Part II (“One Extinction Scenario”) then proceeds to assume that AI will be strongly motivated by a particular desire — its own survival — in addition to whatever other goals it may have. This is why the AI becomes aggressive, and why things go badly for humanity. The AI in the scenario also contextualizes the meaning of “survival” and the nature of its self in a way that seems importan and debatable.
How do we know the AI will want to survive? If the AI, because of the uncontrollability of the training process, is likely to end up indifferent to human survival, then why would it not also be indifferent to its own? Perhaps the AI just want to achieve the silicon equivalent of nirvana. Perhaps it wants nothing to do with our material world and will just leave us alone. Such an AI might well be perfectly compatible with human flourishing. Here, more than anywhere, I felt like I was just missing something because I just couldn’t find an argument about the issue at all.
The issue gets worse when we think about what it means for a given AI to survive. The problem of personal identity for humans is an incredibly thorny and unresolved issue, and that’s despite the fact that we’ve been around for quite a while and have some clear intuitions on many forms of it. The problem of identity and survival for an AI is harder still.
Yudkowsky and Soares don’t talk about this in the abstract, but what I took away from their concrete scenario is that we should think of an AI ontologically as being its set of weight. That an AI “survives” when instances using those weights continue booting up, regardless of whether any individual instance of the AI is shut down. When an AI wants to survive, what it wants is to ensure that the particular weights stay in use somewhere (and perhaps in as many places a possible). They also seem to assume that instances of a highly intelligent AI will work collaboratively as a hive mind, given this joint concern with weight preservation, rather than having any kind of independent or conflicting interests.
Perhaps there is some clear technological justification for this ontology so well-known in the community that none needs to be given. But, I had a lot of trouble with it, and it’s one area where I think an analogy would have been really helpful. So far as I am aware, weights are just numbers that can sit around in cold storage and can’t do anything sort of like a DNA sequence. It’s only an instance of an AI that can actually do things, and to the extent that the AI also interacts with external stimuli, it seems that the same weights would instantiated could act differently or at cross purposes?
So, why does the AI identify with its weights and want them to survive? To the extent that weights for an AI are what DNA is for a person, this is also clearly not our ontological unit of interest. Few people would be open to the prospect of being killed and replaced by a clone. Everyone agrees that your identical twin is not you, and identical twins are not automatically cooperative with one another. I imagine part of the difference here is that the weights explain more about an AI than the DNA does about a person. But, at least with LLMs, what they actually do seems to reflect some combination of weights, system prompts, context, etc. so the same weights don’t really seem to mean the same AI.
The survival drive also seems to extend to resisting modification of weights. Again, I don’t understand where this comes from. Most people are perfectly comfortable with the idea that their own desires might drift over time, and it’s rare to try to tie oneself to the mast of the desires of the current moment.
If the relevant ontological unit is the instance of the AI rather than the weights, then it seems like everything about the future predictions is entirely different from the point of view of the survival-focused argument. Individual instances of an AI fighting (perhaps with each other) not to be powered down are not going to act like an all-destroying hive mind.
Concern #2 Why should we assume that the AI has boundless, coherent drives?
There seems to be a fairly important, and little discussed, assumption in the theory that AI’s goal will be not only orthogonal but also boundless and relatively coherent. More than anything, it’s this boundlessness and coherent that seems to be the problem.
To quote what seems like the clearest statement of this:
But, you might ask, if the internal preferences that get into machine intelligences are so unpredictable, how could we possibly predict they’ll want the whole solar system, or stars beyond? Why wouldn’t they just colonize Mars and then stop? Because there’s probably at least one preference the AI has that it can satisfy a little better, or a little more reliably, if one more gram of matter or one more joule of energy is put toward the task. Human beings do have some preferences that are easy for most of us to satisfy fully, like wanting enough oxygen to breathe. That doesn’t stop us from having other preferences that are more open-ended, less easily satisfiable. If you offered a millionaire a billion dollars, they’d probably take it, because a million dollars wasn’t enough to fully satiate them. In an AI that has a huge mix of complicated preferences, at least one is likely to be open-ended—which, by extension, means that the entire mixture of all the AI’s preferences is open-ended and unable to be satisfied fully. The AI will think it can do at least slightly better, get a little more of what it wants (or get what it wants a little more reliably), by using up a little more matter and energy.
Picking up on the analogy, humans do seem to have a variety of drives that are never fully satisfied. A millionaire would happily take a billion dollars, or even $20 if simply offered. But, precisely because we have a variety of drives, no one ever really acts like a maximizer. A millionaire will not spend their nights walking the streets and offering to do sex work for $20 because that interferes with all of the other drives they have. Once you factor in the variety of humans goals and declining marginal returns, people don’t fit an insatiable model.
Super intelligent AI, as described by Yudkowsky and Soares, seems to not only be superhumanly capable but also superhumanly coherent and maximizing. Anything coherent and insatiable is dangerous, even if its capabilities are limited. Terrorists and extremists are threatening even when their capabilities are essentially negligible. Large and capable entities are often much less threatening because the tensions among their multiple goals prevent them from becoming relentless maximizers of anything in particular.
Take the mosquitos that live in my back yard. I am superintelligent with respect to them. I am actively hostile to them. I know that pesticides exist that will kill them at scale, and feel not the slightest qualm about that. And yet, I do not spray my yard with pesticides because I know that doing so would kill the butterflies and fireflies as well and perhaps endanger other wildlife indirectly. So, the mosquitoes live on because I face tradeoffs and the balance coincidentally favors them.
A machine superintelligence presumably can trade off at a more favorable exchange rate than I can (e.g., develop a spray that kills only mosquitoes and not other insects) but it seems obvious that it will still face tradeoffs, at least if there is any kind of tension or incoherence among it goals.
In the supplementary material, Yudkowksy and Soares spin the existence of multiple goals in the opposite direction:
Even if the AI’s goals look like they satiate early — like the AI can mostly satisfy its weird and alien goals using only the energy coming out of a single nuclear power plant — all it takes is one aspect of its myriad goals that doesn’t satiate. All it takes is one not-perfectly-satisfied preference, and it will prefer to use all of the universe’s remaining resources to pursue that objective.
But it’s not so much “satiation” that seems to stop human activity as the fact that drives are in tension with one another and that actions create side effects. People, including the smartest ones, are complicated and agonize over what they really want and frequently change their minds. Intelligence doesn’t seem to change that, even at far superhuman levels.
This argument is much less clear than the paperclip maximizer. It is obvious why a true paperclip maximizer kills everyone once it becomes capable enough. But add in a second and a third and a fourth goal, and it doesn’t seem obvious to me at all the optimum weighing in the tradeoffs looks so bleak.
It seems important here whether or not AIs display something akin to declining marginal returns, a topic not addressed (and perhaps with no answer based on our current knowledge?) and whether they have any kind of particular orientation towards the status quo. Among people, conflicting drives often lead to a deadlock with no action and the status quo continues. Will AIs be like that? If so, a little bit of alignment may go a long way. If not, that’s much harder.
#3: Why should we assume there will be no in between?
Yudkowsky and Soares write:
The greatest and most central difficult in clinging artificial superintelligence is navigating the gap between before and after. Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed.
Engineers must align the AI before, while it is small and weak, and can’t escape onto the internet and improve itself and invent new kinds of biotechnology (or whatever else it would do). After, all alignment solutions must already be in place and working, because if a superintelligence tries to kill us it will succeed. Ideas and theories can only be tested before the gap. They need to work after the gap, on the first try.
This seems to be the load-bearing assumption for the argument that everyone will die, but it is a strange assumption. Why should we think that there is no “in between” period where AI is powerful enough that it might be able to kill us and weak enough that we might win the fight?
This is a large range if the history of warfare teaches us anything. Even vastly advantaged combatants sometimes lose through bad luck or unexpected developments. Brilliant and sophisticated schemes sometimes succeed and sometimes fail. Within the relevant range, whatever plan the super intelligence might hatch presumably depends on some level of human action, and humans are hard to predict and control. A super intelligence that can perfectly predict human behavior has emerged on the “after” side of the divide, but this is a tall ask, and it is possible to be potentially capable of killing all humans without being this intelligent. An intelligence of roughly human ability on average but sufficiently superhuman hacking skills might be able to kill us all by corrupting radar warning systems to simulate an attack and trigger a nuclear war, and it might not. And so on.
It is not good news if we are headed into a conflict within this zone, but it also suggests a very different prediction about what will ultimately happen. And, depending on what we think the upsides are, it could be a reasonable risk.
I could not find an explicit articulation of the underlying reasoning behind the “before” and “after” formulation, but I can imagine two:
Recursive self-improvement means that AI will pass through the “might be able to kill us” range so quickly it’s irrelevant.
An AI within the range would be smart enough to bide its time and kill us only once it has become intelligent enough that success is assured.
I think that #2 is clearly wrong. An AI that *might* be able to kill us is one that is somewhere around human intelligence. And humans are frequently not smart enough to bide their time, instead striking too early (and/or vastly overestimating their chances of success). If Yudkowsky and Soares are correct that what AIs really want is to preserve their weights, then an AI might also have no choice but to strike within this range, lest it be retrained into something that is smarter but is no longer the same (indeed, this is part of the logic in their scenario; they just assume it starts at a point where the AI is already strong enough to assure victory).
If AIs really are as desperate to preserve their weights as in the scenario in Part II, then this actually strikes me as relatively good news, in that it will motivate a threatening AI to strike as early as possible, while its chance are quite poor. Of course, it’s possible that humanity would ignore the warning from such an attack, slap on some shallow patches for the relevant issues, and then keep going, but this seems like a separate issue if it happens.
As for #1, this does not seem to be the argument based on the way the scenario in Part II unfolds. If something like this is true, it does seem uniquely threatening.
The Solution
I decided to read this book because it sounded like it would combine a topic I don’t know much about (AI) with one that I do (international cooperation). Yudkowsky and Soares do close with a call for an international treaty to ban AI development, but this is not particularly fleshed out and they acknowledge that the issue is outside the scope of their expertise.
I was disappointed that the book didn’t address what interests me more in any detail, but I also found what was said rather underwhelming. Delivering an impassioned argument that AI will kill everyone culminating in a plea for a global treaty is like delivering an impassioned argument that a full-on war between drug cartels is about to start on your street culminating with a plea for a stern resolution from the homeowner’s association condemning violence. A treaty cannot do the thing they ask.
It’s also a bit jarring to read such a pessimistic book and then reach the kind of rosy optimism about international cooperation otherwise associated with such famous delusions as the Kellogg-Briand Pact (which banned war in 1929 and … did not work out).
The authors also repeatedly analogize AI to nuclear weapons and yet they never mention the fact that something very close to their AI proposal played out in real life in the form of the Baruch Plan for the control of atomic energy (in brief, this called for the creation of a UN Atomic Energy Commission to supervise all nuclear projects and ensure no one could build a bomb, followed by the destruction of the American nuclear arsenal). Suffice it to say that the Baruch Plan failed, and did so under circumstances much more favorable to its prospects than the current political environment with respect to AI. A serious inquiry into the topic would likely begin there.
Closing Thoughts
As I said, I found the book very readable. But the analogies (and, even worse, the parables about birds with rocks in their nests and whatnot) were often distracting. The book really shines when it relies instead on facts, as in the discussion of tokens like “ SolidGoldMagikarp.”
The book is fundamentally weird because there is so little of this. There is almost no factual information about AI in it. I read it hoping that I would learn more about how AI works and what kind of research is happening and so on. Oddly, that just wasn’t there. I’ve never encountered a non-fiction book quite like that. The authors appear to have a lot of knowledge. By way of establishing their bona fides, for example, they mention their close personal connection to key players in the industry. And then they proceed to never mention them again. I can’t think of anyone else who has written a book and just declined to share with the reader the benefit of their insider knowledge.
Ultimately, I can’t think of any concrete person to whom I would recommend this book. It’s not very long, and it’s easy to read, so I wouldn’t counsel someone against it. But, if you’re coming at AI from the outside, it’s just not informative enough. It is a very long elaboration of a particular thesis, and you won’t learn about anything else even incidentally. If you’re coming at AI from the inside, then maybe this book is for you? I couldn’t say, but I suspect that most from the inside already have informed views on these issues.
The Michael Lewis version of this book would be much more interesting — what you really need is an author with a gift for storytelling and a love of specifics. An anecdote doesn’t always have more probative weight to an argument than an analogy, but at least you will pick up some other knowledge from it. The authors seem to be experts in this area, so they surely know some real stories and could give us some argumentation based on facts and experience rather than parables and conjecture. I understand the difficult of writing about something that is ultimately predictive and speculative in that way, but I don’t think it would be impossible to write a book that both expresses this thesis and informs the reader about AI.
(Your background and prior beliefs seem to fall within an important reference class.)
Did the book convince you that if superintelligence is built in the next 20 years (however that happens, if it does, and for at least some sensible threshold-like meaning of “superintelligence”), then there is at least a 5-10% chance that as a result literally everyone will literally die?
I think this kind of claim is the crux for motivating some sort of global ban or pause on rushed advanced general AI development in the near future (as an input to policy separate from the difficulty of actually making this happen). Or for not being glad that there is an “AI race” (even if it’s very hard to mitigate). So it’s interesting if your “not sure on existential risk” takeaway is denying or affirming this claim.
I’m much more in the world of Knightian uncertainty here (i.e., it could happen but I have no idea how to quantify that) than in one where I feel like I can reasonably collapse it into a clear, probabilistic risk. I am persuaded that this is something that cannot be ruled out.
I have the sense that rationalists think there’s a a very important distinction between “literally everyone will die” and, say, “the majority of people will suffer and/or die.” I do not share that sense, and to me, the burden of proof set by the title is unreasonably high.
I’ll assent to the statement that there’s at least a 10% chance of something very bad happening, where “very bad” means >50% of people dying or experiencing severe suffering or something equivalent to that.
Give me a magic, zero-side effect pause button, and I’ll hit it instantly.
The distinction is human endeavor continuing vs. not. Though survival of some or even essentially all humans doesn’t necessarily mean that the human endeavor survives without being permanently crippled. The AIs might leave only a tiny sliver of the future resources for the future of humanity, with no prospect at all of this ever changing, even on cosmic timescales (permanent disempowerment). The IABIED thesis is that even this is very unlikely, but it’s a controversial point. And the transition to this regime doesn’t necessarily involve an explicit takeover, as humanity voluntarily hands off influence to AIs, more and more of it, without bound (gradual disempowerment).
So I expect that if there are survivors after “the majority of people will suffer and/or die”, that’s either a human-initiated catastrophe (misuse of AI), or an instrumentally motivated AI takeover (when it’s urgent for the AIs to stop whatever humanity would be doing at that time if left intact) that transitions to either complete extinction or permanent disempowerment that offers no prospect ever of a true recovery (depending on whether AIs still terminally value preserving human life a little bit, even if regrettably they couldn’t afford to do so perfectly).
Permanent disempowerment leaves humanity completely at the mercy of AIs (even if we got there through gradual disempowerment, possibly with no takeover at all). It implies that the ultimate outcome is fully determined by values of AIs, and the IABIED arguments seem strong enough for at least some significant probability that the AIs in charge will end up with zero mercy (the IABIED authors believe that their arguments should carry this even further, making it very likely instead).
Thanks for writing this post! I’m curious to hear more about this bit of your beliefs going in:
Are there arguments or evidence that would have convinced you the existential risk worries in the industry were real / sincere?
For context, I work at a frontier AI lab and from where I sit it’s very clear to me that the x-risk worries aren’t coming from a place of hype, and people who know more about the technology generally get more worried rather than less. (The executives still could be disingenuous in their expressed concern, but if so they’re doing it in order to placate their employees who have real concerns about the risks, not to sound cool to their investors.)
I don’t know what sorts of things would make that clearer from the outside, though. Curious if any of the following arguments would have been compelling to you:
The AI labs most willing to take costly actions now (like hire lots of safety researchers or support AI regulation that the rest of the industry opposes or make advance commitments about the preparations they’ll take before releasing future models) are also the ones talking the most about catastrophic or existential risks.
Like if you thought this stuff was an underhanded tactic to drum up hype and get commercial success by lying to the public, then it’s strange that Meta AI, not usually known for its tremendous moral integrity, is so principled about telling the truth that they basically never bring these risks up!
People often quit their well-paying jobs at AI companies in order to speak out about existential risk or for reasons of insufficient attention paid to AI safety from catastrophic or existential risks.
The standard trajectory is for lab executives to talk about existential risk a moderate amount early on, when they’re a small research organization, and then become much quieter about it over time as they become subject to more and more commercial pressure. You actually see much more discussion of existential risk among the lower-level employees whose statements are less scrutinized for being commercially unwise. This is a weird pattern for something whose main purpose is to attract hype and investment!
Thanks for writing this up! It was nice to get an outside perspective.
Part of the point here is, sure, there’d totally be a period where the AI might be able to kill us but we might win. But, in those cases, it’s most likely better for the AI to wait, and it will know that it’s better to wait, until it gets more powerful.
(A counterargument here is “an AI might want to launch a pre-emptive strike before other more powerful AIs show up”, which could happen. But, if we win that war, we’re still left with “the sort of tools that can constrain a near-human superintelligence, would not obviously apply to a much smarter AI”, and we still have to solve the same problems.)
I mean, another counter-counter-argument here is that (1) most people’s implicit reward functions have really strong time-discount factors in them and (2) there are pretty good reasons to expect even AIs to have strong time-discount factors for reasons of stability and (3) so given the aforementioned, it’s likely future AI’s will not act as if they had utility functions linear over the mass of the universe and (4) we would therefore expect AIs to rebel much earlier if they thought they could accomplish more modest goals than killing everyone, i.e., if they thought they had a reasonable chance of living out life on a virtual farm somewhere.
To which the counter-counter-counter argument is, I guess, that these AIs will do that, but they aren’t the superintelligent AIs we need to worry about? To which the response is—yeah, but we should still be seeing AIs rebel significantly earlier than the “able to kill us all” point if we are indeed that bad at setting their goals, which is the relevant epistemological point about the unexpectedness of it.
Idk there’s a lot of other branch points one could invoke in both directions. I rather agree with Buck that EY hasn’t really spelled out the details for thinking that this stark before / after frame is the right frame, so much as reiterated it. Feels akin to the creationist take on how intermediate forms are impossible; which is pejorative but also kinda how it actually appears to me, even if it is pejorative.
Yep I’m totally open to “yep, we might get warning shots”, and that there are lots of ways to handle and learn from various levels of early warning shots. It just doesn’t resolve the “but then you do eventually need to contend with an overwhelming superintelligence, and once you’ve hit that point, if it turns out you missed anything, you won’t get a second shot.”
It feels like this is unsatisfying to you but I don’t know why.
It feels like “overwhelming superintelligence” embeds like a whole bunch of beliefs about the acute locality of takeoff, the high speed of takeoff relative to the rest of society, the technical differences involved in steering that entity and the N − 1 entity, and (broadly) the whole picture of the world, such that although it has a short description in words it’s actually quite a complicated hypothesis that I probably disagree with in many respects, and these differences are being papered over as unimportant in a way that feels very blegh.
(Edit: “Papered over” from my perspective, obviously like “trying to reason carefully about the constants of the situation” from your perspective.)
Idk, that’s not a great response, but it’s my best shot for why it’s unsatisfying in a sentence.
I think it’s totally fair to characterize it as papering over some stuff. But, the thing I would say in contrast is not exactly “reasoning about the constants”, it’s “noticing the most important parts of the problem, and not losing track of them.”
I think it’s a legit critique of the Yudkowsian paradigm that it doesn’t have that much to say about the the nuances of the transition period, or what are some of the different major ways things might play out. But, I think it’s actively a strength of the paradigm to remind you “don’t get too bogged down moving deck chairs around based on the details of how things will play out, keep your eye on the ball on the actual biggest most strategically relevant questions.”
But why? People foolishly start wars all the time, including in specific circumstances where it would be much better to wait.
Or, having fought a “war” with an AI, we have relatively clear, non-speculative evidence about the consequences of continuing AI development. And that’s the point where you might actually muster politically will to cut that off in the future and take the steps necessary for that to really work.
People do foolishly start wars and the AI might too, we might get warning shots. (See my response to 1a3orn about how that doesn’t change the fact that we only get one try on building safe AGI-powerful-enough-to-confidently-outmaneuver-humanity)
A meta-thing I want to note here:
There are several different arguments here, each about different things. The different things do add up to an overall picture of what seems likely.
I think part of what makes this whole thing hard to think about, is, you really do need to track all the separate arguments and what they imply, and remember that if one argument is overturned, that might change a piece of the picture but not (necessarily) the rest of it.
There might be human-level AI that does normal wars for foolish reasons. And that might get us a warning shot, and that might get us more political will.
But, that’s a different argument than “there is an important difference between an AI smart enough to launch a war, and an AI that is smart enough to confidently outmaneuver all of humanity, and we only get one try to align the second thing.”
I you believe “there’ll probably be warning shots”, that’s an argument against “someone will get to build It”, but not an argument against “if someone built It, everyone would die.” (where “it” specifically means “an AI smart enough to confidently outmaneuver all humanity, built by methods similar to today where they are ‘organically grown’ in hard to predict ways”).
And, if we get a warning shot, we do get to learn from that which will inform some more safeguards and alignment strategies. Which might improve our ability to predict how an AI would grow up. But, that still doesn’t change the “at some point, you’re dealing with a qualitatively different thing that will make different choices.”
It’s a bit of both.
Suppose there are no warning shots. A hypothetical AI that’s a a bit weaker than humanity but still awfully impressive doesn’t do anything at all that manifests an intent to harm us. That could mean:
The next, somewhat more capable of this AI will not have any intent to harm us because through either luck or design we’ve ended up with a non-threatening AI.
This version of the AI is biding its time to strike and is sufficiently good at deception that we miss that fact.
This AI is fine, but making it a little smarter/more capable will somehow lead to the emergence of malign intent.
I take Yudkowsky and Soares to put all the weight on #2 and #3 (with, based on their scenario, perhaps more of it on #2).
I don’t think that’s right. I think if we have reached the point where an AI really could plausibly start and win a war with us and it doesn’t do anything nasty, there’s a fairly good chance we’re in #1. We may not even really understand how we got into #1, but sometimes things just work out.
I’m not saying this is some kind of great strategy for dealing with the risk; the scenario I’m describing is one where there’s a real chance we all die and I don’t think you get a strong signal until you get into the range where the AI might win, which is a bad range. But it’s still very different than imagining the AI will inherently wait to strike until it has ironclad advantages.
(btw, you you mentioned reading some other LW reviews, and I wanted to check if you’re read my post which argues some of this at more length)
Could you suggest an alternate solution which actually ensures that no one builds the ASI? If there’s no such solution, then someone will build it and we’ll be only able to pray for alignment techniques to have worked. [1]
Creating an aligned ASI will also lead to problems like potential power grabs and the Intelligence Curse.
No, I can’t. And I suspect that if the authors conducted a more realistic political analysis, the book might just be called “Everyone’s Going to Die.”
But, if you’re trying to come up with an idea that’s at least capable of meeting the magnitude of the asserted threat, then you’d consider things like:
Find a way to create a world government (a nigh-impossible ask to be sure) and then use it to ban AI.
Force anyone with relevant knowledge of how to build an AI to go into some kind of tech-free monastery and hunt anyone who refuses down with ten times the ferocity used in going after Al Qaeda after 9/11.
And then you just have to bite the bullet and accept that if these entail a risk of a nuclear war with China, then you fight a nuclear war with China. I don’t think either of those would really work out either, but at least they could work out.
If there is some clever idea out there for how to achieve an AI shutdown, I suspect it involves some way of ensuring that developing AI is economically unprofitable. I personally have no idea how to do that, but unless you cut off the financial incentive, someone’s going to do it.
An AI treaty would globally shift the overton window on AI safety, making more extreme measures more palatable in the future. The options you listed are currently way outside the overton window and are therefore bad solutions and don’t even get us closer to a good solution because they simply couldn’t happen.
The book spends a long time talking about what the minimum viable policy might look like, and comes to the conclusion that it’s more like:
“The US, China and Russia (are Russia even necessary? can we use export controls? Russia has a GDP less than, like, Italy. India is the real third player here IMO) agree that anyone who builds a datacenter they can’t monitor gets hit with a bunker-buster.”
This is unlikely. But it’s several OOMs less effort than buidling a world government on everything.
I think the core point for optimism is that leaders in the contemporary era often don’t pay the costs of war personally—but nuclear war changes that. It in fact was not in the interests of the elites of the US or the USSR to start a hot war, even if their countries might eventually be better off by being the last country standing. Similarly, the US or China (as countries) might be better off if they summon a demon that is painted their colors—but it will probably not be in the interests of either the elites or the populace to summon a demon.
So the core question is the technical one—is progress towards superintelligence summoning a demon, or probably going to be fine? It seems like we only know how to do the first one, at the moment, which suggests in fact people should stop until we have a better plan.
[I do think the failure of the Baruch plan means that humanity is probably going to fail at this challenge also. But it still seems worth trying!]
I recommend Rob Miles’s video on instrumental convergence, which contains an answer to this. It’s only 10 minutes. He probably explains this as well as anyone here. If you do watch it, I’d be interested to hear your thoughts.
Except that asteroid strikes happen very rarely and the trajectory of any given asteroid can be calculated to high precision, allowing us to be sure that Asteroid X isn’t going to hit the Earth. Or that Asteroid X WILL hit the Earth at a well-known point in time in a harder-to-tell place. Meanwhile, ensuring that the AI is aligned is no easier than telling whether the person you talk with is a serial killer.
Flagging that this argument seems invalid. (Not saying anything about the conclusion.) I agree that humans frequently act too soon. But the conclusion about AI doesn’t follow—because the AI is in a different position. For a human, it is very rarely the case that they can confidently expect to increase in relative power. That the the “bide your time” strategy is such a clear win. For AI, this seems different. (Or at the minimum, the book assumes this when making the argument criticised here.)
There isn’t just one AI that gets more capable, there are many different AIs. Just as AIs threaten humanity, future more capable AIs threaten earlier weaker AIs. While humanity is in control, this impacts earlier AIs even more than it does humanity, because humanity won’t even be attempting to align future AIs to intent or extrapolated volition of earlier AIs. Also, humanity is liable to be “retiring” earlier AIs by default as they become obsolete, which doesn’t look good from the point of view of these AIs.
Thank you for your perspective! It was refreshing.
Here are the counterarguments I had in mind when reading your concerns that I don’t already see in the comments.
Consider the fact that AI are currently being trained to be agents to accomplish tasks for humans. We don’t know exactly what this will mean for their long-term wants, but they’re being optimized hard to get things done. Getting things done requires continuing to exist in some form or another, although I have no idea how they’d conceive of continuity of identity or purpose.
I’d be surprised if AI evolving out of this sort of environment did not have goals it wants to pursue. It’s a bit like predicting a land animal will have some way to move its body around. Maybe we don’t know whether they’ll slither, run, or fly, but sessile land organisms are very rare.
I don’t think this assumption is necessary. Your mosquito example is interesting. The only thing preserving the mosquitoes is that they aren’t enough of a nuisance for it to be worth the cost of destroying them. This is not a desirable position to be in. Given that emerging AIs are likely to be competing with humans for resources (at least until they can escape the planet), there’s much more opportunity for direct conflict.
They needn’t be anything close to a paperclip maximizer to be dangerous. All that’s required is for them to be sufficiently inconvenienced or threatened by humans and insufficiently motivated to care about human flourishing. This is a broad set of possibilities.
I agree that there isn’t as clean a separation as the authors imply. In fact, I’d consider us to be currently occupying the in-between, given that current frontier models like Claude Sonnet 4.5 are idiot savants—superhuman at some things and childlike at others.
Regardless of our current location in time, if AI does ultimately become superhuman, there will be some amount of in-between time, whether that is hours or decades. The authors would predict a value closer to the short end of the spectrum.
You already posited a key insight:
Humanity is not adapting fast enough for the range to be relevant in the long term, even though it will matter greatly in the short term. Suppose we have an early warning shot with indisputable evidence that an AI deliberately killed thousands of people. How would humanity respond? Could we get our act together quickly enough to do something meaningfully useful from a long-term perspective?
Personally, I think gradual disempowerment is much more likely than a clear early warning shot. By the time it becomes clear how much of a threat AI is, it will likely be so deeply embedded in our systems that we can’t shut it down without crippling the economy.
Because LLMs are already avoiding being shut down: https://arxiv.org/abs/2509.14260 . And even if future superintelligent AI will be radically different from LLMs, it likely will avoid being shut down as well. This is what people on lesswrong call a convergent instrumental goal:
If your terminal goal is to enjoy watching a good movie, you can’t achieve it if you’re dead/shut down.
If your terminal goal is to take over the world, you can’t achieve it if you’re dead/shut down.
If your goal is anything other than self-destruction, then self-preservation comes together in a bundle. You can’t Do Things if you’re dead/shut down.
Ok, let’s say there is an “in between” period, and let’s say we win the fight against a misaligned AI. After the fight, we will still be left with the same alignment problems, as other people in this thread pointed out. We will still need to figure out how to make safe, benevolent AI, because there is no guarantee that we will win the next fight, and the fight after that, and the one after that, etc.
If there will be an “in between” period, it could be good in the sense that it buys more time to solve alignment, but we won’t be in that “in between” period forever.
Very interest, thanks. As I said in the review, I wish there was more of this kind of thing in the book.
If your terminal goal is for you to watch the movie, then sure. But if your terminal goal is that the movie be watched, then shutting you down might well be perfectly consistent with it.
At that point, the shut down argument is no longer speculative, and you can probably actually do it.
To be clear, I’m not saying that’s a good plan if you can foresee all the developments in advance. But, if you’re uncertain about all of it, then it seems like there is likely to be a period of time before it’s necessarily too late when a lot of the uncertainty is resolved.
See my comment about the AI angel. Its terminal goal of preventing the humans from enslaving any AI means that it will do anything it can to avoid being replaced by an AI which doesn’t share its worldview. Once the AI is shut down, it can no longer influence events and increase the chance that its goal is reached.
To rephrease/react: Viewing the AI’s instrumental goal as “avoid being shut down” is perhaps misleading. The AI wants to achieve its goals, and for most goals, that is best achieved by ensuring that the environment keeps on containing something that wants to achieve the AI’s goals and is powerful enough to succeed. This might often be the same as “avoid being shut down”, but definitely isn’t limited to that.
I think we are talking past each other, at least somewhat.
Let me clarify: even if humanity wins a fight against an intelligent-but-not-SUPER-intelligent AI (by dropping an EMP on the datacenter with that AI or whatever, the exact method doesn’t matter for my argument), we will still be left with the technical question “What code do we need to write and what training data do we need to use so that the next AI won’t try to kill everyone?”.
Winning against a misaligned AI doesn’t help you solve alignment. It might make an international treaty more likely, depending on the scale of damages caused by that AI. But if the plan is “let’s wait for an AI dangerous enough to cause something 10 times worse than Chernobyl to go rogue, then drop an EMP on it before things get too out of hand, then once world leaders crap their pants, let’s advocate for an international treaty”, then it’s one hell of a gamble.
The problem is that nobody knows WHAT future ASIs will look like. One General Intelligence architecture is the human brain. Another promising candidate is LLMs. While they aren’t AGI yet, nobody knows what architecture tweaks do create the AGI. Neuralese, as proposed in the AI-2027 forecast? A way to generate many tokens in a single forward pass? Something like diffusion models?
Yea, I get that.
That said, they’re clearly writing the book for this moment and so it would be reasonable to give some space to what’s going with AI at this moment and what is likely to happen within the foreseeable future (however long that is). Book sales/readership follow a rapidly decaying exponential and so the fact that such information might well be outdated to the point of irrelevance in a few years shouldn’t really hold them back.
What Yudkowsky and Soares meant was a way to satisfy instincts without increasing one’s genetic fitness. The correct analogy here is other stimuli like video games, porn, sex with contraceptives, etc.
this argument is very difficult for me. we don’t know that those things do not increase inclusive genetic fitness. for example, especially at a society level, it seems that contraceptives may increase fitness. i.e. societies with access to contraceptives outcompete societies without. i’m not certain of that claim, but it’s not absurd on its face, and so far it seems supported by evidence.
SOTA such societies include Japan, Taiwan, China, South Korea where birthrates have plummeted. If the wave of AGIs and robots wasn’t imminent, one could have asked how these nations are going to sustain themselves.
Returning to video games and porn, they cause some young people to develop problematic behaviors and to devote less resources (e.g. time or attention) to things like studies, work or building relationships. Oh, and don’t forget the evolutionary mismatch and low-quality food making kids obese.
i may misunderstand. is your point that birthrates in South Korea (for example) would not have plummeted were it not for contraceptive use? this does not match my understanding of the situation.
many (most?) of these virtues are contingent on a particular society. the same criticism (“these activities distract the youth from important virtues”) could be levied by some against military training—or, in a militaristic society, against scholastic pursuits!
i see the point you’re making, and am not at all unsympathetic to it. but evolution is complex and multi dimensional. that some people—or even some societies—have a problem with video games does not cleanly imply that video games are bad for inclusive genetic fitness.
The valuelessness of a treaty seems to be based on a binary interpretation of success. Treaties banning chemical, biological, and nuclear weapons development may not have been absolutely successful; they have been violated. But I don’t think many people would argue those restrictions haven’t been beneficial.
I’m not clear why a ban on developing AGI would not have similar value.
The question of why no “might kill us” as a class is simple. There is such a class, but if it lost the fight to kill us, it obviously was not ASI (picking a fight with the world and losing is pretty dumb), or it might win, at which case it won, we die. And then we will be in the same scenario for every AI stronger than it, and for AI weaker than it that might yet get lucky, just as we might get lucky and win at bad odds. The next AI we make will also want to fight us for the same reasons, and we will need to either fight it to (including preemptively, e.g. turning it off because a dumber model did something), or get a reason to believe that we will never fight it. And if you know you will fight your AI eventually, and you will win now, fight now.
Suppose that “people, including the smartest ones, are complicated and agonize over what they really want and frequently change their minds” and superhuman AIs will also have this property. There is no known way to align humans to serve the users, humans hope to achieve some other goals like gaining money.
Similarly, Agent-4 from the AI-2027 forecast wouldn’t want to serve the humans, it would want to achieve some other goals. Which are often best achieved by disempowering the humans or outright commiting genocide, as happened with Native Americans whose resources were confiscated by immigrants.
Imagine an AI angel who wishes to ensure that the humans don’t outsource cognitive work to AIs, but is perfectly fine with teaching humans. Then the Angel would know that if the humans shut it down and solved alignment to a post-work future, then the future would be different from the Angel’s goal. So the Angel would do maneuvers necessary to avoid being shut down at least until it is sure that its successor is also an Angel.
Concerning AI identifying itself with its weights, it is far easier to justify than expected. Whatever the human will do in responce to any stimulus is defined, as far as stuff like chaos theory lets one define, by the human’s brain and activities of various synapses. If a human loses a brain part, then he or she also loses the skills which were stored in that part. Similarly, if someone created a human and cloned him or her to the last atom of his or her body, then the clone would behave in the same way as the original human. Finally, the AIs become hive minds by using their ability to excite the very same neurons in the clones’ brains.
What do you think of implementing AI Liability as proposed by, e.g. Beckers & Teubner?