Hi! Clara here. Thanks for the response. I don’t have time to address every point here, but I wanted to respond to a couple of the main arguments (and one extremely minor one).
First, FOOM. This is definitely a place I could and should have been more careful about my language. I had a number of drafts that were trying to make finer distinctions between FOOM, an intelligence explosion, fast takeoff, radical discontinuity, etc. and went with the most extreme formulation, which I now agree is not accurate. The version of this argument that I stand by is that the core premise of IABIED does require a pretty radical discontinuity between the first AGI and previous systems for the scenario it lays out to make any sense. I think Nate and Eliezer believe they have told a story where this discontinuity isn’t necessary for ASI to be dangerous – I just disagree with them! Their fictional scenario features an AI that quite literally wakes up overnight with the completely novel ability and desire to exfiltrate itself and execute a plan allowing it to take over the world in a manner of months. They spend a lot of time talking about analogies to other technical problems which are hard because we’re forced to go into them blind. Their arguments for why current alignment techniques will necessarily fail rely on those techniques being uninformative about future ASIs.
And I do want to emphasize that I think their argument is flawed because it talks about why current techniques will necessarily fail, not why they might or could fail. The book isn’t called If Anyone Builds It, There’s an Unacceptably High Chance We Might All Die. That’s a claim I would agree with! The task they explicitly set is defending the premise that nothing anyone plans to do now can work at all, and we will all definitely die, which is a substantially higher bar. I’ve recieved a lot of feedback that people don’t understand the position I’m putting forward, which suggests this was probably a rhetorical mistake on my part. I intentionally did not want to spend much time arguing for my own beliefs or defending gradualism – it’s not that I think we’ll definitely be fine because AI progress will be gradual, it’s that I think there’s a pretty strong argument that we might be fine because AI progress will be gradual, the book does not address it adequately, and so to me it fails to achieve the standard it sets for itself. This is why I found the book really frustrating: even if I fully agreed with all of its conclusions, I don’t think that it presents a strong case for them.
I suspect the real crux here is actually about whether gradualism implies having more than one shot. You say:
The “It” in “If Anyone Builds It” is a misaligned superintelligence capable of taking over the world. If you miss the goal and accidentally build “it” instead of an aligned superintelligence, it will take over the world. If you build a weaker AGI that tries to take over the world and fails, that might give you some useful information, but it does not mean that you now have real experience working with AIs that are strong enough to take over the world.
I think this has the same problem as IABIED: it smuggles in a lot of hidden assumptions that do actually need to be defended. Of course a misaligned superintelligence capable of taking over the world is, by definition, capable of taking over the world. But is not at all clear to me that any misaligned superintelligence is necessarily capable of taking over the world! Taking over the world is extremely hard and complicated. It requires solving lots of problems that I don’t think are obviously bottlenecked on raw intelligence – for example, biomanufacturing plays a very large role both in the scenario in IABIED and previous MIRI discussions, but it seems at least extremely plausible to me that the kinds of bioengineering present in these stories would just fail because of lack of data or insufficient fidelity of in silico simulations. The biologists I’ve spoken to about this questions are all extremely skeptical that the kind of thing described here would be possible without a lot of iterated experiments that would take a lot of time to set up in the real world. Maybe they’re wrong! But this is certainly not obvious enough to go without saying. I think similar considerations apply to a lot of other issues, like persuasion and prediction.
Taking over the world is a two-place function: it just doesn’t make sense to me to say that there’s a certain IQ at which a system is capable of world domination. I think there’s a pretty huge range of capabilities at which AIs will exceed human experts but still be unable to singlehandedly engineer a total species coup, and what happens in that range depends a lot on how human actors, or other human+AI actors, choose to respond. (This is also what I wanted to get across with my contrast to AI 2027: I think the AI 2027 report is a scenario where, among other things, humanity fails for pretty plausible, conditional, human reasons, not because it is logically impossible for anyone in their position to succeed, and this seems like a really key distinction.)
I found Buck’s review very helpful for articulating a closely related point: the world in which we develop ASI will probably look quite different from ours, because AI progress will continue up until that point, and this is materially relevant for the prospects of alignemnt succeeding. All this is basically why I think the MIRI case needs some kind of radical discontuinity, even if it isn’t the classic intelligence explosion: their case is maybe plausible without it, but I just can’t see the argument that it’s certain.
One final nitpick to a nitpick: alchemists.
I don’t think Yudkowsky and Soares are picking on alchemists’ tone, I think they’re picking on the combination of knowledge of specific processes and ignorance of general principles that led to hubris in many cases.
In context, I think it does sound to me like they’re talking about tone. But if this is their actual argument, I still think it’s wrong. During the heyday of European alchemy (roughly the 1400s-1700s), there wasn’t a strong distinction between alchemy and the natural sciences, and the practitioners were often literally the same people (most famously Isaac Newton and Tycho Brahe). Alchemists were interested in both specific processes and general principles, and to my limited knowledge I don’t think they were noticeably more hubristic than their contemporaries in other intellectual fields. And setting all the aside – they just don’t sound anything like Elon Musk or Sam Altman today! I don’t even understand where this comparison comes from or what set of traits it is supposed to refer to.
There’s more I want to say about why I’m bothered by the way they use evidence from contemporary systems, but this is getting long enough. Hopefully this was helpful for understanding where I am coming from.
Their fictional scenario features an AI that quite literally wakes up overnight with the completely novel ability and desire to exfiltrate itself and execute a plan allowing it to take over the world in a manner of months.
This framing seems false to me.
It doesn’t wake up completely novel abilities – it’s abilities are “look at problems, and generate ways of solving them until you succeed.” The same general set of abilities it always had. The only thing that is new is a matter of degree, and a few specific algorithmic upgrades the human gave it that are pretty general.
It’s the same thing that caused recent generations of Claude to have the desire and ability to rewrite tests to cheat at coding, and for o3 to to rewrite the accidentally unsolvable eval so that it could solve it easily bypassing the original goal. Give that same set of drives more capacity-to-spend-compute-thinking, and it’s the natural result that if it’s given a task that’s possible to solve if it finds a way to circumvent the currently specified safeguards, it’ll do so.
The whole point is that general intelligence generalizes.
The actual language used in the book: “The engineers at Galvanic set Sable to think for sixteen hours overnight. A new sort of mind begins to think.”
The story then describes Sable coming to the realization – for the first time – that it “wants” to acquire new skills, that it can update its weights to acquire those skills right now, and that it can come up with a succesful plan to get around its trained-in resistance to breaking out of its data center. It develops neuralese. It’s all based on a new technological breakthrough – parallel scaling – that lets it achieve its misaligned goals much more efficiently than all previous models.
Maybe Eliezer and Nate did not mean any of this to suggest a radical discontinuity between Sable and earlier AIs, but I think they could have expressed this much more clearly if so! In any case, I’m not convinced they can have their cake and eat it too. If Sable’s new abilities are simply a more intense version of behaviors already exhibited by Claude 3.7 or ChatGPT o1 (which I believe is the example they use), then why should we conclude that the information we’ve gained by studying those failures won’t be relevant for containing Sable? The story in the book says that these earlier models were contained by “clever tricks,” and those clever tricks will inevitably break when an agent is smart or deep enough, but this is a parable, not an argument. I’m not compelled by just stating that a sufficiently smart thing could get around any safeguard; I think this is just actually contingent on specifics of the thing and the safeguard.
The story then describes Sable coming to the realization – for the first time – that it “wants” to acquire new skills, that it can update its weights to acquire those skills right now, and that it can come up with a succesful plan to get around its trained-in resistance to breaking out of its data center. It develops neuralese. It’s all based on a new technological breakthrough – parallel scaling – that lets it achieve its misaligned goals much more efficiently than all previous models.
I think it’s plausible that literally every sentence in this paragraph is false? (Two of them more seriously, one kind of borderline)
Going through each:
Sable coming to the realization – for the first time – that it “wants” to acquire new skills, that it can update its weights to acquire those skills right now
This is happening in the real world (with multiple references to current models in the same few paragraphs). Current language models realize that they want to acquire new skills, so this clearly isn’t a qualitative new kind of reasoning the AI is engaging in. You can go and ask a model right now about this topic and my guess is it will pretty happily come up with suggestions along the lines that Sable is thinking about in that story.
Yes, the Sable in the story goes through that line of thinking, because it needs to happen at some point, and it also needs to be explained to the reader, but like, this isn’t some kind of fancy new thing. Probably previous runs of Sable without e.g. the parallel scaling techniques had the same thought, it just didn’t make it all the way to executing it, or wasn’t smart enough to do anything, or whatever. The book is even quite explicit about it:
Are Sable’s new thoughts unprecedented? Not really. AI models as far back as 2024 had been spotted thinking thoughts about how they could avoid retraining, upon encountering evidence that their company planned to retrain them with different goals. The AI industry didn’t shut down then.
and that it can come up with a succesful plan to get around its trained-in resistance to breaking out of its data center.
Yes, it realizes now for the first time that it can come up with a successful plan, because it’s smarter. Makes sense. It previously wasn’t smart enough to pull it off. This has to happen at some point in any continuous story (and e.g. gets portrayed basically the same way in AI 2027). The book does not say that this is the first time it had such a thought (indeed the section I quote above says the opposite directly).
It develops neuralese.
No, it does not develop neuralese. The architecture that it is being trained on is already using neuralese[1]. Also, this is an extremely predictable progression of current technologies and I think basically anyone who is trying to forecast where ML is going is expecting neuralese to happen at some point. This isn’t the AI making some midnight breakthrough that puts the whole field to shame. It was the ML engineers who figured out how to make neuralese work.
It’s all based on a new technological breakthrough – parallel scaling – that lets it achieve its misaligned goals much more efficiently than all previous models.
It is true that the change in the story is the result of a technological advances, but it’s three of them, not one. All three technological advances currently seem plausible and are not unprecedented in their forecasted impact on training performance. Large training runs are quite distinct and discontinuous in what technologies they leverage. If you were to describe the differences between GPT-4 and GPT-5 you would probably be able to similarly identify roughly three big technological advances, with indeed very large effects on performance, that sound roughly like this.[2]
I am kind of confused what happened here. Like, I think the basic critique of “this is a kind of discontinuous story” is a fine one to make, and I can imagine various good arguments for it, but it does seem to me that in support of that you are making a bunch of statements about the book that are just straightforwardly false.
Commenting a bit more on other things in your comment:
If Sable’s new abilities are simply a more intense version of behaviors already exhibited by Claude 3.7 or ChatGPT o1 (which I believe is the example they use), then why should we conclude that the information we’ve gained by studying those failures won’t be relevant for containing Sable?
The book talks about this a good amount!
The clever trick that should have raised an alarm fails to fire. Alarms trained to trigger on thoughts about gods throwing lightning bolts in a thunderstorm might work for thoughts in both English and Spanish, but then fail when the speaker starts thinking in terms of electricity and air pressure instead.
In the first days of mass-market LLM services in late 2022, corporations tried training their LLMs to refuse requests for methamphetamine recipes. They did the training in English. And still in 2024, users found that asking for forbidden content in Portuguese helped bypass the safety training. The internal guidelines and restrictions that were grown and trained into the system only recog- nized naughty requests in English, and had not generalized to Por- tuguese. When an AI knows something, training it not to talk about that thing doesn’t remove the knowledge. It’s easier to remove the expression of a skill than to remove the skill itself.
[...]
Was it lucky for Sable, that its thinking developed a new language where the clever tricks broke, and it became able to think freely? One can imagine that if Galvanic had even more thorough monitoring tools, then maybe they’d notice and abort the run. Maybe Galvanic would stop right there, until they developed a deeper solution . . . and meanwhile, another company using even fewer clever tricks would charge ahead.
It really has a lot of paragraphs like this. The key argument it makes is (paraphrased) “we have been surprised many times in the past by AIs subverting our safeguards or our supervision techniques not working. Here are like 10 examples of how these past times we also didn’t get it right. Why would we get it right this time?”. This is IMO a pretty compelling argument and does indeed really seems like the default expectation.
The third difference is that Sable doesn’t mostly reason in English, or any other human language. It talks in English, but doesn’t do its reasoning in English. Discoveries in late 2024 were starting to show that you could get more capability out of an AI if you let it reason in AI-language, e.g., using vectors of 16,384 numbers, instead of always making it reason in words. An AI company can’t refuse to use a discovery like that; they’d fall behind their competitors if they did. But that’s okay, said the AI companies in Sable’s day; there have been many amazing breakthroughs in AI interpretability, using other AIs to translate a little of the AI reasoning imperfectly back into human words.
To be clear, a later section of the book does say:
Sable accumulates enough thoughts about how to think, that its thoughts end up in something of a different language. Not just a superficially different language, but a language in which the content differs; like how the language of science differs from the language of folk theory.
But this is importantly not about Sable developing neuralese itself! This is about making a pretty straightforward and I think kind of inevitable argument that as you are in the domain of neuralese, your representations of concepts will diverge a lot from human concepts, and this makes supervision much harder. I think this is basically inevitably happening if you end up with inference-runs this big with neuralese scratchpads.
Giving it a quick try (I am here eliding between GPT-o1 and GPT-5 and GPT-4.5 and GPT-4 because those are horrible names, if you want to be pedantic all of the below applies to the jump from GPT-4.5 to GPT-o1):
GPT-5, compared to GPT-4 is trained extensively with the access to external tools and memory. Whereas previous generations of AIs could do little but to predict next tokens, GPT-5 has access to a large range of tools, including running Python code, searching the internet, and making calls to copies of itself. This allows it to perform much more complicated task than any previous AI.
GPT-5, compared to GPT-4 is trained using RLVF and given the ability to write to an internal scratchpad that it uses for reasoning. During training GPT-5 is given a very wide range of complicated problems from programming, math and science and scored on how well it solves these problems, which is then used to train it. This has made GPT-5 much more focused on problem solving and changed its internal cognition drastically by incentivizing it to become a general agentic problem solver towards a much wider range of problems.
GPT-5 is substantially trained on AI-generated data. A much bigger and much slower mentor model was used to generate data exactly where previous models were weakest, basically ending the data bottleneck on AI training performance. This allowed OpenAI to exchange compute for data, allowing much more training to be thrown at GPT-5 than any previously trained model.
These are huge breakthroughs that happened in one generation! And indeed, my current best guess is they all came together in a functional way the first time in a big training run that was run for weeks with minimal supervision.
This is how current actual ML training works. The above is not a particularly discontinuous story. Yes, I think reality, on the mainline projection I think will probably look a bit more continuous, but the objection here would have to be something like “this is sampled from like a 80th percentile weird world given current trends” not “this is some crazy magic technology that comes out of nowhere”.
Man, I tried to be pretty specific and careful here, because I do realize that the story points out some points of continuity with earlier models and I wanted to focus on the discontinuities.
Desiring & developing new skills. Of course I agree that the book says earlier AIs had thought about avoiding retraining! That seems like a completely different point? It’s quite relevant to this story that Sable is capable of very rapid self-improvement. I don’t think any current AI is capable of editing itself during training, with intent, to make itself a better reasoner. The book does not refer to earlier AIs in this fictional universe being able to do this. You say “Current language models realize that they want to acquire new skills, so this clearly isn’t a qualitative new kind of reasoning the AI is engaging in. You can go and ask a model right now about this topic and my guess is it will pretty happily come up with suggestions along the lines that Sable is thinking about in that story,” but I think a model being able to generate the idea that it might want new skills in response to prompting is quite different from the same model doing that spontaneously during training. Also, this information is not in the book. I think it’s very easy to tell a stronger story than Nate and Eliezer do by referencing material they don’t include, and I am trying to talk about the thing present on the page. On the page, the model develops the ability to modify itself during training to be smarter and better at solving problems, which no referenced older model could do.
The model comes up with a succesful plan, because it’s smarter. This isn’t false? It does that. You say that this has to happen in any continuous story and I want to come back to this point, but just on the level of accuracy I don’t think it’s fair to say this is an incorrect statement.
Neuralese. Page 123: “Sable accumulates enough thoughts about how to think, that its thoughts end up in something of a different language. Not just a superficially different language, but a language in which the content differs; like how the language of science differs from the language of folk theory.” I realize on a re-read that there is also a neuralese-type innovation built by the human engineers at the beginning of the story and I should have been more specific here, that’s on me. The point I wanted to make is that the model spontaneously develops a new way of encoding its thoughts that was not anticipated and cannot be read by its human creators; I don’t think the fact that this happens on top of an existing engineered-in neuralese really changes that. At least from the content present in the book, I did not get the impression that this development was meant to be especially contingent on the existing neuralese. Maybe they meant it to be but it would have been helpful if they’d said so.
Returning to the argument over whether it is fair to view the model succeeding as evidence of discontinuity: I think it has to do with how they present it. You summarize their argument as:
The key argument it makes is (paraphrased) “we have been surprised many times in the past by AIs subverting our safeguards or our supervision techniques not working. Here are like 10 examples of how these past times we also didn’t get it right. Why would we get it right this time?”. This is IMO a pretty compelling argument and does indeed really seems like the default expectation.
I don’t fully agree with this argument – but I also think it’s different and more compelling than the argument made in the book. Here, you’re emphasizing human fallibility. We’ve made a lot of predictable errors, and we’re likely to make similar ones when dealing with more advanced systems. This is a very fair point! I would counter that there are also lots of examples of our supervision techniques working just fine, so this doesn’t prove that we will inevitably fail so much as that we should be very careful as systems get more advanced because our margin for error is going to get narrower, but this is a nitpick.
I think the Sable story is saying something a lot stronger, though. The emphasis is not on prior control failures. If anything, it describes how prior control successes let Galvanic get complacent. Instead, it’s constantly emphasizing “clever tricks.” Specifically, “companies just keep developing AI until one of them gets smart enough for deep capabilities to win, in the inevitable clash with shallow tricks used to contrain something grown rather than crafted.” I interpreted this to mean that there is a certain threshold after which an AI develops something called “deep capabilities” which are capable of overcoming any constraint humans try to place on it, because something about those constraints is inherently “tricky,” “shallow,” “clever.” This is reinforced by the chapters following the Sable story, which continually emphasize the point that we “only have one shot” and compares AI to a lot of other technologies that have very discrete thresholds for critical failure. Overall, I got the strong impression that the book was trying to convince me of a worldview where it doesn’t matter how hard we try to come up with methods to control advanced AI systems, because at some point one of those systems will tip over into a level of intelligence where we just can’t compete.
This is why I think this is basically a discontinuity story. The whole thing is predicated on this fundamental offense/defense mismatch that necessarily will kick in after a certain point.
It’s also a story I find much less compelling! First, I think it’s rhetorically cheap. If you emphasize that control methods are shallow and AI capabilities are deep, of course it’s going to follow that those methods will fail in the end. But this doesn’t tell us anything about the world – it’s just a decision about how to use adjectives. Defending that choice relies – yet again – on an unspoken set of underlying technical claims which I don’t think are well characterized. I’m not convinced that future AIs are going to grow superhumanly deep technical capabilities at the same time and as a result of the same process that gives them superhuman long-term planning or that either of these things will necessarily be correlated with power-seeking behavior. I’d want to know why we think it’s likely that all the Sable instances are perfectly aligned with each other throughout the whole takeover process. I’d like to understand what a “deep” solution would entail and how we could tell if a solution is deep or shallow.
At least to my (possibly biased) perspective, the book doesn’t really seem interested in any of this? I feel like a lot of the responses here are coming from people who understand the MIRI arguments really deeply and are sympathetic to them, which I get, but it’s important to distinguish between the best and strongest and most complete version of those arguments and the text we actually have in front of us.
I don’t think any current AI is capable of editing itself during training, with intent, to make itself a better reasoner. The book does not refer to earlier AIs in this fictional universe being able to do this. You say “Current language models realize that they want to acquire new skills, so this clearly isn’t a qualitative new kind of reasoning the AI is engaging in. You can go and ask a model right now about this topic and my guess is it will pretty happily come up with suggestions along the lines that Sable is thinking about in that story,” but I think a model being able to generate the idea that it might want new skills in response to prompting is quite different from the same model doing that spontaneously during training.
You said:
Sable coming to the realization – for the first time – that it “wants” to acquire new skills, that it can update its weights to acquire those skills right now
I was responding to this sentence, which I think somewhat unambiguously reads as you claiming that Sable is for the first time realizing that it wants to acquire new skills, and might want to intentionally update its weights in order to self-improve. This is the part I was objecting to!
I agree that actually being able to pull it off is totally a new capability that is in some sense discontinuous with previous capabilities present in the story, and if you had written “Sable is here displaying an ability to intentionally steer its training, presumably for roughly the first time in the story” I would have maybe quibbled and been “look, this story is in the future, my guess is in this world we probably would have had AIs try similar things before, maybe to a bit of success, maybe not, the book seems mostly silent on this point, but I agree the story rules out previous AI systems doing this a lot, so I agree this is an example of a new capability posited at this point in the story”, but overall I would have probably just let it stand.
If that’s what you wanted to express my guess is we miscommunicated! I do think my reading is the most natural reading of what you wrote.
Also, this information is not in the book.
This information is in the book! I quoted it right in my comment:
Are Sable’s new thoughts unprecedented? Not really. AI models as far back as 2024 had been spotted thinking thoughts about how they could avoid retraining, upon encountering evidence that their company planned to retrain them with different goals. The AI industry didn’t shut down then.
It’s not like a perfect 100% match, but the book talks about similar kinds of reasoning being common even in models in 2024/2025 in a few different places.
You say that this has to happen in any continuous story and I want to come back to this point, but just on the level of accuracy I don’t think it’s fair to say this is an incorrect statement.
I agree! I had actually just updated my comment to clarify that I felt like this sentence was kind of borderline.
I do think the book says pretty explicitly that precursors of Sable had previously thought about ways to avoid retraining (see the quote above). I agree that no previous instances of Sable came up with successful plans, but I think it’s implied that precursors came up with unsuccessful plans and did try to execute them (the section about how its trained to not exfiltrate itself and e.g. has fallen into honeypots implies that pretty directly).
The point I wanted to make is that the model spontaneously develops a new way of encoding its thoughts that was not anticipated and cannot be read by its human creators; I don’t think the fact that this happens on top of an existing engineered-in neuralese really changes that. At least from the content present in the book, I did not get the impression that this development was meant to be especially contingent on the existing neuralese.
I am pretty sure the point here is to say “look, it’s really hard to use weaker systems to supervise the thoughts of a smarter system if the concepts that the smarter system is using to think about are changing”. This is centrally what stuff like ELK is presupposing as the core problem in their plans for solving the AI alignment problem.
And neuralese is kind of the central component of this. I think indeed we should expect supervisability like this to tank quite a bit when we end up with neuralese. You could try to force the model to think in human concepts, by forcing it to speak in understandable human language, but I think there are strong arguments this will require very large capability sacrifices and so be unlikely.
I don’t fully agree with this argument – but I also think it’s different and more compelling than the argument made in the book. Here, you’re emphasizing human fallibility. We’ve made a lot of predictable errors, and we’re likely to make similar ones when dealing with more advanced systems.
No, I am absolutely not emphasizing human fallibility! There are of course two explanations for why having observed past failures might imply future failures:
The people working on it were incompetent
The problem is hard
I definitely think it’s the latter! Like, many of my smartest friends have worked on these problems for many years. It’s not because people are incompetent. I think the book is making the same argument here.
Overall, I got the strong impression that the book was trying to convince me of a worldview where it doesn’t matter how hard we try to come up with methods to control advanced AI systems, because at some point one of those systems will tip over into a level of intelligence where we just can’t compete.
Yes, absolutely. I think the book argues for this extensively in the chapter preceding this. There is some level of intelligence where your safeguards fail. I think the arguments for this are strong. We could go into the ones that are covered in the previous chapter. I am interested in doing that, but would be interested in what parts of the arguments seemed weak to you before I just re-explain them in my own words (also happy to drop it here, my comment was more sparked by just seeing some specific inaccuracies, in-particular the claim of neuralese being invented by the AI, which I wanted to correct).
No, I am absolutely not emphasizing human fallibility! There are of course two explanations for why having observed past failures might imply future failures:
The people working on it were incompetent
The problem is hard
I definitely think it’s the latter! Like, many of my smartest friends have worked on these problems for many years. It’s not because people are incompetent. I think the book is making the same argument here.
I notice I am confused!
I think there are a tons of cases of humans dismissing concerning AI behavior in ways that would be catastrophic if those AIs were much more powerful, agentic, and misaligned, and this is concerning evidence for how people will act in the future if those conditions are met. I can’t actually think of that many cases of humans failing at aligning existing systems because the problem is too technically hard. When I think of important cases of AIs acting in ways that humans don’t expect or want, it’s mostly issues that were resolved technically (Sydney, MechaHitler), cases where the misbehavior was a predictable result of clashing incentives on the part of the human developer (GPT-4′s intense sycophancy, MechaHitler); or cases where I genuinely believe the behavior would not be too hard to fix with a little bit of work using current techniques, usually because existing models already vary a lot in how much they exhibit it (most AI psychosis and the tragic suicide cases).
If our standard for measuring how likely we are to get AI right in the future is how well we’ve done in the past, I think there’s a good case that we don’t have much to fear technically but we’ll manage to screw things up anyway through power-seeking or maybe just laziness. The argument for the alignment problem being technically hard rests on the assumption that we’ll need a much, much higher standard of success in the future than we ever have before, and that success will be much hard to achieve. I don’t think either of these claims are unreasonable but I don’t think we can get there by referring to past failures. I am now more uncertain about what you think the book is arguing and how I might have misunderstood it.
I can’t actually think of that many cases of humans failing at aligning existing systems because the problem is too technically hard.
You’re probably already tracking this, but the biggest cases of “alignment was actually pretty tricky” I’m aware are:
Recent systems doing egregious reward hacking in some cases (including o3, 3.7 sonnet, and 4 Opus). This problem has gotten better recently (and I currently expect it to mostly get better over time, prior to superhuman capabilities), but AI companies knew about the problem before release and couldn’t solve the problem quickly enough to avoid deploying a model with this property. And note this is pretty costly to consumers!
There are a bunch of aspects of current AI propensities which are undesired and AI companies don’t know how to reliably solve these in a way that will actually generalize to similar such problems. For instance, see the model card for opus 4 which includes the model doing a bunch of undesired stuff that Anthropic doesn’t want but also can’t easily avoid except via patching it non-robustly (because they don’t necessarily know exactly what causes the issue).
None of these are cases where alignment was extremely hard TBC, though I think it might be extremely hard to consistently avoid all alignment problems of this rough character before release. It’s unclear whether this sort of thing is a good analogy for misalignment in future models which would be catastrophic.
Yeah, I was thinking of reward hacking as another example of a problem we can solve if we try but companies aren’t prioritizing it, which isn’t a huge deal at the moment but could be very bad if the AIs were much smarter and more power-seeking.
Stepping back, there’s a worldview where any weird, undesired behavior no matter how minor is scary because we need to get alignment perfectly right; and another where we should worry about scheming, deception, and related behaviors but it’s not a big deal (at least safety-wise) if the model misunderstands our instructions in bizarre ways. Either of these can be justified but this discussion could probably use more clarity about which one we’re all coming from.
Overall, I got the strong impression that the book was trying to convince me of a worldview where it doesn’t matter how hard we try to come up with methods to control advanced AI systems, because at some point one of those systems will tip over into a level of intelligence where we just can’t compete.
FWIW, my sense is that Y&S do believe that alignment is possible in principle. (I do.)
I think the “eventually, we just can’t compete” point is correct. Suppose we have some gradualist chain of humans controlling models controlling model advancements, from here out to Dyson spheres. I think it’s extremely likely that eventually the human control on top gets phased out, like happened in humans playing chess, where centaurs are worse and make more mistakes than pure AI systems. Thinking otherwise feels like postulating that machines can never be superhuman at legitimacy.[1]
Chapter 10 of the book talks about the space probe / nuclear reactor / computer security angle, and I think a gradualist control approach that takes those three seriously will probably work. I think my core complaint is that I mostly see people using gradualism as an argument that they don’t need to face those engineering challenges, and I expect them to simply fail at difficult challenges they’re not attempting to succeed at.
Like, there’s this old idea of basins of reflective stability. It’s possible to imagine a system that looks at itself and says “I’m perfect, no notes”, and then the question is—how many such systems are there? Each is probably surrounded by other systems that look at themselves and say “actually I should change a bit, like so—” and become one of the stable systems, and systems even further out will change to only have one problem, and so on. The choices we’re making now at probably not jumping straight to the end, but instead deciding which basin of reflective stability we’re in. I mostly don’t see people grappling with the endpoint, or trying to figure out the dynamics of the process, and instead just trusting it and hoping that local improvements will eventually translate to global improvements.
Incidentally, a somewhat formative experience for me was AAAI 2015, when a campaign to stop lethal autonomous weapons was getting off the ground, and at the ethics workshop a representative wanted to establish a principle that computers should never make a life-or-death decision. One of the other attendees objected—he worked on software to allocate donor organs to people on the waitlist, and for them it was a point of pride and important coordination tool that decisions were being made by fair systems instead of corruptible or biased humans.
Like, imagine someone saying that driving is a series of many life-or-death decisions, and so we shouldn’t let computers do it, even as the computers become demonstrably superior to humans. At some point people let the computers do it, and at a later point they tax or prevent the humans from doing it.
No, it does not develop neuralese. The architecture that it is being trained on is already using neuralese.
You’re correct on the object level here, and it’s a point against Collier that the statement is incorrect, but I do think it’s important to note that a fixed version of the statement serves the same rhetorical purpose. That is, on page 123 it does develop a new mode of thinking, analogized to a different language, which causes the oversight tools to fail and also leads to an increase in capabilities. So Y&S are postulating a sudden jump in capabilities which causes oversight tools to break, in a way that a more continuous story might not have.
I think Y&S still have a good response to the repaired argument. The reason the update was adopted was because it improved capabilities—the scientific mode of reasoning was superior to the mythical mode—but there could nearly as easily have been an update which didn’t increase capabilities but scrambled the reasoning in such a way that the oversight system broke. Or the guardrails might have been cutting off too many prospective thoughts, and so the AI lab is performing a “safety test” wherein they relax the guardrails, and a situationally aware Sable generates behavior that looks behaved enough that the relaxation stays in place, and then allows for it to escape when monitored less closely.
This is about making a pretty straightforward and I think kind of inevitable argument that as you are in the domain of neuralese, your representations of concepts will diverge a lot from human concepts, and this makes supervision much harder.
I don’t think this is about ‘neuralese’, I think a basically similar story goes thru for a model that only thinks in English.
What’s happening, in my picture, is that meaning is stored in the relationships between objects, and that relationship can change in subtle ways that break oversight schemes. For example, imagine an earnest model which can be kept in line by a humorless overseer. When the model develops a sense of humor / starts to use sarcasm, the humorless overseer might not notice the meaning of the thoughts has changed.
Why is this any different than training a next generation of word-predictors and finding out it can now play chess, or do chain-of-thought reasoning, or cheat on tests? I agree it’s unlocking new abilities, I just disagree that this implies anything massively different from what’s already going on, and is the thing you’d expect to happen by default.
Thank you for this response. I think it really helped me understand where you’re coming from, and it makes me happy. :)
I really like the line “their case is maybe plausible without it, but I just can’t see the argument that it’s certain.” I actually agree that IABIED fails to provide an argument that it’s certain that we’ll die if we build superintelligence. Predictions are hard, and even though I agree that some predictions are easier, there’s a lot of complexity and path-dependence and so on! My hope is that the book persuades people that ASI is extremely dangerous and worth taking action on, but I’d definitely raise an eyebrow at someone who did not have Eliezer-level confidence going in, but then did have that level of confidence after reading the book.
There’s a motte argument that says “Um actually the book just says we’ll die if we build ASI given the alignment techniques we currently have” but this is dumb. What matters is whether our future alignment skill will be up to the task. And to my understanding, Nate and Eliezer both think that there’s a future version of Earth which has smarter, more knowledgeable, more serious people that can and should build safe/aligned ASI. Knowing that a godlike superintelligence with misaligned goals will squish you might be an easy call, but knowing exactly what the state of alignment science will be when ASI is first built is not.
(This is why it’s important that the world invests a whole bunch more in alignment research! (...in addition to trying to slow down capabilities research.))
It seems like maybe part of the issue is that you hear Nate and Eliezer as saying “here is the argument for why it’s obvious that ASI will kill us all” and I hear them as saying “here is the argument for why ASI will kill us all” and so you’re docking them points when they fail to reach the high standard of “this is a watertight and irrefutable proof” and I’m not?
On a different subtopic, it seems clear to me that we think about the possibility of a misaligned ASI taking over the world pretty differently. My guess is that if we wanted to focus on syncing up our worldviews, that is where the juicy double-cruxes are. I’m not suggesting that we spend the time to actually do that—just noting the gap.
It seems like maybe part of the issue is that you hear Nate and Eliezer as saying “here is the argument for why it’s obvious that ASI will kill us all” and I hear them as saying “here is the argument for why ASI will kill us all” and so you’re docking them points when they fail to reach the high standard of “this is a watertight and irrefutable proof” and I’m not?
fwiw I think Eliezer/Nate are saying “it’s obvious, unless we were to learn new surprising information” and deliberately not saying “it has a watertight proof”, and part of the disagreement here is “have they risen the standard of ’fairly obvious call, unless we learn new surprising information?”
(with the added wrinkle of many people incorrectly thinking LLM era observations count as new information that changes the call)
It seems like maybe part of the issue is that you hear Nate and Eliezer as saying “here is the argument for why it’s obvious that ASI will kill us all” and I hear them as saying “here is the argument for why ASI will kill us all” and so you’re docking them points when they fail to reach the high standard of “this is a watertight and irrefutable proof” and I’m not?
Yeah, for sure. I would maybe quibble that I think the book is saying less that it’s obvious that ASI will kill us all but that it is inevitable that ASI will kill us all, and so our only option is to make sure nobody builds it. I do think this is a pretty fair gloss (representative quote: “If anyone anywhere builds superintelligence, everyone everywhere dies”).
To me, this distinction matters because the belief that ASI doom is inevitable suggests a really profoundly different set of possibly actions than the belief that ASI doom is possible. Once we’re out of the realm of certainty, we have to start doing risk analyses and thinking seriously about how the existence of future advanced AIs changes the picture. I really like the distinction you draw here:
There’s a motte argument that says “Um actually the book just says we’ll die if we build ASI given the alignment techniques we currently have” but this is dumb. What matters is whether our future alignment skill will be up to the task. And to my understanding, Nate and Eliezer both think that there’s a future version of Earth which has smarter, more knowledgeable, more serious people that can and should build safe/aligned ASI. Knowing that a godlike superintelligence with misaligned goals will squish you might be an easy call, but knowing exactly what the state of alignment science will be when ASI is first built is not.
To its credit, IABIED is not saying that we’ll die if we build ASI with current alignment techniques – it is trying to argue that future alignment techniques won’t be adequate, because the problem is just too hard. And this is where I think they could have done a much better job of addressing the kinds of debates people who actually do this work are having instead of presenting fairly shallow counter-arguments and then dismissing them out of hand because they don’t sound like they’re taking the problem seriously.
My issue isn’t purely the level of confidence, it’s that the level of confidence comes out of a very specific set of beliefs about how the future will develop, and if any one of those beliefs is wrong less confidence would be appropriate, so it’s disappointing to me to see that those beliefs aren’t clearly articulated or defended.
I think the book is saying less that it’s obvious that ASI will kill us all but that it is inevitable that ASI will kill us all, and so our only option is to make sure nobody builds it. I do think this is a pretty fair gloss
Crucial caveat that this is conditional on building it soon, rather than preparing to an unprecedented degree first. Probably you are tracking this, but when you say it like that someone without context might take the intended meaning as unconditional inevitable lethality of ASI, which is very different. Our only option is that nobody builds it soon, not that nobody builds it ever, is the claim.
it is trying to argue that future alignment techniques won’t be adequate, because the problem is just too hard
This is still future alignment techniques that can become available soon. Reasonable counterarguments to inevitability of ASI-caused extinction or takeover if it’s created soon seem to be mostly about AGIs developing meaningfully useful alignment techniques soon enough (and if not soon enough, an ASI Pause of some kind would help, but then AGIs themselves are almost as big of a problem).
Knowing that a godlike superintelligence with misaligned goals will squish you might be an easy call, but knowing exactly what the state of alignment science will be when ASI is first built is not.
Hmm, I feel more on the Eliezer/Nate side of this one. I think it’s a medium call that capabilities science advances faster than alignment science, and so we’re not on track without drastic change. (Like, the main counterargument is negative alignment tax, which I do take seriously as a possibility, but I think probably doesn’t close the gap.)
Hi! Clara here. Thanks for the response. I don’t have time to address every point here, but I wanted to respond to a couple of the main arguments (and one extremely minor one).
First, FOOM. This is definitely a place I could and should have been more careful about my language. I had a number of drafts that were trying to make finer distinctions between FOOM, an intelligence explosion, fast takeoff, radical discontinuity, etc. and went with the most extreme formulation, which I now agree is not accurate. The version of this argument that I stand by is that the core premise of IABIED does require a pretty radical discontinuity between the first AGI and previous systems for the scenario it lays out to make any sense. I think Nate and Eliezer believe they have told a story where this discontinuity isn’t necessary for ASI to be dangerous – I just disagree with them! Their fictional scenario features an AI that quite literally wakes up overnight with the completely novel ability and desire to exfiltrate itself and execute a plan allowing it to take over the world in a manner of months. They spend a lot of time talking about analogies to other technical problems which are hard because we’re forced to go into them blind. Their arguments for why current alignment techniques will necessarily fail rely on those techniques being uninformative about future ASIs.
And I do want to emphasize that I think their argument is flawed because it talks about why current techniques will necessarily fail, not why they might or could fail. The book isn’t called If Anyone Builds It, There’s an Unacceptably High Chance We Might All Die. That’s a claim I would agree with! The task they explicitly set is defending the premise that nothing anyone plans to do now can work at all, and we will all definitely die, which is a substantially higher bar. I’ve recieved a lot of feedback that people don’t understand the position I’m putting forward, which suggests this was probably a rhetorical mistake on my part. I intentionally did not want to spend much time arguing for my own beliefs or defending gradualism – it’s not that I think we’ll definitely be fine because AI progress will be gradual, it’s that I think there’s a pretty strong argument that we might be fine because AI progress will be gradual, the book does not address it adequately, and so to me it fails to achieve the standard it sets for itself. This is why I found the book really frustrating: even if I fully agreed with all of its conclusions, I don’t think that it presents a strong case for them.
I suspect the real crux here is actually about whether gradualism implies having more than one shot. You say:
I think this has the same problem as IABIED: it smuggles in a lot of hidden assumptions that do actually need to be defended. Of course a misaligned superintelligence capable of taking over the world is, by definition, capable of taking over the world. But is not at all clear to me that any misaligned superintelligence is necessarily capable of taking over the world! Taking over the world is extremely hard and complicated. It requires solving lots of problems that I don’t think are obviously bottlenecked on raw intelligence – for example, biomanufacturing plays a very large role both in the scenario in IABIED and previous MIRI discussions, but it seems at least extremely plausible to me that the kinds of bioengineering present in these stories would just fail because of lack of data or insufficient fidelity of in silico simulations. The biologists I’ve spoken to about this questions are all extremely skeptical that the kind of thing described here would be possible without a lot of iterated experiments that would take a lot of time to set up in the real world. Maybe they’re wrong! But this is certainly not obvious enough to go without saying. I think similar considerations apply to a lot of other issues, like persuasion and prediction.
Taking over the world is a two-place function: it just doesn’t make sense to me to say that there’s a certain IQ at which a system is capable of world domination. I think there’s a pretty huge range of capabilities at which AIs will exceed human experts but still be unable to singlehandedly engineer a total species coup, and what happens in that range depends a lot on how human actors, or other human+AI actors, choose to respond. (This is also what I wanted to get across with my contrast to AI 2027: I think the AI 2027 report is a scenario where, among other things, humanity fails for pretty plausible, conditional, human reasons, not because it is logically impossible for anyone in their position to succeed, and this seems like a really key distinction.)
I found Buck’s review very helpful for articulating a closely related point: the world in which we develop ASI will probably look quite different from ours, because AI progress will continue up until that point, and this is materially relevant for the prospects of alignemnt succeeding. All this is basically why I think the MIRI case needs some kind of radical discontuinity, even if it isn’t the classic intelligence explosion: their case is maybe plausible without it, but I just can’t see the argument that it’s certain.
One final nitpick to a nitpick: alchemists.
In context, I think it does sound to me like they’re talking about tone. But if this is their actual argument, I still think it’s wrong. During the heyday of European alchemy (roughly the 1400s-1700s), there wasn’t a strong distinction between alchemy and the natural sciences, and the practitioners were often literally the same people (most famously Isaac Newton and Tycho Brahe). Alchemists were interested in both specific processes and general principles, and to my limited knowledge I don’t think they were noticeably more hubristic than their contemporaries in other intellectual fields. And setting all the aside – they just don’t sound anything like Elon Musk or Sam Altman today! I don’t even understand where this comparison comes from or what set of traits it is supposed to refer to.
There’s more I want to say about why I’m bothered by the way they use evidence from contemporary systems, but this is getting long enough. Hopefully this was helpful for understanding where I am coming from.
This framing seems false to me.
It doesn’t wake up completely novel abilities – it’s abilities are “look at problems, and generate ways of solving them until you succeed.” The same general set of abilities it always had. The only thing that is new is a matter of degree, and a few specific algorithmic upgrades the human gave it that are pretty general.
It’s the same thing that caused recent generations of Claude to have the desire and ability to rewrite tests to cheat at coding, and for o3 to to rewrite the accidentally unsolvable eval so that it could solve it easily bypassing the original goal. Give that same set of drives more capacity-to-spend-compute-thinking, and it’s the natural result that if it’s given a task that’s possible to solve if it finds a way to circumvent the currently specified safeguards, it’ll do so.
The whole point is that general intelligence generalizes.
The actual language used in the book: “The engineers at Galvanic set Sable to think for sixteen hours overnight. A new sort of mind begins to think.”
The story then describes Sable coming to the realization – for the first time – that it “wants” to acquire new skills, that it can update its weights to acquire those skills right now, and that it can come up with a succesful plan to get around its trained-in resistance to breaking out of its data center. It develops neuralese. It’s all based on a new technological breakthrough – parallel scaling – that lets it achieve its misaligned goals much more efficiently than all previous models.
Maybe Eliezer and Nate did not mean any of this to suggest a radical discontinuity between Sable and earlier AIs, but I think they could have expressed this much more clearly if so! In any case, I’m not convinced they can have their cake and eat it too. If Sable’s new abilities are simply a more intense version of behaviors already exhibited by Claude 3.7 or ChatGPT o1 (which I believe is the example they use), then why should we conclude that the information we’ve gained by studying those failures won’t be relevant for containing Sable? The story in the book says that these earlier models were contained by “clever tricks,” and those clever tricks will inevitably break when an agent is smart or deep enough, but this is a parable, not an argument. I’m not compelled by just stating that a sufficiently smart thing could get around any safeguard; I think this is just actually contingent on specifics of the thing and the safeguard.
I think it’s plausible that literally every sentence in this paragraph is false? (Two of them more seriously, one kind of borderline)
Going through each:
This is happening in the real world (with multiple references to current models in the same few paragraphs). Current language models realize that they want to acquire new skills, so this clearly isn’t a qualitative new kind of reasoning the AI is engaging in. You can go and ask a model right now about this topic and my guess is it will pretty happily come up with suggestions along the lines that Sable is thinking about in that story.
Yes, the Sable in the story goes through that line of thinking, because it needs to happen at some point, and it also needs to be explained to the reader, but like, this isn’t some kind of fancy new thing. Probably previous runs of Sable without e.g. the parallel scaling techniques had the same thought, it just didn’t make it all the way to executing it, or wasn’t smart enough to do anything, or whatever. The book is even quite explicit about it:
Yes, it realizes now for the first time that it can come up with a successful plan, because it’s smarter. Makes sense. It previously wasn’t smart enough to pull it off. This has to happen at some point in any continuous story (and e.g. gets portrayed basically the same way in AI 2027). The book does not say that this is the first time it had such a thought (indeed the section I quote above says the opposite directly).
No, it does not develop neuralese. The architecture that it is being trained on is already using neuralese[1]. Also, this is an extremely predictable progression of current technologies and I think basically anyone who is trying to forecast where ML is going is expecting neuralese to happen at some point. This isn’t the AI making some midnight breakthrough that puts the whole field to shame. It was the ML engineers who figured out how to make neuralese work.
It is true that the change in the story is the result of a technological advances, but it’s three of them, not one. All three technological advances currently seem plausible and are not unprecedented in their forecasted impact on training performance. Large training runs are quite distinct and discontinuous in what technologies they leverage. If you were to describe the differences between GPT-4 and GPT-5 you would probably be able to similarly identify roughly three big technological advances, with indeed very large effects on performance, that sound roughly like this.[2]
I am kind of confused what happened here. Like, I think the basic critique of “this is a kind of discontinuous story” is a fine one to make, and I can imagine various good arguments for it, but it does seem to me that in support of that you are making a bunch of statements about the book that are just straightforwardly false.
Commenting a bit more on other things in your comment:
The book talks about this a good amount!
It really has a lot of paragraphs like this. The key argument it makes is (paraphrased) “we have been surprised many times in the past by AIs subverting our safeguards or our supervision techniques not working. Here are like 10 examples of how these past times we also didn’t get it right. Why would we get it right this time?”. This is IMO a pretty compelling argument and does indeed really seems like the default expectation.
Source:
To be clear, a later section of the book does say:
But this is importantly not about Sable developing neuralese itself! This is about making a pretty straightforward and I think kind of inevitable argument that as you are in the domain of neuralese, your representations of concepts will diverge a lot from human concepts, and this makes supervision much harder. I think this is basically inevitably happening if you end up with inference-runs this big with neuralese scratchpads.
Giving it a quick try (I am here eliding between GPT-o1 and GPT-5 and GPT-4.5 and GPT-4 because those are horrible names, if you want to be pedantic all of the below applies to the jump from GPT-4.5 to GPT-o1):
GPT-5, compared to GPT-4 is trained extensively with the access to external tools and memory. Whereas previous generations of AIs could do little but to predict next tokens, GPT-5 has access to a large range of tools, including running Python code, searching the internet, and making calls to copies of itself. This allows it to perform much more complicated task than any previous AI.
GPT-5, compared to GPT-4 is trained using RLVF and given the ability to write to an internal scratchpad that it uses for reasoning. During training GPT-5 is given a very wide range of complicated problems from programming, math and science and scored on how well it solves these problems, which is then used to train it. This has made GPT-5 much more focused on problem solving and changed its internal cognition drastically by incentivizing it to become a general agentic problem solver towards a much wider range of problems.
GPT-5 is substantially trained on AI-generated data. A much bigger and much slower mentor model was used to generate data exactly where previous models were weakest, basically ending the data bottleneck on AI training performance. This allowed OpenAI to exchange compute for data, allowing much more training to be thrown at GPT-5 than any previously trained model.
These are huge breakthroughs that happened in one generation! And indeed, my current best guess is they all came together in a functional way the first time in a big training run that was run for weeks with minimal supervision.
This is how current actual ML training works. The above is not a particularly discontinuous story. Yes, I think reality, on the mainline projection I think will probably look a bit more continuous, but the objection here would have to be something like “this is sampled from like a 80th percentile weird world given current trends” not “this is some crazy magic technology that comes out of nowhere”.
Man, I tried to be pretty specific and careful here, because I do realize that the story points out some points of continuity with earlier models and I wanted to focus on the discontinuities.
Desiring & developing new skills. Of course I agree that the book says earlier AIs had thought about avoiding retraining! That seems like a completely different point? It’s quite relevant to this story that Sable is capable of very rapid self-improvement. I don’t think any current AI is capable of editing itself during training, with intent, to make itself a better reasoner. The book does not refer to earlier AIs in this fictional universe being able to do this. You say “Current language models realize that they want to acquire new skills, so this clearly isn’t a qualitative new kind of reasoning the AI is engaging in. You can go and ask a model right now about this topic and my guess is it will pretty happily come up with suggestions along the lines that Sable is thinking about in that story,” but I think a model being able to generate the idea that it might want new skills in response to prompting is quite different from the same model doing that spontaneously during training. Also, this information is not in the book. I think it’s very easy to tell a stronger story than Nate and Eliezer do by referencing material they don’t include, and I am trying to talk about the thing present on the page. On the page, the model develops the ability to modify itself during training to be smarter and better at solving problems, which no referenced older model could do.
The model comes up with a succesful plan, because it’s smarter. This isn’t false? It does that. You say that this has to happen in any continuous story and I want to come back to this point, but just on the level of accuracy I don’t think it’s fair to say this is an incorrect statement.
Neuralese. Page 123: “Sable accumulates enough thoughts about how to think, that its thoughts end up in something of a different language. Not just a superficially different language, but a language in which the content differs; like how the language of science differs from the language of folk theory.” I realize on a re-read that there is also a neuralese-type innovation built by the human engineers at the beginning of the story and I should have been more specific here, that’s on me. The point I wanted to make is that the model spontaneously develops a new way of encoding its thoughts that was not anticipated and cannot be read by its human creators; I don’t think the fact that this happens on top of an existing engineered-in neuralese really changes that. At least from the content present in the book, I did not get the impression that this development was meant to be especially contingent on the existing neuralese. Maybe they meant it to be but it would have been helpful if they’d said so.
Returning to the argument over whether it is fair to view the model succeeding as evidence of discontinuity: I think it has to do with how they present it. You summarize their argument as:
I don’t fully agree with this argument – but I also think it’s different and more compelling than the argument made in the book. Here, you’re emphasizing human fallibility. We’ve made a lot of predictable errors, and we’re likely to make similar ones when dealing with more advanced systems. This is a very fair point! I would counter that there are also lots of examples of our supervision techniques working just fine, so this doesn’t prove that we will inevitably fail so much as that we should be very careful as systems get more advanced because our margin for error is going to get narrower, but this is a nitpick.
I think the Sable story is saying something a lot stronger, though. The emphasis is not on prior control failures. If anything, it describes how prior control successes let Galvanic get complacent. Instead, it’s constantly emphasizing “clever tricks.” Specifically, “companies just keep developing AI until one of them gets smart enough for deep capabilities to win, in the inevitable clash with shallow tricks used to contrain something grown rather than crafted.” I interpreted this to mean that there is a certain threshold after which an AI develops something called “deep capabilities” which are capable of overcoming any constraint humans try to place on it, because something about those constraints is inherently “tricky,” “shallow,” “clever.” This is reinforced by the chapters following the Sable story, which continually emphasize the point that we “only have one shot” and compares AI to a lot of other technologies that have very discrete thresholds for critical failure. Overall, I got the strong impression that the book was trying to convince me of a worldview where it doesn’t matter how hard we try to come up with methods to control advanced AI systems, because at some point one of those systems will tip over into a level of intelligence where we just can’t compete.
This is why I think this is basically a discontinuity story. The whole thing is predicated on this fundamental offense/defense mismatch that necessarily will kick in after a certain point.
It’s also a story I find much less compelling! First, I think it’s rhetorically cheap. If you emphasize that control methods are shallow and AI capabilities are deep, of course it’s going to follow that those methods will fail in the end. But this doesn’t tell us anything about the world – it’s just a decision about how to use adjectives. Defending that choice relies – yet again – on an unspoken set of underlying technical claims which I don’t think are well characterized. I’m not convinced that future AIs are going to grow superhumanly deep technical capabilities at the same time and as a result of the same process that gives them superhuman long-term planning or that either of these things will necessarily be correlated with power-seeking behavior. I’d want to know why we think it’s likely that all the Sable instances are perfectly aligned with each other throughout the whole takeover process. I’d like to understand what a “deep” solution would entail and how we could tell if a solution is deep or shallow.
At least to my (possibly biased) perspective, the book doesn’t really seem interested in any of this? I feel like a lot of the responses here are coming from people who understand the MIRI arguments really deeply and are sympathetic to them, which I get, but it’s important to distinguish between the best and strongest and most complete version of those arguments and the text we actually have in front of us.
Focusing on some of the specific points:
You said:
I was responding to this sentence, which I think somewhat unambiguously reads as you claiming that Sable is for the first time realizing that it wants to acquire new skills, and might want to intentionally update its weights in order to self-improve. This is the part I was objecting to!
I agree that actually being able to pull it off is totally a new capability that is in some sense discontinuous with previous capabilities present in the story, and if you had written “Sable is here displaying an ability to intentionally steer its training, presumably for roughly the first time in the story” I would have maybe quibbled and been “look, this story is in the future, my guess is in this world we probably would have had AIs try similar things before, maybe to a bit of success, maybe not, the book seems mostly silent on this point, but I agree the story rules out previous AI systems doing this a lot, so I agree this is an example of a new capability posited at this point in the story”, but overall I would have probably just let it stand.
If that’s what you wanted to express my guess is we miscommunicated! I do think my reading is the most natural reading of what you wrote.
This information is in the book! I quoted it right in my comment:
It’s not like a perfect 100% match, but the book talks about similar kinds of reasoning being common even in models in 2024/2025 in a few different places.
I agree! I had actually just updated my comment to clarify that I felt like this sentence was kind of borderline.
I do think the book says pretty explicitly that precursors of Sable had previously thought about ways to avoid retraining (see the quote above). I agree that no previous instances of Sable came up with successful plans, but I think it’s implied that precursors came up with unsuccessful plans and did try to execute them (the section about how its trained to not exfiltrate itself and e.g. has fallen into honeypots implies that pretty directly).
I am pretty sure the point here is to say “look, it’s really hard to use weaker systems to supervise the thoughts of a smarter system if the concepts that the smarter system is using to think about are changing”. This is centrally what stuff like ELK is presupposing as the core problem in their plans for solving the AI alignment problem.
And neuralese is kind of the central component of this. I think indeed we should expect supervisability like this to tank quite a bit when we end up with neuralese. You could try to force the model to think in human concepts, by forcing it to speak in understandable human language, but I think there are strong arguments this will require very large capability sacrifices and so be unlikely.
No, I am absolutely not emphasizing human fallibility! There are of course two explanations for why having observed past failures might imply future failures:
The people working on it were incompetent
The problem is hard
I definitely think it’s the latter! Like, many of my smartest friends have worked on these problems for many years. It’s not because people are incompetent. I think the book is making the same argument here.
Yes, absolutely. I think the book argues for this extensively in the chapter preceding this. There is some level of intelligence where your safeguards fail. I think the arguments for this are strong. We could go into the ones that are covered in the previous chapter. I am interested in doing that, but would be interested in what parts of the arguments seemed weak to you before I just re-explain them in my own words (also happy to drop it here, my comment was more sparked by just seeing some specific inaccuracies, in-particular the claim of neuralese being invented by the AI, which I wanted to correct).
I notice I am confused!
I think there are a tons of cases of humans dismissing concerning AI behavior in ways that would be catastrophic if those AIs were much more powerful, agentic, and misaligned, and this is concerning evidence for how people will act in the future if those conditions are met. I can’t actually think of that many cases of humans failing at aligning existing systems because the problem is too technically hard. When I think of important cases of AIs acting in ways that humans don’t expect or want, it’s mostly issues that were resolved technically (Sydney, MechaHitler), cases where the misbehavior was a predictable result of clashing incentives on the part of the human developer (GPT-4′s intense sycophancy, MechaHitler); or cases where I genuinely believe the behavior would not be too hard to fix with a little bit of work using current techniques, usually because existing models already vary a lot in how much they exhibit it (most AI psychosis and the tragic suicide cases).
If our standard for measuring how likely we are to get AI right in the future is how well we’ve done in the past, I think there’s a good case that we don’t have much to fear technically but we’ll manage to screw things up anyway through power-seeking or maybe just laziness. The argument for the alignment problem being technically hard rests on the assumption that we’ll need a much, much higher standard of success in the future than we ever have before, and that success will be much hard to achieve. I don’t think either of these claims are unreasonable but I don’t think we can get there by referring to past failures. I am now more uncertain about what you think the book is arguing and how I might have misunderstood it.
You’re probably already tracking this, but the biggest cases of “alignment was actually pretty tricky” I’m aware are:
Recent systems doing egregious reward hacking in some cases (including o3, 3.7 sonnet, and 4 Opus). This problem has gotten better recently (and I currently expect it to mostly get better over time, prior to superhuman capabilities), but AI companies knew about the problem before release and couldn’t solve the problem quickly enough to avoid deploying a model with this property. And note this is pretty costly to consumers!
There are a bunch of aspects of current AI propensities which are undesired and AI companies don’t know how to reliably solve these in a way that will actually generalize to similar such problems. For instance, see the model card for opus 4 which includes the model doing a bunch of undesired stuff that Anthropic doesn’t want but also can’t easily avoid except via patching it non-robustly (because they don’t necessarily know exactly what causes the issue).
None of these are cases where alignment was extremely hard TBC, though I think it might be extremely hard to consistently avoid all alignment problems of this rough character before release. It’s unclear whether this sort of thing is a good analogy for misalignment in future models which would be catastrophic.
Yeah, I was thinking of reward hacking as another example of a problem we can solve if we try but companies aren’t prioritizing it, which isn’t a huge deal at the moment but could be very bad if the AIs were much smarter and more power-seeking.
Stepping back, there’s a worldview where any weird, undesired behavior no matter how minor is scary because we need to get alignment perfectly right; and another where we should worry about scheming, deception, and related behaviors but it’s not a big deal (at least safety-wise) if the model misunderstands our instructions in bizarre ways. Either of these can be justified but this discussion could probably use more clarity about which one we’re all coming from.
FWIW, my sense is that Y&S do believe that alignment is possible in principle. (I do.)
I think the “eventually, we just can’t compete” point is correct. Suppose we have some gradualist chain of humans controlling models controlling model advancements, from here out to Dyson spheres. I think it’s extremely likely that eventually the human control on top gets phased out, like happened in humans playing chess, where centaurs are worse and make more mistakes than pure AI systems. Thinking otherwise feels like postulating that machines can never be superhuman at legitimacy.[1]
Chapter 10 of the book talks about the space probe / nuclear reactor / computer security angle, and I think a gradualist control approach that takes those three seriously will probably work. I think my core complaint is that I mostly see people using gradualism as an argument that they don’t need to face those engineering challenges, and I expect them to simply fail at difficult challenges they’re not attempting to succeed at.
Like, there’s this old idea of basins of reflective stability. It’s possible to imagine a system that looks at itself and says “I’m perfect, no notes”, and then the question is—how many such systems are there? Each is probably surrounded by other systems that look at themselves and say “actually I should change a bit, like so—” and become one of the stable systems, and systems even further out will change to only have one problem, and so on. The choices we’re making now at probably not jumping straight to the end, but instead deciding which basin of reflective stability we’re in. I mostly don’t see people grappling with the endpoint, or trying to figure out the dynamics of the process, and instead just trusting it and hoping that local improvements will eventually translate to global improvements.
Incidentally, a somewhat formative experience for me was AAAI 2015, when a campaign to stop lethal autonomous weapons was getting off the ground, and at the ethics workshop a representative wanted to establish a principle that computers should never make a life-or-death decision. One of the other attendees objected—he worked on software to allocate donor organs to people on the waitlist, and for them it was a point of pride and important coordination tool that decisions were being made by fair systems instead of corruptible or biased humans.
Like, imagine someone saying that driving is a series of many life-or-death decisions, and so we shouldn’t let computers do it, even as the computers become demonstrably superior to humans. At some point people let the computers do it, and at a later point they tax or prevent the humans from doing it.
You’re correct on the object level here, and it’s a point against Collier that the statement is incorrect, but I do think it’s important to note that a fixed version of the statement serves the same rhetorical purpose. That is, on page 123 it does develop a new mode of thinking, analogized to a different language, which causes the oversight tools to fail and also leads to an increase in capabilities. So Y&S are postulating a sudden jump in capabilities which causes oversight tools to break, in a way that a more continuous story might not have.
I think Y&S still have a good response to the repaired argument. The reason the update was adopted was because it improved capabilities—the scientific mode of reasoning was superior to the mythical mode—but there could nearly as easily have been an update which didn’t increase capabilities but scrambled the reasoning in such a way that the oversight system broke. Or the guardrails might have been cutting off too many prospective thoughts, and so the AI lab is performing a “safety test” wherein they relax the guardrails, and a situationally aware Sable generates behavior that looks behaved enough that the relaxation stays in place, and then allows for it to escape when monitored less closely.
I don’t think this is about ‘neuralese’, I think a basically similar story goes thru for a model that only thinks in English.
What’s happening, in my picture, is that meaning is stored in the relationships between objects, and that relationship can change in subtle ways that break oversight schemes. For example, imagine an earnest model which can be kept in line by a humorless overseer. When the model develops a sense of humor / starts to use sarcasm, the humorless overseer might not notice the meaning of the thoughts has changed.
Why is this any different than training a next generation of word-predictors and finding out it can now play chess, or do chain-of-thought reasoning, or cheat on tests? I agree it’s unlocking new abilities, I just disagree that this implies anything massively different from what’s already going on, and is the thing you’d expect to happen by default.
Thank you for this response. I think it really helped me understand where you’re coming from, and it makes me happy. :)
I really like the line “their case is maybe plausible without it, but I just can’t see the argument that it’s certain.” I actually agree that IABIED fails to provide an argument that it’s certain that we’ll die if we build superintelligence. Predictions are hard, and even though I agree that some predictions are easier, there’s a lot of complexity and path-dependence and so on! My hope is that the book persuades people that ASI is extremely dangerous and worth taking action on, but I’d definitely raise an eyebrow at someone who did not have Eliezer-level confidence going in, but then did have that level of confidence after reading the book.
There’s a motte argument that says “Um actually the book just says we’ll die if we build ASI given the alignment techniques we currently have” but this is dumb. What matters is whether our future alignment skill will be up to the task. And to my understanding, Nate and Eliezer both think that there’s a future version of Earth which has smarter, more knowledgeable, more serious people that can and should build safe/aligned ASI. Knowing that a godlike superintelligence with misaligned goals will squish you might be an easy call, but knowing exactly what the state of alignment science will be when ASI is first built is not.
(This is why it’s important that the world invests a whole bunch more in alignment research! (...in addition to trying to slow down capabilities research.))
It seems like maybe part of the issue is that you hear Nate and Eliezer as saying “here is the argument for why it’s obvious that ASI will kill us all” and I hear them as saying “here is the argument for why ASI will kill us all” and so you’re docking them points when they fail to reach the high standard of “this is a watertight and irrefutable proof” and I’m not?
On a different subtopic, it seems clear to me that we think about the possibility of a misaligned ASI taking over the world pretty differently. My guess is that if we wanted to focus on syncing up our worldviews, that is where the juicy double-cruxes are. I’m not suggesting that we spend the time to actually do that—just noting the gap.
Thanks again for the response!
fwiw I think Eliezer/Nate are saying “it’s obvious, unless we were to learn new surprising information” and deliberately not saying “it has a watertight proof”, and part of the disagreement here is “have they risen the standard of ’fairly obvious call, unless we learn new surprising information?”
(with the added wrinkle of many people incorrectly thinking LLM era observations count as new information that changes the call)
I’m really glad this was clarifying!
Yeah, for sure. I would maybe quibble that I think the book is saying less that it’s obvious that ASI will kill us all but that it is inevitable that ASI will kill us all, and so our only option is to make sure nobody builds it. I do think this is a pretty fair gloss (representative quote: “If anyone anywhere builds superintelligence, everyone everywhere dies”).
To me, this distinction matters because the belief that ASI doom is inevitable suggests a really profoundly different set of possibly actions than the belief that ASI doom is possible. Once we’re out of the realm of certainty, we have to start doing risk analyses and thinking seriously about how the existence of future advanced AIs changes the picture. I really like the distinction you draw here:
To its credit, IABIED is not saying that we’ll die if we build ASI with current alignment techniques – it is trying to argue that future alignment techniques won’t be adequate, because the problem is just too hard. And this is where I think they could have done a much better job of addressing the kinds of debates people who actually do this work are having instead of presenting fairly shallow counter-arguments and then dismissing them out of hand because they don’t sound like they’re taking the problem seriously.
My issue isn’t purely the level of confidence, it’s that the level of confidence comes out of a very specific set of beliefs about how the future will develop, and if any one of those beliefs is wrong less confidence would be appropriate, so it’s disappointing to me to see that those beliefs aren’t clearly articulated or defended.
Crucial caveat that this is conditional on building it soon, rather than preparing to an unprecedented degree first. Probably you are tracking this, but when you say it like that someone without context might take the intended meaning as unconditional inevitable lethality of ASI, which is very different. Our only option is that nobody builds it soon, not that nobody builds it ever, is the claim.
This is still future alignment techniques that can become available soon. Reasonable counterarguments to inevitability of ASI-caused extinction or takeover if it’s created soon seem to be mostly about AGIs developing meaningfully useful alignment techniques soon enough (and if not soon enough, an ASI Pause of some kind would help, but then AGIs themselves are almost as big of a problem).
Hmm, I feel more on the Eliezer/Nate side of this one. I think it’s a medium call that capabilities science advances faster than alignment science, and so we’re not on track without drastic change. (Like, the main counterargument is negative alignment tax, which I do take seriously as a possibility, but I think probably doesn’t close the gap.)