I think Will MacAskill’s summary of the argument made in Chapter 4 of IABIED is inaccurate, and his criticisms don’t engage with the book version. Here’s how he summarises the argument:
The evolution analogy:
Illustrative quote: “To extend the [evolution] analogy to AI: [...] The link between what the AI was trained for and what it ends up caring about would be complicated, unpredictable to engineers in advance, and possibly not predictable in principle.”
Evolution is a fine analogy for ML if you want to give a layperson the gist of how AI training works, or to make the point that, off-distribution, you don’t automatically get what you trained for. It’s a bad analogy if you want to give a sense of alignment difficulty.
Y&S argue:
Humans inventing and consuming sucralose (or birth control) would have been unpredictable from evolution’s perspective, and is misaligned with the goal of maximising inclusive genetic fitness.
The goals of a superintelligence will be similarly unpredictable, and similarly misaligned.
On my reading, the argument goes more like this:[1]
The analogy to evolution (and a series of AI examples) is used to argue that there is a complicated relationship between training environment and preferences (in a single training run!), and that we don’t have a good understanding of that relationship. The book uses “complications” to refer to weird effects in the link between training environment and resulting preferences.
From this, the book concludes: Alignment is non-trivial, and shouldn’t be attempted by fools.[2]
The book doesn’t explicitly draw a conclusion just from purely the above premises, but I imagine this is sufficient to conclude some intermediate level of alignment difficulty, depending on the difficulty of spotting and patching the most troublesome complications.
Some complications may only show up behaviourally after a lot of deliberation and learning, making them extremely difficult to detect and extremely difficult to patch.
Some complications only show up after reflection and self-correction.
Some complications only show up after AIs have built new AIs.
From this it concludes: It’s unrealistically difficult to remove all complications, and shouldn’t be attempted with anything like current levels of understanding.[5]
So MacAskill’s summary of the argument is inaccurate. It removes all of the supporting structure that makes the argument work, and pretends that the analogy was used by itself to support the strong conclusion.
He goes on to criticise the analogy by pointing at (true) dis-analogies between evolution and the entire process of building an AI:
But our training of AI is different to human evolution in ways that systematically point against reasons for pessimism.
The most basic disanalogy is that evolution wasn’t trying, in any meaningful sense, to produce beings that maximise inclusive genetic fitness in off-distribution environments. But we will be doing the equivalent of that!
That alone makes the evolution analogy of limited value. But there are some more substantive differences. Unlike evolution, AI developers can:
See the behaviour of the AI in a very wide range of diverse environments, including carefully curated and adversarially-selected environments.
Give more fine-grained and directed shaping of individual minds throughout training, rather than having to pick among different randomly-generated genomes that produce minds.
Use interpretability tools to peer inside the minds they are creating to better understand them (in at least limited, partial ways).
Choose to train away from minds that have a general desire to replicate or grow in power (unlike for evolution, where such desires are very intensely rewarded).
(And more!)
So these dis-analogies don’t directly engage with the argument. If they were directly engaging with the first part of the argument, they would be about the predictability of a single training run, rather than the total AI research process.
The dis-analogies could be reinterpreted as engaging with the other premises: 1 and 3 could be interpreted as claims that we do have (potentially yet to be invented) methods of detecting complications, and 2 and 4 could be interpreted as claims that we similarly have methods to patch complications. But the examples of unintended LLM weirdness given throughout the book are enough to show that at least current techniques aren’t working well for detection or patching of weirdness.
More importantly, the examples of particularly difficult complications entirely ignored, despite their necessity for the conclusion.
From the book: “If all the complications were visible early, and had easy solutions, then we’d be saying that if any fool builds it, everyone dies, and that would be a different situation.”
In the paragraphs following “So far, we’ve only touched on the sorts of complications that would arise in the preferences trained directly into an AI.”
So these dis-analogies don’t directly engage with the argument. If they were directly engaging with the first part of the argument, they would be about the predictability of a single training run, rather than the total AI research process.
This seems wrong to me. Which of the things that Will says couldn’t be considered to be about a single training process?
I don’t think that the MIRI book would hold up if you analyzed it with this level of persnicketiness–they were absolutely not precise at the level of distinguishing between the whole development process and single training runs. (Which is arguably fine–they were trying to write a popular book, not trying to persuade super high-context readers of anything!) So this complaint strikes me as somewhat of an isolated demand for rigor.
I’m not trying to debate or gotcha. I agree that if I tried to do adversarial nitpicking at IABIED I could make it sound equally bad. I found Will’s review convincing, in the sense that it intuitively snapped me into the worldview where the evolutionary analogy isn’t a good argument. I spent the day thinking about it, and I wrote out my own steelman of it that extrapolated details, and re-evaluated whether I thought the original argument was valid, and decided that yeah it still was. This exercise was partially motivated by you saying that your complaints were similar in another comment.
Then I went through and found the important differences between my steelman-will-beliefs and my actual beliefs, the places where I thought it was locally making a mistake, and wrote them down, and then turned that into this shortform. I framed it as misrepresenting after re-reading chapter 4 to check how my argument matched up. Maybe this was a bad way to write it up. It definitely feels like he’s doing the opposite of steelmanning, not particularly trying to convey a good version of the argument in the book, or understand the coherent worldview that produced it.
But it’s an honest guess that this is a thing Will is missing (how the evolution analogy should be scoped, and how the other premises are separate from it and also necessary). The guess was constructed without knowing Will or reading much of his other writing, so I admit it’s pretty likely to be wrong, but if so maybe someone will explain how.
But either way, I figured it’s particularly worth publishing this particular part of the things I wrote today because of how often I hear people misunderstand the evolution analogy.
I feel like your title for this short-form post is unreasonably aggressive, given what you’re saying here.
I found your articulation of the structure of the book’s argument helpful and clarifying.
I’m planning to write something more about this at some point: I think a key issue here is that we aren’t making the kind of arguments where “local validity” is a reliable concept. No-one is trying to make proofs, they’re trying to make defeasible heuristic arguments. Suppose the book makes an argument of the form “Because of argument A, I believe conclusion X. You might have thought that B is a counterargument to A. But actually, because of argument C, B doesn’t work.” If Will thinks that argument C doesn’t work, I think it’s fine for him to summarize this as: “they make an argument mostly around A, and which I don’t think suffices to establish X”.
Sorry I intended single training run to refer to purely running SGD on a training set. As opposed to humans examining the result or intermediate result and making adjustments based on their observations. So at least 2 & 3, as they definitely involve human intervention.
Whatever. I don’t think that’s a very important difference, and I don’t think it’s fair to call Will’s argument a straw man based on it. I think a very small proportion of readers would confidently interpret the book’s argument the way you did.
You’re claiming that the book’s argument is only trying to apply to an extremely narrow definition of AI training. You describe it as SGD on a training set; I assume you intend to refer to things like RL on diverse environments, like how R1 was trained. If that’s what the argument is about, it’s really important for the authors to explain how that argument connects to the broader notion of training used in practice, and I don’t remember this happening. I don’t remember them talking carefully about the still broader question of “what happens when you do get to examine results and intermediate results and make adjustments based on observations?”
The way that the analogy interacts with other assumptions seems crucial. I don’t mean to insult Will, if it helps I also think there are a bunch of strawmen in IABIED. But I think most readers whose attention was drawn to the following quote would understand that the evolution analogy needs to be combined with the other things listed there to conclude that alignment is very difficult.
“If all the complications were visible early, and had easy solutions, then we’d be saying that if any fool builds it, everyone dies, and that would be a different situation. But when some of the problems stay out of sight? When some complications inevitably go unforeseen? When the AIs are grown rather than crafted, and no one understands what’s going on inside of them?”
..
If that’s what the argument is about, it’s really important for the authors to explain how that argument connects to the broader notion of training used in practice, and I don’t remember this happening. I don’t remember them talking carefully about the still broader question of “what happens when you do get to examine results and intermediate results and make adjustments based on observations?”
Neither do I, but this doesn’t seem really important for a non-researcher audience. Conditional on the claims that weird goal errors are difficult to understand by examining behaviour, and interventions to patch weird goal errors often don’t generalise well. If you buy those claims, then it’s easy to extrapolate what happens when you examine results and make adjustments based on those.
A little bit late, but there are more reasons why I think the evolution analogy is particularly good and better than the selective breeding analogy.
Evolution basically ended up optimizing the brain such that it has desires that were instrumental to genetic fitness. So we end up with these instrumental sub-preferences for high calorie food or sex. Then we go through a huge shift of the distribution or environment from a very constrained hunter gatherer society to a technologically advanced civilization. This isn’t just a random shift but a shift toward and environment with a much larger space of possible actions and outcomes, options such as radically changing aspects of the environment. So naturally there are now many superior ways to satisfy our preferences than before. For AI this is the same thing, it will go from being the nice assistant in ChatGPT to having options such as taking over, killing us, running it’s own technology. It’s essentially guaranteed that there will be better ways to satisfy preferences without human oversight, out of the control of humans. Importantly, that isn’t actually a distributional shift you can test in any meaningful way. You could either try incremental stuff (giving the rebellious general one battalion at a time) or you could try to trick it into believing it can takeover through some honey pot (Imagine trying to test human what they would do if they were God emperor of the galaxy. That would be insane and the subject wouldn’t believe the scenario). Both of these are going to fail.
The selective breeding story ignores the distributional shift at the end, it does not account for this being a particular type of distributional shift (from low action space, immutable environment to large action space, mutable environment). It doesn’t account for the fact that we can’t test such a distribution such as being emperor.
I think Will MacAskill’s summary of the argument made in Chapter 4 of IABIED is inaccurate, and his criticisms don’t engage with the book version. Here’s how he summarises the argument:
On my reading, the argument goes more like this:[1]
The analogy to evolution (and a series of AI examples) is used to argue that there is a complicated relationship between training environment and preferences (in a single training run!), and that we don’t have a good understanding of that relationship. The book uses “complications” to refer to weird effects in the link between training environment and resulting preferences.
From this, the book concludes: Alignment is non-trivial, and shouldn’t be attempted by fools.[2]
Then it adds some premises:[3]
It’s difficult to spot some complications.
It’s difficult to patch some complications.
The book doesn’t explicitly draw a conclusion just from purely the above premises, but I imagine this is sufficient to conclude some intermediate level of alignment difficulty, depending on the difficulty of spotting and patching the most troublesome complications.
The book adds:[4]
Some complications may only show up behaviourally after a lot of deliberation and learning, making them extremely difficult to detect and extremely difficult to patch.
Some complications only show up after reflection and self-correction.
Some complications only show up after AIs have built new AIs.
From this it concludes: It’s unrealistically difficult to remove all complications, and shouldn’t be attempted with anything like current levels of understanding.[5]
So MacAskill’s summary of the argument is inaccurate. It removes all of the supporting structure that makes the argument work, and pretends that the analogy was used by itself to support the strong conclusion.
He goes on to criticise the analogy by pointing at (true) dis-analogies between evolution and the entire process of building an AI:
So these dis-analogies don’t directly engage with the argument. If they were directly engaging with the first part of the argument, they would be about the predictability of a single training run, rather than the total AI research process.
The dis-analogies could be reinterpreted as engaging with the other premises: 1 and 3 could be interpreted as claims that we do have (potentially yet to be invented) methods of detecting complications, and 2 and 4 could be interpreted as claims that we similarly have methods to patch complications. But the examples of unintended LLM weirdness given throughout the book are enough to show that at least current techniques aren’t working well for detection or patching of weirdness.
More importantly, the examples of particularly difficult complications entirely ignored, despite their necessity for the conclusion.
I’ve reordered it for logical clarity.
From the book: “If all the complications were visible early, and had easy solutions, then we’d be saying that if any fool builds it, everyone dies, and that would be a different situation.”
These are mentioned in the book (e.g. in footnote 2), and the first is briefly supported with examples. The second isn’t supported in this chapter.
In the paragraphs following “So far, we’ve only touched on the sorts of complications that would arise in the preferences trained directly into an AI.”
“Problems like this are why we say that if anyone builds it, everyone dies.”
This seems wrong to me. Which of the things that Will says couldn’t be considered to be about a single training process?
I don’t think that the MIRI book would hold up if you analyzed it with this level of persnicketiness–they were absolutely not precise at the level of distinguishing between the whole development process and single training runs. (Which is arguably fine–they were trying to write a popular book, not trying to persuade super high-context readers of anything!) So this complaint strikes me as somewhat of an isolated demand for rigor.
I’m not trying to debate or gotcha. I agree that if I tried to do adversarial nitpicking at IABIED I could make it sound equally bad. I found Will’s review convincing, in the sense that it intuitively snapped me into the worldview where the evolutionary analogy isn’t a good argument. I spent the day thinking about it, and I wrote out my own steelman of it that extrapolated details, and re-evaluated whether I thought the original argument was valid, and decided that yeah it still was. This exercise was partially motivated by you saying that your complaints were similar in another comment.
Then I went through and found the important differences between my steelman-will-beliefs and my actual beliefs, the places where I thought it was locally making a mistake, and wrote them down, and then turned that into this shortform. I framed it as misrepresenting after re-reading chapter 4 to check how my argument matched up. Maybe this was a bad way to write it up. It definitely feels like he’s doing the opposite of steelmanning, not particularly trying to convey a good version of the argument in the book, or understand the coherent worldview that produced it.
But it’s an honest guess that this is a thing Will is missing (how the evolution analogy should be scoped, and how the other premises are separate from it and also necessary). The guess was constructed without knowing Will or reading much of his other writing, so I admit it’s pretty likely to be wrong, but if so maybe someone will explain how.
But either way, I figured it’s particularly worth publishing this particular part of the things I wrote today because of how often I hear people misunderstand the evolution analogy.
I feel like your title for this short-form post is unreasonably aggressive, given what you’re saying here.
I found your articulation of the structure of the book’s argument helpful and clarifying.
I’m planning to write something more about this at some point: I think a key issue here is that we aren’t making the kind of arguments where “local validity” is a reliable concept. No-one is trying to make proofs, they’re trying to make defeasible heuristic arguments. Suppose the book makes an argument of the form “Because of argument A, I believe conclusion X. You might have thought that B is a counterargument to A. But actually, because of argument C, B doesn’t work.” If Will thinks that argument C doesn’t work, I think it’s fine for him to summarize this as: “they make an argument mostly around A, and which I don’t think suffices to establish X”.
You’re right, I edited it.
That makes sense about local validity.
Sorry I intended single training run to refer to purely running SGD on a training set. As opposed to humans examining the result or intermediate result and making adjustments based on their observations. So at least 2 & 3, as they definitely involve human intervention.
Whatever. I don’t think that’s a very important difference, and I don’t think it’s fair to call Will’s argument a straw man based on it. I think a very small proportion of readers would confidently interpret the book’s argument the way you did.
You’re claiming that the book’s argument is only trying to apply to an extremely narrow definition of AI training. You describe it as SGD on a training set; I assume you intend to refer to things like RL on diverse environments, like how R1 was trained. If that’s what the argument is about, it’s really important for the authors to explain how that argument connects to the broader notion of training used in practice, and I don’t remember this happening. I don’t remember them talking carefully about the still broader question of “what happens when you do get to examine results and intermediate results and make adjustments based on observations?”
The way that the analogy interacts with other assumptions seems crucial. I don’t mean to insult Will, if it helps I also think there are a bunch of strawmen in IABIED. But I think most readers whose attention was drawn to the following quote would understand that the evolution analogy needs to be combined with the other things listed there to conclude that alignment is very difficult.
..
Neither do I, but this doesn’t seem really important for a non-researcher audience. Conditional on the claims that weird goal errors are difficult to understand by examining behaviour, and interventions to patch weird goal errors often don’t generalise well. If you buy those claims, then it’s easy to extrapolate what happens when you examine results and make adjustments based on those.
A little bit late, but there are more reasons why I think the evolution analogy is particularly good and better than the selective breeding analogy.
Evolution basically ended up optimizing the brain such that it has desires that were instrumental to genetic fitness. So we end up with these instrumental sub-preferences for high calorie food or sex. Then we go through a huge shift of the distribution or environment from a very constrained hunter gatherer society to a technologically advanced civilization. This isn’t just a random shift but a shift toward and environment with a much larger space of possible actions and outcomes, options such as radically changing aspects of the environment. So naturally there are now many superior ways to satisfy our preferences than before. For AI this is the same thing, it will go from being the nice assistant in ChatGPT to having options such as taking over, killing us, running it’s own technology. It’s essentially guaranteed that there will be better ways to satisfy preferences without human oversight, out of the control of humans. Importantly, that isn’t actually a distributional shift you can test in any meaningful way. You could either try incremental stuff (giving the rebellious general one battalion at a time) or you could try to trick it into believing it can takeover through some honey pot (Imagine trying to test human what they would do if they were God emperor of the galaxy. That would be insane and the subject wouldn’t believe the scenario). Both of these are going to fail.
The selective breeding story ignores the distributional shift at the end, it does not account for this being a particular type of distributional shift (from low action space, immutable environment to large action space, mutable environment). It doesn’t account for the fact that we can’t test such a distribution such as being emperor.