A Review of Nina Panickssery’s Review of Scott Alexander’s Review of “If Anyone Builds It, Everyone Dies”

A review of Nina Panickssery’s review of Scott Alexander’s review of the book “If Anyone Builds It, Everyone Dies” (IABIED).

This essay is not my best work but I just couldn’t resist. Thanks to Nina and others for comments/​feedback.

I confess I mostly wrote this because I think a review of a review of a review is funny. But I also have a lot of disagreements with both Nina and the authors of IABIED. Nina’s review is the first time a skeptic has substantively argued against the book (because the book isn’t out and the authors haven’t given an advance copy to very many skeptics). I want the discourse around critiques of the book to be good. I want people to understand the real limitations of the authors’ arguments, not straw-mans of them.

Although she frames her writing as a review of Scott’s review, Nina is clearly trying to provide an in-depth critique of the book itself. Unfortunately, it’s hard for her to avoid straw-manning, since she hasn’t read the book and can only go off of Scott’s review. I happen to have read IABIED, which gives me a leg up on writing about it.

It’s of course not Nina’s fault she hasn’t been given an advance copy of the book. I feel torn about whether or not it’s bad form to review a book she hasn’t read.[1]

(I may publish my own review which goes into my views when IABIED comes out.)

On Fictional Scenarios and Concrete Examples

Nina complains about part of the book giving a sci-fi story about how everything could go wrong because she doesn’t like the idea of generalizing from fictional evidence. But I think it’s clearly very valuable to lay out a concrete scenario of AI disaster. The authors bemoan for several pages why writing a scenario like this is somewhat fraught. I think if they had refused to give a concrete scenario, it would have seemed like a cop-out or an admission that their arguments rely on hand-waving and magic. I really like concrete scenarios and wish there were more of them because they head-off this obvious and common critique. They’re labor-intensive to write and hard to pull off, so I appreciate whenever people try.

I also think talking through potential future scenarios is a good way to concretize and understand things. If someone says AI takeover could never happen, laying out one story for how AI takeover could happen among a sea of possible futures that could unfold can build intuitions and help people understand where their understandings diverge. I think the scenario makes it much easier for me to talk about where I disagree with Nate and Eliezer.

Of course, concrete scenarios can also be misleading. You can hide a lot of sleight-of-hand by saying, “Well, naturally, this exact scenario is unlikely, but I had to pick a specific scenario, and any specific scenario is unlikely.” But I don’t really think the scenario in the book is guilty of doing this intentionally and on net I’m very glad it’s in the book.

I also think Nate and Eliezer do a good job communicating that their scenario is just one possibility of many, and things are almost certainly not going to go as they describe. They also go out of their way to make it a story that is not crafted to be entertaining. The main issue Eliezer has with generalizing from fictional evidence is that it is designed to be a fun read, which comes at the cost of realism. But Nate and Eliezer try dutifully to avoid this.

On AI Goals and Alignment

Nina says the book lacks an explanation for why our inability to give AI specific goals will cause problems, but it seems pretty straightforward to me. If we can’t give them a specific goal, then the AIs will have a goal that is not exactly what we intended, and then their goal will be at odds with ours.

The authors do a fine job providing the standard arguments for why an AI that is not aligned with humanity’s goals might kill everyone (although I think they don’t spend enough time on why it would definitely kill literally everyone as opposed to only most people). Their arguments include:

  • Humans might destroy the AI, and the AI doesn’t want that

  • The AI is going to consume vast resources for its own purposes

I agree with Nina’s qualms that the authors don’t do a great job representing the counterarguments. There are stronger counter-arguments out there, although they are not the counter-arguments that most people reading the book will have in their minds.

On the Evolution Analogy

This is probably my biggest disagreement with Nina. She says humans are a successful species, but I think she’s conflating success in human terms with success in evolutionary terms (especially compared to other species—the bacteria are crushing us). I think that if humanity really set its eyes on evolutionary success, we could be way, way more evolutionarily successful than we currently are, except for the fact that evolution isn’t exactly optimizing for a coherent thing. If what we wanted was lots of copies of our literal DNA everywhere, we could probably start sticking it into bacteria or something and having it replicate.[2]

Human values don’t seem totally orthogonal to what evolution was optimizing for, and the outcomes so far are still fairly good from an evolutionary perspective. I wish Nate and Eliezer touched on that more. But we are still clearly somewhat misaligned.[3]

Nina also says “Evolution is much less efficient and slower than any method used to train AI models and hasn’t finished. If you take a neural network in the middle of training it sure won’t be good at maximizing its objective function.” This doesn’t seem persuasive to me. Evolution will not succeed anywhere near as well as it would have if humanity were truly trying to maximize evolution’s objectives. Even if humanity were to go extinct, evolution would probably still not reign because it would evolve another species like humans that would be misaligned.

To extend Nina’s analogy, if training your AI causes a superintelligent model to arise that is misaligned with your values, then maybe your training is never going to finish.

Nina also talks about how the environment is constantly changing and humans are off-distribution from the ancestral environment. But I think this is part of the point the authors are trying to make. The future, once shaped by AIs, will be very off-distribution from where AIs were trained, at the very least. The authors argue a stronger claim—that there’s a distributional shift once the model is capable of taking over the world, though I don’t think they defend this claim very well personally.

On Reward Functions and Model Behavior

Next, Nina argues that the fact that LLMs don’t directly encode your reward function makes them less likely to be misaligned, not more, the way IABIED implies. I think maybe she’s straw-manning the concerns here. She asks “What would it mean for models to encode their reward functions without the context of training examples?” But nobody’s arguing against examples, they’re just saying it might be more reassuring if the architecture directly included the reward function in the model itself. For instance, if the model was at every turn searching over possible actions and choosing one that will maximize this reward function. (Of course, no one knows how to do that in practice yet, but everyone’s on the same page about that.)

Nina claims it’s good that AIs learn by generalizing from examples to really understand what we want, but I think this is missing some of the point. The main concern outlined in IABIED is not an outer alignment failure—it’s not that we won’t be able to articulate what we want. The concern is that even if we articulate what we want totally perfectly and all of our examples demonstrate what we want really well, there is no guarantee that the AI we get will optimize for what we specified.

Other Tidbits

Next, Nina complains that the Mink story is simplistic with how Mink “perfectly internalizes this goal of maximizing user chat engagement”. The authors say the Mink scenario is a very simple “fairytale”—they argue that in this simple world, things go poorly for humanity, and that more complexity won’t increase safety.

I think Nina has a different kind of complexity in mind, which the authors don’t touch on. It seems like she thinks real models won’t be so perfectly coherent and goal-directed. But I don’t really think Nina spells out why she believes this and there are a lot of counter-arguments. The main counter-argument in the book is that there are good incentives to train for that kind of coherence.

I agree with Nina and Scott about sharp left turns seeming less likely than Nate and Eliezer think.

I think that covers Nina’s important substantive claims where I disagree.

  1. ^

    In particular it seems like it might be a little unfair/​bad if people can give copies of books only to sympathetic people before the book comes out, drum up support and praise since the people they gave copies of can read the book, and then not face criticism until much later when skeptical people can get ahold of the book. But also maybe that’s just the norm in book circles? And I don’t have an amazing alternative.

  2. ^

    Is that really what evolution is optimizing for? I’m not sure, but I think a lot of the point is that it’s under-constrained, and the thing that the AI is “really” optimizing for is also under-constrained. Is it reward? And if it’s reward, is it the number in some register going up, or the kind of scenario where the number in the register would go up if the register wasn’t hacked? I’m not sure.

  3. ^

    I’ve heard a lot of arguments about these points, but one thing I haven’t heard brought up before is that evolution might have an easier time than humanity at having its values be preserved. Replicating yourself is really useful. Evolution’s “goal” was picked to be something that was really useful, not picked to be something particularly complex and glorious. So it’s kind of natural in some ways that humanity wouldn’t go against it very much.