Thoughts on “AI is easy to control” by Pope & Belrose

Quintin Pope & Nora Belrose have a new “AI Optimists” website, along with a new essay “AI is easy to control”, arguing that the risk of human extinction due to future AI (“AI x-risk”[1]) is a mere 1% (“a tail risk worth considering, but not the dominant source of risk in the world”). (I’m much more pessimistic.) It makes lots of interesting arguments, and I’m happy that the authors are engaging in substantive and productive discourse, unlike the ad hominem vibes-based drivel which is growing increasingly common on both sides of the AI x-risk issue in recent months.

This is not a comprehensive rebuttal or anything, but rather picking up on a few threads that seem important for where we disagree, or where I have something I want to say.

(Note: Nora has a reply here.)

Summary /​ table-of-contents:

Note: I think Sections 1 & 4 are the main reasons that I’m much more pessimistic about AI x-risk than Pope & Belrose, whereas Sections 2 & 3 are more nitpicky.

  • Section 1 argues that even if controllable AI has an “easy” technical solution, there are still good reasons to be concerned about AI takeover, because of things like competition and coordination issues, and in fact I would still be overall pessimistic about our prospects.

  • Section 2 talks about the terms “black box” versus “white box”.

  • Section 3 talks about what if anything we learn from “human alignment”, including some background on how I think about human innate drives.

  • Section 4 argues that pretty much the whole essay would need to be thrown out if future AI is trained in a substantially different way from current LLMs. If this strikes you as a bizarre unthinkable hypothetical, yes I am here to tell you that other types of AI do actually exist, and I specifically discuss the example of “brain-like AGI” (a version of actor-critic model-based RL), spelling out a bunch of areas where the essay makes claims that wouldn’t apply to that type of AI, and more generally how it would differ from LLMs in safety-relevant ways.

1. Even if controllable AI has an “easy” technical solution, I’d still be pessimistic about AI takeover

Most of Pope & Belrose’s essay is on the narrow question of whether the AI control problem has an easy technical solution. That’s great! I’m strongly in favor of arguing about narrow questions. And after this section I’ll be talking about that narrow question as well. But the authors do also bring up the broader question of whether AI takeover is likely to happen, all things considered. These are not the same question; for example, there could be an easy technical solution, but people don’t use it.

So, for this section only, I will assume for the sake of argument that there is in fact an easy technical solution to the AI control and/​or alignment problem. Unfortunately, in this world, I would still think future catastrophic takeover by out-of-control AI is not only plausible but likely.

Suppose someone makes an AI that really really wants something in the world to happen, in the same way a person might really really want to get out of debt, or Elon Musk really really wants for there to be a Mars colony—including via means-end reasoning, out-of-the-box solutions, inventing new tools to solve problems, and so on. Then, if that “want” is stronger than the AI’s desires and habits for obedience and norm-following (if any), and if the AI is sufficiently capable, then the natural result would be an AI that irreversibly escapes human control—see instrumental convergence.

But before we get to that, why might we suppose that someone might make an AI that really really wants something in the world to happen? Well, lots of reasons:

  • People have been trying to do that since the dawn of AI.

  • Humans often really really want something in the world to happen (e.g., for there to be more efficient solar cells, for my country to win the war, to make lots of money, to do a certain very impressive thing that will win fame and investors and NeurIPS papers, etc.), and one presumes that some of those humans will reason “Well, the best way to make X happen is to build an AI that really really wants X to happen”. You and I might declare that these people are being stupid, but boy, people do stupid things every day.

  • As AI advances, more and more people are likely to have an intuition that it’s unethical to exclusively make AIs that have no rights and are constitutionally subservient with no aspirations of their own. This is already starting to happen. I’ll put aside the question of whether or not that intuition is justified.

  • Some people think that irreversible AGI catastrophe cannot possibly happen regardless of the AGI’s motivations and capabilities, because of [insert stupid reason that doesn’t stand up to scrutiny], or will be prevented by [poorly-thought-through “guardrail” that won’t actually work]. One hopes that the number of such people will go down with time, but I don’t expect it to go to zero.

  • Some people want to make AGI as capable and independent as possible, even if it means that humanity will go extinct, because “AI is the next step of evolution” or whatever. Mercifully few people think that! But they do exist.

  • Sometimes people do things just to see what would happen (cf. chaos-GPT).

So now I wind up with a strong default assumption that the future world will have both AIs under close human supervision and out-of-control consequentialist AIs ruthlessly seeking power. So, what should we expect to happen at the end of the day? It depends on offense-defense balance, and regulation, and a host of other issues. This is a complicated topic with lots of uncertainties and considerations on both sides. As it happens, I lean pessimistic that humanity will survive; see my post What does it take to defend the world against out-of-control AGIs? for details. Again, I think there’s a lot of uncertainty, and scope for reasonable people to disagree—but I don’t think one can think carefully and fairly about this topic and wind up with a probability as low as 1% that there will ever be a catastrophic AI takeover.

2. Black-and-white (box) thinking

The authors repeat over and over that AIs are “white boxes” unlike human brains which are “black boxes”. I was arguing with Nora about this a couple months ago, and Charles Foster also chimed in with a helpful perspective on twitter, arguing convincingly that the terms “white box” and “black box” are used differently in different fields. My takeaway is: I’m sick of arguing about this and I really wish everybody would just taboo the words “black box” and “white box”—i.e., say whatever you want to say without using those particular words.

So, here are two things that I hope everybody can agree with:

  • (A) Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer. In this respect, LLMs are different from many other engineered artifacts[2] such as bridges and airplanes. For example, if an airplane reliably exhibits a certain behavior (let’s say, it tends to pitch left in unusually low air pressure), and you ask me “why does it exhibit that behavior?” then it’s a safe bet that the airplane designers could figure out a satisfying intuitive answer pretty quickly (maybe immediately, maybe not, but certainly not decades). Likewise, if a non-ML computer program, like the Linux kernel, reliably exhibits a certain behavior, then it’s a safe bet that there’s a satisfying intuitive answer to “why does the program do that”, and that the people who have been writing and working with the source code could generate that answer pretty quickly, often in minutes. (There is such a thing as a buggy behavior that takes many person-years to understand, but they make good stories partly because they are so rare.)

  • (B) Hopefully everyone on all sides can agree that if you train an LLM, then you can view any or all the billions of weights and activations, and you can also perform gradient descent on the weights. In this respect, LLMs are different from biological intelligence, because biological neurons are far harder to measure and manipulate experimentally. Even mice have orders of magnitude too many neurons to measure and manipulate the activity of all of them in real time, and even if you could, you certainly couldn’t perform gradient descent on an entire mouse brain.

Again, I hope we can agree on those two things (and similar), even if we disagree about what those facts imply about AI x-risk. For the record, I don’t think either of the above bullet points by itself should be sufficient to make someone feel optimistic or pessimistic about AI x-risk. But they can be an ingredient in a larger argument. So can we all stop arguing about whether LLMs are “black box” or “white box”, and move on to the meaningful stuff, please?

3. What lessons do we learn from “human alignment” (such as it is)?

The authors write:

If we could observe and modify everything that’s going on in a human brain, we’d be able to use optimization algorithms to calculate the precise modifications to the synaptic weights which would cause a desired change in behavior. Since we can’t do this, we are forced to resort to crude and error-prone tools for shaping young humans into kind and productive adults. We provide role models for children to imitate, along with rewards and punishments that are tailored to their innate, evolved drives. In essence, we are poking and prodding at the human brain’s learning algorithms from the outside, instead of directly engineering those learning algorithms.

It’s striking how well these black box alignment methods work: most people do assimilate the values of their culture pretty well, and most people are reasonably pro-social. But human alignment is also highly imperfect. Lots of people are selfish and anti-social when they can get away with it, and cultural norms do change over time, for better or worse. Black box alignment is unreliable because there is no guarantee that an intervention intended to change behavior in a certain direction will in fact change behavior in that direction. Children often do the exact opposite of what their parents tell them to do, just to be rebellious.

I think there are two quite different stories here which are confusingly tangled together.

  • Story 1: “Humans have innate, evolved drives that lead to them wanting to be prosocial, fit into their culture, imitate role models, etc., at least to some extent.”

  • Story 2: “Human children are gradually sculpted into kind and productive adults by parents and society providing rewards and punishments, and controlling their life experience in other ways.”

I basically lean towards Story 1 for reasons in my post Heritability, Behaviorism, and Within-Lifetime RL.

There are some caveats—e.g. parents can obviously “sculpt” arbitrary children into unkind and unproductive adults by malnourishing them, or by isolating them from all human contact, or by exposing them to lead dust, etc. But generally, the sentence “we are forced to resort to crude and error-prone tools for shaping young humans into kind and productive adults” sounds almost as absurd to my ears as if the authors had written “we are forced to resort to crude and error-prone tools for shaping young humans into adults that have four-chambered hearts”. The credit goes to evolution, not “us”.

So what does that imply for AI x-risk? I don’t know, this is a few steps removed. But it brings us to the subject of “human innate drives”, a subject close to my (four-chambered) heart. I think the AGI-safety-relevant part of human innate drives—the part related to compassion and so on—is the equivalent of probably hundreds of lines of pseudocode, and nobody knows what they are. I think it would be nice if we did, and that happens to be a major research interest of mine. If memory serves, Quintin has kindly wished me luck in figuring this out. But the article here seems to strongly imply that it hardly matters, as we can easily get AI alignment and control regardless.

Instead, the authors make a big deal out of the fact that human innate drives are relatively simple (I think they mean “simple compared to a modern big trained ML model”, which I would agree with). I’m confused why that matters. Who cares if there’s a simple solution, when we don’t know what it is?

I think maybe the idea is that we’re approximating human innate drives via the RLHF reward function, so the fact that human innate drives are simple should give us confidence that the RLHF reward function (with its comparatively abundant amount of free parameters and training data) will accurately capture human innate drives? If so, I strongly disagree with the premise: The RLHF reward function is not approximating human innate drives. Instead it’s approximating the preferences of human adults, which are not totally unrelated to human innate drives, but sure aren’t the same thing. For example, here’s what an innate drive might vaguely look like for laughter—it’s this weird thing involving certain hormone levels and various poorly-studied innate signaling pathways in the hypothalamus (if I’m right). Compare that to a human adult’s sense of humor. The RLHF reward function is approximately capturing the latter (among many other things), but it has little relation to the former.

Again, so what? Is this an argument for AI doom? No. I’m making a narrow argument against some points raised in this post. If you want to argue that the RLHF reward function does a good job of capturing the preferences of human adults, then by all means, make that argument directly. I might even agree. But let’s leave human innate drives out of it.

4. Can we all agree in advance to disavow this whole “AI is easy to control” essay if future powerful AI is trained in a meaningfully different way from current LLMs?

My understanding is that the authors expect the most powerful future AI training approaches to be basically similar to what’s used in today’s Large Language Models—autoregressive prediction of human-created text and/​or other data, followed by RLHF fine-tuning or similar.

As it happens, I disagree. But if the authors are right, then … I don’t know. “AI x-risk in the scenario that future transformative AI is trained in a similar way as current LLMs” is not really my area of expertise or interest. I don’t expect that scenario to actualize, so I have difficulty thinking clearly about it—like if someone says to me “Imagine a square circle, and now answer the following questions about it…”. Anyway, if we’re specifically talking about future AI whose training is basically the same as modern LLMs, then a lot of the optimistic takes in the essay would seem pretty plausible to me. But I also often read more pessimistic narratives, and those takes sound pretty plausible to me too!! I don’t really know how I feel. I’ll step aside and leave that debate to others.

So anyway, if the authors think that future transformative AI will be trained much like modern LLMs, then that’s a fine thing for them to believe—even if I happen to disagree. Lots of reasonable people believe that. And I think these authors in particular believe it for interesting and well-considered reasons, not just “ooh, chatGPT is cool!”. I don’t want to argue about that—we’ll find out one way or the other, sooner or later.

But it means that the post is full of claims and assumptions that are valid for current LLMs (or for future AI which is trained in basically the same way as current LLMs) but not for other kinds of AI. And I think this is not made sufficiently clear. In fact, it’s not even mentioned.

Why is this a problem? Because there are people right now trying to build transformative AI using architectures and training approaches that are quite different from LLMs, in safety-relevant ways. And they are reading this essay, and they are treating it as further confirmation that what they’re doing is fine and (practically) risk-free. But they shouldn’t! This essay just plain doesn’t apply to what those people are doing!! (For a real-life example of such a person, see here & here.)

So I propose that the authors should state clearly and repeatedly that, if the most powerful future AI is trained in a meaningfully different way from current LLMs, then they disavow their essay (and, I expect, much of the rest of their website). If the authors are super-confident that that will never happen, because LLM-like approaches are the future, then such a statement would be unimportant—they’re really not conceding anything, from their own perspective. But it would be really important from my perspective!

4.1 Examples where the essay is making points that don’t apply to “brain-like AGI” (≈ actor-critic model-based RL)

I’ll leave aside the many obvious examples throughout the essay where the authors use properties of current LLMs as direct evidence about the properties of future powerful AI. Here are some slightly-less-obvious examples:

Since AIs are white boxes, we have full control over their “sensory environment” (whether that consists of text, images, or other modalities).

As a human, I can be sitting in bed, staring into space, and I can think a specific abstruse thought about string theory, and now I’ve figured out something important. If a future AI can do that kind of thing, as I expect, then it’s not so clear that “controlling the AI’s sensory environment” is really all that much control.

If the AI is secretly planning to kill you, gradient descent will notice this and make it less likely to do that in the future, because the neural circuitry needed to make the secret murder plot can be dismantled and reconfigured into circuits that directly improve performance.

A human can harbor a secret desire for years, never acting on it, and their brain won’t necessarily overwrite that desire, even as they think millions of thoughts in the meantime. So evidently, the argument above is inapplicable to human brains. An interesting question is, where does it go wrong? My current guess is that the main problem is that the “desires” of actor-critic RL agents are (by necessity) mainly edited by TD learning, which I think of as generally a much cruder tool than gradient descent.

We can run large numbers of experiments to find the most effective interventions, and we can also run it in a variety of simulated environments and test whether it behaves as expected both with and without the cognitive intervention. Each time the AI’s “memories” can be reset, making the experiments perfectly reproducible and preventing the AI from adapting to our actions, very much unlike experiments in psychology and social science.

That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before (different ANN weights, not just different environment /​ different prompt). Humans are like that too, but LLMs are not.

When it comes to AIs, we are the innate reward system.

I have no idea how I’m supposed to interpret this sentence for brain-like AGI, such that it makes any sense at all. Actually, I’m not quite sure what it means even for LLMs!

4.2 No, “brain-like AGI” is not trained similarly to LLMs

This seems really obvious to me, but evidently it’s controversial, so let’s walk through some example differences. None of these are trying to prove some point about AI alignment and control being easy or hard; instead I am making the narrow point that the safety/​danger of future LLMs is a different technical question than the safety/​danger of hypothetical future brain-like AGI.

  • Brains can imitate, but do so in a fundamentally different way from LLM pretraining. Specifically, after self-supervised pretraining, an LLM outputs exactly the thing that it expects to see. (After RLHF, that is no longer strictly true, but RLHF is just a fine-tuning step, most of the behavioral inclinations are coming from pretraining IMO.) That just doesn’t make sense in a human. When I take actions, I am sending motor commands to my own arms and my own mouth etc. Whereas when I observe another human and do self-supervised learning, my brain is internally computing predictions of upcoming sounds and images etc. These are different, and there isn’t any straightforward way to translate between them. (Cf. here where Owain Evans & Jacob Steinhardt show a picture of a movie frame and ask “what actions are being performed?”) Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.

  • Relatedly, brains have a distinction between expectations and desires, cleanly baked into the algorithms. I think this is obvious common sense, leaving aside galaxy-brain Free-Energy-Principle takes which try to deny it. By contrast, there isn’t any distinction between “the LLM expects the next token to be ‘a’” and “the LLM wants the next token to be ‘a’”. (Or if there is, it’s complicated and emergent and controversial, rather than directly and straightforwardly reflected in the source code, as I claim it would be in brain-like AGI.) So this is another disanalogy, and one with obvious relevance to technical arguments about safety.

  • In brains, online learning (editing weights, not just context window) is part of problem-solving. If I ask a smart human a hard science question, their brain may chug along from time t=0 to t=10 minutes, as they stare into space, and then out comes an answer. After that 10 minutes, their brain is permanently different than it was before (i.e., different weights)—they’ve figured things out about science that they didn’t previously know. Not only that, but the online-learning (weight editing) that they did during time 0<t<5 minutes is absolutely critical for the further processing that happens during time 5<t<10 minutes. This is not how today’s LLMs work—LLMs don’t edit weights in the course of “thinking”. I think this is safety-relevant for a number of reasons, including whether we can expect future AI to get rapidly more capable in an open-ended way without new human-provided training data (related discussion).

I want to reiterate that I’m delighted that people are contingency-planning for the possibility that future transformative AI will be LLM-like. We should definitely be doing that. But we should be very clear that that’s what we’re doing.

  1. ^

    “x-risk” is not quite synonymous with “extinction risk”, but they’re close enough in the context of the stuff I’m talking about here.

  2. ^

    Note that I said “many other engineered artifacts”, not “all”. The other examples I can think of tend to be in biology. For example, if I selectively breed a cultivar of cabbage that has lower irrigation needs, and I notice that its stalks are all a weird color, I may have no idea why, and it may take a decades-long research project to figure it out. Or as another example, there are many pharmaceutical drugs that are effective, but where nobody knows why they are effective, even after extraordinary efforts to figure it out.