My AGI Threat Model: Misaligned Model-Based RL Agent

Rohin Shah advocates a vigorous discussion of “Threat Models”, i.e. stories for how AGI is developed, what the AGI then looks like, and then what might go catastrophically wrong.

An AGI threat model. (After Rohin Shah’s talk.)

Ben Garfinkel likewise wants to see a “a picture of the risk...grounded in reality”. Richard Ngo recently had a go at answering this call with AGI Safety From First Principles, which is excellent and full of valuable insights, but less specific than what I have in mind. So here’s my story, going all the way from how we’ll make AGI to why it may cause catastrophic accidents, and what to do about it.

My intended audience for this post is “people generally familiar with ML and RL, and also familiar with AGI-risk-related arguments”. (If you’re in the first category but not the second, read Stuart Russell’s book first.) I’ll try to hyperlink jargon anyway.

My AGI development model

I assume that we’ll wind up building an AGI that looks more-or-less like this:

Why do I think this model is likely? To make a long story short:

  • This seems like a natural extension of some of the types of AIs that researchers are building today (cf. MuZero).

  • I think that human intelligence works more-or-less this way (see My Computational Framework for the Brain)

    • …And if so, then we can have high confidence that this is a realistic path to AGI—whereas all other paths are more uncertain.

    • …And moreover, this offers a second R&D path to the same destination—i.e., trying to understand how the brain’s learning algorithms work (which people in both AI/​ML and neuroscience are already doing all the time anyway). That makes this destination more likely on the margin.

  • See also: some discussion of different development paths in Against evolution as an analogy for how humans will build AGI.

More details about the model

  • The value function is a function of the latent variables in the world-model—thus, even abstract concepts like “differentiate both sides of the equation” are assigned values. The value function is updated by the reward signals, using (I assume) some generalization of TD learning (definition).

  • I assume that the learned components (world-model, value function, planner /​ actor) continue to be updated in deployment—a.k.a. online learning (definition). This is important for the risk model below, but seems very likely—indeed, unavoidable—to me:

    • Online updating of the world-model is necessary for the AGI to have a conversation, learn some new idea from that conversation, and then refer back to that idea perpetually into the future.

    • Online updating of the value function is then also necessary for the AGI to usefully employ those new concepts. For example, if the deployed AGI has a conversation in which it learns the idea of “Try differentiating both sides of the equation”, it needs to be able to assign and update a value for that new idea (in different contexts), in order to gradually learn how and when to properly apply it.

    • Online updating of the value function is also necessary for the AGI to break down problems into subproblems. Like if “inventing a better microscope” is flagged by the value function as being high-value, and then the planner notices that “If only I had a smaller laser, then I’d be able to invent a better microscope”, then we need a mechanism for the value function to flag “inventing a smaller laser” as itself high-value.

  • My default assumption is that this thing proceeds in one ridiculously-long RL episode, with the three interconnected “learning” modules initialized from random weights, using online learning for long enough to learn a common-sense understanding of the world from scratch. That is, after all, how the brain works, I think, and see also some related discussion in Against evolution as an analogy for how humans will build AGI. If there’s learning through multiple shorter episodes, that’s fine too, I don’t think that really affects this post.

  • Note the word “planner”—I assume that the algorithm is doing model-based RL, in the sense that it will make foresighted, goal-directed plans, relying on the world-model for the prediction of what would happen, and on the value function for the judgment of whether that thing would be desirable. There has been a lot of discussion about what goal-directedness is; I think that discussion is moot in this particular case, because this type of AGI will be obviously goal-directed by design. Note that the goal(s) to which it is directed will depend on the current state of the value function (which in turn is learned from the reward function calculator)—much more on which below.

What about the reward function calculator?

The above discussion was all about the learning algorithm (the three boxes on the top of the diagram above). The other part of my presumed AGI architecture is the reward function calculator box at the bottom of the diagram. Here’s a currently-open research question:

What reward function calculator, when inserted into the diagram above, would allow that RL system to safely scale all the way to super-powerful AGIs while remaining under human control? (Or do we need some new concept that will supersede our current concept of reward function calculators?)

(see Stuart Russell’s book). There are some small number of researchers working on this problem, including many people reading this post, and others, and me. Go us! Let’s figure it out! But in the grand scheme of things this remains a niche research topic, we have no clue whether this research program will succeed, and if it does eventually succeed, we have no clue how much time and effort is needed before we get there. Meanwhile, orders of magnitude more people are working on the other parts of the diagram, i.e. the three interconnected learning algorithms at the top.

So my assumption is that by default, by the time the “learning” parts of the AGI diagram above are getting really good, and scaling all the way to AGI, most people will still be putting very simple things into the “reward function calculator” box, things like “get correct answers on these math problems”.

(I make an exception for capability-advancing aspects of reward function engineering, like reward shaping, curiosity-like drives for novelty, etc. People already have a lot of techniques like that, and I assume those will continue to develop. I really meant to say: I’m assuming no breakthrough solution to the AGI-safety-relevant aspects of reward function engineering—more on which in the next section.)

To be clear, these simple reward functions will certainly lead to misaligned AIs (i.e., AIs trying to accomplish goals that no human would want them to accomplish, at least not if the AI is sufficiently competent). Such AIs will not be suitable for applications like robots and consumer products. But they will be very suitable to the all-important tasks of getting high-impact publications, getting funding, and continuing to improve the learning algorithms.

That said, sooner or later, more and more researchers will finally turn their attention to the question of what reward function to use, in order to reliably get an aligned /​ human-controllable system.

And then—unless we AI alignment researchers have a solution ready to hand them on a silver platter—I figure lots of researchers will mainly just proceed by trial-and-error, making things up as they go along. Maybe they’ll use reasonable-seeming reward functions like “get the human to approve of your output text”, or “listen to the following voice command, and whatever concepts it activates in your world-model, treat those as high-value”, etc. And probably also some people will play around with reward functions that are not even superficially safe, like “positive reward when my bank account goes up, negative reward when it goes down”. I expect a proliferation of dangerous experimentation. Why dangerous? That brings us to...

My AGI risk model

The AGI is by assumption making foresighted, strategic plans to accomplish goals. Those goals would be things flagged as high-value by its value function. Therefore, there are two “alignment problems”, outer and inner:

The outer & inner alignment problems. Note that this is not exactly the same “outer & inner alignment problems” as the ones defined in Risks From Learned Optimization, but I think they have enough in common that I can reuse the terminology.

Inner alignment problem: The value function might be different from the reward function.

In fact that’s an understatement: The value function will be different from the reward function. Why? Among other things, because they have different type signatures—they accept different input!

The input to the reward function calculator is, well, whatever we program it to be. Maybe it would be a trivial calculation, that simply answers the question: “Is the remote control Reward Button currently being pressed?” Maybe it would look at the learning algorithm’s actions, and give rewards when it prints the correct answers to the math problems. Maybe it would take the camera and microphone data and run it through a trained classifier. It could be anything.

(In the brain, the reward function calculator includes a pain-detector that emits negative reward, and a yummy-food-detector that emits positive reward, and probably hundreds of other things, some of which may be quite complicated, and which may involve clever interpretability-like mechanisms, and so on.)

The input to the value function is specifically “the latent variables in the learned world-model”, as mentioned above.

Do you like football? Well “football” is a learned concept living inside your world-model. Learned concepts like that are the only kinds of things that it’s possible to “like”. You cannot like or dislike [nameless pattern in sensory input that you’ve never conceived of]. It’s possible that you would find this nameless pattern rewarding, were you to come across it. But you can’t like it, because it’s not currently part of your world-model. That also means: you can’t and won’t make a goal-oriented plan to induce that pattern.

“Nameless pattern in sensory input that you’ve never conceived of” is a case where something is in-domain for the reward function but (currently) out-of-domain for the value function. Conversely, there are things that are in-domain for your value function—so you can like or dislike them—but wildly out-of-domain for your reward function! You can like or dislike “the idea that the universe is infinite”! You can like or dislike “the idea of doing surgery on your brainstem in order to modify your own internal reward function calculator”! A big part of the power of intelligence is this open-ended ever-expanding world-model that can re-conceptualize the world and then leverage those new concepts to make plans and achieve its goals. But we cannot expect those kinds of concepts to be evaluable by the reward function calculator.

(Well, I guess “the idea that the universe is infinite” and so on could be part of the reward function calculator. But now the reward function calculator is presumably a whole AGI of its own, which is scrutinizing the first AGI using interpretability tools. Maybe there’s a whole tower of AGIs-scrutinizing-AGIs! That’s all very interesting to think about, but until we flesh out the details, especially the interpretability part, we shouldn’t assume that there’s a good solution along these lines.)

So leaving that aside, the value function and reward function are necessarily different functions. How different? Will the value function converge to a better and better approximation of the reward function (at least where the domains overlap)? I’m inclined to answer: “Maybe sometimes (especially with simple reward functions), but not reliably, and maybe not at all with the techniques we’ll have on hand by the time we’re actually doing this.” Some potential problems (which partially overlap) are:

  • Ambiguity in the reward signals—There are many different value functions (defined on different world-models) that agree with the actual history of reward signals, but that generalize out-of-sample in different ways. To take an easy example, the wireheading value function (“I like it when there’s a reward signal”) is always trivially consistent with the reward history. Or compare “negative reward for lying” to “negative reward for getting caught lying”!

  • Credit assignment failures—The AGI algorithm is implicitly making an inference, based on its current understanding of the world, about what caused the reward prediction error, and then incrementing the value associated with that thing. Such inferences will not always be correct. Look at humans with superstitions. Or how about the time Lisa Feldman Barrett went on a date, felt butterflies in her stomach, and thought she had found True Love … only to discover later that she was coming down with the flu! Note that the AGI is not trying to avoid credit assignment failures (at least, not before we successfully put corrigible motivation (definition) into it), because credit assignment is how it gets motivation in the first place. We just have some imperfect credit-assignment algorithm that we wrote—I presume it’s something a bit like TD learning, but elaborated to work with flexible, time-extended plans and concepts and so on—and we’re hoping that this algorithm assigns credit properly. (Actually, we need to be concerned that the AGI may try to cause credit assignment failures! See below.)

  • Different aspects of the value-function duking it out—For example, I currently have mutually-contradictory desires in my brain’s value function: I like the idea of eating candy because it’s yummy, and I also like the idea of not eating candy because that’s healthy. Those desires are in conflict. My general expectation is that reward functions will by default flow value into multiple different concepts in the world-model, which encode mutually-contradictory desires, at least to some extent. This is an unstable situation, and when the dust settles, the agent could wind up effectively ignoring or erasing some of those desires. For example, if I had write access to my brain, I would strongly consider self-modifying to not find candy quite so yummy. I can’t do that with current technology, but I can wait until some moment when my “eat candy” drive is unusually weak (like when I’m not hungry), and then my “stay healthy” drive goes on a brutal attack! I throw out all my candy, I set up a self-control system to sap my desire to buy more candy in the future, etc. So by the same token, we could set up a reward function that is supposed to induce a nice balance between multiple motivations in our AGI, but the resulting AGI could wind up going all out on just one of those motivations, preventing the others from influencing its behavior. And we might not be able to predict which motivation will win the battle. You might say: the solution is to have a reward function that defines a self-consistent, internally-coherent motivation. (See Stuart Armstrong’s defense of making AGIs with utility functions.) Maybe! But doing that is not straightforward either! A reward which is “just one internally-coherent motivation system” from our human perspective has to then get projected onto the available concepts in the AGI’s world-model, and in that concept space, it could wind up taking the form of multiple competing motivations, which again leads to an unpredictable endpoint which may be quite different from the reward function.

  • Ontological crises—For example, let’s say I build an AGI with the goal “Do what I want you to do”. Maybe the AGI starts with a primitive understanding of human psychology, and thinks of me as a monolithic rational agent. So then “Do what I want you to do” is a nice, well-defined goal. But then later on, the AGI develops a more sophisticated understanding of human psychology, and it realizes that I have contradictory goals, and context-dependent goals, and I have a brain made of neurons and so on. Maybe its goal is still “Do what I want you to do”, but now it’s not so clear what exactly that refers to, in its updated world model. How does that shake out?

  • Manipulating the training signal—Insofar as the AGI has non-corrigible real-world goals, and understands its own motivation system, it will be motivated to preserve aspects of its current value function, including by manipulating or subverting the mechanism by which the rewards change the value function. This is a bit like gradient hacking, but it’s not a weird hypothetical, it’s a thing where there are actually agents running this kind of algorithm (namely, us humans), and they literally do this exact thing a hundred times a day. Like, every time we put our cellphone on the other side of the room so that we’re less tempted to check Facebook, we’re manipulating our own future reward stream in order to further our current goal of “being productive”. Or more amusingly, some people manipulate their motivations the old-fashioned way—y’know, by wearing a wristband that they then use to electrocute themselves. Corrigibility would seem to solve this manipulation problem, but we don’t yet know how to install corrigible motivation, and even if we did, there would at least be some period during early training where it wasn’t corrigible yet.

Some of these problems are especially problematic problems because you don’t know when they will strike. For example, ontological crises: Maybe you’re seven years into deployment, and the AGI has been scrupulously helpful the whole time, and we’ve been trusting the AGI with more and more autonomy, and then the AGI then happens to be reading some new philosophy book, and it converts to panpsychism (nobody’s perfect!), and as it maps its existing values onto its reconceptualized world, it finds itself no longer valuing the lives of humans over the lives of ants, or whatever.

Outer alignment problem: The reward function might be different than the thing we want.

Here there are problems as well, such as:

  • Translation of “what we want” into machine code—The reward function needs to be written (directly or indirectly) in machine code, which rules out any straightforward method of leveraging common-sense concepts, and relatedly introduces the strong possibility of edge-cases where the reward function calculator gives the wrong answer. Goodhart’s law (definition) comes into play here (as elsewhere), warning us that optimizing an oversimplified approximation to what we want can wildly diverge from optimizing what we want—particularly if the “wildly diverging” part includes corrigibility. See Superintelligence, Complexity of Value, etc. Presumably we need a system for continually updating the reward function with human feedback, but this faces problems of (1) human-provided data being expensive, and (2) humans not always being capable (for various reasons) of judging whether the right action was taken—let alone whether the right action was taken for the right reason. As elsewhere, there are ideas in the AI Alignment literature (cf. debate, recursive reward modelling, iterated amplification, etc.), but no solution yet.

  • Things we don’t inherently care about but which we shoved into the reward function for capability reasons could also lead to dangerous misalignment. I’m especially thinking here about curiosity (the drive for exploration /​ novelty /​ etc.). Curiosity seems like a potentially necessary motivation to get our AGI to succeed in learning, figuring things out, and doing the things we want it to do. But now we just put another ingredient into the reward function, which will then flow into the value function, and from there into plans and behavior, and exactly what goals and behaviors will it end up causing downstream? I think it’s very hard to predict. Will the AGI really love making up and then solving harder and harder math problems, forever discovering elegant new patterns, and consuming all of our cosmic endowment in the process?

By the way, in terms of solving the alignment problem, I’m not sure that splitting things up into “outer alignment” and “inner alignment” is actually that helpful! After all, the reward function will diverge from the thing we want, and the value function will diverge from the reward function. The most promising solution directions that I can think of seem to rely on things like interpretability, “finding human values inside the world-model”, corrigible motivation, etc.—things which cut across both layers, bridging all the way from the human’s intentions to the value function.

So then what happens? What’s the risk model?

I’ll go with a slow takeoff (definition) risk scenario. (If we’re doomed under slow takeoff then we’re even more doomed under fast takeoff.) A particularly bad case—which I see as plausible in all respects—would be something like this:

  • Assumption 1: The AGI’s learned value function winds up at least sometimes (and perhaps most or all of the time) misaligned with human values, and in particular, non-corrigible and subject to the classic instrumental convergence argument that makes it start trying to not get shut down, to prevent its current goals from being manipulated, to self-replicate, to increase its power and so on. And this is not a straightforward debugging exercise—we could have a misbehaving AGI right in front of us, with a reproducible failure mode, and still not know how to fix the underlying problem. So it remains a frequent occurrence early on, though hopefully we will eventually solve the problem so that it happens less often over time.

    • I take this as the default, for all the reasons listed above if the programmers are smart and thoughtful and actually trying, and certainly if they aren’t. (Unless we AI alignment researchers solve the problem, of course!)

    • The “instrumental convergence” part relies on the idea that most possible value functions are subject to instrumental convergence, so if there is unintended and unpredictable variation in the final value function, we’re reasonably likely to get a goal with instrumental convergence. (Why “most possible value functions”? Well, the value function assigns values to things in the world-model, and I figure that most things in the world-model—e.g. pick a random word in the dictionary—will be persistent patterns in the world which could in principle be expanded, or made more certain, without bound.) Here’s an example. Let’s say I’m programming an AGI, and I want the AGI’s goal to be “do what I, the programmer, want you to do”. As it happens, I very much want to solve climate change. If alignment goes perfectly, the AGI will be motivated to solve climate change, but only as a means to an end (of doing what I want). But with all the alignment difficulties listed above, the AGI may well wind up with a distorted version of that goal. So maybe the AGI will (among other things) want to solve climate change as an end in itself. (In fact, it may be worse than that: the human brain implementation of a value function does not seem to have a baked-in distinction between instrumental goals vs final goals in the first place!) That motivation is of course non-corrigible and catastrophically unsafe. And as described in the previous section, if even one aspect of the AGI’s motivation would be non-corrigible in isolation, then we’re potentially in trouble, because that sub-motivation might subvert all the other sub-motivations and take control of behavior. Incidentally, I don’t buy the argument that “corrigibility is a broad basin of attraction”, but even if I did, this example here is supposed to illustrate how alignment is so error-prone (by default) that it may miss the basin entirely!

  • Assumption 2: More and more groups are capable of training this kind of AGI, in a way that’s difficult to monitor or prevent.

    • I also take this to be the default, given that new ideas in AI tend to be open-sourced, that they get progressively easier to implement due to improved tooling, pedagogy, etc., that there are already several billion GPUs dispersed across the planet, and that the global AI community includes difficult-to-police elements like the many thousands of skilled researchers around the globe with strong opinions and no oversight mechanism, not to mention secret military labs etc.

  • Assumption 3: There is no widely-accepted proof or solid argument that we can’t get this kind of AGI to wind up with a safe value function.

    • I also find this very likely—“wanting to help the human” seems very much like a possible configuration of the value function, and there is an endless array of plausible-sounding approaches to try to get the AGI into that configuration.

  • Assumption 4: Given a proposed approach to aligning /​ controlling this kind of AGI, there is no easy, low-risk way to see whether that approach will work.

    • Also seems very likely to me, in the absence of new ideas. I expect that the final state of the value function is a quite messy function of the reward function, environment, random details of how the AGI is conceptualizing certain things, and so on. In the absence of new ideas, I think you might just have to actually try it. While a “safe test environment” would solve this problem, I’m pessimistic that there even is such a thing: No matter how much the AGI learns in the test environment, it will continue to learn new things, to think new thoughts, and to see new opportunities in deployment, and as discussed above (e.g. ontological crises), the value function is by default fundamentally unstable under those conditions.

  • Assumption 5: A safer AGI architecture doesn’t exist, or requires many years of development and many new insights.

    • Also seems very likely to me, in that we currently have zero ways to build an AGI, so we will probably have exactly one way before we have multiple ways.

  • Assumption 6: In a world with one or more increasingly-powerful misaligned AGIs that are self-improving and self-replicating around the internet (again cf. instrumental convergence discussion above), things may well go very badly for humanity (including possibly extinction), even if some humans also eventually succeed in making aligned AGIs.

    • Consider how unaligned AGIs will have asymmetric superpowers like the ability to steal resources, to manipulate people and institutions via lying and disinformation; to cause wars, pandemics, blackouts, and so on; and to not have to deal with coordination challenges across different actors with different beliefs and goals. Also, there may be a substantial head-start, where misaligned AGIs start escaping into the wild well before we figure out how to align an AGI. And, there’s a potential asymmetric information advantage, if rogue misaligned AGIs can prevent their existence from becoming known. See The Strategy-Stealing Assumption for further discussion.

Assuming slow takeoff (again, fast takeoff is even worse), it seems to me that under these assumptions there would probably be a series of increasingly-worse accidents spread out over some number of years, culminating in irreversible catastrophe, with humanity unable to coordinate to avoid that outcome—due to the coordination challenges in Assumptions 2-4.

Well, maybe humans and/​or aligned AGIs would be able to destroy the unaligned AGIs, but that would be touch-and-go under the best of circumstances (see Assumption 6)—and the longer into this period that it takes us to solve the alignment problem (if indeed we do at all), the worse our prospects get. I’d rather have a plan ready to go in advance! That brings us to...

If so, what now?

So that’s my AGI threat model. (To be clear, avoiding this problem is only one aspect of getting to Safe & Beneficial AGI—necessary but by no means sufficient.)

If you buy all that, then some of the implications include:

  1. In general, we should be doing urgent, intense research on AGI safety. The “urgent” is important even if AGI is definitely a century away because (A) some interventions become progressively harder with time, like “coordinate on not pursuing a certain R&D path towards AGI, in favor of some very different R&D path” (see appendix here for a list of very different paths to AGI), and (B) some interventions seem to simply take a lot of serial time to unfold, like “develop the best foundation of basic ideas, definitions, concepts, and pedagogy” (a.k.a. deconfusion), or “create a near-universal scientific consensus about some technical topic” (because as the saying goes, “science progresses one funeral at a time”).

  2. We should focus some attention on this particular AGI architecture that I drew above, and develop good plans for aligning /​ controlling /​ inspecting /​ testing /​ using such an AGI. (We’re not starting from scratch; many existing AGI safety & alignment ideas already apply to this type of architecture, possibly with light modifications. But we still don’t have a viable plan.)

  3. We should treat the human brain “neocortex subsystem” as a prototype of one way this type of algorithm could work, and focus some attention on understanding its details—particularly things like how exactly the reward function updates the value function—in order to better game out different alignment approaches. (This category of work brushes against potential infohazards, but I think that winds up being a manageable problem, for various reasons.)

…And there you have it—that’s what I’m doing every day; that’s my current research agenda in a nutshell!

Well, I’m doing that plus the meta-task of refining and discussing and questioning my assumptions. Hence this post! So leave a comment or get in touch. What do you think?