(Thanks to Ajeya Cotra, Nick Beckstead, and Jared Kaplan for helpful comments on a draft of this post.)
I really don’t want my AI to strategically deceive me and resist my attempts to correct its behavior. Let’s call an AI that does so egregiously misaligned (for the purpose of this post).
Most possible ML techniques for avoiding egregious misalignment depend on detailed facts about the space of possible models: what kind of thing do neural networks learn? how do they generalize? how do they change as we scale them up?
But I feel like we should be possible to avoid egregious misalignment regardless of how the empirical facts shake out—it should be possible to get a model we build to do at least roughly what we want. So I’m interested in trying to solve the problem in the worst case, i.e. to develop competitive ML algorithms for which we can’t tell any plausible story about how they lead to egregious misalignment.
This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it’s possible, there are several ways in which it could actually be easier:
We can potentially iterate much faster, since it’s often easier to think of a single story about how an algorithm can fail than it is to characterize its behavior in practice.
We can spend a lot of our time working with simple or extreme toy cases that are easier to reason about, since our algorithm is supposed to work even in these cases.
We can find algorithms that have a good chance of working in the future even if we don’t know what AI will look like or how quickly it will advance, since we’ve been thinking about a very wide range of possible failure cases.
I’d guess there’s a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can’t come up with a plausible story about how it leads to egregious misalignment. That’s a high enough probability that I’m very excited to gamble on it. Moreover, if it fails I think we’re likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable.
What this looks like (3 examples)
My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.”
Example 1: human feedback
In an unaligned benchmark I describe a simple AI training algorithm:
Our AI observes the world through a bunch of cameras and outputs motor actions. We train a generative model that predicts these camera observations given the motor actions.
We ask humans to evaluate possible futures by looking at the predicted videos output by the model. We then train a model to predict these human evaluations.
At test time the AI searches for plans that lead to trajectories that look good to humans.
In the same post, I describe a plausible story about how this algorithm leads to egregious misalignment:
Our generative model understands reality better than human evaluators.
There are plans that acquire influence in ways that are obvious to the generative model but completely incomprehensible and invisible to humans.
It’s possible to use that influence to “hack” the cameras, in the sense of creating a fiction that looks convincing to a human looking at predicted videos.
The fiction can look much better than the actual possible futures. So our planning process finds an action that covertly gathers resources and uses them to create a fiction.
I don’t know if or when this kind of reward hacking would happen — I think it’s pretty likely eventually, but it’s far from certain and it might take a long time.
But from my perspective this failure mode is at least plausible — I don’t see any contradictions between this sequence of events and anything I know about the real world. So this is enough for me to conclude that human feedback can’t handle the worst plausible situation, and to keep looking for an algorithm that can.
To better understand whether this story is really plausible, we can spend time refining it into something more and more concrete to see if it still seems to make sense. There are lots of directions in which we could add detail:
Specify how our model could fool humans — maybe it sets some configuration files in ways that introduce security vulnerabilities we can’t understand, and sends an email that looks innocuous to us but is actually enlisting aid from an external attacker.
Specify how an attacker could ultimately corrupt cameras and how that would lead to a higher reward — maybe we were trying to train an agent to design our new computing cluster, and it is able to corrupt the benchmarks we were using to estimate cluster performance.
Specify how the algorithm works in more detail — maybe it uses MCTS with a learned value function for planning against the learned models, and we spend 10% of our budget on human evaluations of the policies output by the current planner, and so on.
Filling more and more details lets us notice if our abstract story was actually incoherent in important ways, or to notice weird things the story implies about the world that we might want to rule out by assumption.
Example 2: iterated amplification
To avoid the problems with raw human feedback, we could train additional ML assistants that help us evaluate outcomes. For example, assistants could point out possible consequences of a plan that we didn’t notice. Various variants of this idea are explored in benign model-free RL, supervising strong learners by amplifying weak experts, AI safety via debate, and recursive reward modeling.
In inaccessible information I tried to explore a story about how this entire family of algorithms could fail:
Our AI learns to model the world by looking at a lot of videos, doing gradient descent over a bunch of possible physical models, and selecting a good one. It might model physics with a complicated mix of low-level physics and high-level abstractions.
This model lets it predict data like camera observations, but also predict “off-screen” events like what code is running on my computer (even if the computation be directly observed on video). These predictions are economically useful and don’t have any close substitute.
The process of finding this model of the world is much more complex than anything our AI can do at inference time, and the intermediate results are too complex and numerous to be “memorized” in the weights of our trained AI. So there doesn’t seem to be any way to break the model-finding work into pieces that can be delegated to an ML assistant (in amplification) or a debater (in debate).
If we decide that we don’t trust this physical model and so don’t use it, then we aren’t economically competitive. But if we decide that we do trust the model, then we are back in the unfortunate situation from the last time: our AI might find a plan in which the “off-screen” information (computers outside of our view) are taking actions that ultimately hack our cameras and create a compelling fiction that we will rate highly. Because amplification gives us no way to access that off-screen data, we appear to have no way to tell if the plan is bad.
Example 3: imitative generalization
Imitative generalization is intended to address this problem with iterated amplification.
To briefly summarize: instead of using gradient descent to search over a space of human-incomprehensible models that predict some data (e.g. autoregressive models of videos), we try to search over space of models that a human can “understand” (perhaps with the help of aligned assistants as in amplification or debate), and optimize for a model that both looks plausible to the human and allows the human to successfully predict the same data (i.e. to predict the next pixel of a video). We hope that this allows us to find a human-comprehensible model that allows the human to both predict the data and figure out if the camera is being hacked.
(This algorithm is quite vague, so you could think of it as a whole family of algorithms based on how you parametrize the space of “human-comprehensible” models, how you search over that space, and how you define the prior. I’m going to try to tell a story about the limitations of this whole approach.)
Here’s an exotic situation where I think the naive version of this approach wouldn’t work:
Our AI learns an extremely low-level model of the world defined in terms of concepts like “atoms.” It makes predictions about high-level concepts (like human behavior) by simulating the lower-level parts of which they are made.
There is an easy way to read our dataset itself off of the low level simulation — -perhaps you can read bits off of our computer easily when expressed in the low-level simulation. So our AI learns how to use that extraction procedure without learning anything else explicit about the mapping between high-level and low-level concepts.
Any model which allows the human to predict the video must include all of the low-level facts. But that means that any further high-level facts (e.g. about how the low-level structure relates to the high-level things we care about) can only decrease the prior probability without any compensating increase in the ability to explain data.
There are many obvious ways to try to address this problem, but I think it does break the most obvious implementations of imitative generalization. So now I have two questions:
Is this story really plausible? It looks bad, but maybe as we flesh it out we’ll discover that it’s implausible.
Refine imitative generalization so that it handles the situation in this situation.
After a little bit of inspection it turns out that the original story is inconsistent: it’s literally impossible to run a detailed low-level simulation of physics in situations where the computer itself needs to be part of the simulation. So the story as I told it is inconsistent, and we can breathe a temporary sigh of relief.
Unfortunately, the basic problem persists even when we make the story more complicated and plausible. Our AI inevitably needs to reason about some parts of the world in a heuristic and high-level way, but it could still use a model that is lower-level than what humans are familiar with (or more realistically just alien but simpler). And at that point we have the same difficulty.
It’s possible that further refinements of the story would reveal other inconsistencies or contradictions with what we know about ML. But I’ve thought enough about this that I think this failure story is probably something that could actually happen, and so I’m back to the step of improving or replacing imitative generalization.
This story is even more exotic than the ones in the previous sections. I’m including it in part to illustrate how much I’m willing to push the bounds of “plausible.” I think it’s extremely difficult to tell completely concrete and realistic stories, so as we make our stories more concrete they are likely to start feeling a bit strange. But I think that’s OK if we are trying to think about the worst case, until the story starts contradicting some clear assumptions about reality that we might want to rely on for alignment. When that happens, I think it’s really valuable to talk concretely about what those assumptions are, and be more precise about why the unrealistic nature of the story excuses egregious misalignment.
More general process
We start with some unaligned “benchmark”. We rule out a proposed alignment algorithm if we can come up with any story about how it can be either egregiously misaligned or uncompetitive.
I’m always thinking about a stable of possible alignment strategies and possible stories about how each strategy can fail. Depending on the current state of play, there are a bunch of different things to do:
If there’s a class of algorithms (like imitative generalization) for which I can’t yet tell any failure story, I try to tell a story about how whole the class of algorithms would fail.
If I can’t come up with any failure story, then I try to fill in more details about the algorithm. As the algorithm gets more and more concrete it becomes easier and easier to tell a failure story.
The best case is that we end up with a precise algorithm for which we still can’t tell any failure story. In that case we should implement it (in some sense this is just the final step of making it precise) and see how it works in practice.
More likely I’ll end up feeling like all of our current algorithms are doomed in the worst case. At that point I try to think of a new algorithm. For this step, it’s really helpful to look at the stories about how existing algorithms fail and try to design an algorithm that handles those difficulties.
If all of my algorithms look doomed and I can’t think of anything new, then I try to really dig in on the existing failure stories by filling in details more concretely and exploring the implications. Are those stories actually inconsistent after all? Do they turn out to contradict anything I know about the world? If so, I may add another assumption about the world that I think makes alignment possible (e.g. the strategy stealing assumption), and throw out any stories that violate that assumption or which I now realize are inconsistent.
If I have a bunch of stories about how particular algorithms fail, and I can’t think of any new algorithms, then I try to unify and generalize them to tell a story about why alignment could turn out to be impossible. This is a second kind of “victory condition” for my work, and I hope it would shed light on what the fundamental difficulties are in alignment (e.g. by highlighting additional empirical assumptions that would be necessary for any working approach to alignment).
Objections and responses
Can you really come up with a working algorithm on paper? Empirical work seems important
My goal from theoretical work is to find a credible alignment proposal. Even from that point I think it will take a lot of practical work to get it to the point where it works well and we feel confident about it in practice:
I expect most alignment schemes are likely to depend on some empirical parameters that need to be estimated from experiment, especially to argue that they are competitive. For example, we may need to show that models are able to perform some tasks, like modeling some aspects of human preferences, “easily enough.” (This seems like an unusually easy claim to validate empirically — -if we show that our 2021 models can do a task, then it’s likely that future models can as well.) Or maybe we’ve argued that the aligned optimization problem is only harder by a bounded amount, but it really matters whether it’s 1.01 or 101 as expensive, so we need to measure this overhead and how it scales empirically. I’ve simplified my methodology a bit in this blog post, and I’d be thrilled if our alignment scheme ended up depending on some clearly defined and measurable quantities for which we can start talking about scaling laws.
I don’t expect to literally have a proof-of-safety. I think at best we’re going to have some convincing arguments and some years of trying-and-failing to find a plausible failure story. That means that empirical research can still turn up failures we didn’t anticipate, or (more realistically) places where reality doesn’t quite match our on-paper picture and so we need to dig in to make sure there isn’t a failure lurking somewhere.
Even if we’ve correctly argued that our scheme is workable, it’s still going to take a ton of effort to make it actually work. We need to write a bunch of code and debug it. We need to cope with the divergences between our conceptual “ML benchmark” and the messier ML training loops used in practice, even if those divergences are small enough that the theoretical algorithm still works. We need to collect the relevant datasets, even if we’ve argued that they won’t be prohibitively costly. And so on.
My view is that working with pen and paper is an important first step that allows you to move quickly until you have something that looks good on paper. After that point I think you are mostly in applied world, and I think that applied investments are likely to ultimately dwarf the theoretical investments by orders of magnitude even if it turns out that we found a really good algorithm on paper.
That’s why I’m personally excited about “starting with theory,” but I think we should do theoretical and applied work in parallel for a bunch of reasons:
We need to eventually be able to make alignment techniques in the real world, and so we want to get as much practice as we can. Similarly, we want to build and grow capable teams and communities with good applied track records.
There’s a good chance (50%?) that no big theoretical insights are forthcoming and empirical work is all that matters. So we really can’t wait on theoretical progress.
I think there’s a reasonable chance of empirical work turning up unknown unknowns that change how we think about alignment, or to find empirical facts that make alignment easier. We want to get those sooner rather than later.
Why think this task is possible? 50% seems way too optimistic
When I describe this methodology, many people feel that I’ve set myself an impossible task. Surely any algorithm will be egregiously misaligned under some conditions?
My “50% probability of possibility” is coming largely from a soup of optimistic intuitions. I think it would be crazy to be confident on the basis of this kind of intuition, but I do think it’s enough to justify 50%:
10 years ago this project seemed much harder to me and my probability would have been much lower. Since then I feel like I’ve made a lot of progress in my own thinking about this problem (I think that a lot of this was a personal journey of rediscovering things that other people already knew or answering questions in a way that was only salient to me because of the way I think about the domain). I went from feeling kind of hopeless, to feeling like indirect normativity formalized the goal, to thinking about evaluating actions rather than outcomes, to believing that we can bootstrap superhuman judgments using AI assistants, to understanding the role of epistemic competitiveness, to seeing that all of these theoretical ideas appear to be practical for ML alignment, to seeing imitative generalization as a plausible approach to the big remaining limitation of iterated amplification.
There is a class of theoretical problems for which I feel like it’s surprisingly often possible to either solve the problem or develop a clear picture of why you can’t. I don’t really know how to pin down this category but it contains almost all of theoretical computer science and mathematics. I feel like the “real” alignment problem is a messy practical problem, but that the worst-case alignment problem is more like a theory problem. Some theory problems turn out to be hard, e.g. it could be that worst-case alignment is as hard as P vs NP, but it seems surprisingly rare and even being as hard as P vs NP wouldn’t make it worthless to work on (and even for P vs NP we get various consolation prizes showing us why it’s hard to argue that it’s hard). And even for messy domains like engineering there’s something similar that often feels true, where given enough time we either understand how to build+improve a machine (like an engine or rocket) or we understand the fundamental limits that make it hard to improve further.
So if it’s not possible to find any alignment algorithm that works in the worst case, I think there’s a good chance that we can say something about why, e.g. by identifying a particular hard case where we don’t know how to solve alignment and where we can say something about what causes misalignment in that case. This is important for two reasons: (i) I think that would be a really great consolation prize, (ii) I don’t yet see any good reason that alignment is impossible, so that’s a reason to be a bit more optimistic for now.
I think one big reason to be more skeptical about alignment than about other theoretical problems is that the problem statement is incredibly imprecise. What constitutes a “plausible story,” and what are the assumptions about reality that an alignment algorithm can leverage? My feeling is that full precision isn’t actually essential to why theoretical problems tend to be soluble. But even more importantly, I feel like there is some precise problem here that we are groping towards, and that makes me feel more optimistic. (I discuss this more in the section “Are there any examples of this methodology working?”)
Egregious misalignment still feels weird to me and I have a strong intuitive sense that we should be able to avoid it, at least in the case of a particular known technique like ML, if only we knew what we were doing. So I feel way more optimistic about being able to avoid egregious misalignment in the worst case than I do about most other theoretical or practical problems for which I have no strong feasibility intuition. This feasibility intuition also often does useful work for us since we can keep asking “Does this intermediate problem still feel like it should obviously be soluble?” and I don’t feel like this approach has yet led me into a dead end.
Modern ML is largely based on simple algorithms that look good on paper and scale well in practice. I think this makes it much more plausible that alignment can also be based on simple algorithms that look good on paper and scale well in practice. Some people think of Sutton’s “bitter lesson” as bad news for the difficulty of alignment, and perhaps it is in general, but I think it’s great news if you’re looking for something really simple.
Despite having lots of optimistic words to say, feasibility is one of my biggest concerns with my methodology.
These failure stories involve very unrealistic learned models
My failure stories involve neural networks learning something like “simulate physics at a low level” or “perform logical deductions from the following set of axioms.” This is not the kind of thing that a neural network would learn in practice. I think this leads many people to be skeptical that thinking about such simplified stories could really be useful.
I feel a lot more optimistic:
I don’t think neural network cognition will be simple, but I think it will involve lots of the features that come up in simple cognition: powerful models will likely make cognitive steps similar to logical deduction, bayesian updating, modeling physics at some level of abstraction, and so on.
If our alignment techniques don’t work for simple cognition, I’m skeptical that they will work for complex cognition. I haven’t seen any alignment schemes that leverage complexity per se in order to work. A bigger and messier model is more likely to have some piece of its cognition that satisfies any given desirable property — -for example it’s more likely to have particular neurons that whose behavior can be easily understood — -but seems less likely to have every piece of its cognition satisfy any given desirable property.
I think it’s very reasonable to focus on capable models — -we don’t need to solve alignment for models that can’t speak natural language or understand roughly what humans want. I think that’s OK: we should imagine simple models being very capable, and we can rule out a failure story as implausible if it involves the model being too weak.
I think it’s more plausible for an alignment scheme to work well for simple cognition but fail for complex cognition. But in that case my methodology will just start with the simple cognition and move on to the more complex cognition, and I think that’s OK.
Are there any examples of a similar research methodology working well? This is different from traditional theoretical work
When theorists design algorithms they often focus on the worst case. But for them the “worst case” is e.g. a particular graph on which their algorithm runs slowly, not a “plausible” story about how a model is “egregiously misaligned.”
I think this is a real, big divergence that’s going to make it way harder to get traditional theorists on board with this approach. But there are a few ways in which I think the situation is less disanalogous than it looks:
Although the majority of computer science theorists work in closed, precisely defined domains, the field also has some experience with fuzzier domains where the definitions themselves need to be refined. For example, at the beginning of modern cryptography you could describe the methodology as “Tell a story about how someone learns something about your secret” and that only gradually crystallized into definitions like semantic security (and still people sometimes retreat to this informal process in order to define and clarify new security notions). Or while defining interactive and zero knowledge proofs people would work with more intuitive notions of “cheating” or “learning” before they were able to capture them with formal definitions.
I think the biggest difference is that most parts of theoretical CS move quickly past this stage and spend most of their time working with precise definitions. That said, (i) part of this is due to the taste of the field and the increasing unwillingness to engage in hard-to-formalize activities, rather than a principled take that you need to avoid spending long in this stage, (ii) although many people are working on alignment only very few are taking the kind of approach I’m advocating here, so it’s not actually clear that we’ve spent so much more time than is typically needed in theoretical CS to formalize a new area (especially given that people in academia typically pick problems based on tractability).
Both traditional theorists and I will typically start with a vague “hard case,” e.g. “What if the graph consists of two densely connected clusters with two edges in between them?” They then tell a story about how the algorithm would fail in that case, and think about how to fix the problem. In both cases, the point is that you could make the hard case more precise if you wanted to — -you can specify more details about the graph or you can fill in more details about the story. And in both cases, we learn how to tell vague stories by repeatedly going through the exercise of making them more precise and building intuitions about what the more precise story would look like. The big difference is that you can make a graph fully precise — -you can exactly specify the set of vertices and edges — -but you can never make a story about the world fully precise because there is just too much stuff happening. I think this really does mean that the traditional theorist’s intuition about what “counts” as a hard case is better grounded. But in practice I think it’s usually a difference in degree rather than kind. E.g., you very rarely need to actually write out the full graph in order to compute exactly how an algorithm behaves.
Although the definition of a “plausible failure story” is pretty vague, most of the concrete stories we are working with can be made very specific in the ways that I think matter. For example, we may be able to specify completely precisely how a learned deduction process works (specifying the formal language L, specifying the “proof search order” it uses to loop over inferences, and so on) and why it leads to misalignment in a toy scenario.