The Plan

This is a high-level overview of the reasoning behind my research priorities, written as a Q&A.

What’s your plan for AI alignment?

Step 1: sort out our fundamental confusions about agency

Step 2: ambitious value learning (i.e. build an AI which correctly learns human values and optimizes for them)

Step 3: …

Step 4: profit!

… and do all that before AGI kills us all.

That sounds… awfully optimistic. Do you actually think that’s viable?

Better than a 5050 chance of working in time.

Do you just have really long timelines?

No. My median is maybe 10-15 years, though that’s more a gut estimate based on how surprised I was over the past decade rather than a carefully-considered analysis. (I wouldn’t be shocked by another AI winter, especially on an inside view, but on an outside view the models generating that prediction have lost an awful lot of Bayes Points over the past few years.)

Mostly timelines just aren’t that relevant; they’d have to get down to around 18-24 months before I think it’s time to shift strategy a lot.

… Wat. Not relevant until we’re down to two years?!?

To be clear, I don’t expect to solve the whole problem in the next two years. Rather, I expect that even the incremental gains from partial progress on fundamental understanding will be worth far more than marginal time/​effort on anything else, at least given our current state.

At this point, I think we’re mostly just fundamentally confused about agency and alignment. I expect approximately-all of the gains-to-be-had come from becoming less confused. So the optimal strategy is basically to spend as much time as possible sorting out as much of that general confusion as possible, and if the timer starts to run out, then slap something together based on the best understanding we have.

18-24 months is about how long I expect it to take to slap something together based on the best understanding we have. (Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)

But iterative engineering is important!

In order for iterative engineering to be useful, we first need to have a strong enough understanding of what we even want to achieve in order to recognize when an iteration has brought us a step closer to the goal. No amount of A/​B testing changes to our website will make our company profitable if we’re measuring the wrong metrics. I claim that, for alignment, we do not yet have a strong enough understanding for iteration to produce meaningful progress.

When I say “we’re just fundamentally confused about agency and alignment”, that’s the sort of thing I’m talking about.

To be clear: we can absolutely come up with proxy measures of alignment. The problem is that I don’t expect iteration under those proxy measures to get us meaningfully closer to aligned AGI. No reasonable amount of iterating on gliders’ flight-range will get one to the moon.

But engineering is important for advancing understanding too!

I do still expect some amount of engineering to be central for making progress on fundamental confusion. Engineering is one of the major drivers of science; failed attempts to build amplifiers drove our first decent understanding of semiconductors, for instance. But this is a very different path-to-impact than directly iterating on “alignment”, and it makes sense to optimize our efforts differently if the path-to-impact is through fundamental understanding. Just take some confusing concept which is fundamental to agency and alignment (like abstraction, or optimization, or knowledge, or …) and try to engineer anything which can robustly do something with that concept. For instance, a lot of my own work is driven by the vision of a “thermometer of abstraction”, a device capable of robustly and generalizably measuring abstractions and presenting them in a standard legible format. It’s not about directly iterating on some alignment scheme, it’s about an engineering goal which drives and grounds the theorizing and can be independently useful for something of value.

Also, the theory-practice gap is a thing, and I generally expect the majority of “understanding” work to go into crossing that gap. I consider such work a fundamental part of sorting out confusions; if the theory doesn’t work in practice, then we’re still confused. But I also expect that the theory-practice gap is only very hard to cross the first few times; once a few applications work, it gets much easier. Once the first field-effect transistor works, it’s a lot easier to come up with more neat solid-state devices, without needing to further update the theory much. That’s why it makes sense to consider the theory-practice gap a part of fundamental understanding in its own right: once we understand it well enough for a few applications, we usually understand it well enough to implement many more with much lower marginal effort.

An analogy: to go from medieval castles to skyscrapers, we don’t just iterate on stone towers; we leverage fundamental scientific advances in both materials and structural engineering. My strategy for building the tallest possible metaphorical skyscraper is to put all my effort into fundamental materials and structural science. That includes testing out structures as-needed to check that the theory actually works, but the goal there is understanding, not just making tall test-towers; tall towers might provide useful data, but they’re probably not the most useful investment until we’re near the end-goal. Most of the iteration is on e.g. metallurgy, not on tower-height directly. Most of the experimentation is on e.g. column or beam loading under controlled conditions, again not on tower-height directly. If the deadline is suddenly 18-24 months, then it’s time to slap together a building with whatever understanding is available, but hopefully we figure things out fast enough that the deadline isn’t that limiting of a constraint.

What do you mean by “fundamentally confused”?

My current best explanation of “fundamental confusion” is that we don’t have the right frames. When thinking about agency or alignment, we do not know:

  • What are the most important questions to ask?

  • What approximations work?

  • What do we need to pay attention to, and what can we safely ignore?

  • How can we break the problem/​system up into subproblems/​subsystems?

For all of these, we can certainly make up some answers. The problem is that we don’t have answers to these questions which seem likely to generalize well. Indeed, for most current answers to these questions, I think there are strong arguments that they will not generalize well. Maybe we have an approximation which works well for a particular class of neural networks, but we wouldn’t expect it to generalize to other kinds of agenty systems (like e.g. a bacteria), and it’s debatable whether it will even apply to future ML architectures. Maybe we know of some possible failure modes for alignment, but we don’t know which of them we need to pay attention to vs which will mostly sort themselves out, especially in future regimes/​architectures which we currently can’t test. (Even more important: there’s only so much we can pay attention to at all, and we don’t know what details are safe to ignore.) Maybe we have a factorization of alignment which helps highlight some particular problems, but the factorization is known to be leaky; there are other problems which it obscures.

By contrast, consider putting new satellites into orbit. At this point, we generally know what the key subproblems are, what approximations we can make, what to pay attention to, what questions to ask. Most importantly, we are fairly confident that our framing for satellite delivery will generalize to new missions and applications, at least in the near-to-medium-term future. When someone needs to put a new satellite in orbit, it’s not like the whole field needs to worry about their frames failing to generalize.

(Note: there’s probably aspects of “fundamental confusion” which this explanation doesn’t capture, but I don’t have a better explanation right now.)

What are we fundamentally confused about?

We’ve already talked about one example: I think we currently do not understand alignment well enough for iterative engineering to get us meaningfully closer to solving the real problem, in the same way that iterating on glider range will not get one meaningfully closer to going to the moon. When iterating, we don’t currently know which questions to ask, we don’t know which things to pay attention to, we don’t know which subproblems are bottlenecks.

Here’s a bunch of other foundational problems/​questions where I think we currently don’t know the right framing to answer them in a generalizable way:

  • Is an e-coli an agent? Does it have a world-model, and if so, what is it? Does it have a utility function, and if so, what is it? Does it have some other kind of “goal”?

  • What even are “human values”? What’s the type signature of human values?

  • Given two agents (with potentially completely different world models), how can I tell whether one is “trying to help” the other? What does that even mean?

  • Given a trained neural network, does it contain any subagents? What are their world-models, and what do they want?

  • Given an atomically-precise scan of a whole human brain, body, and local environment, and unlimited compute, calculate the human’s goals/​wants/​values, in a manner legible to an automated optimizer.

  • Given some physical system, identify any agents in it, and what they’re optimizing for.

  • Back out the learned objective of a trained neural net, and compare it to the training objective.

What kinds of “incremental progress” do you have in mind here?

As an example, I’ve spent the last couple years better understanding abstraction (and I’m currently working to push that across the theory-practice gap). It’s a necessary component for the sorts of questions I want to answer about agency in general (like those above), but in the nearer term I also expect it to provide very strong ML interpretability tools. (This is a technical thing, but if you want to see the rough idea, take a look at the Telephone Theorem post and imagine that the causal models are computational circuits for neural nets. There are still some nontrivial steps after that to adapt the theorem to neural nets, but it should convey the general idea, and it’s a very simple theorem.) If I found out today that AGI was two years away, I’d probably spend a few more months making the algorithms for abstraction-extraction as efficient as I could get them, then focus mainly on applying it to interpretability.

(What I actually expect/​hope is that I’ll have efficient algorithms demo-ready in the first half of next year, and then some engineers will come along and apply them to interpretability while I work on other things.)

Another example: the next major thing to sort out after abstraction will be when and why large optimized systems (e.g. neural nets or biological organisms) are so modular, and how the trained/​evolved modularity corresponds to modular structures in the environment. I expect that will yield additional actionable insights into ML interpretability, and especially into what environmental/​training features lead to more transparent ML models.

Ok, the incremental progress makes sense, but the full plan still sounds ridiculously optimistic with 10-15 year timelines. Given how slow progress has been on the foundational theory of agency (especially at MIRI), why do you expect it to go so much faster?

Mostly I think MIRI has been asking not-quite-the-right-questions, in not-quite-the-right-ways.

Not-quite-the-right-questions: when I look at MIRI’s past work on agent foundations, it’s clear that the motivating questions were about how to build AGI which satisfies various desiderata (e.g. stable values under self-modification, corrigibility, etc). Trying to understand agency-in-general was mostly secondary, and was not the primary goal guiding choice of research directions. One clear example of this is MIRI’s work on proof-based decision theories: absolutely nobody would choose this as the most-promising research direction for understanding the decision theory used by, say, an e-coli. But plenty of researchers over the years have thought about designing AGI using proof-based internals.

I’m not directly thinking about how to design an AGI with useful properties. I’m trying to understand agenty systems in general—be it humans, ML systems, e-coli, cats, organizations, markets, what have you. My impression is that MIRI’s agent foundations team has started to think more along these lines over time (especially since Embedded Agency came out), but I think they’re still carrying a lot of baggage.

… which brings us to MIRI tackling questions in not-quite-the-right-ways. The work on Tiling Agents is a central example here: the problem is to come up with models for agents which copy themselves, so copies of the agents “tile” across the environment. When I look at that problem through an “understand agency in general” lens, my immediate thought is “ah, this is a baseline model for evolution”. Once we have a good model for agents which “reproduce” (i.e. tile), we can talk about agents which approximately-reproduce with small perturbations (i.e. mutations) and the resulting evolutionary process. Then we can go look at how evolution actually behaves to empirically check our models.

When MIRI looks at the Tiling Agents problem, on the other hand, they set it up in terms of proof systems proving things about “successor” proof systems. Absolutely nobody would choose this as the most natural setup to talk about evolution. It’s a setup which is narrowly chosen for a particular kind of “agent” (i.e. AI with some provable guarantees) and a particular use-case (i.e. maintaining the guarantees when the AI self-modifies).

Main point: it does not look like MIRI has primarily been trying to sort out fundamental confusions about agency-in-general, at least not for very long; that’s not what they were optimizing for. Their work was much more narrow than that. And this is one of those cases where I expect the more-general theory to be both easier to find (because we can use lots of data from existing agenty systems in biology, economics and ML) and more useful (because it will more likely generalize to many use-cases and many kinds of agenty systems).

Side note: contrary to popular perception, MIRI is an extremely heterogeneous org, and the criticisms above apply to different people at different times to very different degrees. That said, I think it’s a reasonable representation of the median past work done at MIRI. Also, MIRI is still the best org at this sort of thing, which is why I’m criticizing them in particular.

What’s the roadmap?

Abstraction is the main foundational piece (more on that below). After that, the next big piece will be selection theorems, and I expect to ride that train most of the way to the destination.

Regarding selection theorems: I think most of the gap between aspects of agency which we understand in theory, and aspects of agenty systems which seem to occur consistently in practice, come from broad and robust optima. Real search systems (like gradient descent or evolution) don’t find just any optima. They find optima which are “broad”: optima whose basins fill a lot of parameter/​genome space. And they find optima which are robust: small changes in the distribution of the environment don’t break them. There are informal arguments that this leads to a lot of key properties:

  • Modularity of the trained/​evolved system (which we do indeed see in practice)

  • Good generalization properties

  • Information compression

  • Goal-directedness

… but we don’t have good formalizations of those arguments, and we’ll need the formalizations in order to properly leverage these properties for engineering.

Besides that, there’s also some cruft to clean up in existing theorems around agency. For instance, coherence theorems (i.e. the justifications for Bayesian expected utility maximization) have some important shortcomings, and are incomplete in important ways. And of course there’s also work to be done on the theoretical support structure for all this—for instance, sorting out good models of what optimization even means.

Why do we need formalizations for engineering?

It’s not that we need formalizations per se; it’s that we need gears-level understanding. We need to have some understanding of why e.g. modularity shows up in trained/​evolved systems, what precisely makes that happen. The need for gears-level understanding, in turn, stems from the need for generalizability.

Let’s get a bit more concrete with the modularity example. We could try to build some non-gears-level (i.e. black-box) model of modularity in neural networks by training some different architectures in different regimes on different tasks and with different parameters, empirically computing some proxy measure of “modularity” for each trained network, and then fitting a curve to it. This will probably work great right up until somebody tries something well outside of the distribution on which this black-box model was fit. (Those crazy engineers are constantly pushing the damn boundaries; that’s largely why they’re so useful for driving fundamental understanding efforts.)

On the other hand, if we understand why modularity occurs in trained/​evolved systems, then we can follow the gears of our reasoning even on new kinds of systems. More importantly, we can design new systems to leverage those gears without having to guess and check.

Now, gears-level understanding need not involve formal mathematics in general. But for the sorts of things I’m talking about here (like modularity or good generalization or information compression in evolved/​trained systems), gears-level understanding mostly looks like mathematical proofs, or at least informal mathematical arguments. A gears-level answer to the question “Why does modularity show up in evolved systems?”, for instance, should have the same rough shape as a proof that modularity shows up in some broad class of evolved systems (for some reasonably-general formalization of “modularity” and “evolution”). It should tell us what the necessary conditions are, and explain why those conditions are necessary in such a way that we can modify the argument to handle different kinds of conditions without restarting from scratch.

Why so much focus on abstraction?

Abstraction is a common bottleneck to a whole bunch of problems in agency and alignment. Questions like:

  • If I have some system, what’s the right way to carve out a subsystem (which might be an “agent”, or a “world model”, or an “optimizer”, etc)? This should be robust/​general enough to let us confidently say things like e.g. “there are no agents embedded in this trained neural net”.

  • What kinds-of-things show up in world models? For instance, is an AI likely to have internal notions of “tree” or “rock” or “car” which map to the corresponding human notions, and how closely?

  • How can we empirically measure high-level abstract things (like trees or agents) in the real world, in robustly generalizable ways?

  • To the extent that humans care about high-level abstract things like trees or cars, rather than quantum fields, how can we formalize that?

  • How can we translate the internal concepts used by trained ML systems into human-legible concepts, robustly enough that we won’t miss anything important (or at least can tell if we do)?

… and so forth. The important point isn’t any one of these questions; the important point is that understanding abstraction is a blocker for a whole bunch of different things. That’s what makes it an ideal target to focus on. Once it’s worked out, I expect to be unblocked not just on the above questions, but also on other important questions I haven’t even thought of yet—if it’s a blocker for many things already, it’s probably also a blocker for other things which I haven’t noticed.

If I had to pick one central reason why abstraction matters so much, it’s that we don’t currently have a robust, generalizable and legible way to measure high-level abstractions. Once we can do that, it will open up a lot of tricky conceptual questions to empirical investigation, in the same way that robust, generalizable and legible measurement tools usually open up scientific investigation of new conceptual areas.

But, like, 10-15 years?!?

A crucial load-bearing part of my model here is that agency/​alignment work will undergo a phase transition in the next ~5 years. We’ll go from a basically-preparadigmatic state, where we don’t even know what questions to ask or what tools to use to answer them, to a basically-paradigmatic state, where we have a general roadmap and toolset. Or at the very least I expect to have a workable paradigm, whether anyone else jumps on board is a more open question.

There’s more than one possible path here, more than one possible future paradigm. My estimate of “~5 years” comes from eyeballing the current rate of progress, plus a gut feel for how close the frames are to where they need to be for progress to take off.

As an example of one path which I currently consider reasonably likely: abstraction provides the key tool for the phase transition. Once we can take a simulated environment or a trained model or the like, and efficiently extract all the natural abstractions from it, that changes everything. It’ll be like introducing the thermometer to the study of thermodynamics. We’ll be able to directly, empirically answer questions like “does this model know what a tree is?” or “does this model have a notion of human values?” or “is ‘human’ a natural abstraction?” or “are the agenty things in this simulation natural abstractions?” or …. (These won’t be yes/​no answers, but they’ll be quantifiable in a standardized and robustly-generalizable way.) This isn’t a possibility I expect to be legibly plausible to other people right now, but it’s one I’m working towards.

Another path: once a few big selection theorems are sorted out (like modularity of evolved systems, for instance) and empirically verified, we’ll have a new class of tools for empirical study of agenty systems. Like abstraction measurement, this has the potential to open up a whole class of tricky conceptual questions to empirical investigation. Things like “what is this bacteria’s world model?” or “are there any subagents in this trained neural network?”. Again, I don’t necessarily expect this possibility to be legibly plausible to other people right now.

To be clear: not all of my “better than 5050 chance of working in time” comes from just these two paths. I’ve sketched a fair amount of burdensome detail here, and there’s a lot of variations which lead to similar outcomes with different details, as well as entirely different paths. But the general theme is that I don’t think it will take too much longer to get to a point where we can start empirically investigating key questions in robustly-generalizable ways (rather than the ad-hoc methods used for empirical work today), and get proper feedback loops going for improving understanding.

Why ambitious value learning?

It’s the best-case outcome. I mean, c’mon, it’s got “ambitious” right there in the name.

… but why not aim for some easier strategy?

The main possibly-easier strategy for which I don’t know of any probably-fatal failure mode is to emulate/​simulate humans working on the alignment problem for a long time, i.e. a Simulated Long Reflection. The main selling point of this strategy is that, assuming the emulation/​simulation is accurate, it probably performs at least as well as we would actually do if we tackled the problem directly.

This is really a whole class of strategies, with many variations, most of which involve training ML systems to mimic humans. (Yes, that implies we’re already at the point where it can probably FOOM.) In general, the further the variations get from just directly simulating humans working on alignment basically the way we do now (but for longer), the more possibly-fatal failure modes show up. HCH is a central example here: for some reason a structure whose most obvious name is The Infinite Bureaucracy was originally suggested as an approximation of a Long Reflection. Look, guys, there is no way in hell that The Infinite Bureaucracy is even remotely a good approximation of a Long Reflection. Naming it “HCH” does not make it any less of an infinite bureaucracy, and yes it is going to fail in basically the same ways as real bureaucracies and for basically the same underlying reasons (except even worse, because it’s infinite).

… but the failure of variations does not necessarily mean that the basic idea is doomed. The basic idea seems basically-sound to me; the problem is implementing it in such a way that the output accurately mimics a real long reflection, while also making it happen before unfriendly AGI kills us all.

Personally, I’m still not working on that strategy, for a few main reasons:

  • I expect my current strategy to be more competitive. One big advantage of understanding agency in general is that we can apply that understanding to whatever ML/​AI progress comes along, even if it ends up looking very different from e.g. GPT-3.

  • The Simulated Long Reflection strategy gets more likely to work when we have people for it to mimic who are already far down the road to solving alignment. The further, the better.

  • On a gut level, I just don’t expect ML to emulate humans accurately enough for a Simulated Long Reflection to work until we’ve already passed doomsday. (This is probably the cruxiest issue.)

I am generally happy that other people are working on strategies in the Simulated Long Reflection family, and hope that such work continues.