Paul’s research agenda FAQ

I think Paul Christiano’s research agenda for the alignment of superintelligent AGIs presents one of the most exciting and promising approaches to AI safety. After being very confused about Paul’s agenda, chatting with others about similar confusions, and clarifying with Paul many times over, I’ve decided to write a FAQ addressing common confusions around his agenda.

This FAQ is not intended to provide an introduction to Paul’s agenda, nor is it intended to provide an airtight defense. This FAQ only aims to clarify commonly misunderstood aspects of the agenda. Unless otherwise stated, all views are my own views of Paul’s views. (ETA: Paul does not have major disagreements with anything expressed in this FAQ. There are many small points he might have expressed differently, but he endorses this as a reasonable representation of his views. This is in contrast with previous drafts of this FAQ, which did contain serious errors he asked to have corrected.)

For an introduction to Paul’s agenda, I’d recommend Ajeya Cotra’s summary. For good prior discussion of his agenda, I’d recommend Eliezer’s thoughts, Jessica Taylor’s thoughts (here and here), some posts and discussions on LessWrong, and Wei Dai’s comments on Paul’s blog. For most of Paul’s writings about his agenda, visit

0. Goals and non-goals

0.1: What is this agenda trying to accomplish?

Enable humans to build arbitrarily powerful AGI assistants that are competitive with unaligned AGI alternatives, and only try to help their operators (and in particular, never attempt to kill or manipulate them).

People often conceive of safe AGIs as silver bullets that will robustly solve every problem that humans care about. This agenda is not about building a silver bullet, it’s about building a tool that will safely and substantially assist its operators. For example, this agenda does not aim to create assistants that can do any of the following:

  • They can prevent nuclear wars from happening

  • They can prevent evil dictatorships

  • They can make centuries’ worth of philosophical progress

  • They can effectively negotiate with distant superintelligences

  • They can solve the value specification problem

On the other hand, to the extent that humans care about these things and could make them happen, this agenda lets us build AGI assistants that can substantially assist humans achieve these things. For example, a team of 1,000 competent humans working together for 10 years could make substantial progress on preventing nuclear wars or solving metaphilosophy. Unfortunately, it’s slow and expensive to assemble a team like this, but an AGI assistant might enable us to reap similar benefits in far less time and at much lower cost.

(See Clarifying “AI Alignment” and Directions and desiderata for AI alignment.)

0.2: What are examples of ways in which you imagine these AGI assistants getting used?

Two countries end up in an AGI arms race. Both countries are aware of the existential threats that AGIs pose, but also don’t want to limit the power of their AIs. They build AGIs according to this agenda, which stay under the operators’ control. These AGIs then help the operators broker an international treaty, which ushers in an era of peace and stability. During this era, foundational AI safety problems (e.g. those in MIRI’s research agenda) are solved in earnest, and a provably safe recursively self-improving AI is built.

A more pessimistic scenario is that the countries wage war, and the side with the more powerful AGI achieves a decisive victory and establishes a world government. This scenario isn’t as good, but it at least leaves humans in control (instead of extinct).

The most pressing problem in AI strategy is how to stop an AGI race to the bottom from killing us all. Paul’s agenda aims to solve this specific aspect of the problem. That isn’t an existential win, but it does represent a substantial improvement over the status quo.

(See section “2. Competitive” in Directions and desiderata for AI alignment.)

0.3: But this might lead to a world dictatorship! Or a world run by philosophically incompetent humans who fail to capture most of the possible value in our universe! Or some other dystopia!

Sure, maybe. But that’s still better than a paperclip maximizer killing us all.

There is a social/​political/​philosophical question about how to get humans in a post-AGI world to claim a majority of our cosmic endowment (including, among other things, not establishing a tyrannical dictatorship under which intellectual progress halts). While technical AI safety does make progress on this question, it’s a broader question overall that invites fairly different angles of attack (e.g. policy interventions and social influence). And, while this question is extremely important, it is a separate question from how you can build arbitrarily powerful AGIs that stay under their operators’ control, which is the only question this agenda is trying to answer.

1. Alignment

1.1 How do we get alignment at all?

(“Alignment” is an imprecise term meaning “nice” /​ “not subversive” /​ “trying to actually help its operator“. See Clarifying “AI alignment” for Paul’s description.)

1.1.1: Isn’t it really hard to give an AI our values? Value learning is really hard, and the default is for it to encounter instrumental incentives to manipulate you or prevent itself from getting shut down.

The AI isn’t learning our values, it’s learning to optimize for our short-term approval—in other words, for each action it takes, it optimizes for something like what rating we’d give it on a scale from 1 to 5 if we just saw it act.

It’s hard to learn the nuances of human values. But from a machine learning perspective, it’s very easy to learn that humans would strongly disapprove of attempts to kill or manipulate us. Paul expresses this well on his blog:

You need only the vaguest understanding of humans to guess that killing the user is: (1) not something they would approve of, (2) not something they would do, (3) not in line with their instrumental preferences.

So in order to get bad outcomes here you have to really mess up your model of what humans want (or more likely mess up the underlying framework in an important way). If we imagine a landscape of possible interpretations of human preferences, there is a “right” interpretation that we are shooting for. But if you start with a wrong answer that is anywhere in the neighborhood, you will do things like “ask the user what to do, and don’t manipulate them.” And these behaviors will eventually get you where you want to go.

(See: Approval-directed agents and Act-based agents.)

1.1.2: OK, but doesn’t this only incentivize it to appear like it’s doing what the operator wants? Couldn’t it optimize for hijacking its reward signal, while seeming to act in ways that humans are happy with?

We’re not just training the agent to take good actions. We’re also training it to comprehensibly answer questions about why it took the actions it took, to arbitrary levels of detail. (Imagine a meticulous boss grilling an employee about a report he put together, or a tax auditor grilling a corporation about the minutiae of its expenses.) We ensure alignment by randomly performing thorough evaluations of its justifications for its actions, and punishing it severely if any of those justifications seem subversive. To the extent we trust these justifications to accurately reflect the agent’s cognition, we can trust the agent to not act subversively (and thus be aligned).

(See: The informed oversight problem.)

1.1.3: How do we train it to answer questions comprehensively?

Reward it for doing so, and punish it for failing to do so.

Imagine being a meticulous boss who asks his employee to put together a report. Imagine grilling him about the report, and punishing him every time he fails to answer your questions clearly enough or at a satisfactory level of detail, in addition to punishing him for low-quality reports. If your employee is capable enough, he’ll eventually learn to produce high-quality reports and answer questions satisfactorily when you grill him.

Note that the ability to generate comprehensible descriptions of arbitrary cognition is a major open problem in Paul’s agenda, and also a major problem in AI safety broadly. It’s the part of his agenda that Paul is currently focusing most on.

(See The informed oversight problem, Learning representations, and Approval-maximising representations.)

1.1.4: Why should we expect the agent’s answers to correspond to its cognition at all?

We don’t actually have any guarantees that it does, but giving honest answers is probably the easiest way for the agent to maximize its reward. The only feasible alternative is giving thorough dishonest explanations for its actions. But if an employee lies about how he arrived at his conclusions, his explanations won’t check out, and he might get punished severely. And if a corporation lies about how it spent its money, the numbers won’t add up, and its auditor might punish it severely.

It’s critically important for this scheme that the overseer is capable of evaluating the assistant’s justifications extremely thoroughly, and does so often enough. Corporations cook their books in proportion to how thorough their auditors are, how common audits are, and how bad it is to get caught by their auditors. If we audit thoroughly enough and frequently enough, and punish our assistants severely enough for subversion, we should expect them to answer honestly.

1.1.5: Wouldn’t this incentivize the assistant to produce justifications that seem fine and innocent to you, but may actually be harmful?

We should expect the AI not to adversarially manipulate us—by assumption, we’re evaluating its justifications thoroughly enough that we should be able to catch any subversive cognition.

It’s possible that there are free variables in what sorts of cognition the operator deems acceptable, and that a non-adversarial optimization process might be able to persuade the operator of essentially arbitrary conclusions by manipulating these free variables in ways the operators approves of. For example, an AGI assistant might accidentally persuade you to become an ISIS suicide bomber, while only thinking in ways that you approve of.

I do think this is a potentially severe problem. But I don’t consider it a dealbreaker, for a number of reasons:

  • An AGI assistant “accidentally” manipulating you is no different from a very smart and capable human assistant who, in the process of assisting you, causes you to believe drastic and surprising conclusions. Even if this might lead to bad outcomes, Paul isn’t aiming for his agenda to prevent this class of bad outcomes.

  • The more rational you are, the smaller the space of conclusions you can be non-adversarially led into believing. (For example, it’s very hard for me to imagine myself getting persuaded into becoming an ISIS suicide bomber by a process whose cognition I approve of.) It might be that some humans have passed a rationality threshold, such that they only end up believing correct conclusions after thinking for a long time without adversarial pressures.

1.2 Amplifying and distilling alignment

1.2.1: OK, you propose that to amplify some aligned agent, you just run it for a lot longer, or run way more of them and have them work together. I can buy that our initial agent is aligned; why should I trust their aggregate to be aligned?

When aligned agents work together, there’s often emergent behavior that can be described as non-aligned. For example, if the operator is pursuing a goal (like increasing Youtube’s revenue), one group of agents proposes a subgoal (like increasing Youtube views), and another group competently pursues that subgoal without understanding how it relates to the top-level goal (e.g. by triple-counting all the views), you end up with misaligned optimization. As another example, there might be some input (e.g. some weirdly compelling argument) that causes some group of aligned agents to “go insane” and behave unpredictably, or optimize for something against the operator’s wishes.

Two approaches that Paul considers important for preserving alignment:

  • Reliability amplification—aggregating agents that can answer a question correctly some of the time (say, 80% of the time) in a way that they can answer questions correctly with arbitrarily high probability.

  • Security amplification—winnowing down the set of queries that, when fed to the aggregate, causes the aggregate to “go insane”.

It remains an open question in Paul’s agenda how alignment can be robustly preserved through capability amplification—in other words, how to increase the capabilities of aligned agents without introducing misaligned behavior.

(See: Capability amplification, Reliability amplification, Security amplification, Universality and security amplification, and Two guarantees.)

1.2.2: OK, so given this amplified aligned agent, how do you get the distilled agent?

Train a new agent via some combination of imitation learning (predicting the actions of the amplified aligned agent), semi-supervised reinforcement learning (where the amplified aligned agent helps specify the reward), and techniques for optimizing robustness (e.g. creating red teams that generate scenarios that incentivize subversion).

(See: RL+Imitation, Benign model-free RL, Semi-supervised reinforcement learning, and Techniques for optimisizing worst-case performance.)

1.2.3: It seems like imitation learning might cause a lot of minutiae to get lost, and would create something that’s “mostly aligned” but actually not aligned in a bunch of subtle ways. Maybe this is tolerable for one round of iteration, but after 100 rounds, I wouldn’t feel very good about the alignment of the resulting agent...

Indeed, which is why this new agent is also trained with semi-supervised reinforcement learning, and optimized for robustness. The imitation learning is more about getting this new agent off the ground than about ensuring alignment. The bulk of the alignment guarantee comes from the semi-supervised reinforcement learning, where we train it to work on a wide range of tasks and answer questions about its cognition. We trust it to be aligned because we train it the same way we trained our original agent to be aligned—namely, by rewarding it for maximizing our approval, and punishing it for thinking subversive thoughts.

(There’s a common misconception that Paul’s agenda proceeds entirely by taking a single aligned assistant, running it for a lot longer, using imitation learning to distill that into a new assistant, running that new assistant a lot longer, using imitation learning to distill that, etc., much like how AlphaGo Zero improves its policy network. While this does capture a key idea behind how the assistant’s capabilities get amplified, it omits critical components of his agenda that ensure alignment, like inspecting cognition and amplifying robustness.)

1.2.4: This distilled agent’s cognition might be much more complex than that of our previous agent. How can we reasonably determine whether we’d approve of its cognition?

This is where our assistants come in. While the human operator does inspect some aspects of the distilled agent’s cognition, most of the inspection is done by our assistants, whose judgments we trust because we trust that they’re aligned.

(See The reward engineering problem.)

1.2.5: Given that this distilled agent is way more powerful than you and your aligned agents, couldn’t it manipulate you and your team of aligned agents?

If we’re amplifying and distilling agent A, we’re actually training our newly distilled agent to be stronger than A, but much weaker than the amplified version of A. The amplification/​distillation procedure can be thought of as repeatedly taking two steps forward (in amplifying A) and one step back (in distilling a new agent).

Any subversive plan the distilled agent makes could get recognized as subversive by the amplified agent, because the amplified agent has enough power to thoroughly evaluate every aspect of the cognition that went into the plan. The amplified agent rarely performs thorough evaluations of the distilled agent’s cognition, because it’s extremely expensive, but does so randomly and punishes the distilled agent extremely harshly if it detects subversion. The distilled agent, knowing that it will have to thoroughly justify its actions to a competent supervisor, is thus strongly incentivized against acting subversively.

In other words, the amplified agent randomly “audits” the distilled agent, and punishes the distilled agent very harshly if it fails the audit. Though the distilled agent knows that it might be able to deceive its supervisor when it isn’t audited, it’s so scared of the outcome where it tries to do that and gets audited that it doesn’t even want to try. (Even if you were 99% confident that you could get away with tax evasion, you wouldn’t want to try if you knew the government tortures and murders the families of the tax evaders they catch.)

(See Benign model-free RL, the comments in The informed oversight problem, and William Saunders’ post Improbable Oversight, An Attempt at Informed Oversight.)

1.3 Robust alignment /​ corrigibility

1.3.1: If you delegate oversight to a well-meaning assistant (“hey, can you make sure things are going as planned?”), who delegates oversight to a well-meaning assistant, who delegates oversight to a well-meaning assistant, etc., then the default is for the delegatee 100 layers down to end up with some extremely distorted version of what you care about. Shouldn’t we expect this distortion to happen with the agents we produce?

Yes, which is why alignment isn’t the property we’re ultimately aiming to inductively guarantee. The property we’re trying to guarantee is something closer to “alignment + extreme caution about whether it’s aligned and cautious enough”. Paul usually refers to this as corrigibility.

This way, every time we’re training a distilled agent, we train it to want to clarify with its overseer (i.e., us assisted with a team of corrigible assistants) whenever it’s uncertain about what we would approve of. Our corrigible assistants either answer the question confidently, or clarify with us if it’s uncertain about its answer.

1.3.2: OK, but doesn’t this just pass the buck onto corrigibility? In the same way that alignment can get distorted, couldn’t corrigibility also get distorted?

Paul has two strong intuitions about corrigibility:

  1. corrigibility has a simple core that should be easy for humans to impart into powerful AI systems

  2. corrigibility is robust—agents that are somewhat corrigible want to become more corrigible.

On this view, even if human approval is high-complexity and hard to get right, corrigibility is straightforward to learn and most likely to improve over distillation rather than worsen.

You also need robustness amplification to ensure that corrigibility is maintained. For example, a corrigible supervisor needs to be able to thoroughly evaluate a distilled agent’s cognition, but if its evaluations aren’t reliable, it can’t be trusted to ensure that the distilled agent is corrigible (and not a daemon). And a corrigible agent that “goes insane” doesn’t stay corrigible.

(See: Corrigibility.)

1.3.3: I don’t share those intuitions around corrigibility. Do you have any intuition pumps?

One intuition pump: corrigibility can be thought of as extreme caution about whether you’re actually being helpful, and extreme caution is robust—if you’re extremely cautious about how things can go wrong, you want to know more ways things can go wrong and you want to improve your ability to spot how things are going wrong, which will lead you to become more cautious.

Another intuition pump: I have some intuitive concept of “epistemically corrigible humans”. Some things that gesture at this concept:

  • They care deeply about finding the truth, and improving their skill at finding the truth.

  • They’re aware that they’re flawed reasoners, with biases and blind spots, and actively seek out ways to notice and remove these flaws. They try to take ideas seriously, no matter how weird they seem.

  • Their beliefs tend to become more true over time.

  • Their skill at having true beliefs improves over time.

  • They tend to reach similar conclusions in the limit (namely, the correct ones), even if they’re extremely weird and not broadly accepted.

I think of corrigible assistants as being corrigible in the above way, except optimizing for helping its operator instead of finding the truth. Importantly, so long as an agent crosses some threshold of corrigibility, they will want to become more and more cautious about whether they’re helpful, which is where robustness comes from.

Given that corrigibility seems like a property that any reasoner could have (and not just humans), it’s probably not too complicated a concept for a powerful AI system to learn, especially given that many humans seem able to learn some version of it.

1.3.4: This corrigibility thing still seems really fishy. It feels like you just gave some clever arguments about something very fuzzy and handwavy, and I don’t feel comfortable trusting that.

While Paul thinks there’s a good intuitive case for something like corrigibility, he also considers getting a deeper conceptual understanding of corrigibility one of the most important research directions for his agenda. He agrees it’s possible that corrigibility may not be safely learnable, or not actually robust, in which case he’d feel way more pessimistic about his entire agenda.

2. Usefulness

2.1. Can the system be both safe and useful?

2.1.1: A lot of my values and knowledge are implicit. Why should I trust my assistant to be able to learn my values well enough to assist me?

Imagine a question-answering system trained on all the data on Wikipedia, that ends up with comprehensive, gears-level world-models, which it can use to synthesize existing information to answer novel questions about social interactions or what our physical world is like. (Think Wolfram|Alpha, but much better.)

This system is something like a proto-AGI. We can easily restrict it (for example by limiting how long it gets to reflect when it answers questions) so that we can train it to be corrigible while trusting that it’s too limited to do anything dangerous that the overseer couldn’t recognize as dangerous. We use such a restricted system to start off the iterated distillation and amplification process, and bootstrap it to get systems of arbitrarily high capabilities.

(See: Automated assistants)

2.1.2: OK, sure, but it’ll essentially still be an alien and get lots of minutiae about our values wrong.

How bad is it really if it gets minutiae wrong, as long as it doesn’t cause major catastrophes? Major catastrophes (like nuclear wars) are pretty obvious, and we would obviously disapprove of actions that lead us to catastrophe. So long as it learns to avoid those (which it will, if we give it the right training data), we’re fine.

Also keep in mind that we’re training it to be corrigible, which means it’ll be very cautious about what sorts of things we’d consider catastrophic, and try very hard to avoid them.

2.1.3: But it might make lots of subtle mistakes that add up to something catastrophic!

And so might we. Maybe there are some classes of subtle mistakes the AI will be more prone to than we are, but there are probably also classes of subtle mistakes we’ll be more prone to than the AI. We’re only shooting for our assistant to avoid trying to lead us to a catastrophic outcome.

(See: Techniques for optimizing worst-case performance.)

2.1.4: I’m really not sold that training it to avoid catastrophes and training it to be corrigible will be good enough.

This is actually more a capabilities question (is our system good enough at trying very hard to avoid catastrophes to actually avoid a catastrophe?) than an alignment question. A major open question in Paul’s agenda is how we can formalize performance guarantees well enough to state actual worst-case guarantees.

(See: Two guarantees and Techniques for optimizing worst-case performance)

2.2. Universality

2.2.1. What sorts of cognition will our assistants be able to perform?

We should roughly expect it to think in ways that would be approved by an HCH (short for “human consulting HCH”). To describe HCHs, let me start by describing a weak HCH:

Consider a human Hugh who has access to a question-answering machine. Suppose the machine answers question Q by perfectly imitating how Hugh would answer question Q, if Hugh had access to the question-answering machine.

That is, Hugh is able to consult a copy of Hugh, who is able to consult a copy of Hugh, who is able to consult a copy of Hugh…

I sometimes picture this as an infinite tree of humans-in-boxes, who can break down questions and pass them to other humans-in-boxes (who can break down those questions and pass them along to other humans-in-boxes, etc.) and get back answers instantaneously. A few remarks:

  • This formalism tries to capture some notion of “what would H think about some topic if H thought about it for arbitrarily long amounts of time”? For example, H might make partial progress on some question, and then share this progress with some other H and ask it to make more progress, who might do the same.

  • A weak HCH could simulate the cognitive labor of an economy the size of the US economy. After all, a weak HCH can emulate a single human thinking for a long time, so it can emulate teams of humans thinking for a long time, and thus teams of teams of humans thinking for a long time, etc. If you imagine a corporation as teams of teams of teams of humans performing cognitive labor, you get that a weak HCH can emulate the output of an arbitrary corporation, and thus collections of arbitrary corporations communicating with one another.

  • Many tasks that don’t intuitively seem like they can be broken down, can in fact be fairly substantially broken down. For example, making progress on difficult math problems seems difficult to break down. But you could break down progress on a math problem into something like (think for a while about possible angles of attack) + (try each angle of attack, and recurse on the new math problem). And (think for a while about possible angles of attack) can be reduced into (look at features of this problem and see if you’ve solved anything similar), which can be reduced into focusing on specific features, and so on.

Strong HCH, or just HCH, is a variant of weak HCHs where the agents-in-boxes are able to communicate with each other directly, and read and write to some shared external memory, in addition to being able to ask, answer, and break down questions. Note that they would be able to implement arbitrary Turing machines this way, and thus avoid any limits on cognition imposed by the structure of weak HCH.

(Note: most people think “HCH” refers to “weak HCH”, but whenever Paul mentions HCHs, he now refers to strong HCHs.)

The exact relationship between HCH and the agents produced through iterated amplification and distillation is confusing and very commonly misunderstood:

  • HCHs should not be visualized as having humans in the box. They should be thought of as having some corrigible assistant inside the box, much like the question-answering system described in 2.1.1.

  • Throughout the iterated amplification and distillation process, there is never any agent whose cognition resembles an HCH of the corrigible assistant. In particular, agents produced via distillation are general RL agents with no HCH-like constraints on their cognition. The closest resemblance to HCH appears during amplification, during which a superagent (formed out of copies of the agent getting amplified) performs tasks by breaking them down and distributing them among the agent copies.

(As of the time of this writing, I am still confused about the sense in which the agent’s cognition is approved by an HCH, and what that means about the agent’s capabilities.)

(See: Humans consulting HCH and Strong HCH.)

2.2.2. Why should I think the HCH of some simple question-answering AI assistant can perform arbitrarily complex cognition?

All difficult and creative insights stem from chains of smaller and easier insights. So long as our first AI assistant is a universal reasoner (i.e., it can implement arbitrary Turing machines via reflection), it should be able to realize arbitrarily complex things if it reflects for long enough. For illustration, Paul thinks that chimps aren’t universal reasoners, and that most humans past some intelligence threshold are universal.

If this seems counterintuitive, I’d claim it’s because we have poor intuitions around what’s achievable with 2,000,000,000 years of reflection. For example, it might seem that an IQ 120 person, knowing no math beyond arithmetic, would simply be unable to prove Fermat’s last theorem given arbitrary amounts of time. But if you buy that:

  • An IQ 180 person could, in 2,000 years, prove Fermat’s last theorem knowing nothing but arithmetic (which seems feasible, given that most mathematical progress was made by people with IQs under 180)

  • An IQ 160 person could, in 100 years, make the intellectual progress an IQ 180 person could in 1 year

  • An IQ 140 person could, in 100 years, make the intellectual progress an IQ 160 person could in 1 year

  • An IQ 120 person could, in 100 years, make the intellectual progress an IQ 140 person could in 1 year

Then it follows that an IQ 120 person could prove Fermat’s last theorem in 2,000*100*100*100 = 2,000,000,000 years’ worth of reflection.

(See: Of humans and universality thresholds.)

2.2.3. Different reasoners can reason in very different ways and reach very different conclusions. Why should I expect my amplified assistant to reason anything like me, or reach conclusions that I’d have reached?

You shouldn’t expect it to reason anything like you, you shouldn’t expect it to reach the conclusions you’d reach, and you shouldn’t expect it to realize everything you’d consider obvious (just like you wouldn’t realize everything it would consider obvious). You should expect it to reason in ways you approve of, which should constrain its reasoning to be sensible and competent, as far as you can tell.

The goal isn’t to have an assistant that can think like you or realize everything you’d realize. The goal is to have an assistant who can think in ways that you consider safe and substantially helpful.

2.2.4. HCH seems to depend critically on being able to break down arbitrary tasks into subtasks. I don’t understand how you can break down tasks that are largely intuitive or perceptual, like playing Go very well, or recognizing images.

Go is actually fairly straightforward: an HCH can just perform an exponential tree search. Iterated amplification and distillation applied to Go is not actually that different from how AlphaZero trains to play Go.

Image recognition is harder, but to the extent that humans have clear concepts of visual features they can reference within images, the HCH should be able to focus on those features. The cat vs. dog debate in Geoffrey Irving’s approach to AI safety via debate gives some illustration of this.

Things get particularly tricky when humans are faced with a task they have little explicit knowledge about, like translating sentences between languages. Paul did mention something like “at some point, you’ll probably just have to stick with relying on some brute statistical regularity, and just use the heuristic that X commonly leads to Y, without being able to break it down further”.

(See: Wei Dai’s comment on Can Corrigibility be Learned Safely, and Paul’s response to a different comment by Wei Dai on the topic.)

2.2.5: What about tasks that require significant accumulation of knowledge? For example, how would the HCH of a human who doesn’t know calculus figure out how to build a rocket?

This sounds difficult for weak HCHs on their own to overcome, but possible for strong HCHs to overcome. The accumulated knowledge would be represented in the strong HCHs shared external memory, and the humans essentially act as “workers” implementing a higher-level cognitive system, much like ants in an ant colony. (I’m still somewhat confused about what the details of this would entail, and am interested in seeing a more fleshed out implementation.)

2.2.6: It seems like this capacity to break tasks into subtasks is pretty subtle. How does the AI learn to do this? And how do we find human operators (besides Paul) who are capable of doing this?

Ought is gathering empirical data about task decomposition. If that proves successful, Ought will have numerous publicly available examples of humans breaking down tasks.

3. State of the agenda

3.1: What are the current major open problems in Paul’s agenda?

The most important open problems in Paul’s agenda, according to Paul:

  • Worst-case guarantees: How can we make worst-case guarantees about the reliability and security of our assistants? For example, how can we ensure our oversight is reliable enough to prevent the creation of subversive subagents (a.k.a. daemons) in the distillation process that cause our overall agent to be subversive?

  • Transparent cognition: How can we extract useful information from ML systems’ cognition? (E.g. what concepts are represented in them, what logical facts are embedded in them, and what statistical regularities about the data it captures.)

  • Formalizing corrigibility: Can we formalize corrigibility to the point that we can create agents that are knowably robustly corrigible? For example, could we formalize corrigibility, use that formalization to prove the existence of a broad basin of corrigibility, and then prove that ML systems past some low threshold will land and stay in this basin?

  • Aligned capability amplification: Can we perform amplification in a way that doesn’t introduce alignment failures? In particular, can we safely decompose every task we care about without effectively implementing an aligned AGI built out of human transistors?

(See: Two guarantees, The informed oversight problem, Corrigibility, and the “Low Bandwidth Overseer” section of William Saunder’s post Understanding Iterated Distillation and Amplification: Claims and Oversight.)

3.2: How close to completion is Paul’s research agenda?

Not very close. For all we know, these problems might be extraordinarily difficult. For example, a subproblem of “transparent cognition” is “how can humans understand what goes on inside neural nets”, which is a broad open question in ML. Subproblems of “worst-case guarantees” include ensuring that ML systems are robust to distributional shift and adversarial inputs, which are also broad open questions in ML, and which might require substantial progress on MIRI-style research to articulate and prove formal bounds. And getting a formalization of corrigibility might require formalizing aspects of good reasoning (like calibration about uncertainty), which might in turn require substantial progress on MIRI-style research.

I think people commonly conflate “Paul has a safety agenda he feels optimistic about” with “Paul thinks he has a solution to AI alignment”. Paul in fact feels optimistic about these problems getting solved well enough for his agenda to work, but does not consider his research agenda anything close to complete.

(See: Universality and security amplification, search “MIRI”)

Thanks to Paul Christiano, Ryan Carey, David Krueger, Rohin Shah, Eric Rogstad, and Eli Tyre for helpful suggestions and feedback.