On motivations for MIRI’s highly reliable agent design research

(this post came out of a conversation between me and Owen Cotton-Barratt, plus a follow-up conversation with Nate)

I want to clarify my understanding of some of the motivations of MIRI’s highly reliable agent design (HRAD) research (e.g. logical uncertainty, decision theory, multi level models).

Top-level vs. subsystem reasoning

I’ll distinguish between an AI system’s top-level reasoning and subsystem reasoning. Top-level reasoning is the reasoning the system is doing in a way its designers understand (e.g. using well-understood algorithms); subsystem reasoning is reasoning produced by the top-level reasoning that its designers (by default) don’t understand at an algorithmic level.

Here are a few examples:

AlphaGo

Top-level reasoning: MCTS, self-play, gradient descent, …

Subsystem reasoning: whatever reasoning the policy network is doing, which might involve some sort of “prediction of consequences of moves”

Deep Q learning

Top-level reasoning: the Q-learning algorithm, gradient descent, random exploration, …

Subsystem reasoning: whatever reasoning the Q network is doing, which might involve some sort of “prediction of future score”

Solomonoff induction

Top-level reasoning: selecting (Cartesian) hypotheses by seeing which make the best predictions

Subsystem reasoning: the reasoning of the consequentialist reasoners who come to dominate Solomonoff induction, who will use something like naturalized induction and updateless decision theory

Genetic selection

It is possible to imagine a system that learns to play video games by finding (encodings of) policies that get high scores on training games, and combining encodings of policies that do well to produce new policies.

Top-level reasoning: genetic selection

Subsystem reasoning: whatever reasoning the policies are doing (which is something like “predicting the consequences of different actions”)

In the Solomonoff induction case and this case, if the algorithm is run with enough computation, the subsystem reasoning is likely to overwhelm the top-level reasoning (i.e. the system running Solomonoff induction or genetic selection will eventually come to be dominated by opaque consequentialist reasoners).

Good consequentialist reasoning

Humans are capable of good consequentialist reasoning (at least in comparison to current AI systems). Humans can:

  • make medium-term predictions about complex systems containing other humans

  • make plans that take months or years to execute

  • learn and optimize proxies for long-term success (e.g. learning skills, gaining money)

  • reason about how to write a computer program

and so on. Current AI systems are not capable of good consequentialist reasoning. Superintelligent AGI systems would be capable of good consequentialist reasoning (though superintelligent narrow AI systems might not in full generality).

The concern

Using these concepts, MIRI’s main concern motivating HRAD research can be stated as something like:

  1. The first AI systems capable of pivotal acts will use good consequentialist reasoning.

  2. The default AI development path will not produce good consequentialist reasoning at the top level.

  3. Therefore, on the default AI development path, the first AI systems capable of pivotal acts will have good consequentialist subsystem reasoning but not good consequentialist top-level reasoning.

  4. Consequentialist subsystem reasoning will likely come “packaged with a random goal” in some sense, and this goal will not be aligned with human interests.

  5. Therefore, the default AI development path will produce, as the first AI systems capable of pivotal acts, AI systems with goals not aligned with human interests, causing catastrophe.

Note that, even if the AI system is doing good consequentialist reasoning at the top level rather than in subsystems, this top-level reasoning must still be directed towards the correct objective for the system to be aligned. So HRAD research does not address the entire AI alignment problem.

Possible paths

Given this concern, a number of possible paths to aligned AI emerge:

Limited/​tool AI

One might reject premise 1 and attempt to accomplish pivotal acts using AI systems that do not use good consequentialist reasoning. Roughly, the proposal is to have humans do the good consequentialist reasoning, and to use AI systems as tools.

The main concern with this proposal is that a system of humans and limited AIs might be much less effective (for a given level of computing resources) than an AI system capable of good consequentialist reasoning. In particular, (a) a limited AI might require a lot of human labor to do the good consequentialist reasoning, and (b) human consequentialist reasoning is likely to be less effective than superintelligent AI consequentialist reasoning.

The main hope, despite this concern, is that either “general consequentialist reasoning” is not particularly important for the kinds of tasks people will want to use AI systems for (including pivotal acts), or that some sort of global coordination will make the efficiency disadvantage less relevant.

Example research topics:

Hope that top-level reasoning stays dominant on the default AI development path

Currently, it seems like most AI systems’ consequentialist reasoning is explainable in terms of top-level algorithms. For example, AlphaGo’s performance is mostly explained by MCTS and the way it’s trained through self-play. The subsystem reasoning is subsumed by the top-level reasoning and does not overwhelm it.

One could hope that algorithms likely to be developed in the future by default (e.g. model-based reinforcement learning) continue to be powerful enough that the top-level consequentialist reasoning is more powerful than subsystem consequentialist reasoning.

The biggest indication that this might not happen by default is that we currently don’t have an in-principle theory for good reasoning (e.g. we’re currently confused about logical uncertainty and multi-level models), and it doesn’t look like these theories will be developed without a concerted effort. Usually, theory lags behind common practice.

Despite this, a possible reason for hope is that perhaps it’s possible to AI researchers to develop enough tacit understanding of these theories for practical purposes. Currently, algorithms such as MCTS are implicitly handling some subproblem of “logical uncertainty” without having a full formal theory, and this does not seem problematic yet. It’s conceivable that future algorithms will be similar to MCTS and implicitly handle larger parts of these theories in a way as well-understood as MCTS, such that good consequentialist reasoning in subsystems does not overwhelm the top-level consequentialist reasoning.

“MIRI” has a strong intuition that this won’t be the case, and personally I’m somewhat confused about the details; see Nate’s comments below for details.

On this path, the most important research topics are those that relate to directing top-level consequentialist reasoning (implemented using algorithms on the default AI development path) towards useful objectives. (Note that these research problems are also important on other paths; goals have to be specified at some point in all cases).

Example research topics:

  • Avoiding negative side effects

  • Avoiding reward hacking

  • Robustness to distribution shift

  • Safe exploration

  • Inverse reinforcement learning

  • Neural network transparency

(research topics like these are discussed in Concrete Problems in AI Safety and Alignment for Advanced Machine Learning Systems)

Figure out some core of good consequentialist reasoning and ensure that AI is developed through this paradigm

This is the main purpose of MIRI’s research in HRAD. The main hope is that there is some simple core of good reasoning that can be discovered through theoretical research.

On this pathway, it isn’t currently cleanly argued that the right way to research good consequentialist reasoning is to study the particular MIRI research topics such as decision theory. One could imagine other approaches to studying good consequentialist reasoning (e.g. thinking about how to train model-based reinforcement learners). I think the focus on problems like decision theory is mostly based on intuitions that are (currently) hard to explicitly argue for.

Example research topics:

  • Logical uncertainty

  • Decision theory

  • Multi level models

  • Vingean reflection

(see the agent foundations technical agenda paper) for details)

Figure out how to align a “messy” AI whose good consequentialist reasoning is in a subsystem

This is the main part of Paul Christiano’s research program. Disagreements about the viability of this approach are quite technical; I have previously written about some aspects of this disagreement here.

Example research topics:

Interaction with task AGI

Given this concern, it isn’t immediately clear how task AGI fits into the picture. I think the main motivation for task AGI is that it alleviates some aspects of this concern but not others; ideally it requires knowing fewer aspects of good consequentialist reasoning (e.g. perhaps some decision-theoretic problems can be dodged), and has subsystems “small” enough that they will not develop good consequentialist reasoning independently.

Conclusion

I hope I have clarified what the main argument motivating HRAD research is, and what positions are possible to take on this argument. There seem to be significant opportunities for further clarification of arguments and disagreements, especially the MIRI intuition that there is a small core of good consequentialist reasoning that is important for AI capabilities and that can be discovered through theoretical research.