On motivations for MIRI’s highly reliable agent design research

(this post came out of a con­ver­sa­tion be­tween me and Owen Cot­ton-Bar­ratt, plus a fol­low-up con­ver­sa­tion with Nate)

I want to clar­ify my un­der­stand­ing of some of the mo­ti­va­tions of MIRI’s highly re­li­able agent de­sign (HRAD) re­search (e.g. log­i­cal un­cer­tainty, de­ci­sion the­ory, multi level mod­els).

Top-level vs. sub­sys­tem reasoning

I’ll dis­t­in­guish be­tween an AI sys­tem’s top-level rea­son­ing and sub­sys­tem rea­son­ing. Top-level rea­son­ing is the rea­son­ing the sys­tem is do­ing in a way its de­sign­ers un­der­stand (e.g. us­ing well-un­der­stood al­gorithms); sub­sys­tem rea­son­ing is rea­son­ing pro­duced by the top-level rea­son­ing that its de­sign­ers (by de­fault) don’t un­der­stand at an al­gorith­mic level.

Here are a few ex­am­ples:


Top-level rea­son­ing: MCTS, self-play, gra­di­ent de­scent, …

Sub­sys­tem rea­son­ing: what­ever rea­son­ing the policy net­work is do­ing, which might in­volve some sort of “pre­dic­tion of con­se­quences of moves”

Deep Q learning

Top-level rea­son­ing: the Q-learn­ing al­gorithm, gra­di­ent de­scent, ran­dom ex­plo­ra­tion, …

Sub­sys­tem rea­son­ing: what­ever rea­son­ing the Q net­work is do­ing, which might in­volve some sort of “pre­dic­tion of fu­ture score”

Solomonoff induction

Top-level rea­son­ing: se­lect­ing (Carte­sian) hy­pothe­ses by see­ing which make the best predictions

Sub­sys­tem rea­son­ing: the rea­son­ing of the con­se­quen­tial­ist rea­son­ers who come to dom­i­nate Solomonoff in­duc­tion, who will use some­thing like nat­u­ral­ized in­duc­tion and up­date­less de­ci­sion theory

Ge­netic selection

It is pos­si­ble to imag­ine a sys­tem that learns to play video games by find­ing (en­cod­ings of) poli­cies that get high scores on train­ing games, and com­bin­ing en­cod­ings of poli­cies that do well to pro­duce new poli­cies.

Top-level rea­son­ing: ge­netic selection

Sub­sys­tem rea­son­ing: what­ever rea­son­ing the poli­cies are do­ing (which is some­thing like “pre­dict­ing the con­se­quences of differ­ent ac­tions”)

In the Solomonoff in­duc­tion case and this case, if the al­gorithm is run with enough com­pu­ta­tion, the sub­sys­tem rea­son­ing is likely to over­whelm the top-level rea­son­ing (i.e. the sys­tem run­ning Solomonoff in­duc­tion or ge­netic se­lec­tion will even­tu­ally come to be dom­i­nated by opaque con­se­quen­tial­ist rea­son­ers).

Good con­se­quen­tial­ist reasoning

Hu­mans are ca­pa­ble of good con­se­quen­tial­ist rea­son­ing (at least in com­par­i­son to cur­rent AI sys­tems). Hu­mans can:

  • make medium-term pre­dic­tions about com­plex sys­tems con­tain­ing other humans

  • make plans that take months or years to execute

  • learn and op­ti­mize prox­ies for long-term suc­cess (e.g. learn­ing skills, gain­ing money)

  • rea­son about how to write a com­puter program

and so on. Cur­rent AI sys­tems are not ca­pa­ble of good con­se­quen­tial­ist rea­son­ing. Su­per­in­tel­li­gent AGI sys­tems would be ca­pa­ble of good con­se­quen­tial­ist rea­son­ing (though su­per­in­tel­li­gent nar­row AI sys­tems might not in full gen­er­al­ity).

The concern

Us­ing these con­cepts, MIRI’s main con­cern mo­ti­vat­ing HRAD re­search can be stated as some­thing like:

  1. The first AI sys­tems ca­pa­ble of pivotal acts will use good con­se­quen­tial­ist rea­son­ing.

  2. The de­fault AI de­vel­op­ment path will not pro­duce good con­se­quen­tial­ist rea­son­ing at the top level.

  3. There­fore, on the de­fault AI de­vel­op­ment path, the first AI sys­tems ca­pa­ble of pivotal acts will have good con­se­quen­tial­ist sub­sys­tem rea­son­ing but not good con­se­quen­tial­ist top-level rea­son­ing.

  4. Con­se­quen­tial­ist sub­sys­tem rea­son­ing will likely come “pack­aged with a ran­dom goal” in some sense, and this goal will not be al­igned with hu­man in­ter­ests.

  5. There­fore, the de­fault AI de­vel­op­ment path will pro­duce, as the first AI sys­tems ca­pa­ble of pivotal acts, AI sys­tems with goals not al­igned with hu­man in­ter­ests, caus­ing catas­tro­phe.

Note that, even if the AI sys­tem is do­ing good con­se­quen­tial­ist rea­son­ing at the top level rather than in sub­sys­tems, this top-level rea­son­ing must still be di­rected to­wards the cor­rect ob­jec­tive for the sys­tem to be al­igned. So HRAD re­search does not ad­dress the en­tire AI al­ign­ment prob­lem.

Pos­si­ble paths

Given this con­cern, a num­ber of pos­si­ble paths to al­igned AI emerge:

Limited/​tool AI

One might re­ject premise 1 and at­tempt to ac­com­plish pivotal acts us­ing AI sys­tems that do not use good con­se­quen­tial­ist rea­son­ing. Roughly, the pro­posal is to have hu­mans do the good con­se­quen­tial­ist rea­son­ing, and to use AI sys­tems as tools.

The main con­cern with this pro­posal is that a sys­tem of hu­mans and limited AIs might be much less effec­tive (for a given level of com­put­ing re­sources) than an AI sys­tem ca­pa­ble of good con­se­quen­tial­ist rea­son­ing. In par­tic­u­lar, (a) a limited AI might re­quire a lot of hu­man la­bor to do the good con­se­quen­tial­ist rea­son­ing, and (b) hu­man con­se­quen­tial­ist rea­son­ing is likely to be less effec­tive than su­per­in­tel­li­gent AI con­se­quen­tial­ist rea­son­ing.

The main hope, de­spite this con­cern, is that ei­ther “gen­eral con­se­quen­tial­ist rea­son­ing” is not par­tic­u­larly im­por­tant for the kinds of tasks peo­ple will want to use AI sys­tems for (in­clud­ing pivotal acts), or that some sort of global co­or­di­na­tion will make the effi­ciency dis­ad­van­tage less rele­vant.

Ex­am­ple re­search top­ics:

Hope that top-level rea­son­ing stays dom­i­nant on the de­fault AI de­vel­op­ment path

Cur­rently, it seems like most AI sys­tems’ con­se­quen­tial­ist rea­son­ing is ex­plain­able in terms of top-level al­gorithms. For ex­am­ple, AlphaGo’s perfor­mance is mostly ex­plained by MCTS and the way it’s trained through self-play. The sub­sys­tem rea­son­ing is sub­sumed by the top-level rea­son­ing and does not over­whelm it.

One could hope that al­gorithms likely to be de­vel­oped in the fu­ture by de­fault (e.g. model-based re­in­force­ment learn­ing) con­tinue to be pow­er­ful enough that the top-level con­se­quen­tial­ist rea­son­ing is more pow­er­ful than sub­sys­tem con­se­quen­tial­ist rea­son­ing.

The biggest in­di­ca­tion that this might not hap­pen by de­fault is that we cur­rently don’t have an in-prin­ci­ple the­ory for good rea­son­ing (e.g. we’re cur­rently con­fused about log­i­cal un­cer­tainty and multi-level mod­els), and it doesn’t look like these the­o­ries will be de­vel­oped with­out a con­certed effort. Usu­ally, the­ory lags be­hind com­mon prac­tice.

De­spite this, a pos­si­ble rea­son for hope is that per­haps it’s pos­si­ble to AI re­searchers to de­velop enough tacit un­der­stand­ing of these the­o­ries for prac­ti­cal pur­poses. Cur­rently, al­gorithms such as MCTS are im­plic­itly han­dling some sub­prob­lem of “log­i­cal un­cer­tainty” with­out hav­ing a full for­mal the­ory, and this does not seem prob­le­matic yet. It’s con­ceiv­able that fu­ture al­gorithms will be similar to MCTS and im­plic­itly han­dle larger parts of these the­o­ries in a way as well-un­der­stood as MCTS, such that good con­se­quen­tial­ist rea­son­ing in sub­sys­tems does not over­whelm the top-level con­se­quen­tial­ist rea­son­ing.

“MIRI” has a strong in­tu­ition that this won’t be the case, and per­son­ally I’m some­what con­fused about the de­tails; see Nate’s com­ments be­low for de­tails.

On this path, the most im­por­tant re­search top­ics are those that re­late to di­rect­ing top-level con­se­quen­tial­ist rea­son­ing (im­ple­mented us­ing al­gorithms on the de­fault AI de­vel­op­ment path) to­wards use­ful ob­jec­tives. (Note that these re­search prob­lems are also im­por­tant on other paths; goals have to be speci­fied at some point in all cases).

Ex­am­ple re­search top­ics:

  • Avoid­ing nega­tive side effects

  • Avoid­ing re­ward hacking

  • Ro­bust­ness to dis­tri­bu­tion shift

  • Safe exploration

  • In­verse re­in­force­ment learning

  • Neu­ral net­work transparency

(re­search top­ics like these are dis­cussed in Con­crete Prob­lems in AI Safety and Align­ment for Ad­vanced Ma­chine Learn­ing Sys­tems)

Figure out some core of good con­se­quen­tial­ist rea­son­ing and en­sure that AI is de­vel­oped through this paradigm

This is the main pur­pose of MIRI’s re­search in HRAD. The main hope is that there is some sim­ple core of good rea­son­ing that can be dis­cov­ered through the­o­ret­i­cal re­search.

On this path­way, it isn’t cur­rently cleanly ar­gued that the right way to re­search good con­se­quen­tial­ist rea­son­ing is to study the par­tic­u­lar MIRI re­search top­ics such as de­ci­sion the­ory. One could imag­ine other ap­proaches to study­ing good con­se­quen­tial­ist rea­son­ing (e.g. think­ing about how to train model-based re­in­force­ment learn­ers). I think the fo­cus on prob­lems like de­ci­sion the­ory is mostly based on in­tu­itions that are (cur­rently) hard to ex­plic­itly ar­gue for.

Ex­am­ple re­search top­ics:

  • Log­i­cal uncertainty

  • De­ci­sion theory

  • Multi level models

  • Vingean reflection

(see the agent foun­da­tions tech­ni­cal agenda pa­per) for de­tails)

Figure out how to al­ign a “messy” AI whose good con­se­quen­tial­ist rea­son­ing is in a subsystem

This is the main part of Paul Chris­ti­ano’s re­search pro­gram. Disagree­ments about the vi­a­bil­ity of this ap­proach are quite tech­ni­cal; I have pre­vi­ously writ­ten about some as­pects of this dis­agree­ment here.

Ex­am­ple re­search top­ics:

In­ter­ac­tion with task AGI

Given this con­cern, it isn’t im­me­di­ately clear how task AGI fits into the pic­ture. I think the main mo­ti­va­tion for task AGI is that it alle­vi­ates some as­pects of this con­cern but not oth­ers; ideally it re­quires know­ing fewer as­pects of good con­se­quen­tial­ist rea­son­ing (e.g. per­haps some de­ci­sion-the­o­retic prob­lems can be dodged), and has sub­sys­tems “small” enough that they will not de­velop good con­se­quen­tial­ist rea­son­ing in­de­pen­dently.


I hope I have clar­ified what the main ar­gu­ment mo­ti­vat­ing HRAD re­search is, and what po­si­tions are pos­si­ble to take on this ar­gu­ment. There seem to be sig­nifi­cant op­por­tu­ni­ties for fur­ther clar­ifi­ca­tion of ar­gu­ments and dis­agree­ments, es­pe­cially the MIRI in­tu­ition that there is a small core of good con­se­quen­tial­ist rea­son­ing that is im­por­tant for AI ca­pa­bil­ities and that can be dis­cov­ered through the­o­ret­i­cal re­search.