Plausible cases for HRAD work, and locating the crux in the “realism about rationality” debate

This post is my at­tempt to sum­ma­rize and dis­till the ma­jor pub­lic de­bates about MIRI’s highly re­li­able agent de­signs (HRAD) work (which in­cludes work on de­ci­sion the­ory), in­clud­ing the dis­cus­sions in Real­ism about ra­tio­nal­ity and Daniel Dewey’s My cur­rent thoughts on MIRI’s “highly re­li­able agent de­sign” work. Part of the difficulty with dis­cussing the value of HRAD work is that it’s not even clear what the dis­agree­ment is about, so my sum­mary takes the form of mul­ti­ple pos­si­ble “wor­lds” we might be in; each world con­sists of a pos­i­tive case for do­ing HRAD work, along with the po­ten­tial ob­jec­tions to that case, which re­sults in one or more cruxes.

I will talk about “be­ing in a world” through­out this post. What I mean by this is the fol­low­ing: If we are “in world X”, that means that the case for HRAD work out­lined in world X is the one that most res­onates with MIRI peo­ple as their mo­ti­va­tion for do­ing HRAD work; and that when peo­ple dis­agree about the value of HRAD work, this is what the dis­agree­ment is about. When I say that “I think we are in this world”, I don’t mean that I agree with this case for HRAD work; it just means that this is what I think MIRI peo­ple think.

In this post, the pro-HRAD stance is some­thing like “HRAD work is the most im­por­tant kind of tech­ni­cal re­search in AI al­ign­ment; it is the over­whelming pri­or­ity and we’re pretty much screwed if we un­der-in­vest in this kind of re­search” and the anti-HRAD stance is some­thing like “HRAD work seems sig­nifi­cantly less promis­ing than other tech­ni­cal AI al­ign­ment agen­das, such as the ap­proaches to di­rectly al­ign ma­chine learn­ing sys­tems (e.g. iter­ated am­plifi­ca­tion)”. There is a much weaker pro-HRAD stance, which is some­thing like “HRAD work is in­ter­est­ing and do­ing more of it adds value, but it’s not nec­es­sar­ily the most im­por­tant kind of tech­ni­cal AI al­ign­ment re­search to be work­ing on”; this post is not about this weaker stance.

Clar­ify­ing some terms

Be­fore de­scribing the var­i­ous wor­lds, I want to pre­sent some dis­tinc­tions that have come up in dis­cus­sions about HRAD, which will be rele­vant when dis­t­in­guish­ing be­tween the wor­lds.

Levels of ab­strac­tion vs lev­els of indirection

The idea of lev­els of ab­strac­tion was in­tro­duced in the con­text of de­bate about HRAD work by Ro­hin Shah, and is de­scribed in this com­ment (start from “When groups of hu­mans try to build com­pli­cated stuff”). For more back­ground, see these ar­ti­cles on Wikipe­dia.

Later on, in this com­ment Ro­hin gave a some­what differ­ent “lev­els” idea, which I’ve de­cided to call “lev­els of in­di­rec­tion”. The idea is that there might not be a hi­er­ar­chy of ab­strac­tion, but there’s still mul­ti­ple in­ter­me­di­ate lay­ers be­tween the the­ory you have and the end-re­sult you want. The rele­vant “lev­els of in­di­rec­tion” is the se­quence HRAD → ma­chine learn­ing → AGI. Even though lev­els of in­di­rec­tion are differ­ent from lev­els of ab­strac­tion, the idea is that the same prin­ci­ple ap­plies, where the more lev­els there are, the harder it be­comes for a the­ory to ap­ply to the fi­nal level.

Pre­cise vs im­pre­cise theory

A pre­cise the­ory is one which can scale to 2+ lev­els of ab­strac­tion/​in­di­rec­tion.

An im­pre­cise the­ory is one which can scale to at most 1 level of ab­strac­tion/​in­di­rec­tion.

More in­tu­itively, a pre­cise the­ory is more mathy, rigor­ous, and ex­act like pure math and physics, and an im­pre­cise the­ory is less mathy, like eco­nomics and psy­chol­ogy.

Build­ing agents from the ground up vs un­der­stand­ing the be­hav­ior of ra­tio­nal agents and pre­dict­ing roughly what they will do

This dis­tinc­tion comes from Abram Dem­ski’s com­ment. How­ever, I’m not con­fi­dent I’ve un­der­stood this dis­tinc­tion in the way that Abram in­tended it, so what I de­scribe be­low may be a slightly differ­ent dis­tinc­tion.

Build­ing agents from the ground up means hav­ing a pre­cise the­ory of ra­tio­nal­ity that al­lows us to build an AGI in a satis­fy­ing way, e.g. where some­one with se­cu­rity mind­set can be con­fi­dent that it is al­igned. Im­por­tantly, we al­low the AGI to be built us­ing what­ever way is safest or most the­o­ret­i­cally satis­fy­ing, rather than re­quiring that the AGI be built us­ing what­ever meth­ods are main­stream (e.g. cur­rent ma­chine learn­ing meth­ods).

Un­der­stand­ing the be­hav­ior of ra­tio­nal agents and pre­dict­ing roughly what they will do means be­ing handed an ar­bi­trary agent im­ple­mented in some way (e.g. via black­box ML) and then be­ing able to pre­dict roughly how it will act.

I think of the differ­ence be­tween these two as the differ­ence be­tween ex­is­ten­tial and uni­ver­sal quan­tifi­ca­tion: “there ex­ists x such that P(x)” and “for all x we have P(x)”, where P(x) is some­thing like “we can un­der­stand and pre­dict how x will act in a satis­fy­ing way”. The former only says that we can build some AGI us­ing the pre­cise the­ory that we un­der­stand well, whereas the lat­ter says we have to deal with what­ever kind of AGI that ends up be­ing de­vel­oped us­ing meth­ods we might not un­der­stand well.

World 1

Case for HRAD

The goal of HRAD re­search is to gen­er­ally be­come less con­fused about things like coun­ter­fac­tual rea­son­ing and log­i­cal un­cer­tainty. Be­com­ing less con­fused about these things will: help AGI builders avoid, de­tect, and fix safety is­sues; help AGI builders pre­dict or ex­plain safety is­sues; help to con­cep­tu­ally clar­ify the AI al­ign­ment prob­lem; and help us be satis­fied that the AGI is do­ing what we want. More­over, un­less we be­come less con­fused about these things, we are likely to screw up al­ign­ment be­cause we won’t deeply un­der­stand how our AI sys­tems are rea­son­ing. There are other ways to gain clar­ity on al­ign­ment, such as by work­ing on iter­ated am­plifi­ca­tion, but these ap­proaches don’t de­com­pose cog­ni­tive work enough.

For this case, it is not im­por­tant for the fi­nal product of HRAD to be a pre­cise the­ory. Even if the fi­nal the­ory of em­bed­ded agency is im­pre­cise, or even if there is no “fi­nal say” on the topic, if we are merely much less con­fused than we are now, that is still good enough to help us en­sure AI sys­tems are al­igned.

Why I think we might be in this world

The main rea­son I think we might be in this world (i.e. that the above case is the mo­ti­vat­ing rea­son for MIRI pri­ori­tiz­ing HRAD work) is that peo­ple at MIRI fre­quently seem to be say­ing some­thing like the case above. How­ever, they also seem to be say­ing differ­ent things in other places, so I’m not con­fi­dent this is ac­tu­ally their case. Here are some ex­am­ples:

  • Eliezer Yud­kowsky: “Tech­niques you can ac­tu­ally adapt in a safe AI, come the day, will prob­a­bly have very sim­ple cores — the sort of core con­cept that takes up three para­graphs, where any re­viewer who didn’t spend five years strug­gling on the prob­lem them­selves will think, “Oh I could have thought of that.” Some­day there may be a book full of clever and difficult things to say about the sim­ple core — con­trast the sim­plic­ity of the core con­cept of causal mod­els, ver­sus the com­plex­ity of prov­ing all the clever things Judea Pearl had to say about causal mod­els. But the plane­tary benefit is mainly from pos­ing un­der­stand­able prob­lems crisply enough so that peo­ple can see they are open, and then from the sim­pler ab­stract prop­er­ties of a found solu­tion — com­pli­cated as­pects will not carry over to real AIs later.”

  • Rob Bens­inger: “We’re work­ing on de­ci­sion the­ory be­cause there’s a cluster of con­fus­ing is­sues here (e.g., coun­ter­fac­tu­als, up­date­less­ness, co­or­di­na­tion) that rep­re­sent a lot of holes or anoma­lies in our cur­rent best un­der­stand­ing of what high-qual­ity rea­son­ing is and how it works.” and phrases like “de­vel­op­ing an un­der­stand­ing of roughly what coun­ter­fac­tu­als are and how they work” and “very roughly how/​why it works” -- This post then doesn’t re­ally spec­ify whether or not the fi­nal out­put is ex­pected to be pre­cise. (The anal­ogy with prob­a­bil­ity the­ory and rock­ets ges­tures at pre­cise the­o­ries, but the post doesn’t come out and say it.)

  • Abram Dem­ski: “I don’t think there’s a true ra­tio­nal­ity out there in the world, or a true de­ci­sion the­ory out there in the world, or even a true no­tion of in­tel­li­gence out there in the world. I work on agent foun­da­tions be­cause there’s still some­thing I’m con­fused about even af­ter that, and fur­ther­more, AI safety work seems fairly hope­less while still so rad­i­cally con­fused about the-phe­nom­ena-which-we-use-in­tel­li­gence-and-ra­tio­nal­ity-and-agency-and-de­ci­sion-the­ory-to-de­scribe.”

  • Nate Soares: “The main case for HRAD prob­lems is that we ex­pect them to help in a gestalt way with many differ­ent known failure modes (and, plau­si­bly, un­known ones). E.g., ‘de­vel­op­ing a ba­sic un­der­stand­ing of coun­ter­fac­tual rea­son­ing im­proves our abil­ity to un­der­stand the first AGI sys­tems in a gen­eral way, and if we un­der­stand AGI bet­ter it’s like­lier we can build sys­tems to ad­dress de­cep­tion, edge in­stan­ti­a­tion, goal in­sta­bil­ity, and a num­ber of other prob­lems’.”

  • In the de­con­fu­sion sec­tion of MIRI’s 2018 up­date, some of the ex­am­ples of de­con­fu­sion are not pre­cise/​math­e­mat­i­cal in na­ture (e.g. see the para­graph start­ing with “In 1998, con­ver­sa­tions about AI risk and tech­nolog­i­cal sin­gu­lar­ity sce­nar­ios of­ten went in cir­cles in a funny sort of way” and the list af­ter “Among the bits of con­cep­tual progress that MIRI con­tributed to are”). There are more math­e­mat­i­cal ex­am­ples in the post, but the fact that there are also non-math­e­mat­i­cal ex­am­ples sug­gests that hav­ing a pre­cise the­ory of ra­tio­nal­ity is not im­por­tant to the case for HRAD work. There’s also the quote “As AI re­searchers ex­plore the space of op­ti­miz­ers, what will it take to en­sure that the first highly ca­pa­ble op­ti­miz­ers that re­searchers find are op­ti­miz­ers they know how to aim at cho­sen tasks? I’m not sure, be­cause I’m still in some sense con­fused about the ques­tion.”

The crux

One way to re­ject this case for HRAD work is by say­ing that im­pre­cise the­o­ries of ra­tio­nal­ity are in­suffi­cient for helping to al­ign AI sys­tems. This is what Ro­hin does in this com­ment where he says im­pre­cise the­o­ries can­not build things “2+ lev­els above”.

There is a sep­a­rate po­ten­tial re­jec­tion, which is to say that ei­ther HRAD work will never re­sult in pre­cise the­o­ries or that even a pre­cise the­ory is in­suffi­cient for helping to al­ign AI sys­tems. How­ever, these move the crux to a place where they ap­ply to more re­stricted wor­lds where the goal of HRAD work is speci­fi­cally to come up with a pre­cise the­ory, so these will be cov­ered in the other wor­lds be­low.

There is a third re­jec­tion, which is to ar­gue that other ap­proaches (such as iter­ated am­plifi­ca­tion) are more promis­ing for gain­ing clar­ity on al­ign­ment. In this case, the main dis­agree­ment may in­stead be about other agen­das rather than about HRAD.

World 2

Case for HRAD

The goal of HRAD re­search is to come up with a the­ory of ra­tio­nal­ity that is so pre­cise that it al­lows one to build an agent from the ground up. De­con­fu­sion is still im­por­tant, as with world 1, but in this case we don’t merely want any kind of de­con­fu­sion, but speci­fi­cally de­con­fu­sion which is ac­com­panied by a pre­cise the­ory of ra­tio­nal­ity.

For this case, HRAD re­search isn’t in­tended to pro­duce a pre­cise the­ory about how to pre­dict ML sys­tems, or to be able to make pre­cise pre­dic­tions about what ML sys­tems will do. In­stead, the idea is that the pre­cise the­ory of ra­tio­nal­ity will help AGI builders avoid, de­tect, and fix safety is­sues; pre­dict or ex­plain safety is­sues; help to con­cep­tu­ally clar­ify the AI al­ign­ment prob­lem; and help us be satis­fied that the AGI is do­ing what we want. In other words, in­stead of di­rectly us­ing a pre­cise the­ory about un­der­stand­ing/​pre­dict­ing ra­tio­nal agents in gen­eral, we use the pre­cise the­ory about ra­tio­nal­ity to help us roughly pre­dict what ra­tio­nal agents will do in gen­eral (in­clud­ing ML sys­tems).

As with world 1, un­less we be­come less con­fused, we are likely to screw up al­ign­ment be­cause we won’t deeply un­der­stand how our AI sys­tems are rea­son­ing. There are other ways to gain clar­ity on al­ign­ment, such as by work­ing on iter­ated am­plifi­ca­tion, but these ap­proaches don’t de­com­pose cog­ni­tive work enough.

Why I think we might be in this world

This seems to be what Abram is say­ing in this com­ment (see es­pe­cially the part af­ter “I guess there’s a tricky in­ter­pre­ta­tional is­sue here”).

It also seems to match what Ro­hin is say­ing in these two com­ments.

The ex­am­ples MIRI peo­ple some­times give for prece­dents of HRAD-ish work, like the work done by Tur­ing, Shan­non, and Maxwell are pre­cise math­e­mat­i­cal the­o­ries.

The crux

There seem to be two pos­si­ble re­jec­tions of this case:

  • We can re­ject the ex­is­tence of the pre­cise the­ory of ra­tio­nal­ity. This is what Ro­hin does in this com­ment and this com­ment where he says “MIRI’s the­o­ries will always be the rel­a­tively-im­pre­cise the­o­ries that can’t scale to ‘2+ lev­els above’.” Paul Chris­ti­ano seems to also do this, as sum­ma­rized by Jes­sica Tay­lor in this post: in­tu­ition 18 is “There are rea­sons to ex­pect the de­tails of rea­son­ing well to be ‘messy’.”

  • We can ar­gue that even a pre­cise the­ory of ra­tio­nal­ity is in­suffi­cient for helping to al­ign AI sys­tems. This seems to be what Daniel Dewey is do­ing in this post when he says things like “AIXI and Solomonoff in­duc­tion are par­tic­u­larly strong ex­am­ples of work that is very close to HRAD, but don’t seem to have been ap­pli­ca­ble to real AI sys­tems” and “It seems plau­si­ble that the kinds of ax­io­matic de­scrip­tions that HRAD work could pro­duce would be too tax­ing to be use­fully ap­plied to any prac­ti­cal AI sys­tem”.

World 3

Case for HRAD

The goal of HRAD re­search is to di­rectly come up with a pre­cise the­ory for un­der­stand­ing the be­hav­ior of ra­tio­nal agents and pre­dict­ing what they will do. De­con­fu­sion is still im­por­tant, as with wor­lds 1 and 2, but in this case we don’t merely want any kind of de­con­fu­sion, but speci­fi­cally de­con­fu­sion which is ac­com­panied by a pre­cise the­ory that al­lows us to pre­dict agents’ be­hav­ior in gen­eral. And a pre­cise the­ory is im­por­tant, but we don’t merely want a pre­cise the­ory that lets us build an agent; we want our the­ory to act like a box that takes in an ar­bi­trary agent (such as one built us­ing ML and other black boxes) and al­lows us to an­a­lyze its be­hav­ior.

This the­ory can then be used to help AGI builders avoid, de­tect, and fix safety is­sues; pre­dict or ex­plain safety is­sues; help to con­cep­tu­ally clar­ify the AI al­ign­ment prob­lem; and help us be satis­fied that the AGI is do­ing what we want.

As with world 1 and 2, un­less we be­come less con­fused, we are likely to screw up al­ign­ment be­cause we won’t deeply un­der­stand how our AI sys­tems are rea­son­ing. There are other ways to gain clar­ity on al­ign­ment, such as by work­ing on iter­ated am­plifi­ca­tion, but these ap­proaches don’t de­com­pose cog­ni­tive work enough.

Why I think we might be in this world

I mostly don’t think we’re in this world, but some crit­ics might think we are.

For ex­am­ple Abram says in this com­ment: “I can see how Ricraz would read state­ments of the first type [i.e. hav­ing pre­cise un­der­stand­ing of ra­tio­nal­ity] as sug­gest­ing very strong claims of the sec­ond type [i.e. be­ing able to un­der­stand the be­hav­ior of agents in gen­eral].”

Daniel Dewey might also ex­pect to be in this world; it’s hard for me to tell based on his post about HRAD.

The crux

The crux in this world is ba­si­cally the same as the first re­jec­tion for world 2: we can re­ject the ex­is­tence of a pre­cise the­ory for un­der­stand­ing the be­hav­ior of ar­bi­trary ra­tio­nal agents.

Con­clu­sion, and mov­ing forward

To sum­ma­rize the above, com­bin­ing all of pos­si­ble wor­lds, the pro-HRAD stance be­comes:

(ML safety agenda not promis­ing) and (
  (even an im­pre­cise the­ory of ra­tio­nal­ity helps to al­ign AGI) or
  ((a pre­cise the­ory of ra­tio­nal­ity can be found) and
   (a pre­cise the­ory of ra­tio­nal­ity can be used to help al­ign AGI)) or
  (a pre­cise the­ory to pre­dict be­hav­ior of ar­bi­trary agent can be found)
)

and the anti-HRAD stance is the nega­tion of the above:

(ML safety agenda promis­ing) or (
  (an im­pre­cise the­ory of ra­tio­nal­ity can­not be used to help al­ign AGI) and
  ((a pre­cise the­ory of ra­tio­nal­ity can­not be found) or
   (even a pre­cise the­ory of ra­tio­nal­ity can­not be used to help al­ign AGI)) and
  (a pre­cise the­ory to pre­dict be­hav­ior of ar­bi­trary agent can­not be found)
)

How does this fit un­der the Dou­ble Crux frame­work? The cur­rent “over­all crux” is a messy propo­si­tion con­sist­ing of mul­ti­ple con­junc­tions and dis­junc­tions, and fully re­solv­ing the dis­agree­ment can in the worst case re­quire as­sign­ing truth val­ues to all five parts: the state­ment “A and (B or (C and D) or E)”, with dis­agree­ments re­solved in the or­der A=True, B=False, C=True, D=False can still be true or false de­pend­ing on the value of E. From an effi­ciency per­spec­tive, if some of the con­junc­tions/​dis­junc­tions don’t mat­ter, we want to get rid of them in or­der to sim­plify the struc­ture of the over­all crux (this cor­re­sponds to iden­ti­fy­ing which “world” we are in, us­ing the ter­minol­ogy of this post), and we also might want to pick an or­der­ing of which parts to re­solve first (for ex­am­ple, with A=True and B=True, we already know the over­all propo­si­tion is true).

So some steps for mov­ing the dis­cus­sion for­ward:

  • I think it would be great to get HRAD pro­po­nents/​op­po­nents to be like “we’re definitely in world X, and not any of the other wor­lds” or even be like “ac­tu­ally, the case for HRAD re­ally is dis­junc­tive, so both of the cases in wor­lds X and Y ap­ply”.

  • If I missed any ad­di­tional pos­si­ble wor­lds, or if I de­scribed one of the wor­lds in­cor­rectly, I am in­ter­ested in hear­ing about it.

  • If it be­comes clear which world we are in, then the next step is to drill down on the crux(es) in that world.

Thanks to Ben Cot­tier, Ro­hin Shah, and Joe Bern­stein for feed­back on this post.