My current take on the Paul-MIRI disagreement on alignability of messy AI

Paul Chris­ti­ano and “MIRI” have dis­agreed on an im­por­tant re­search ques­tion for a long time: should we fo­cus re­search on al­ign­ing “messy” AGI (e.g. one found through gra­di­ent de­scent or brute force search) with hu­man val­ues, or on de­vel­op­ing “prin­ci­pled” AGI (based on the­o­ries similar to Bayesian prob­a­bil­ity the­ory)? I’m go­ing to pre­sent my cur­rent model of this dis­agree­ment and ad­di­tional thoughts about it.


I put “MIRI” in quotes be­cause MIRI is an or­ga­ni­za­tion com­posed of peo­ple who have differ­ing views. I’m go­ing to use the term “MIRI view” to re­fer to some com­bi­na­tion of the views of Eliezer, Benya, and Nate. I think these three re­searchers have quite similar views, such that it is ap­pro­pri­ate in some con­texts to at­tribute a view to all of them col­lec­tively; and that these re­searchers’ views con­sti­tute what most peo­ple think of as the “MIRI view”.

(KANSI AI com­pli­cates this dis­agree­ment some­what; the story here is that we can use “messy” com­po­nents in a KANSI AI but these com­po­nents have to have their ca­pa­bil­ities re­stricted sig­nifi­cantly. Such re­stric­tion isn’t nec­es­sary if we think messy AGI can be al­igned in gen­eral.)

In­tu­itions and re­search approaches

I’m gen­er­ally go­ing to take the per­spec­tive of look­ing for the in­tu­itions mo­ti­vat­ing a par­tic­u­lar re­search ap­proach or pro­duced by a par­tic­u­lar re­search ap­proach, rather than look­ing at the re­search ap­proaches them­selves. I ex­pect it is eas­ier to reach agree­ment about the how com­pel­ling a par­tic­u­lar in­tu­ition is (at least when other in­tu­itions are tem­porar­ily ig­nored), than to reach agree­ment on par­tic­u­lar re­search ap­proaches.

In gen­eral, it’s quite pos­si­ble for a re­search ap­proach to be in­effi­cient while still be­ing based on, or giv­ing rise to, use­ful in­tu­itions. So a crit­i­cism of a par­tic­u­lar re­search ap­proach is not nec­es­sar­ily a crit­i­cism of the in­tu­itions be­hind it.

Terminology

  • A learn­ing prob­lem is a task for which the AI is sup­posed to out­put some in­for­ma­tion, and if we wanted, we could give the in­for­ma­tion a score mea­sur­ing how good it is the task, us­ing less than ~2 weeks of la­bor. In other words, there’s an in­ex­pen­sive “ground truth” we have ac­cess to. This looks a lit­tle weird but I think this is a nat­u­ral cat­e­gory, and some of the in­tu­itions re­late to learn­ing and non-learn­ing prob­lems. Paul has writ­ten about learn­ing and non-learn­ing prob­lems here.

  • An AI sys­tem is al­igned if it is pur­su­ing some com­bi­na­tion of differ­ent hu­mans’ val­ues and not sig­nifi­cantly pur­su­ing other val­ues that could im­pact the long term fu­ture of hu­man­ity. If it is pur­su­ing other val­ues sig­nifi­cantly it is un­al­igned.

  • An AI sys­tem is com­pet­i­tive if it is nearly as effi­cient as other AI sys­tems (al­igned or un­al­igned) that peo­ple could build.

List­ing out intuitions

I’m go­ing to list out a bunch of rele­vant in­tu­itions. Usu­ally I can’t ac­tu­ally con­vey the in­tu­ition through text; at best I can write “what some­one who has this in­tu­ition would feel like say­ing” and “how some­one might go about gain­ing this in­tu­ition”. Per­haps the text will make “log­i­cal” sense to you with­out feel­ing com­pel­ling; this could be a sign that you don’t have the un­der­ly­ing in­tu­ition.

Back­ground AI safety intuitions

Th­ese back­ground in­tu­itions ones that I think are shared by both Paul and MIRI.

1. Weak or­thog­o­nal­ity. It is pos­si­ble to build highly in­tel­li­gent agents with silly goals such as max­i­miz­ing pa­per­clips. Ran­dom “minds from mindspace” (e.g. found through brute force search) will have val­ues that sig­nifi­cantly di­verge from hu­man val­ues.

2. In­stru­men­tal con­ver­gence. Highly ad­vanced agents will by de­fault pur­sue strate­gies such as gain­ing re­sources and de­ceiv­ing their op­er­a­tors (perform­ing a “treach­er­ous turn”).

3. Edge in­stan­ti­a­tion. For most ob­jec­tive func­tions that naively seem use­ful, the max­i­mum is quite “weird” in a way that is bad for hu­man val­ues.

4. Patch re­sis­tance. Most AI al­ign­ment prob­lems (e.g. edge in­stan­ti­a­tion) are very difficult to “patch”; adding a patch that deals with a spe­cific failure will fail to fix the un­der­ly­ing prob­lem and in­stead lead to fur­ther un­in­tended solu­tions.

In­tu­itions mo­ti­vat­ing the agent foun­da­tions approach

I think the fol­low­ing in­tu­itions are suffi­cient to mo­ti­vate the agent foun­da­tions ap­proach to AI safety (think­ing about ideal­ized mod­els of ad­vanced agents to be­come less con­fused), and some­thing similar to the agent foun­da­tions agenda, at least if one ig­nores con­tra­dic­tory in­tu­itions for a mo­ment. In par­tic­u­lar, when con­sid­er­ing these in­tu­itions at once, I feel com­pel­led to be­come less con­fused about ad­vanced agents through re­search ques­tions similar to those in the agent foun­da­tions agenda.

I’ve con­firmed with Nate that these are similar to some of his main in­tu­itions mo­ti­vat­ing the agent foun­da­tions ap­proach.

5. Cog­ni­tive re­duc­tions are great. When we feel con­fused about some­thing, there is of­ten a way out of this con­fu­sion, by figur­ing out which al­gorithm would have gen­er­ated that con­fu­sion. Often, this works even when the origi­nal prob­lem seemed “messy” or “sub­jec­tive”; some­thing that looks messy can have sim­ple prin­ci­ples be­hind it that haven’t been dis­cov­ered yet.

6. If you don’t do cog­ni­tive re­duc­tions, you will put your con­fu­sion in boxes and hide the ac­tual prob­lem. By de­fault, a lot of peo­ple study­ing a prob­lem will fail to take the per­spec­tive of cog­ni­tive re­duc­tions and thereby not ac­tu­ally be­come less con­fused. The free will de­bate is a good ex­am­ple of this: most dis­cus­sion of free will con­tains con­fu­sions that could be re­solved us­ing Daniel Den­nett’s cog­ni­tive re­duc­tion of free will. (This is es­sen­tially the same as the cog­ni­tive re­duc­tion dis­cussed in the se­quences.)

7. We should ex­pect main­stream AGI re­search to be in­effi­cient at learn­ing much about the con­fus­ing as­pects of in­tel­li­gence, for this rea­son. It’s pretty easy to look at most AI re­search and see where it’s hid­ing fun­da­men­tal con­fu­sions such as log­i­cal un­cer­tainty with­out ac­tu­ally re­solv­ing them. E.g. if neu­ral net­works are used to pre­dict math, then the con­fu­sion about how to do log­i­cal un­cer­tainty is placed in the black box of “what this neu­ral net learns to do”. This isn’t that helpful for ac­tu­ally un­der­stand­ing log­i­cal un­cer­tainty in a “cog­ni­tive re­duc­tion” sense; such an un­der­stand­ing could lead to much more prin­ci­pled al­gorithms.

8. If we ap­ply cog­ni­tive re­duc­tions to in­tel­li­gence, we can de­sign agents we ex­pect to be al­igned. Sup­pose we are able to ob­serve “how in­tel­li­gence feels from the in­side” and dis­till these ob­ser­va­tions into an ideal­ized cog­ni­tive al­gorithm for in­tel­li­gence (similar to the ideal­ized al­gorithm Daniel Den­nett dis­cusses to re­solve free will). The min­i­max al­gorithm is one ex­am­ple of this: it’s an ideal­ized ver­sion of plan­ning that in prin­ci­ple could have been de­rived by ob­serv­ing the men­tal mo­tions hu­mans do when play­ing games. If we im­ple­ment an AI sys­tem that ap­prox­i­mates this ideal­ized al­gorithm, then we have a story for why the AI is do­ing what it is do­ing: it is tak­ing ac­tion X for the same rea­son that an “ideal­ized hu­man” would take ac­tion X. That is, it “goes through men­tal mo­tions” that we can imag­ine go­ing through (or ap­prox­i­mates do­ing so), if we were solv­ing the task we pro­grammed the AI to do. If we’re pro­gram­ming the AI to as­sist us, we could imag­ine the men­tal mo­tions we would take if we were as­sist­ing aliens.

9. If we don’t re­solve our con­fu­sions about in­tel­li­gence, then we don’t have this story, and this is sus­pi­cious. Sup­pose we haven’t ac­tu­ally re­solved our con­fu­sions about in­tel­li­gence. Then we don’t have the story in the pre­vi­ous point, so it’s pretty weird to think our AI is al­igned. We must have a pretty differ­ent story, and it’s hard to imag­ine differ­ent sto­ries that could al­low us to con­clude that an AI is al­igned.

10. Sim­ple rea­son­ing rules will cor­rectly gen­er­al­ize even for non-learn­ing prob­lems. That is, there’s some way that agents can learn rules for mak­ing good judg­ments that gen­er­al­ize to tasks they can’t get fast feed­back on. Hu­mans seem to be an ex­is­tence proof that sim­ple rea­son­ing rules can gen­er­al­ize; sci­ence can make pre­dic­tions about far-away galax­ies even when there isn’t an ob­serv­able ground truth for the state of the galaxy (only in­di­rect ob­ser­va­tions). Plau­si­bly, it is pos­si­ble to use “brute force” to find agents us­ing these rea­son­ing rules by search­ing for agents that perform well on small tasks and then hop­ing that they gen­er­al­ize to large tasks, but this can re­sult in mis­al­ign­ment. For ex­am­ple, Solomonoff in­duc­tion is con­trol­led by ma­lign con­se­quen­tial­ists who have learned good rules for how to rea­son; ap­prox­i­mat­ing Solomonoff in­duc­tion is one way to make an un­al­igned AI. If an al­igned AI is to be roughly com­pet­i­tive with these “brute force” un­al­igned AIs, we should have some story for why the al­igned AI sys­tem is also able to ac­quire sim­ple rea­son­ing rules that gen­er­al­ize well. Note that Paul mostly agrees with this in­tu­ition and is in fa­vor of agent foun­da­tions ap­proaches to solv­ing this prob­lem, al­though his re­search ap­proach would sig­nifi­cantly differ from the cur­rent agent foun­da­tions agenda. (This point is some­what con­fus­ing; see my other post for clar­ifi­ca­tion)

In­tu­itions mo­ti­vat­ing act-based agents

I think these fol­low­ing in­tu­itions are all in­tu­itions that Paul has that mo­ti­vate his cur­rent re­search ap­proach.

11. Al­most all tech­ni­cal prob­lems are ei­ther tractable to solve or are in­tractable/​im­pos­si­ble for a good rea­son. This is based on Paul’s ex­pe­rience in tech­ni­cal re­search. For ex­am­ple, con­sider a statis­ti­cal learn­ing prob­lem where we are try­ing to pre­dict a Y value from an X value us­ing some model. It’s pos­si­ble to get good statis­ti­cal guaran­tees on prob­lems where the train­ing dis­tri­bu­tion of X val­ues is the same as the test dis­tri­bu­tion of X val­ues, but when those dis­tri­bu­tions are dis­t­in­guish­able (i.e. there’s a clas­sifier that can sep­a­rate them pretty well), there’s a fun­da­men­tal ob­struc­tion to get­ting the same guaran­tees: given the in­for­ma­tion available, there is no way to dis­t­in­guish a model that will gen­er­al­ize from one that won’t, since they could be­have in ar­bi­trary ways on test data that is dis­tinctly differ­ent from train­ing data. An ex­cep­tion to the rule is NP-com­plete prob­lems; we don’t have a good ar­gu­ment yet for why they can’t be solved in polyno­mial time. How­ever, even in this case, NP-hard­ness forms a use­ful bound­ary be­tween tractable and in­tractable prob­lems.

12. If the pre­vi­ous in­tu­ition is true, we should search for solu­tions and fun­da­men­tal ob­struc­tions. If there is ei­ther a solu­tion or a fun­da­men­tal ob­struc­tion to a prob­lem, then an ob­vi­ous way to make progress on the prob­lem is to al­ter­nate be­tween gen­er­at­ing ob­vi­ous solu­tions and find­ing good rea­sons why a class of solu­tions (or all solu­tions) won’t work. In the case of AI al­ign­ment, we should try get­ting a very good solu­tion (e.g. one that al­lows the al­igned AI to be com­pet­i­tive with un­prin­ci­pled AI sys­tems such as ones based on deep learn­ing by ex­ploit­ing the same tech­niques) un­til we have a fun­da­men­tal ob­struc­tion to this. Such a fun­da­men­tal ob­struc­tion would tell us which re­lax­ations to the “full prob­lem” we should con­sider, and be use­ful for con­vinc­ing oth­ers that co­or­di­na­tion is re­quired to en­sure that al­igned AI can pre­vail even if it is not com­pet­i­tive with un­al­igned AI. (Paul’s re­search ap­proach looks quite op­ti­mistic par­tially be­cause he is pur­su­ing this strat­egy).

13. We should be look­ing for ways of turn­ing ar­bi­trary AI ca­pa­bil­ities into equally pow­er­ful al­igned AI ca­pa­bil­ities. On pri­ors, we should ex­pect it to be hard for AI safety re­searchers to make ca­pa­bil­ities ad­vances; AI safety re­searchers make up only a small per­centage of AI re­searchers. If this is the case, then al­igned AI will be quite un­com­pet­i­tive un­less it takes ad­van­tage of the most effec­tive AI tech­nol­ogy that’s already around. It would be re­ally great if we could take an ar­bi­trary AI tech­nol­ogy (e.g. deep learn­ing), do a bit of think­ing, and come up with a way to di­rect that tech­nol­ogy to­wards hu­man val­ues. There isn’t a crisp fun­da­men­tal ob­struc­tion to do­ing this yet, so it is the nat­u­ral first place to look. To be more spe­cific about what this re­search strat­egy en­tails, sup­pose it is pos­si­ble to build built an un­al­igned AI sys­tem. We ex­pect it to be com­pe­tent; say it is com­pe­tent for rea­son X. We ought to be able to ei­ther build an al­igned AI sys­tem that also works for rea­son X, or else find a fun­da­men­tal ob­struc­tion. For ex­am­ple, rea­son X could be “it does gra­di­ent de­scent to find weights op­ti­miz­ing a proxy for com­pe­tence”; then we’d seek to build a sys­tem that works be­cause it does gra­di­ent de­scent to find weights op­ti­miz­ing a proxy for com­pe­tence and al­ign­ment.

14. Pur­su­ing hu­man nar­row val­ues pre­sents a much more op­ti­mistic pic­ture of AI al­ign­ment. See Paul’s posts on nar­row value learn­ing, act-based agents, and ab­stract ap­proval di­rec­tion. The agent foun­da­tions agenda of­ten con­sid­ers prob­lems of the form “let’s use Bayesian VNM agents as our start­ing point and look for re­lax­ations ap­pro­pri­ate to re­al­is­tic agents, which are nat­u­ral­ized”. This leads to prob­lems such as de­ci­sion the­ory, nat­u­ral­ized in­duc­tion, and on­tol­ogy iden­ti­fi­ca­tion. How­ever, there isn’t a clear ar­gu­ment for why they are sub­prob­lems of the prob­lem we ac­tu­ally care about (which is close to some­thing like “pur­su­ing hu­man nar­row val­ues”). For ex­am­ple, per­haps we can un­der­stand how to have an AI pur­sue hu­man nar­row val­ues with­out solv­ing de­ci­sion the­ory, since maybe hu­mans don’t ac­tu­ally have a util­ity func­tion or a de­ci­sion the­ory yet (though we might upon long-term re­flec­tion; pur­su­ing nar­row val­ues should pre­serve the con­di­tions for such long-term re­flec­tion). Th­ese re­search ques­tions might be use­ful threads to pull on if solv­ing them would tell us more about the prob­lems we ac­tu­ally care about. But I think Paul has a strong in­tu­ition that work­ing on these prob­lems isn’t the right way to make progress on pur­su­ing hu­man nar­row val­ues.

15. There are im­por­tant con­sid­er­a­tions in fa­vor of fo­cus­ing on al­ign­ment for fore­see­able AI tech­nolo­gies. See posts here and here. In par­tic­u­lar, this mo­ti­vates work re­lated to al­ign­ment for sys­tems solv­ing learn­ing prob­lems.

16. It is, in prin­ci­ple, pos­si­ble to au­to­mate a large frac­tion of hu­man la­bor us­ing ro­bust learn­ing. That is, a hu­man can use amount of la­bor to over­see the AI do­ing some­thing like amount of la­bor in a ro­bust fash­ion. KWIK learn­ing is a par­tic­u­larly clean (though im­prac­ti­cal) demon­stra­tion of this. This en­ables the hu­man to spend much more time over­see­ing a par­tic­u­lar de­ci­sion than the AI takes to make it (e.g. spend­ing 1 day to over­see a de­ci­sion made in 1 sec­ond), since only a small frac­tion of de­ci­sions are over­seen.

17. The above is quite pow­er­ful, due to boot­strap­ping. “Au­tomat­ing a large frac­tion of hu­man la­bor” is sig­nifi­cantly more im­pres­sive than it first seems, since the hu­man can use other AI sys­tems in the course of eval­u­at­ing a spe­cific de­ci­sion. See ALBA. We don’t yet have a fun­da­men­tal ob­struc­tion to any of ALBA’s sub­prob­lems, and we have an ar­gu­ment that solv­ing these sub­prob­lems is suffi­cient to cre­ate an al­igned learn­ing sys­tem.

18. There are rea­sons to ex­pect the de­tails of rea­son­ing well to be “messy”. That is, there are rea­sons why we might ex­pect cog­ni­tion to be as messy and hard to for­mal­ize as biol­ogy is. While biol­ogy has some im­por­tant large-scale fea­tures (e.g. evolu­tion), over­all it is quite hard to cap­ture us­ing sim­ple rules. We can take the his­tory of AI as ev­i­dence for this; AI re­search of­ten does con­sist of peo­ple try­ing to figure out how hu­mans do some­thing at an ideal­ized level and for­mal­ize it (roughly similar to the agent foun­da­tions ap­proach), and this kind of AI re­search does not always lead to the most ca­pa­ble AI sys­tems. The suc­cess of deep learn­ing is ev­i­dence that the most effec­tive way for AI sys­tems to ac­quire good rules of rea­son­ing is usu­ally to learn them, rather than hav­ing them be hard­coded.

What to do from here?

I find all the in­tu­itions above at least some­what com­pel­ling. Given this, I have made some ten­ta­tive con­clu­sions:

  • I think the in­tu­ition 10 (“sim­ple rea­son­ing rules gen­er­al­ize for non-learn­ing prob­lems”) is par­tic­u­larly im­por­tant. I don’t quite un­der­stand Paul’s re­search ap­proach for this ques­tion, but it seems that there is con­ver­gence that this in­tu­ition is use­ful and that we should take an agent foun­da­tions ap­proach to solve the prob­lem. I think this con­ver­gence rep­re­sents a great deal of progress in the over­all dis­agree­ment.

  • If we can re­solve the above prob­lem by cre­at­ing in­tractable al­gorithms for find­ing sim­ple rea­son­ing rules that gen­er­al­ize, then plau­si­bly some­thing like ALBA could “dis­till” these al­gorithms into a com­pet­i­tive al­igned agent mak­ing use of e.g. deep learn­ing tech­nol­ogy. My pic­ture of this is vague but if this is cor­rect, then the agent foun­da­tions ap­proach and ALBA are quite syn­er­gis­tic. Paul has writ­ten a bit about the re­la­tion be­tween ALBA and non-learn­ing prob­lems here.

  • I’m still some­what op­ti­mistic about Paul’s ap­proach of “turn ar­bi­trary ca­pa­bil­ities into al­igned ca­pa­bil­ities” and pes­simistic about the al­ter­na­tives to this ap­proach. If this ap­proach is ul­ti­mately doomed, I think it’s likely be­cause it’s far eas­ier to find a sin­gle good AI sys­tem than to turn ar­bi­trary un­al­igned AI sys­tems into com­pet­i­tive al­igned AI sys­tems; there’s a kind of “uni­ver­sal quan­tifier” im­plicit in the sec­ond ap­proach. How­ever, I don’t see this as a good rea­son not to use this re­search ap­proach. It seems like if it is doomed, we will likely find some kind of fun­da­men­tal ob­struc­tion some­where along the way, and I ex­pect a crisply stated fun­da­men­tal ob­struc­tion to be quite use­ful for know­ing ex­actly which re­lax­ation of the “com­pet­i­tive al­igned AI” prob­lem to pur­sue. Though this does ar­gue for pur­su­ing other ap­proaches in par­allel that are mo­ti­vated by this par­tic­u­lar difficulty.

  • I think in­tu­ition 14 (“pur­su­ing hu­man nar­row val­ues pre­sents a much more op­ti­mistic pic­ture of AI al­ign­ment”) is quite im­por­tant, and would strongly in­form re­search I do us­ing the agent foun­da­tions ap­proach. I think the main rea­son “MIRI” is wary of this is that it seems quite vague and con­fus­ing, and maybe fun­da­men­tal con­fu­sions like de­ci­sion the­ory and on­tol­ogy iden­ti­fi­ca­tion will re-emerge if we try to make it more pre­cise. Per­son­ally, I ex­pect that, though nar­row value learn­ing is con­fus­ing, it re­ally ought to dodge de­ci­sion the­ory and on­tol­ogy iden­ti­fi­ca­tion. One way of test­ing this ex­pec­ta­tion would be for me to think about nar­row value learn­ing by cre­at­ing toy mod­els of agents that have nar­row val­ues but not proper util­ity func­tions. Un­for­tu­nately, I wouldn’t be too sur­prised if this turns out to be su­per messy and hard to for­mal­ize.

Acknowledgements

Thanks to Paul, Nate, Eliezer, and Benya for a lot of con­ver­sa­tions on this topic. Thanks to John Sal­vatier for helping me to think about in­tu­itions and teach­ing me skills for learn­ing in­tu­itions from other peo­ple.