Clarifying some key hypotheses in AI alignment

We’ve cre­ated a di­a­gram map­ping out im­por­tant and con­tro­ver­sial hy­pothe­ses for AI al­ign­ment. We hope that this will help re­searchers iden­tify and more pro­duc­tively dis­cuss their dis­agree­ments.


A part of the di­a­gram. Click through to see the full ver­sion.

Diagram preview


  1. This does not de­com­pose ar­gu­ments ex­haus­tively. It does not in­clude ev­ery rea­son to favour or dis­favour ideas. Rather, it is a set of key hy­pothe­ses and re­la­tion­ships with other hy­pothe­ses, prob­lems, solu­tions, mod­els, etc. Some ex­am­ples of im­por­tant but ap­par­ently un­con­tro­ver­sial premises within the AI safety com­mu­nity: or­thog­o­nal­ity, com­plex­ity of value, Good­hart’s Curse, AI be­ing de­ployed in a catas­tro­phe-sen­si­tive con­text.

  2. This is not a com­pre­hen­sive col­lec­tion of key hy­pothe­ses across the whole space of AI al­ign­ment. It fo­cuses on a sub­space that we find in­ter­est­ing and is rele­vant to more re­cent dis­cus­sions we have en­coun­tered, but where key hy­pothe­ses seem rel­a­tively less illu­mi­nated. This in­cludes ra­tio­nal agency and goal-di­rect­ed­ness, CAIS, cor­rigi­bil­ity, and the ra­tio­nale of foun­da­tional and prac­ti­cal re­search. In hind­sight, the se­lec­tion crite­ria was some­thing like:

    1. The idea is closely con­nected to the prob­lem of ar­tifi­cial sys­tems op­ti­miz­ing ad­ver­sar­i­ally against hu­mans.

    2. The idea must be ex­plained suffi­ciently well that we be­lieve it is plau­si­ble.

  3. Ar­rows in the di­a­gram in­di­cate flows of ev­i­dence or soft re­la­tions, not ab­solute log­i­cal im­pli­ca­tions — please read the “in­ter­pre­ta­tion” box in the di­a­gram. Also pay at­ten­tion to any rea­son­ing writ­ten next to a Yes/​No/​Defer ar­row — you may dis­agree with it, so don’t blindly fol­low the ar­row!


Much has been writ­ten in the way of ar­gu­ments for AI risk. Re­cently there have been some talks and posts that clar­ify differ­ent ar­gu­ments, point to open ques­tions, and high­light the need for fur­ther clar­ifi­ca­tion and anal­y­sis. We largely share their as­sess­ments and echo their recom­men­da­tions.

One as­pect of the dis­course that seems to be lack­ing clar­ifi­ca­tion and anal­y­sis is the rea­sons to favour one ar­gu­ment over an­other — in par­tic­u­lar, the key hy­pothe­ses or cruxes that un­der­lie the differ­ent ar­gu­ments. Un­der­stand­ing this bet­ter will make dis­course more pro­duc­tive and help peo­ple rea­son about their be­liefs.

This work aims to col­late and clar­ify hy­pothe­ses that seem key to AI al­ign­ment in par­tic­u­lar (by “al­ign­ment” we mean the prob­lem of get­ting an AI sys­tem to re­li­ably do what an over­seer in­tends, or try to do so, de­pend­ing on which part of the di­a­gram you are in). We point to which hy­pothe­ses, ar­gu­ments, ap­proaches, and sce­nar­ios are favoured and dis­favoured by each other. It is nei­ther com­pre­hen­sive nor suffi­ciently nu­anced to cap­ture ev­ery­one’s views, but we ex­pect it to re­duce con­fu­sion and en­courage fur­ther anal­y­sis.

You can di­gest this post through the di­a­gram or the sup­ple­men­tary in­for­ma­tion, which have their re­spec­tive strengths and limi­ta­tions. How­ever, we recom­mend start­ing with the di­a­gram, then if you are in­ter­ested in re­lated read­ing or our com­ments about a par­tic­u­lar hy­poth­e­sis, you can click the link on the box ti­tle in the di­a­gram, or look it up be­low.

Sup­ple­men­tary information

The sec­tions here list the hy­pothe­ses in the di­a­gram, along with re­lated read­ings and our more opinion-based com­ments, for lack of soft­ware to neatly em­bed this in­for­ma­tion (how­ever, boxes in the di­a­gram do link back to the head­ings here). Note that the di­a­gram is the best way to un­der­stand re­la­tion­ships and high-level mean­ing, while this offers more depth and re­sources for each hy­poth­e­sis. Phrases in ital­ics with the first let­ter cap­i­tal­ised are refer­ring to a box in the di­a­gram.


  • AGI: a sys­tem (not nec­es­sar­ily agen­tive) that, for al­most all eco­nom­i­cally rele­vant cog­ni­tive tasks, at least matches any hu­man’s abil­ity at the task. Here, “agen­tive AGI” is es­sen­tially what peo­ple in the AI safety com­mu­nity usu­ally mean when they say AGI. Refer­ences to be­fore and af­ter AGI are to be in­ter­preted as fuzzy, since this defi­ni­tion is fuzzy.

  • CAIS: com­pre­hen­sive AI ser­vices. See Refram­ing Su­per­in­tel­li­gence.

  • Goal-di­rected: de­scribes a type of be­havi­our, cur­rently not for­mal­ised, but char­ac­ter­ised by gen­er­al­i­sa­tion to novel cir­cum­stances and the ac­qui­si­tion of power and re­sources. See In­tu­itions about goal-di­rected be­havi­our.

Agen­tive AGI?

Will the first AGI be most effec­tively mod­el­led like a uni­tary, un­bounded, goal-di­rected agent?

  • Re­lated read­ing: Refram­ing Su­per­in­tel­li­gence, Com­ments on CAIS, Sum­mary and opinions on CAIS, em­bed­ded agency se­quence, In­tu­itions about goal-di­rected behaviour

  • Com­ment: This is con­sis­tent with some of clas­si­cal AI the­ory, and agency con­tinues to be a rele­vant con­cept in ca­pa­bil­ity-fo­cused re­search, e.g. re­in­force­ment learn­ing. How­ever, it has been ar­gued that the way AI sys­tems are tak­ing shape to­day, and the way hu­mans his­tor­i­cally do en­g­ineer­ing, are cause to be­lieve su­per­in­tel­li­gent ca­pa­bil­ities will be achieved by differ­ent means. Some grant that a CAIS-like sce­nario is prob­a­ble, but main­tain that there will still be In­cen­tive for agen­tive AGI. Others ar­gue that the cur­rent un­der­stand­ing of agency is prob­le­matic (per­haps just for be­ing vague, or speci­fi­cally in re­la­tion to em­bed­ded­ness), so we should defer on this hy­poth­e­sis un­til we bet­ter un­der­stand what we are talk­ing about. It ap­pears that this is a strong crux for the prob­lem of In­cor­rigible goal-di­rected su­per­in­tel­li­gence and the gen­eral aim of (Near) proof-level as­surance of al­ign­ment, ver­sus other ap­proaches that re­ject al­ign­ment be­ing such a hard, one-false-move kind of prob­lem. How­ever, to ad­vance this de­bate it does seem im­por­tant to clar­ify no­tions of goal-di­rect­ed­ness and agency.

In­cen­tive for agen­tive AGI?

Are there fea­tures of sys­tems built like uni­tary goal-di­rected agents that offer a worth­while ad­van­tage over other broadly su­per­in­tel­li­gent sys­tems?

Mo­du­lar­ity over in­te­gra­tion?

In gen­eral and hold­ing re­sources con­stant, is a col­lec­tion of mod­u­lar AI sys­tems with dis­tinct in­ter­faces more com­pe­tent than a sin­gle in­te­grated AI sys­tem?

  • Re­lated read­ing: Refram­ing Su­per­in­tel­li­gence Ch. 12, 13, AGI will dras­ti­cally in­crease economies of scale

  • Com­ment: an al­most equiv­a­lent trade-off here is gen­er­al­ity vs. spe­cial­i­sa­tion. Mo­du­lar sys­tems would benefit from spe­cial­i­sa­tion, but likely bear greater cost in prin­ci­pal-agent prob­lems and shar­ing in­for­ma­tion (see this com­ment thread). One case that might be rele­vant to think about is hu­man roles in the econ­omy — al­though hu­mans have a gen­eral learn­ing ca­pac­ity, they have tended to­wards spe­cial­is­ing their com­pe­ten­cies as part of the econ­omy, with al­most no one be­ing truly self-suffi­cient. How­ever, this may be ex­plained merely by limited brain size. The re­cent suc­cess of end-to-end learn­ing sys­tems has been ar­gued in favour of in­te­gra­tion, as has the evolu­tion­ary prece­dent of hu­mans (since hu­man minds ap­pear to be more in­te­grated than mod­u­lar).

Cur­rent AI R&D ex­trap­o­lates to AI ser­vices?

AI sys­tems so far gen­er­ally lack some key qual­ities that are tra­di­tion­ally sup­posed of AGI, namely: pur­su­ing cross-do­main long-term goals, hav­ing broad ca­pa­bil­ities, and be­ing per­sis­tent and uni­tary. Does this lack­ing ex­trap­o­late, with in­creas­ing au­toma­tion of AI R&D and the rise of a broad col­lec­tion of su­per­in­tel­li­gent ser­vices?

In­ci­den­tal agen­tive AGI?

Will sys­tems built like uni­tary goal-di­rected agents de­velop in­ci­den­tally from some­thing hu­mans or other AI sys­tems build?

Con­ver­gent ra­tio­nal­ity?

Given suffi­cient ca­pac­ity, does an AI sys­tem con­verge on ra­tio­nal agency and con­se­quen­tial­ism to achieve its ob­jec­tive?

  • Re­lated read­ing: Let’s talk about “Con­ver­gent Ra­tion­al­ity”

  • Com­ment: As far as we know, “con­ver­gent ra­tio­nal­ity” has only been named re­cently by David Krueger, and while it is not well fleshed out yet, it seems to point at an im­por­tant and com­monly-held as­sump­tion. There is some con­fu­sion about whether the con­ver­gence could be a the­o­ret­i­cal prop­erty, or is merely a mat­ter of hu­man fram­ing, or merely a mat­ter of In­cen­tive for agen­tive AGI.


Will there be op­ti­mi­sa­tion pro­cesses that, in turn, de­velop con­sid­er­ably pow­er­ful op­ti­misers to achieve their ob­jec­tive? A his­tor­i­cal ex­am­ple is nat­u­ral se­lec­tion op­ti­mis­ing for re­pro­duc­tive fit­ness to make hu­mans. Hu­mans may have good re­pro­duc­tive fit­ness, but op­ti­mise for other things such as plea­sure even when this di­verges from fit­ness.

Dis­con­ti­nu­ity to AGI?

Will there be dis­con­tin­u­ous, ex­plo­sive growth in AI ca­pa­bil­ities to reach the first agen­tive AGI? A dis­con­ti­nu­ity re­duces the op­por­tu­nity to cor­rect course. Be­fore AGI it seems most likely to re­sult from a qual­i­ta­tive change in learn­ing curve, due to an al­gorith­mic in­sight, ar­chi­tec­tural change or scale-up in re­source util­i­sa­tion.

Re­cur­sive self im­prove­ment?

Is an AI sys­tem that im­proves through its own AI R&D and self-mod­ifi­ca­tion ca­pa­bil­ities more likely than dis­tributed AI R&D au­toma­tion? Re­cur­sive im­prove­ment would give some form of ex­plo­sive growth, and so could re­sult in un­prece­dented gains in in­tel­li­gence.

Dis­con­ti­nu­ity from AGI?

Will there be dis­con­tin­u­ous, ex­plo­sive growth in AI ca­pa­bil­ities af­ter agen­tive AGI? A dis­con­ti­nu­ity re­duces the op­por­tu­nity to cor­rect course. After AGI it seems most likely to re­sult from a re­cur­sive im­prove­ment ca­pa­bil­ity.

  • Re­lated read­ing: see Dis­con­ti­nu­ity to AGI

  • Com­ment: see Dis­con­ti­nu­ity to AGI

ML scales to AGI?

Do con­tem­po­rary ma­chine learn­ing tech­niques scale to gen­eral hu­man level (and be­yond)? The state-of-the-art ex­per­i­men­tal re­search aiming to­wards AGI is char­ac­ter­ised by a set of the­o­ret­i­cal as­sump­tions, such as re­in­force­ment learn­ing and prob­a­bil­is­tic in­fer­ence. Does this paradigm read­ily scale to gen­eral hu­man-level ca­pa­bil­ities with­out fun­da­men­tal changes in the as­sump­tions or meth­ods?

  • Re­lated read­ing: Pro­saic AI al­ign­ment, A pos­si­ble stance for al­ign­ment re­search, Con­cep­tual is­sues in AI safety: the paradig­matic gap, Dis­cus­sion on the ma­chine learn­ing ap­proach to AI safety

  • Com­ment: One might won­der how much change in as­sump­tions or meth­ods con­sti­tutes a paradigm shift, but the more im­por­tant ques­tion is how rele­vant cur­rent ML safety work can be to the most high-stakes prob­lems, and that seems to de­pend strongly on this hy­poth­e­sis. Pro­po­nents of the ML safety ap­proach ad­mit that much of the work could turn out to be ir­rele­vant, es­pe­cially with a paradigm shift, but ar­gue that there is nonethe­less a worth­while chance. ML is a fairly broad field, so peo­ple tak­ing this ap­proach should think more speci­fi­cally about what as­pects are rele­vant and scal­able. If one pro­poses to build safe AGI by scal­ing up con­tem­po­rary ML tech­niques, clearly they should be­lieve the hy­poth­e­sis — but there is also a feed­back loop: the more fea­si­ble ap­proaches one comes up with, the more ev­i­dence there is for the hy­poth­e­sis. You may opt for Foun­da­tional or “de­con­fu­sion” re­search if (1) you don’t feel con­fi­dent enough about this to com­mit to work­ing on ML, or (2) you think that, whether or not ML scales in terms of ca­pa­bil­ity, we need deep in­sights about in­tel­li­gence to get a satis­fac­tory solu­tion to al­ign­ment. This im­plies Align­ment is much harder than, or does not over­lap much with, ca­pa­bil­ity gain.

Deep in­sights needed?

Do we need a much deeper un­der­stand­ing of in­tel­li­gence to build an al­igned AI?

Broad basin for cor­rigi­bil­ity?

Do cor­rigible AI sys­tems have a broad basin of at­trac­tion to in­tent al­ign­ment? Cor­rigible AI tries to help an over­seer. It acts to im­prove its model of the over­seer’s prefer­ences, and is in­cen­tivised to make sure any sub­sys­tems it cre­ates are al­igned — per­haps even more so than it­self. In this way, per­tur­ba­tions or er­rors in al­ign­ment tend to be cor­rected, and it takes a large per­tur­ba­tion to move out of this “basin” of cor­rigi­bil­ity.

  • Re­lated read­ing: Cor­rigi­bil­ity, dis­cus­sion on the need for a grounded defi­ni­tion of prefer­ences (com­ment thread), Two Ne­glected Prob­lems in Hu­man-AI Safety (prob­lem 1 poses a challenge for cor­rigi­bil­ity)

  • Com­ment: this defi­ni­tion of cor­rigi­bil­ity is still vague, and al­though it can be ex­plained to work in a de­sir­able way, it is not clear how prac­ti­cally fea­si­ble it is. It seems that pro­po­nents of cor­rigible AI ac­cept that greater the­o­ret­i­cal un­der­stand­ing and clar­ifi­ca­tion is needed: how much is a key source of dis­agree­ment. On a prac­ti­cal ex­treme, one would iter­ate ex­per­i­ments with tight feed­back loops to figure it out, and cor­rect er­rors on the go. This as­sumes am­ple op­por­tu­nity for trial and er­ror, re­ject­ing Dis­con­ti­nu­ity to/​from AGI. On a the­o­ret­i­cal ex­treme, some ar­gue that one would need to de­velop a new math­e­mat­i­cal the­ory of prefer­ences to be con­fi­dent enough that this ap­proach will work, or such a the­ory would provide the nec­es­sary in­sights to make it work at all. If you find this hy­poth­e­sis weak, you prob­a­bly put more weight on threat mod­els based on Good­hart’s Curse, e.g. In­cor­rigible goal-di­rected su­per­in­tel­li­gence, and the gen­eral aim of (Near) proof-level as­surance of al­ign­ment.

In­con­spicu­ous failure?

Will a con­crete, catas­trophic AI failure be over­whelm­ingly hard to recog­nise or an­ti­ci­pate? For cer­tain kinds of ad­vanced AI sys­tems (namely the goal-di­rected type), it seems that short of near proof-level as­surances, all safe­guards are thwarted by the near­est un­blocked strat­egy. Such AI may also be in­cen­tivised for de­cep­tion and ma­nipu­la­tion to­wards a treach­er­ous turn. Or, in a ma­chine learn­ing fram­ing, it would be very difficult to make such AI ro­bust to dis­tri­bu­tional shift.

Creep­ing failure?

Would grad­ual gains in the in­fluence of AI al­low small prob­lems to ac­cu­mu­late to catas­tro­phe? The grad­ual as­pect af­fords op­por­tu­nity to recog­nise failures and think about solu­tions. Yet for any given in­cre­men­tal change in the use of AI, the eco­nomic in­cen­tives could out­weigh the prob­lems, such that we be­come more en­tan­gled in, and re­li­ant on, a com­plex sys­tem that can col­lapse sud­denly or drift from our val­ues.

Thanks to Stu­art Arm­strong, Wei Dai, Daniel Dewey, Eric Drexler, Scott Em­mons, Ben Garfinkel, Richard Ngo and Cody Wild for helpful feed­back on drafts of this work. Ben es­pe­cially thanks Ro­hin for his gen­er­ous feed­back and as­sis­tance through­out its de­vel­op­ment.