Sections 1 & 2: Introduction, Strategy and Governance

This post is part of the se­quence ver­sion of the Effec­tive Altru­ism Foun­da­tion’s re­search agenda on Co­op­er­a­tion, Con­flict, and Trans­for­ma­tive Ar­tifi­cial In­tel­li­gence.

1 Introduction

Trans­for­ma­tive ar­tifi­cial in­tel­li­gence (TAI) may be a key fac­tor in the long-run tra­jec­tory of civ­i­liza­tion. A grow­ing in­ter­dis­ci­plinary com­mu­nity has be­gun to study how the de­vel­op­ment of TAI can be made safe and benefi­cial to sen­tient life (Bostrom 2014; Rus­sell et al., 2015; OpenAI, 2018; Ortega and Maini, 2018; Dafoe, 2018). We pre­sent a re­search agenda for ad­vanc­ing a crit­i­cal com­po­nent of this effort: pre­vent­ing catas­trophic failures of co­op­er­a­tion among TAI sys­tems. By co­op­er­a­tion failures we re­fer to a broad class of po­ten­tially-catas­trophic in­effi­cien­cies in in­ter­ac­tions among TAI-en­abled ac­tors. Th­ese in­clude de­struc­tive con­flict; co­er­cion; and so­cial dilem­mas (Kol­lock, 1998; Macy and Flache, 2002) which de­stroy value over ex­tended pe­ri­ods of time. We in­tro­duce co­op­er­a­tion failures at greater length in Sec­tion 1.1.

Karnofsky (2016) defines TAI as ″AI that pre­cip­i­tates a tran­si­tion com­pa­rable to (or more sig­nifi­cant than) the agri­cul­tural or in­dus­trial rev­olu­tion″. Such sys­tems range from the unified, agent-like sys­tems which are the fo­cus of, e.g., Yud­kowsky (2013) and Bostrom (2014), to the “com­pre­hen­sive AI ser­vices’’ en­vi­sioned by Drexler (2019), in which hu­mans are as­sisted by an ar­ray of pow­er­ful do­main-spe­cific AI tools. In our view, the po­ten­tial con­se­quences of such tech­nol­ogy are enough to mo­ti­vate re­search into miti­gat­ing risks to­day, de­spite con­sid­er­able un­cer­tainty about the timeline to TAI (Grace et al., 2018) and na­ture of TAI de­vel­op­ment. Given these un­cer­tain­ties, we will of­ten dis­cuss “co­op­er­a­tion failures” in fairly ab­stract terms and fo­cus on ques­tions rele­vant to a wide range of po­ten­tial modes of in­ter­ac­tion be­tween AI sys­tems. Much of our dis­cus­sion will per­tain to pow­er­ful agent-like sys­tems, with gen­eral ca­pa­bil­ities and ex­pan­sive goals. But whereas the sce­nar­ios that con­cern much of the ex­ist­ing long-term-fo­cused AI safety re­search in­volve agent-like sys­tems, an im­por­tant fea­ture of catas­trophic co­op­er­a­tion failures is that they may also oc­cur among hu­man ac­tors as­sisted by nar­row-but-pow­er­ful AI tools.

Co­op­er­a­tion has long been stud­ied in many fields: poli­ti­cal the­ory, eco­nomics, game the­ory, psy­chol­ogy, evolu­tion­ary biol­ogy, multi-agent sys­tems, and so on. But TAI is likely to pre­sent un­prece­dented challenges and op­por­tu­ni­ties aris­ing from in­ter­ac­tions be­tween pow­er­ful ac­tors. The size of losses from bar­gain­ing in­effi­cien­cies may mas­sively in­crease with the ca­pa­bil­ities of the ac­tors in­volved. More­over, fea­tures of ma­chine in­tel­li­gence may lead to qual­i­ta­tive changes in the na­ture of multi-agent sys­tems. Th­ese in­clude changes in:

  1. the abil­ity to make cred­ible com­mit­ments;

  2. the abil­ity to self-mod­ify (Omo­hun­dro, 2007; Ever­itt et al., 2016) or oth­er­wise cre­ate suc­ces­sor agents;

  3. the abil­ity to model other agents.

Th­ese changes call for the de­vel­op­ment of new con­cep­tual tools, build­ing on and mod­ify­ing the many rele­vant liter­a­tures which have stud­ied co­op­er­a­tion among hu­mans and hu­man so­cieties.

1.1 Co­op­er­a­tion failure: mod­els and examples

Many of the co­op­er­a­tion failures in which we are in­ter­ested can be un­der­stood as mu­tual defec­tion in a so­cial dilemma. In­for­mally, a so­cial dilemma is a game in which ev­ery­one is bet­ter off if ev­ery­one co­op­er­ates, yet in­di­vi­d­ual ra­tio­nal­ity may lead to defec­tion. For­mally, fol­low­ing Macy and Flache (2002), we will say that a two-player nor­mal-form game with pay­offs de­noted as in Table 1 is a so­cial dilemma if the pay­offs satisfy these crite­ria:

  • (Mu­tual co­op­er­a­tion is bet­ter than mu­tual defec­tion);

  • (Mu­tual co­op­er­a­tion is bet­ter than co­op­er­at­ing while your coun­ter­part defects);

  • (Mu­tual co­op­er­a­tion is bet­ter than ran­dom­iz­ing be­tween co­op­er­a­tion and defec­tion);

  • For quan­tities and , the pay­offs satisfy or .

Nash equil­ibrium (i.e., a choice of strat­egy by each player such that no player can benefit from unilat­er­ally de­vi­at­ing) has been used to an­a­lyze failures of co­op­er­a­tion in so­cial dilem­mas. In the Pri­soner’s Dilemma (PD), the unique Nash equil­ibrium is mu­tual defec­tion. In Stag Hunt, there is a co­op­er­a­tive equil­ibrium which re­quires agents to co­or­di­nate, and a defect­ing equil­ibrium which does not. In Chicken, there are two pure-strat­egy Nash equil­ibria (Player 1 plays while Player 2 plays , and vice versa) as well as an equil­ibrium in which play­ers in­de­pen­dently ran­dom­ize be­tween and . The mixed strat­egy equil­ibrium or un­co­or­di­nated equil­ibrium se­lec­tion may there­fore re­sult in a crash (i.e., mu­tual defec­tion).

So­cial dilem­mas have been used to model co­op­er­a­tion failures in in­ter­na­tional poli­tics; Sny­der (1971) re­views ap­pli­ca­tions of PD and Chicken, and Jervis (1978) dis­cusses each of the clas­sic so­cial dilem­mas in his in­fluen­tial treat­ment of the se­cu­rity dilemma.[1] Among the most promi­nent ex­am­ples is the model of arms races as a PD: both play­ers build up arms (defect) de­spite the fact that disar­ma­ment (co­op­er­a­tion) is mu­tu­ally benefi­cial, as nei­ther wants to be the party who disarms while their coun­ter­part builds up. So­cial dilem­mas have like­wise been ap­plied to a num­ber of col­lec­tive ac­tion prob­lems, such as use of a com­mon re­source (cf. the fa­mous “tragedy of the com­mons” (Hardin, 1968; Pero­lat et al., 2017)) and pol­lu­tion. See Dawes (1980) for a re­view fo­cus­ing on such cases.

Many in­ter­ac­tions are not ad­e­quately mod­eled by sim­ple games like those in Table 1. For in­stance, states fac­ing the prospect of mil­i­tary con­flict have in­com­plete in­for­ma­tion. That is, each party has pri­vate in­for­ma­tion about the costs and benefits of con­flict, their mil­i­tary strength, and so on. They also have the op­por­tu­nity to ne­go­ti­ate over ex­tended pe­ri­ods; to mon­i­tor one an­other’s ac­tivi­ties to some ex­tent; and so on. The liter­a­ture on bar­gain­ing mod­els of war (or “crisis bar­gain­ing”) is a source of more com­plex analy­ses (e.g., Pow­ell 2002; Kydd 2003; Pow­ell 2006; Smith and Stam 2004; Feyand Ram­say 2007, 2011; Kydd 2010). In a clas­sic ar­ti­cle from this liter­a­ture, Fearon (1995) defends three now-stan­dard hy­pothe­ses as the most plau­si­ble ex­pla­na­tions for why ra­tio­nal agents would go to war:

  • Cred­i­bil­ity: The agents can­not cred­ibly com­mit to the terms of a peace­ful set­tle­ment;

  • In­com­plete in­for­ma­tion: The agents have differ­ing pri­vate in­for­ma­tion re­lated to their chances of win­ning a con­flict, and in­cen­tives to mis­rep­re­sent that in­for­ma­tion (see Sanchez-Pages (2012) for a re­view of the liter­a­ture on bar­gain­ing and con­flict un­der in­com­plete in­for­ma­tion);

  • In­di­visi­ble stakes: Con­flict can­not be re­solved by di­vid­ing the stakes, side pay­ments, etc.

Another ex­am­ple of po­ten­tially dis­as­trous co­op­er­a­tion failure is ex­tor­tion (and other com­pel­lent threats), and the ex­e­cu­tion of such threats by pow­er­ful agents. In ad­di­tion to threats be­ing harm­ful to their tar­get, the ex­e­cu­tion of threats seems to con­sti­tute an in­effi­ciency: much like go­ing to war, threat­en­ers face the di­rect costs of caus­ing harm, and in some cases, risks from re­tal­i­a­tion or le­gal ac­tion.

The liter­a­ture on crisis bar­gain­ing be­tween ra­tio­nal agents may also help us to un­der­stand the cir­cum­stances un­der which com­pel­lent threats are made and car­ried out, and point to mechanisms for avoid­ing these sce­nar­ios. Coun­ter­ing the hy­poth­e­sis that war be­tween ra­tio­nal agents A and B can oc­cur as a re­sult of in­di­visi­ble stakes (for ex­am­ple a ter­ri­tory), Pow­ell (2006, p. 178) pre­sents a case similar to that in Ex­am­ple 1.1.1, which shows that al­lo­cat­ing the full stakes to each agent ac­cord­ing to their prob­a­bil­ities of win­ning a war Pareto-dom­i­nates fight­ing.

Ex­am­ple 1.1.1(Si­mu­lated con­flict).

Con­sider two coun­tries dis­put­ing a ter­ri­tory which has value for each of them. Sup­pose that the row coun­try has prob­a­bil­ity of win­ning a con­flict, and con­flict costs for each coun­try, so that their pay­offs for Sur­ren­der­ing and Fight­ing are as in the top ma­trix in Table 2. How­ever, sup­pose the coun­tries agree on the prob­a­bil­ity that the row play­ers win; per­haps they have ac­cess to a mu­tu­ally trusted war-simu­la­tor which has row player win­ning in of simu­la­tions. Then, in­stead of en­gag­ing in real con­flict, they could al­lo­cate the ter­ri­tory based on a draw from the simu­la­tor. Play­ing this game is prefer­able, as it saves each coun­try the cost of ac­tual con­flict.

If play­ers could com­mit to the terms of peace­ful set­tle­ments and truth­fully dis­close pri­vate in­for­ma­tion nec­es­sary for the con­struc­tion of a set­tle­ment (for in­stance, in­for­ma­tion per­tain­ing to the out­come prob­a­bil­ity in Ex­am­ple 1.1.1), the al­lo­ca­tion of in­di­visi­ble stakes could of­ten be ac­com­plished. Thus, the most plau­si­ble of Fearon’s ra­tio­nal­ist ex­pla­na­tions for war seem to be (1) the difficulty of cred­ible com­mit­ment and (2) in­com­plete in­for­ma­tion (and in­cen­tives to mis­rep­re­sent that in­for­ma­tion). Sec­tion 3 con­cerns dis­cus­sion of cred­i­bil­ity in TAI sys­tems. In Sec­tion 4 we dis­cuss sev­eral is­sues re­lated to the re­s­olu­tion of con­flict un­der pri­vate in­for­ma­tion.

Lastly, while game the­ory pro­vides a pow­er­ful frame­work for mod­el­ing co­op­er­a­tion failure, TAI sys­tems or their op­er­a­tors will not nec­es­sar­ily be well-mod­eled as ra­tio­nal agents. For ex­am­ple, sys­tems in­volv­ing hu­mans in the loop, or black-box TAI agents trained by evolu­tion­ary meth­ods, may be gov­erned by a com­plex net­work of de­ci­sion-mak­ing heuris­tics not eas­ily cap­tured in a util­ity func­tion. We dis­cuss re­search di­rec­tions that are par­tic­u­larly rele­vant to co­op­er­a­tion failures among these kinds of agents in Sec­tions 5.2 (Multi-agent train­ing) and 6 (Hu­mans in the loop).

1.2 Out­line of the agenda

We list the sec­tions of the agenda be­low. Differ­ent sec­tions may ap­peal to read­ers from differ­ent back­grounds. For in­stance, Sec­tion 5 (Con­tem­po­rary AI ar­chi­tec­tures) may be most in­ter­est­ing to those with some in­ter­est in ma­chine learn­ing, whereas Sec­tion 7 (Foun­da­tions of ra­tio­nal agency) will be more rele­vant to read­ers with an in­ter­est in for­mal episte­mol­ogy or the philo­soph­i­cal foun­da­tions of de­ci­sion the­ory. Tags af­ter the de­scrip­tion of each sec­tion in­di­cate the fields most rele­vant to that sec­tion. Some sec­tions con­tain Ex­am­ples illus­trat­ing tech­ni­cal points, or ex­plain­ing in greater de­tail a pos­si­ble re­search di­rec­tion.

  • Sec­tion 2: AI strat­egy and gov­er­nance. The na­ture of losses from co­op­er­a­tion failures will de­pend on the strate­gic land­scape at the time TAI is de­ployed. This in­cludes, for in­stance, the ex­tent to which the land­scape is uni- or mul­ti­po­lar (Bostrom, 2014) and the bal­ance be­tween offen­sive and defen­sive ca­pa­bil­ities (Garfinkel and Dafoe, 2019). Like oth­ers with an in­ter­est in shap­ing TAI for the bet­ter, we want to un­der­stand this land­scape, es­pe­cially in­so­far as it can help us to iden­tify lev­ers for pre­vent­ing catas­trophic co­op­er­a­tion failures. Given that much of our agenda con­sists of the­o­ret­i­cal re­search, an im­por­tant ques­tion for us to an­swer is whether and how such re­search trans­lates into the gov­er­nance of TAI.

    Public policy; In­ter­na­tional re­la­tions; Game the­ory; Ar­tifi­cial intelligence

  • Sec­tion 3: Cred­i­bil­ity. Cred­i­bil­ity—for in­stance, the cred­i­bil­ity of com­mit­ments to honor the terms of set­tle­ments, or to carry out threats—is a cru­cial fea­ture of strate­gic in­ter­ac­tion. Changes in agents’ abil­ity to self-mod­ify (or cre­ate suc­ces­sor agents) and to ver­ify as­pects of one an­other’s in­ter­nal work­ings are likely to change the na­ture of cred­ible com­mit­ments. Th­ese an­ti­ci­pated de­vel­op­ments call for the ap­pli­ca­tion of ex­ist­ing de­ci­sion and game the­ory to new kinds of agents, and the de­vel­op­ment of new the­ory (such as that of pro­gram equil­ibrium (Ten­nen­holtz, 2004)) that bet­ter ac­counts for rele­vant fea­tures of ma­chine in­tel­li­gence.

    Game the­ory; Be­hav­ioral eco­nomics; Ar­tifi­cial intelligence

  • Sec­tion 4: Peace­ful bar­gain­ing mechanisms. Call a peace­ful bar­gain­ing mechanism a set of strate­gies for each player that does not lead to de­struc­tive con­flict, and which each agent prefers to play­ing a strat­egy which does lead to de­struc­tive con­flict. In this sec­tion, we dis­cuss sev­eral pos­si­ble such strate­gies and prob­lems which need to be ad­dressed in or­der to en­sure that they are im­ple­mented. Th­ese strate­gies in­clude bar­gain­ing strate­gies taken from or in­spired by the ex­ist­ing liter­a­ture on ra­tio­nal crisis bar­gain­ing (see Sec­tion 1.1, as well as a lit­tle-dis­cussed pro­posal for deflect­ing com­pel­lent threats which we call sur­ro­gate goals (Bau­mann, 2017, 2018).

    Game the­ory; In­ter­na­tional re­la­tions; Ar­tifi­cial intelligence

  • Sec­tion 5: Con­tem­po­rary AI ar­chi­tec­tures. Multi-agent ar­tifi­cial in­tel­li­gence is not a new field of study, and co­op­er­a­tion is of in­creas­ing in­ter­est to ma­chine learn­ing re­searchers (Leibo et al., 2017; Fo­er­ster et al., 2018; Lerer and Peysakhovich, 2017; Hughes et al., 2018; Wang et al., 2018). But there re­main un­ex­plored av­enues for un­der­stand­ing co­op­er­a­tion failures us­ing ex­ist­ing tools for ar­tifi­cial in­tel­li­gence and ma­chine learn­ing. Th­ese in­clude the im­ple­men­ta­tion of ap­proaches to im­prov­ing co­op­er­a­tion which make bet­ter use of agents’ po­ten­tial trans­parency to one an­other; the im­pli­ca­tions of var­i­ous multi-agent train­ing regimes for the be­hav­ior of AI sys­tems in multi-agent set­tings; and anal­y­sis of the de­ci­sion-mak­ing pro­ce­dures im­plic­itly im­ple­mented by var­i­ous re­in­force­ment learn­ing al­gorithms.

    Ma­chine learn­ing; Game theory

  • Sec­tion 6: Hu­mans in the loop. Sev­eral TAI sce­nar­ios and pro­pos­als in­volve a hu­man in the loop, ei­ther in the form of a hu­man-con­trol­led AI tool, or an AI agent which seeks to ad­here to the prefer­ences of hu­man over­seers. Th­ese in­clude Chris­ti­ano (2018c)’s iter­ated dis­til­la­tion and am­plifi­ca­tion (IDA; see Co­tra 2018 for an ac­cessible in­tro­duc­tion), Drexler (2019)’s com­pre­hen­sive AI ser-vices, and the re­ward mod­el­ing ap­proach of Leike et al. (2018). We would like a bet­ter un­der­stand­ing of be­hav­ioral game the­ory, tar­geted at im­prov­ing co­op­er­a­tion in TAI land­scapes in­volv­ing hu­man-in-the-loop sys­tems. We are par­tic­u­larly in­ter­ested in ad­vanc­ing the study of the be­hav­ioral game the­ory of in­ter­ac­tions be­tween hu­mans and AIs.

    Ma­chine learn­ing; Be­hav­ioral economics

  • Sec­tion 7: Foun­da­tions of ra­tio­nal agency. The prospect of TAI fore­grounds sev­eral un­re­solved is­sues in the foun­da­tions of ra­tio­nal agency. While the list of open prob­lems in de­ci­sion the­ory, game the­ory, for­mal episte­mol­ogy, and the foun­da­tions of ar­tifi­cial in­tel­li­gence is long, our fo­cus in­cludes de­ci­sion the­ory for com­pu­ta­tion­ally bounded agents; and prospects for the ra­tio­nal­ity and fea­si­bil­ity of var­i­ous kinds of de­ci­sion-mak­ing in which agents take into ac­count non-causal de­pen­dences be­tween their ac­tions and their out­comes.

    For­mal episte­mol­ogy; Philo­soph­i­cal de­ci­sion the­ory; Ar­tifi­cial intelligence

2 AI strat­egy and gov­er­nance [2]

We would like to bet­ter un­der­stand the ways the strate­gic land­scape among key ac­tors (states, AI labs, and other non-state ac­tors) might look at the time TAI sys­tems are de­ployed, and to iden­tify lev­ers for shift­ing this land­scape to­wards widely benefi­cial out­comes. Our in­ter­ests here over­lap with Dafoe (2018)’s AI gov­er­nance re­search agenda (see es­pe­cially the “Tech­ni­cal Land­scape’’ sec­tion), though we are most con­cerned with ques­tions rele­vant to risks as­so­ci­ated with co­op­er­a­tion failures.

2.1 Po­lar­ity and tran­si­tion scenarios

From the per­spec­tive of re­duc­ing risks from co­op­er­a­tion failures, it is prima fa­cie prefer­able if the tran­si­tion to TAI re­sults in a unipo­lar rather than a dis­tributed out­come: The greater the chances of a sin­gle dom­i­nant ac­tor, the lower the chances of con­flict (at least af­ter that ac­tor has achieved dom­i­nance). But the anal­y­sis is likely not so sim­ple, if the in­ter­na­tional re­la­tions liter­a­ture on the rel­a­tive safety of differ­ent power dis­tri­bu­tions (e.g., Deutsch and Singer 1964; Waltz 1964; Christensen and Sny­der 1990) is any in­di­ca­tion. We are there­fore es­pe­cially in­ter­ested in a more fine-grained anal­y­sis of pos­si­ble de­vel­op­ments in the bal­ance of power. In par­tic­u­lar, we would like to un­der­stand the like­li­hood of the var­i­ous sce­nar­ios, their rel­a­tive safety with re­spect to catas­tropic risk, and the tractabil­ity of policy in­ter­ven­tions to steer to­wards safer dis­tri­bu­tions of TAI-re­lated power. Rele­vant ques­tions in­clude:

  • One might ex­pect rapid jumps in AI ca­pa­bil­ities, rather than grad­ual progress, to make unipo­lar out­comes more likely. Should we ex­pect rapid jumps in ca­pa­bil­ities or are the ca­pa­bil­ity gains likely to re­main grad­ual (AI Im­pacts, 2018)?

  • Which dis­tri­bu­tions of power are, all things con­sid­ered, least at risk of catas­trophic failures of co­op­er­a­tion?

  • Sup­pose we had good rea­son to be­lieve we ought to pro­mote more uni- (or multi-) po­lar out­comes. What are the best policy lev­ers for in­creas­ing the con­cen­tra­tion (or spread) of AI ca­pa­bil­ities, with­out se­vere down­sides (such as con­tribut­ing to arms-race dy­nam­ics)?

2.2 Com­mit­ment and trans­parency [3][4]

Agents’ abil­ity to make cred­ible com­mit­ments is a crit­i­cal as­pect of multi-agent sys­tems. Sec­tion 3 is ded­i­cated to tech­ni­cal ques­tions around cred­i­bil­ity, but it is also im­por­tant to con­sider the strate­gic im­pli­ca­tions of cred­i­bil­ity and com­mit­ment.

One con­cern­ing dy­namic which may arise be­tween TAI sys­tems is com­mit­ment races (Koko­ta­jlo, 2019a). In the game of Chicken (Table 1), both play­ers have rea­son to com­mit to driv­ing ahead as soon as pos­si­ble, by con­spicu­ously throw­ing out their steer­ing wheels. Like­wise, AI agents (or their hu­man over­seers) may want to make cer­tain com­mit­ments (for in­stance, com­mit­ments to carry through with a threat if their de­mands aren’t met) as soon as pos­si­ble, in or­der to im­prove their bar­gain­ing po­si­tions. As with Chicken, this is a dan­ger­ous situ­a­tion. Thus we would like to ex­plore pos­si­bil­ities for cur­tailing such dy­nam­ics.

  • At least in some cases, greater trans­parency seems to limit pos­si­bil­ities for agents to make dan­ger­ous si­mul­ta­neous com­mit­ments. For in­stance, if one coun­try is care­fully mon­i­tor­ing an­other, they are likely to de­tect efforts to build dooms­day de­vices with which they can make cred­ible com­mit­ments. On the other hand, trans­parency seems to pro­mote the abil­ity to make dan­ger­ous com­mit­ments: I have less rea­son to throw out my steer­ing wheel if you can’t see me do it. Un­der what cir­cum­stances does mu­tual trans­parency miti­gate or ex­ac­er­bate com­mit­ment race dy­nam­ics, and how can this be used to de­sign safer AI gov­er­nance regimes?

  • What poli­cies can make the suc­cess of greater trans­parency be­tween TAI sys­tems more likely (to the ex­tent that this is de­sir­able)? Are there path de­pen­den­cies which must be ad­dressed early on in the en­g­ineer­ing of TAI sys­tems so that open-source in­ter­ac­tions are fea­si­ble?

Fi­nally, in hu­man so­cieties, im­prove­ments in the abil­ity to make cred­ible com­mit­ments (e.g., to sign con­tracts en­force­able by law) seem to have fa­cil­i­tated large gains from trade through more effec­tive co­or­di­na­tion, longer-term co­op­er­a­tion, and var­i­ous other mechanisms (e.g., Knack and Keefer 1995; North 1991; Greif et al. 1994; Dixit 2003).

  • Which fea­tures of in­creased cred­i­bil­ity pro­mote good out­comes? For in­stance, laws typ­i­cally don’t al­low a threat­ener to pub­li­cly re­quest they be locked up if they don’t carry out their threat. How much would so­cietal out­comes change given in­dis­crim­i­nate abil­ity to make cred­ible com­mit­ments? Have there been situ­a­tions where laws and norms around what one can com­mit to were differ­ent from what we see now, and what were the con­se­quences?

  • How have past tech­nolog­i­cal ad­vance­ments changed bar­gain­ing be­tween hu­man ac­tors? (Nu­clear weapons are one ob­vi­ous ex­am­ple of a tech­nolog­i­cal ad­vance­ment which con­sid­er­ably changed the bar­gain­ing dy­nam­ics be­tween pow­er­ful ac­tors.)

  • Open-source game the­ory, de­scribed in Sec­tion 3.2, is con­cerned with an ideal­ized form of mu­tual au­dit­ing. What do his­tor­i­cal cases tell us about the fac­tors for the suc­cess of mu­tual au­dit­ing schemes? For in­stance, the Treaty on Open Sk­ies, in which mem­bers states agreed to al­low un­manned overflights in or­der to mon­i­tor their mil­i­tary ac­tivi­ties (Brit­ting and Spitzer, 2002),, is a no­table ex­am­ple of such a scheme. See also the liter­a­ture on “con­fi­dence-build­ing” mea­sures in in­ter­na­tional se­cu­rity, e.g., Lan­dau and Lan­dau (1997) and refer­ences therein.

  • What are the main costs from in­creased com­mit­ment abil­ity?

2.3 AI mis­al­ign­ment scenarios

Chris­ti­ano (2018a) defines “the al­ign­ment prob­lem” as “the prob­lem of build­ing pow­er­ful AI sys­tems that are al­igned with their op­er­a­tors”. Re­lated prob­lems, as dis­cussed by Bostrom (2014), in­clude the “value load­ing” (or “value al­ign­ment”) prob­lem (the prob­lem of en­sur­ing that AI sys­tems have goals com­pat­i­ble with the goals of hu­mans), and the “con­trol prob­lem” (the gen­eral prob­lem of con­trol­ling a pow­er­ful AI agent). De­spite the re­cent surge in at­ten­tion on AI risk, there are few de­tailed de­scrip­tions of what a fu­ture with mis­al­igned AI sys­tems might look like (but see So­tala 2018; Chris-tiano 2019; Dai 2019 for ex­am­ples). Bet­ter mod­els of the ways in which mis­al­igned AI sys­tems could arise and how they might be­have are im­por­tant for our un­der­stand­ing of crit­i­cal in­ter­ac­tions among pow­er­ful ac­tors in the fu­ture.

  • Is AI mis­al­ign­ment more likely to con­sti­tute a “near-miss” with re­spect to hu­man val­ues, or ex­treme de­par­tures from hu­man goals (cf. Bostrom (2003)’s “pa­per­clip max­i­mizer’’)?

  • Should we ex­pect hu­man-al­igned AI sys­tems be able to co­op­er­ate with mis­al­igned sys­tems (cf. Shul­man (2010))?

  • What is the like­li­hood that out­right-mis­al­igned AI agents will be de­ployed alongside al­igned sys­tems, ver­sus the like­li­hood that al­igned sys­tems even­tu­ally be­come mis­al­igned by failing to pre­serve their origi­nal goals? (cf. dis­cus­sion of “goal preser­va­tion” (Omo­hun­dro, 2008)).

  • What does the land­scape of pos­si­ble co­op­er­a­tion failures look like in each of the above sce­nar­ios?

2.4 Other directions

Ac­cord­ing to the offense-defense the­ory, the like­li­hood and na­ture of con­flict de­pend on the rel­a­tive effi­cacy of offen­sive and defen­sive se­cu­rity strate­gies (Jervis, 2017, 1978; Glaser, 1997). Tech­nolog­i­cal progress seems to have been a crit­i­cal driver of shifts in the offense-defense bal­ance (Garfinkel and Dafoe, 2019), and the ad­vent of pow­er­ful AI sys­tems in strate­gic do­mains like com­puter se­cu­rity or mil­i­tary tech­nol­ogy could lead to shifts in that bal­ance.

  • To bet­ter un­der­stand the strat­egy land­scape at the time of AI de­ploy­ment, we would like to be able to pre­dict tech­nol­ogy-in­duced changes in the offense-defense bal­ance and how they might af­fect the na­ture of con­flict. One area of in­ter­est, for in­stance, is cy­ber­se­cu­rity (e.g., whether lead­ing de­vel­op­ers of TAI sys­tems would be able to pro­tect against cy­ber­at­tacks; cf. Za­bel and Muehlhauser 2019).

Be­sides fore­cast­ing fu­ture dy­nam­ics, we are cu­ri­ous as to what les­sons can be drawn from case stud­ies of co­op­er­a­tion failures, and poli­cies which have miti­gated or ex­ac­er­bated such risks. For ex­am­ple: Co­op­er­a­tion failures among pow­er­ful agents rep­re­sent­ing hu­man val­ues may be par­tic­u­larly costly when threats are in­volved. Ex­am­ples of pos­si­ble case stud­ies in­clude nu­clear de­ter­rence, ran­somware (Gazet, 2010) and its im­pli­ca­tions for com­puter se­cu­rity, the eco­nomics of hostage-tak­ing (Atkin-son et al., 1987; Short­land and Roberts, 2019), and ex­tor­tion rack­ets (Su­perti, 2009). Such case stud­ies might in­ves­ti­gate costs to the threat­en­ers, gains for the threat­en­ers, dam­ages to third par­ties, fac­tors that make agents more or less vuln­er­a­ble to threats, ex­ist­ing efforts to com­bat ex­tor­tion­ists, etc. While it is un­clear how in­for­ma­tive such case stud­ies will be about in­ter­ac­tions be­tween TAI sys­tems, they may be par­tic­u­larly rele­vant in hu­mans-in-the-loop sce­nar­ios (Sec­tion 6).

Lastly, in ad­di­tion to case stud­ies of co­op­er­a­tion failures them­selves, it would be helpful for the pri­ori­ti­za­tion of the re­search di­rec­tions pre­sented in this agenda to study how other in­stances of for­mal re­search have in­fluenced (or failed to in­fluence) crit­i­cal real-world de­ci­sions. Par­tic­u­larly rele­vant ex­am­ples in­clude the ap­pli­ca­tion of game the­ory to geopoli­tics (see Wein­traub (2017) for a re­view of game the­ory and de­ci­sion-mak­ing in the Cold War); cryp­tog­ra­phy to com­puter se­cu­rity, and for­mal ver­ifi­ca­tion in the ver­ifi­ca­tion of soft­ware pro­grams.

2.5 Po­ten­tial down­sides of re­search on co­op­er­a­tion failures

The re­main­der of this agenda largely con­cerns tech­ni­cal ques­tions re­lated to in­ter­ac­tions in­volv­ing TAI-en­abled sys­tems. A key strate­gic ques­tion run­ning through­out is: What are the po­ten­tial down­sides to in­creased tech­ni­cal un­der­stand­ing in these ar­eas? It is pos­si­ble, for in­stance, that tech­ni­cal and strate­gic in­sights re­lated to cred­ible com­mit­ment in­crease rather than de­crease the effi­cacy and like­li­hood of com­pel­lent threats. More­over, the naive ap­pli­ca­tion of ideal­ized mod­els of ra­tio­nal­ity may do more harm than good; it has been ar­gued that this was the case in some ap­pli­ca­tions of for­mal meth­ods to Cold War strat­egy, for in­stance Ka­plan (1991). Thus the ex­plo­ra­tion of the dan­gers and limi­ta­tions of tech­ni­cal and strate­gic progress is it­self a crit­i­cal re­search di­rec­tion.

The next post in the se­quence, “Sec­tions 3 & 4: Cred­i­bil­ity, Peace­ful Bar­gain­ing Mechanisms”, will come out Tues­day, De­cem­ber 17.

Ac­knowl­edge­ments & References


  1. The se­cu­rity dilemma refers to a situ­a­tion in which ac­tions taken by one state to im­prove their se­cu­rity (e.g., in­creas­ing their mil­i­tary ca­pa­bil­ities) leads other states to act similarly. This leads to an in­crease in ten­sions which all par­ties would pre­fer to avoid. ↩︎

  2. Notes by Lukas Gloor con­tributed sub­stan­tially to the con­tent of this sec­tion. ↩︎

  3. We re­fer the reader to Garfinkel (2018)’s re­view of re­cent de­vel­op­ments in cryp­tog­ra­phy and their pos­si­ble long-term con­se­quences. The sec­tions of Garfinkel (2018) par­tic­u­larly rele­vant to is­sues con­cern­ing the trans­parency of TAI sys­tems and im­pli­ca­tions for co­op­er­a­tion are sec­tions 3.3 (non-in­tru­sive agree­ment ver­ifi­ca­tion), 3.5 (col­lec­tive ac­tion prob­lems), 4 (limi­ta­tions and skep­ti­cal views on im­pli­ca­tions of cryp­to­graphic tech­nol­ogy), and the ap­pendix (rele­vance of progress in ar­tifi­cial in­tel­li­gence). See also Kroll et al.(2016)’s re­view of po­ten­tial ap­pli­ca­tions of com­puter sci­ence tools, in­clud­ing soft­ware ver­ifi­ca­tion, cryp­to­praphic com­mit­ments, and zero knowl­edge proofs, to the ac­countabil­ity of al­gorith­mic de­ci­sions. Re­gard­ing the prob­lem of en­sur­ing that au­to­mated de­ci­sion sys­tems are “ac­countable and gov­ern­able”, they write: “We challenge the dom­i­nant po­si­tion in the le­gal liter­a­ture that trans­parency will solve these prob­lems. Dis­clo­sure of source code is of­ten nei­ther nec­es­sary (be­cause of al­ter­na­tive tech­niques from com­puter sci­ence) nor suffi­cient (be­cause of the is­sues of an­a­lyz­ing code) to demon­strate the fair­ness of a pro­cess.” ↩︎

  4. Parts of this sub­sec­tion were de­vel­oped from notes by Anni Leskelä. ↩︎