Sections 3 & 4: Credibility, Peaceful Bargaining Mechanisms

This post is part of the se­quence ver­sion of the Effec­tive Altru­ism Foun­da­tion’s re­search agenda on Co­op­er­a­tion, Con­flict, and Trans­for­ma­tive Ar­tifi­cial In­tel­li­gence.

3 Credibility

Cred­i­bil­ity is a cen­tral is­sue in strate­gic in­ter­ac­tion. By cred­i­bil­ity, we re­fer to the is­sue of whether one agent has rea­son to be­lieve that an­other will do what they say they will do. Cred­i­bil­ity (or lack thereof) plays a cru­cial role in the effi­cacy of con­tracts (Fehr et al., 1997; Bohnet et al., 2001), ne­go­ti­ated set­tle­ments for avoid­ing de­struc­tive­con­flict (Pow­ell, 2006), and com­mit­ments to carry out (or re­fuse to give in to) threats (e.g., Kil­gour and Za­gare 1991; Kon­rad and Skaper­das 1997).

In game the­ory, the fact that Nash equil­ibria (Sec­tion 1.1) some­times in­volve non-cred­ible threats mo­ti­vates a re­fined solu­tion con­cept called sub­game perfect equil­ibrium (SPE). An SPE is a Nash equil­ibrium of an ex­ten­sive-form game in which a Nash equil­ibrium is also played at each sub­game. In the threat game de­picted in Figure 1, “carry out” is not played in a SPE, be­cause the threat­ener has no rea­son to carry out the threat once the threat­ened party has re­fused to give in; that is, “carry out’’ isn’t a Nash equil­ibrium of the sub­game played af­ter the threat­ened party re­fuses to give in.

So in an SPE-based anal­y­sis of one-shot threat situ­a­tions be­tween ra­tio­nal agents, threats are never car­ried out be­cause they are not cred­ible (i.e., they vi­o­late sub­game perfec­tion).

How­ever, agents may es­tab­lish cred­i­bil­ity in the case of re­peated in­ter­ac­tions by re­peat­edly mak­ing good on their claims (So­bel, 1985). Se­condly, de­spite the fact that car­ry­ing out a threat in the one-shot threat game vi­o­lates sub­game perfec­tion, it is a well-known re­sult from be­hav­ioral game the­ory that hu­mans typ­i­cally re­fuse un­fair splits in the Ul­ti­ma­tum Game [1] (Güth et al., 1982; Hen­rich et al., 2006), which is equiv­a­lent to car­ry­ing out the threat in the one-shot threat game. So ex­e­cut­ing com­mit­ments which are ir­ra­tional (by the SPE crite­rion) may still be a fea­ture of hu­man-in-the-loop sys­tems (Sec­tion 6), or per­haps sys­tems which have some hu­man­like game-the­o­retic heuris­tics in virtue of be­ing trained in multi-agent en­vi­ron­ments (Sec­tion 5.2). Lastly, threats may be­come cred­ible if the threat­ener has cred­ibly com­mit­ted to car­ry­ing out the threat (in the case of the game in Fig. 1, this means con­vinc­ing the op­po­nent that they have re­moved the op­tion (or made it costly) to “Not carry out’’). There is a con­sid­er­able game-the­o­retic liter­a­ture on cred­ible com­mit­ment, both on how cred­i­bil­ity can be achieved (Schel­ling, 1960) and on the anal­y­sis of games un­der the as­sump­tion that cred­ible com­mit­ment is pos­si­ble (Von Stack­elberg, 2010; Nash, 1953; Muthoo, 1996; Bag­well, 1995).

3.1 Com­mit­ment capabilities

It is pos­si­ble that TAI sys­tems may be rel­a­tively trans­par­ent to one an­other; ca­pa­ble of self-mod­ify­ing or con­struct­ing so­phis­ti­cated com­mit­ment de­vices; and mak­ing var­i­ous other “com­puter-me­di­ated con­tracts’’ (Var­ian, 2010); see also the lengthy dis­cus­sions in Garfinkel (2018) and Kroll et al. (2016), dis­cussed in Sec­tion 2 Foot­note 3, of po­ten­tial im­pli­ca­tions of cryp­to­graphic tech­nol­ogy for cred­i­bil­ity.
We want to un­der­stand how plau­si­ble changes in the abil­ity to make cred­ible com­mit­ments af­fect risks from co­op­er­a­tion failures.

  • In what ways does ar­tifi­cial in­tel­li­gence make cred­i­bil­ity more difficult, rather than less so? For in­stance, AIs lack evolu­tion­ar­ily es­tab­lished mechanisms (like cred­ible signs of anger; Hir­sh­leifer 1987) for sig­nal­ing their in­ten­tions to other agents.

  • The cred­i­bil­ity of an agent’s stated com­mit­ments likely de­pends on how in­ter­pretable [2] that agent is to oth­ers. What are the pos­si­ble ways in which in­ter­pretabil­ity may de­velop, and how does this af­fect the propen­sity to make com­mit­ments? For in­stance, in tra­jec­to­ries where AI agents are in­creas­ingly opaque to their over­seers, will these agents be mo­ti­vated to make com­mit­ments while they are still in­ter­pretable enough to over­seers that these com­mit­ments are cred­ible?

  • In the case of train­ing regimes in­volv­ing the imi­ta­tion of hu­man ex­em­plars (see Sec­tion 6), can hu­mans also make cred­ible com­mit­ments on be­half of the AI sys­tem which is imi­tat­ing them?

3.2 Open-source game theory

Ten­nen­holtz (2004) in­tro­duced pro­gram games, in which play­ers sub­mit pro­grams that have ac­cess to the source codes of their coun­ter­parts. Pro­gram games provide a model of in­ter­ac­tion un­der mu­tual trans­parency. Ten­nen­holtz showed that in the Pri­soner’s Dilemma, both play­ers sub­mit­ting Al­gorithm 1 is a pro­gram equil­ibrium (that is, a Nash equil­ibrium of the cor­re­spond­ing pro­gram game). Thus agents may have in­cen­tive to par­ti­ci­pate in pro­gram games, as these pro­mote more co­op­er­a­tive out­comes than the cor­re­spond­ing non-pro­gram games.

For these rea­sons, pro­gram games may be helpful to our un­der­stand­ing of in­ter­ac­tions among ad­vanced AIs.

Other mod­els of strate­gic in­ter­ac­tion be­tween agents who are trans­par­ent to one an­other have been stud­ied (more on this in Sec­tion 5.1); fol­low­ing Critch (2019), we will call this broader area open-source game the­ory. Game the­ory with source-code­trans­parency has been stud­ied by Fort­now 2009; Halpern and Pass 2018; LaVic­toireet al. 2014; Critch 2019; Oester­held 2019, and mod­els of multi-agent learn­ing un­der trans­parency are given by Braf­man and Ten­nen­holtz (2003); Fo­er­ster et al. (2018). But open-source game the­ory is in its in­fancy and many challenges re­main [3].

  • The study of pro­gram games has, for the most part, fo­cused on the sim­ple set­ting of two-player, one-shot games. How can (co­op­er­a­tive) pro­gram equil­ibrium strate­gies be au­to­mat­i­cally con­structed in gen­eral set­tings?

  • Un­der what cir­cum­stances would agents be in­cen­tivized to en­ter into open-source in­ter­ac­tions?

  • How can pro­gram equil­ibrium be made to pro­mote more effi­cient out­comes even in cases of in­com­plete ac­cess to coun­ter­parts’ source codes?

    • As a toy ex­am­ple, con­sider two robots play­ing a sin­gle-shot pro­gram pris­oner’s dilemma, in which their re­spec­tive moves are in­di­cated by a si­mul­ta­neous but­ton press. In the ab­sence of ver­ifi­ca­tion that the out­put of the source code ac­tu­ally causes the agent to press the but­ton, it is pos­si­ble that the out­put of the pro­gram does not match the ac­tual phys­i­cal ac­tion taken. What are the prospects for clos­ing such “cred­i­bil­ity gaps’’? The liter­a­ture on (phys­i­cal) zero-knowl­edge proofs (Fisch et al., 2014; Glaser et al., 2014) may be helpful here.

    • See also the dis­cus­sion in Sec­tion 3 on multi-agent learn­ing un­der vary­ing de­grees of trans­parency.

4 Peace­ful bar­gain­ing mechanisms

In other sec­tions of the agenda, we have pro­posed re­search di­rec­tions for im­prov­ing our gen­eral un­der­stand­ing of co­op­er­a­tion and con­flict among TAI sys­tems. In this sec­tion, on the other hand, we con­sider sev­eral fam­i­lies of strate­gies de­signed to ac­tu­ally avoid catas­trophic co­op­er­a­tion failure. The idea of such “peace­ful bar­gain­ing mechanisms″ is, roughly speak­ing, to find strate­gies which are 1) peace­ful (i.e., avoid con­flict) and 2) preferred by ra­tio­nal agents to non-peace­ful strate­gies[4].

We are not con­fi­dent that peace­ful bar­gain­ing mechanisms will be used by de­fault. First, in hu­man-in-the-loop sce­nar­ios, the bar­gain­ing be­hav­ior of TAI sys­tems may be dic­tated by hu­man over­seers, who we do not ex­pect to sys­tem­at­i­cally use ra­tio­nal bar­gain­ing strate­gies (Sec­tion 6.1). Even in sys­tems whose de­ci­sion-mak­ing is more in­de­pen­dent of hu­mans’, evolu­tion-like train­ing meth­ods could give rise to non-ra­tio­nal hu­man-like bar­gain­ing heuris­tics (Sec­tion 5.2). Even among ra­tio­nal agents, be­cause there may be many co­op­er­a­tive equil­ibria, ad­di­tional mechanisms for en­sur­ing co­or­di­na­tion may be nec­es­sary to avoid con­flict aris­ing from the se­lec­tion of differ­ent equil­ibria (see Ex­am­ple 4.1.1). Fi­nally, the ex­am­ples in this sec­tion sug­gest that there may be path-de­pen­den­cies in the en­g­ineer­ing of TAI sys­tems (for in­stance, in mak­ing cer­tain as­pects of TAI sys­tems more trans­par­ent to their coun­ter­parts) which de­ter­mine the ex­tent to which peace­ful bar­gain­ing mechanisms are available.

In the first sub­sec­tion, we pre­sent some di­rec­tions for iden­ti­fy­ing mechanisms which could im­ple­ment peace­ful set­tle­ments, draw­ing largely on ex­ist­ing ideas in the liter­a­tures on ra­tio­nal bar­gain­ing. In the sec­ond sub­sec­tion we sketch a pro­posal for how agents might miti­gate down­sides from threats by effec­tively mod­ify­ing their util­ity func­tion. This pro­posal is called sur­ro­gate goals.

4.1 Ra­tional crisis bargaining

As dis­cussed in Sec­tion 1.1, there are two stan­dard ex­pla­na­tions for war among ra­tio­nal agents: cred­i­bil­ity (the agents can­not cred­ibly com­mit to the terms of a peace­ful set­tle­ment) and in­com­plete in­for­ma­tion (the agents have differ­ing pri­vate in­for­ma­tion which makes each of them op­ti­mistic about their prospects of win­ning, and in­cen­tives not to dis­close or to mis­rep­re­sent this in­for­ma­tion).

Fey and Ram­say (2011) model crisis bar­gain­ing un­der in­com­plete in­for­ma­tion. They show that in 2-player crisis bar­gain­ing games with vol­un­tary agree­ments (play­ers are able to re­ject a pro­posed set­tle­ment if they think they will be bet­ter off go­ing to war); mu­tu­ally known costs of war; un­known types mea­sur­ing the play­ers’ mil­i­tary strength; a com­monly known func­tion giv­ing the prob­a­bil­ity of player 1 win­ning when the true types are ; and a com­mon prior over types; a peace­ful set­tle­ment ex­ists if and only if the costs of war are suffi­ciently large. Such a set­tle­ment must com­pen­sate each player’s strongest pos­si­ble type by the amount they ex­pect to gain in war.

Po­ten­tial prob­lems fac­ing the re­s­olu­tion of con­flict in such cases in­clude:

  • Reli­ance on com­mon prior and agreed-upon win prob­a­bil­ity model . If play­ers dis­agree on these quan­tities it is not clear how bar­gain­ing will pro­ceed. How can play­ers come to an agree­ment on these quan­tities, with­out gen­er­at­ing a regress of bar­gain­ing prob­lems? One pos­si­bil­ity is to defer to a mu­tu­ally trusted party to es­ti­mate these quan­tities from pub­li­cly ob­served data. This raises its own ques­tions. For ex­am­ple, what con­di­tions must a third party satisfy so that their judge­ments are trusted by each player? (Cf. Kydd (2003), Rauch­haus (2006), and sources therein on me­di­a­tion).

  • The ex­act costs of con­flict to each player are likely to be pri­vate in­for­ma­tion, as well. The as­sump­tion of a com­mon prior, or the abil­ity to agree upon a prior, may be par­tic­u­larly un­re­al­is­tic in the case of costs.

Re­call that an­other form of co­op­er­a­tion failure is the si­mul­ta­neous com­mit­ment to strate­gies which lead to catas­trophic threats be­ing car­ried out (Sec­tion 2.2). Such “com­mit­ment games″ may be mod­eled as a game of Chicken (Table 1), where Defec­tion cor­re­sponds to mak­ing com­mit­ments to carry out a threat if one’s de­mands are not met, while Co­op­er­a­tion cor­re­sponds to not mak­ing such com­mit­ments. Thus we are in­ter­ested in bar­gain­ing strate­gies which avoid mu­tual Defec­tion in com­mit­ment games. Such a strat­egy is sketched in Ex­am­ple 4.1.1.


Ex­am­ple 4.1.1 (Care­ful com­mit­ments).

Con­sider two agents with ac­cess to com­mit­ment de­vices. Each may de­cide to com­mit to car­ry­ing out a threat if their coun­ter­part does not forfeit some prize (of value to each party, say). As be­fore, call this de­ci­sion . How­ever, they may in­stead com­mit to car­ry­ing out their threat only if their coun­ter­part does not agree to a cer­tain split of the prize (say, a split in which Player 1 gets ). Call this com­mit­ment , for “co­op­er­at­ing with split ″.

When would an agent pre­fer to make the more so­phis­ti­cated com­mit­ment ? In or­der to say whether an agent ex­pects to do bet­ter by mak­ing , we need to be able to say how well they ex­pect to do in the “origi­nal″ com­mit­ment game where their choice is be­tween and . This is not straight­for­ward, as Chicken ad­mits three Nash equil­ibria. How­ever, it may be rea­son­able to re­gard the play­ers’ ex­pected val­ues un­der mixed strat­egy Nash equil­ibrium as the val­ues they ex­pect from play­ing this game. Thus, split could be cho­sen such that and ex­ceed player 1 and 2′s re­spec­tive ex­pected pay­offs un­der the mixed strat­egy Nash equil­ibrium. Many such splits may ex­ist. This calls for the se­lec­tion among , for which we may turn to a bar­gain­ing solu­tion con­cept such as Nash (Nash, 1950) or Kalai-Smorokind­sky (Kalai et al., 1975). If each player uses the same bar­gain­ing solu­tion, then each will pre­fer to com­mit­ting to hon­or­ing the re­sult­ing split of the prize to play­ing the origi­nal threat game, and car­ried-out threats will be avoided.

Of course, this mechanism is brit­tle in that it re­lies on a sin­gle take-it-or-leave-it pro­posal which will fail if the agents use differ­ent bar­gain­ing solu­tions, or have slightly differ­ent es­ti­mates of each play­ers’ pay­offs. How­ever, this could be gen­er­al­ized to a com­mit­ment to a more com­plex and ro­bust bar­gain­ing pro­ce­dure, such as an al­ter­nat­ing-offers pro­ce­dure (Ru­bin­stein 1982; Bin­moreet al. 1986; see Muthoo (1996) for a thor­ough re­view of such mod­els) or the se­quen­tial higher-or­der bar­gain­ing pro­ce­dure of Van Damme (1986).

Fi­nally, note that in the case where there is un­cer­tainty over whether each player has a com­mit­ment de­vice, suffi­ciently high stakes will mean that play­ers with com­mit­ment de­vices will still have Chicken-like pay­offs. So this model can be straight­for­wardly ex­tended to cases of where the cred­i­bil­ity of a threat comes in de­grees. An ex­am­ple of a sim­ple bar­gain­ing pro­ce­dure to com­mit to is the Bayesian ver­sion of the Nash bar­gain­ing solu­tion (Harsanyi and Selten, 1972).


Lastly, see Kydd (2010)’s re­view of po­ten­tial ap­pli­ca­tions of the liter­a­ture ra­tio­nal crisis bar­gain­ing to re­solv­ing real-world con­flict.

4.2 Sur­ro­gate goals [5]

In this sec­tion we in­tro­duce sur­ro­gate goals, a re­cent [6] pro­posal for limit­ing the down­sides from co­op­er­a­tion failures (Bau­mann, 2017, 2018) [7]. We will fo­cus on the phe­nomenon of co­er­cive threats (for game-the­o­retic dis­cus­sion see Ells­berg (1968); Har-ren­stein et al. (2007)), though the tech­nique is more gen­eral. The pro­posal is: In or­der to deflect threats against the things it ter­mi­nally val­ues, an agent adopts a new (sur­ro­gate) goal [8]. This goal may still be threat­ened, but threats car­ried out against this goal are be­nign. Fur­ther­more, the sur­ro­gate goal is cho­sen such that it in­cen­tives at most marginally more threats.

In Ex­am­ple 4.2.1, we give an ex­am­ple of an op­er­a­tional­iza­tion of sur­ro­gate goals in a threat game.


Ex­am­ple 4.2.1 (Sur­ro­gate goals via rep­re­sen­ta­tives)

Con­sider the game be­tween Threat­ener and Tar­get, where Threat­ener makes a de­mand of Tar­get, such as giv­ing up some re­source. Threat­ener can — at some cost — com­mit to car­ry­ing out a threat against Tar­get . Tar­get can like­wise com­mit to give in to such threats or not. A sim­ple model of this game is given in the pay­off ma­trix in Table 3 (a nor­mal-form var­i­ant of the threat game dis­cussed in Sec­tion 3 [9]).

Un­for­tu­nately, play­ers may some­times play (Threaten, Not give in). For ex­am­ple, this may be due to un­co­or­di­nated se­lec­tion among the two pure-strat­egy Nash equil­ibria ((Give in, Threaten) and (Not give in, Not threaten)).

But sup­pose that, in the above sce­nario, Tar­get is ca­pa­ble of cer­tain kinds of cred­ible com­mit­ments, or oth­er­wise is rep­re­sented by an agent, Tar­get’s Rep­re­sen­ta­tive, who is. Then Tar­get or Tar­get’s Rep­re­sen­ta­tive may mod­ify its goal ar­chi­tec­ture to adopt a sur­ro­gate goal whose fulfill­ment is not ac­tu­ally valuable to that player, and which is slightly cheaper for Threat­ener to threaten. (More gen­er­ally, Tar­get could mod­ify it­self to com­mit to act­ing as if it had a sur­ro­gate goal in threat situ­a­tions.) If this mod­ifi­ca­tion is cred­ible, then it is ra­tio­nal for Threat­ener to threaten the sur­ro­gate goal, ob­vi­at­ing the risk of threats against Tar­get’s true goals be­ing car­ried out.

As a first pass at a for­mal anal­y­sis: Adopt­ing an ad­di­tional threat­en­able goal adds a column to the pay­off ma­trix, as in Table 4. And this column weakly dom­i­nates the old threat column (i.e., the threat against Tar­get’s true goals). So a ra­tio­nal player would never threaten Tar­get’s true goal. Tar­get does not them­selves care about the new type of threats be­ing car­ried out, so for her, the util­ities are given by the blue num­bers in Table 4.


This ap­pli­ca­tion of sur­ro­gate goals, in which a threat game is already un­der­way but play­ers have the op­por­tu­nity to self-mod­ify or cre­ate rep­re­sen­ta­tives with sur­ro­gate goals, is only one pos­si­bil­ity. Another is to con­sider the adop­tion of a sur­ro­gate goal as the choice of an agent (be­fore it en­coun­ters any threat) to com­mit to act­ing ac­cord­ing to a new util­ity func­tion, rather than the one which rep­re­sents their true goals. This could be mod­eled, for in­stance, as an ex­ten­sive-form game of in­com­plete in­for­ma­tion in which the agent de­cides which util­ity func­tion to com­mit to by rea­son­ing about (among other things) what sorts of threats hav­ing the util­ity func­tion might pro­voke. Such mod­els have a sig­nal­ing game com­po­nent, as the player must suc­cess­fully sig­nal to dis­trust­ful coun­ter­parts that it will ac­tu­ally act ac­cord­ing to the sur­ro­gate util­ity func­tion when threat­ened. The game-the­o­retic liter­a­ture on sig­nal­ing (Kreps and So­bel, 1994) and the liter­a­ture on in­fer­ring prefer­ences in multi-agent set­tings (Yu et al., 2019; Lin et al., 2019) may sug­gest use­ful mod­els. The im­ple­men­ta­tion of sur­ro­gate goals faces a num­ber of ob­sta­cles. Some prob­lems and ques­tions in­clude:

  • The sur­ro­gate goal must be cred­ible, i.e., threat­en­ers must be­lieve that the agent will act con­sis­tently with the stated sur­ro­gate goal. TAI sys­tems are un­likely to have eas­ily-iden­ti­fi­able goals, and so must sig­nal their goals to oth­ers through their ac­tions. This raises ques­tions both of how to sig­nal so that the sur­ro­gate goal is at all cred­ible, and how to sig­nal in a way that doesn’t in­terfere too much with the agent’s true goals. One pos­si­bil­ity in the con­text of Ex­am­ple 4.2.1 is the use of zero-knowl­edge proofs (Gold­wasser et al., 1989; Goldre­ich and Oren,1994) to re­veal the Tar­get’s sur­ro­gate goal (but not how they will ac­tu­ally re­spond to a threat) to the Threat­ener.

  • How does an agent come to adopt an ap­pro­pri­ate sur­ro­gate goal, prac­ti­cally speak­ing? For in­stance, how can ad­vanced ML agents be trained to rea­son cor­rectly about the choice of sur­ro­gate goal?

  • The rea­son­ing which leads to the adop­tion of a sur­ro­gate goal might in fact lead to iter­ated sur­ro­gate goals. That is, af­ter hav­ing adopted a sur­ro­gate goal, Tar­get may adopt a sur­ro­gate goal to pro­tect that sur­ro­gate goal, and so on. Given that Threat­ener must be in­cen­tivized to threaten a newly adopted sur­ro­gate goal rather than the pre­vi­ous goal, this may re­sult in Tar­get giv­ing up much more of its re­sources than it would if only the ini­tial sur­ro­gate goal were threat­ened.

  • How do sur­ro­gate goals in­ter­act with open-source game the­ory (Sec­tions 3.2 and 5.1)? For in­stance, do open source in­ter­ac­tions au­to­mat­i­cally lead to the use of sur­ro­gate goals in some cir­cum­stances?

  • In or­der to deflect threats against the origi­nal goal, the adop­tion of a sur­ro­gate goal must lead to a similar dis­tri­bu­tion of out­comes as the origi­nal threat game (mod­ulo the need to be slightly cheaper to threaten). In­for­mally, Tar­get should ex­pect Tar­get’s Rep­re­sen­ta­tive to have the same propen­sity to give in as Tar­get; how this is made pre­cise de­pends on the de­tails of the for­mal sur­ro­gate goals model.

A cru­cial step in the in­ves­ti­ga­tion of sur­ro­gate goals is the de­vel­op­ment of ap­pro­pri­ate the­o­ret­i­cal mod­els. This will help to gain trac­tion on the prob­lems listed above.

The next post in the se­quence, “Sec­tions 5 & 6: Con­tem­po­rary AI Ar­chi­tec­tures, Hu­mans in the Loop”, will come out on Thurs­day, De­cem­ber 19.

Ac­knowl­edge­ments & References


  1. The Ul­ti­ma­tum Game is the 2-player game in which Player 1 pro­poses a split of an amount of money , and Player 2 ac­cepts or re­jects the split. If they ac­cept, both play­ers get the pro­posed amount, whereas if they re­ject, nei­ther player gets any­thing. The unique SPE of this game is for Player 1 to pro­pose as lit­tle as pos­si­ble, and for Player 2 to ac­cept the offer. ↩︎

  2. See Lip­ton (2016); Doshi-Velez and Kim (2017) for re­cent dis­cus­sions of in­ter­pretabil­ity in ma­chine learn­ing. ↩︎

  3. See also Sec­tion 5.1 for dis­cus­sion of open-source game the­ory in the con­text of con­tem­po­rary ma­chine learn­ing, and Sec­tion 2 for policy con­sid­er­a­tions sur­round­ing the im­ple­men­ta­tion of open-source in­ter­ac­tion. ↩︎

  4. More pre­cisely, we bor­row the term “peace­ful bar­gain­ing mechanisms″ from Fey and Ram­say (2009). They con­sider mechanisms for crisis bar­gain­ing be­tween two coun­tries. Their mechanisms are defined by the value of the re­sult­ing set­tle­ment to each pos­si­ble type for each player, and the prob­a­bil­ity of war un­der that mechanism for each pro­file of types. They call a “peace­ful mechanism” one in which the prob­a­bil­ity of war is 0 for ev­ery pro­file of types. ↩︎

  5. This sub­sec­tion is based on notes by Cas­par Oester­held. ↩︎

  6. Although, the idea of mod­ify­ing prefer­ences in or­der to get bet­ter out­comes for each player was dis­cussed by Raub (1990) un­der the name “prefer­ence adap­ta­tion’’, who ap­plied it to the pro­mo­tion of co­op­er­a­tion in the one-shot Pri­soner’s Dilemma. ↩︎

  7. See also the dis­cus­sion of sur­ro­gate goals and re­lated mechanisms in Chris­ti­ano and Wiblin (2019). ↩︎

  8. Mod­ifi­ca­tions of an agent’s util­ity func­tion have been dis­cussed in other con­texts. Omo­hun­dro (2008) ar­gues that “AIs will try to pre­serve their util­ity func­tions’’ and “AIs will try to pre­vent coun­terfeit util­ity’’. Ever­itt et al. (2016) pre­sent a for­mal model of a re­in­force­ment learn­ing agent who is able to mod­ify its util­ity func­tion, and study con­di­tions un­der which agents self-mod­ify. ↩︎

  9. Note that the nor­mal form rep­re­sen­ta­tion in Table 3 is over-sim­plify­ing; it as­sumes the cred­i­bil­ity of threats, which we saw in Sec­tion 3 to be prob­le­matic. For sim­plic­ity of ex­po­si­tion, we will nev­er­the­less fo­cus on this nor­mal-form game in this sec­tion. ↩︎