Cooperating with agents with different ideas of fairness, while resisting exploitation

There’s an idea from the lat­est MIRI work­shop which I haven’t seen in in­for­mal the­o­ries of ne­go­ti­a­tion, and I want to know if this is a known idea.

(Old well-known ideas:)

Sup­pose a stan­dard Pri­soner’s Dilemma ma­trix where (3, 3) is the pay­off for mu­tual co­op­er­a­tion, (2, 2) is the pay­off for mu­tual defec­tion, and (0, 5) is the pay­off if you co­op­er­ate and they defect.

Sup­pose we’re go­ing to play a PD iter­ated for four rounds. We have com­mon knowl­edge of each other’s source code so we can ap­ply modal co­op­er­a­tion or similar means of reach­ing a bind­ing ‘agree­ment’ with­out other en­force­ment meth­ods.

If we mu­tu­ally defect on ev­ery round, our net mu­tual pay­off is (8, 8). This is a ‘Nash equil­ibrium’ be­cause nei­ther agent can unilat­er­ally change its ac­tion and thereby do bet­ter, if the op­po­nents’ ac­tions stay fixed. If we mu­tu­ally co­op­er­ate on ev­ery round, the re­sult is (12, 12) and this re­sult is on the ‘Pareto bound­ary’ be­cause nei­ther agent can do bet­ter un­less the other agent does worse. It would seem a de­sir­able prin­ci­ple for ra­tio­nal agents (with com­mon knowl­edge of each other’s source code /​ com­mon knowl­edge of ra­tio­nal­ity) to find an out­come on the Pareto bound­ary, since oth­er­wise they are leav­ing value on the table.

But (12, 12) isn’t the only pos­si­ble re­sult on the Pareto bound­ary. Sup­pose that run­ning the op­po­nent’s source code, you find that they’re will­ing to co­op­er­ate on three rounds and defect on one round, if you co­op­er­ate on ev­ery round, for a pay­off of (9, 14) slanted their way. If they use their knowl­edge of your code to pre­dict you re­fus­ing to ac­cept that bar­gain, they will defect on ev­ery round for the mu­tual pay­off of (8, 8).

I would con­sider it ob­vi­ous that a ra­tio­nal agent should re­fuse this un­fair bar­gain. Other­wise agents with knowl­edge of your source code will offer you only this bar­gain, in­stead of the (12, 12) of mu­tual co­op­er­a­tion on ev­ery round; they will ex­ploit your will­ing­ness to ac­cept a re­sult on the Pareto bound­ary in which al­most all of the gains from trade go to them.

(Newer ideas:)

Gen­er­al­iz­ing: Once you have a no­tion of a ‘fair’ re­sult—in this case (12, 12) - then an agent which ac­cepts any out­come in which it does worse than the fair re­sult, while the op­po­nent does bet­ter, is ‘ex­ploitable’ rel­a­tive to this fair bar­gain. Like the Nash equil­ibrium, the only way you should do worse than ‘fair’ is if the op­po­nent also does worse.

So we wrote down on the white­board an at­tempted defi­ni­tion of un­ex­ploita­bil­ity in co­op­er­a­tive games as fol­lows:

“Sup­pose we have a [mag­i­cal] defi­ni­tion N of a fair out­come. A ra­tio­nal agent should only do worse than N if its op­po­nent does worse than N, or else [if bar­gain­ing fails] should only do worse than the Nash equil­ibrium if its op­po­nent does worse than the Nash equil­ibrium.” (Note that this defi­ni­tion pre­cludes giv­ing in to a threat of black­mail.)

(Key pos­si­ble-in­no­va­tion:)

It then oc­curred to me that this defi­ni­tion opened the pos­si­bil­ity for other, in­ter­me­di­ate bar­gains be­tween the ‘fair’ solu­tion on the Pareto bound­ary, and the Nash equil­ibrium.

Sup­pose the other agent has a slightly differ­ent defi­ni­tion of fair­ness and they think that what you con­sider to be a pay­off of (12, 12) fa­vors you too much; they think that you’re the one mak­ing an un­fair de­mand. They’ll re­fuse (12, 12) with the same feel­ing of in­dig­na­tion that you would ap­ply to (9, 14).

Well, if you give in to an ar­range­ment with an ex­pected pay­off of, say, (11, 13) as you eval­u­ate pay­offs, then you’re giv­ing other agents an in­cen­tive to skew their defi­ni­tions of fair­ness.

But it does not cre­ate poor in­cen­tives (AFAICT) to ac­cept in­stead a bar­gain with an ex­pected pay­off of, say, (10, 11) which the other agent thinks is ‘fair’. Though they’re sad that you re­fused the truly fair out­come of (as you count utilons) 11, 13 and that you couldn’t reach the Pareto bound­ary to­gether, still, this is bet­ter than the Nash equil­ibrium of (8, 8). And though you think the bar­gain is un­fair, you are not cre­at­ing in­cen­tives to ex­ploit you. By in­sist­ing on this defi­ni­tion of fair­ness, the other agent has done worse for them­selves than other (12, 12). The other agent prob­a­bly thinks that (10, 11) is ‘un­fair’ slanted your way, but they like­wise ac­cept that this does not cre­ate bad in­cen­tives, since you did worse than the ‘fair’ out­come of (11, 13).

There could be many ac­cept­able ne­go­ti­at­ing equil­ibria be­tween what you think is the ‘fair’ point on the Pareto bound­ary, and the Nash equil­ibrium. So long as each step down in what you think is ‘fair­ness’ re­duces the to­tal pay­off to the other agent, even if it re­duces your own pay­off even more. This re­sists ex­ploita­tion and avoids cre­at­ing an in­cen­tive for claiming that you have a differ­ent defi­ni­tion of fair­ness, while still hold­ing open the pos­si­bil­ity of some de­gree of co­op­er­a­tion with agents who hon­estly dis­agree with you about what’s fair and are try­ing to avoid ex­ploita­tion them­selves.

This trans­lates into an in­for­mal prin­ci­ple of ne­go­ti­a­tions: Be will­ing to ac­cept un­fair bar­gains, but only if (you make it clear) both sides are do­ing worse than what you con­sider to be a fair bar­gain.

I haven’t seen this ad­vo­cated be­fore even as an in­for­mal prin­ci­ple of ne­go­ti­a­tions. Is it in the liter­a­ture any­where? Some­one sug­gested Schel­ling might have said it, but didn’t provide a chap­ter num­ber.

ADDED:

Clar­ifi­ca­tion 1: Yes, util­ities are in­var­i­ant up to a pos­i­tive af­fine trans­for­ma­tion so there’s no canon­i­cal way to split util­ities evenly. Hence the part about “As­sume a mag­i­cal solu­tion N which gives us the fair di­vi­sion.” If we knew the ex­act prop­er­ties of how to im­ple­ment this mag­i­cal solu­tion, tak­ing it at first for mag­i­cal, that might give us some idea of what N should be, too.

Clar­ifi­ca­tion 2: The way this might work is that you pick a se­ries of in­creas­ingly un­fair-to-you, in­creas­ingly worse-for-the-other-player out­comes whose first el­e­ment is what you deem the fair Pareto out­come: (100, 100), (98, 99), (96, 98). Per­haps stop well short of Nash if the skew be­comes too ex­treme. Drop to Nash as the last re­sort. The other agent does the same, start­ing with their own ideal of fair­ness on the Pareto bound­ary. Un­less one of you has a com­pletely skewed idea of fair­ness, you should be able to meet some­where in the mid­dle. Both of you will do worse against a fixed op­po­nent’s strat­egy by unilat­er­ally adopt­ing more self-fa­vor­ing ideas of fair­ness. Both of you will do worse in ex­pec­ta­tion against po­ten­tially ex­ploitive op­po­nents by unilat­er­ally adopt­ing looser ideas of fair­ness. This gives ev­ery­one an in­cen­tive to obey the Galac­tic Schel­ling Point and be fair about it. You should not be pick­ing the de­scend­ing se­quence in an agent-de­pen­dent way that in­cen­tivizes, at cost to you, skewed claims about fair­ness.

Clar­ifi­ca­tion 3: You must take into ac­count the other agent’s costs and other op­por­tu­ni­ties when en­sur­ing that the net out­come, in terms of fi­nal util­ities, is worse for them than the re­ward offered for ‘fair’ co­op­er­a­tion. Offer­ing them the chance to buy half as many pa­per­clips at a lower, less fair price, does no good if they can go next door, get the same offer again, and buy the same num­ber of pa­per­clips at a lower to­tal price.