Pavlov Generalizes

Sarah Con­stan­tine re­cently wrote about the Pavlov strat­egy for iter­ated pris­oner’s dilemma. She points out that it de­serves to be bet­ter-known, given that it has similar virtues to Tit-for-Tat (TfT). I strongly agree.

  • Much of my think­ing about puz­zles in de­ci­sion the­ory such as co­op­er­a­tion/​co­or­di­na­tion/​bar­gain­ing and log­i­cal causal­ity (through the lense of log­i­cal time) has been shaped by an as­sump­tion that TfF type be­hav­ior is “cor­rect” in some sense, and needs to be ap­pro­pri­ately gen­er­al­ized. Th­ese in­tu­itions have been formed with­out an aware­ness of Pavlov, and so, are worth re­con­sid­er­ing.

  • It turns out that Pavlov-type be­hav­ior is much eas­ier to gen­er­al­ize that TfT. Pavlov-es­que strate­gies only re­quire see­ing when things are go­ing well or poorly for you. TfT-es­que strate­gies re­quire an agent to un­der­stand what co­op­er­a­tion vs defec­tion means in a par­tic­u­lar game. This re­quires the agent to un­der­stand the wider rules of the game (not just the pay­off on a given round), lo­cate the other agents in­volved (dis­t­in­guish­ing them from the en­vi­ron­ment), in­fer the val­ues of the other agents, define a set of com­mon co­op­er­a­tive norms which can be co­or­di­nated on, and figure out whether other agents have fol­lowed them! Sev­eral of these el­e­ments are both difficult in prac­tice and difficult to define in prin­ci­ple.

In this post, I would like to ex­plain how to gen­er­al­ize Pavlov to a wide set of games (achiev­ing some sur­pris­ingly strong, but not ac­tu­ally very good, co­or­di­na­tion prop­er­ties), and also con­vey some in­tu­itions about Pavlov-style vs TfT-style co­or­di­na­tion.

Gen­er­al­iz­ing Pavlov

The pa­per Achiev­ing Pareto Op­ti­mal­ity Through Distributed Learn­ing by Mar­den, Young, and Pao de­scribes a strat­egy which al­lows play­ers in an iter­ated game to even­tu­ally con­verge to a Pareto-op­ti­mal ac­tion pro­file with high prob­a­bil­ity. In fact, the strat­egy de­scribed con­verges to ac­tion pro­files which max­i­mize the sum of all the util­ities! That should sound too good to be true. Some big caveats are com­ing.

The strat­egy is, roughly, as fol­lows. Agents have two differ­ent states: con­tent and dis­con­tent. When con­tent, the agent copies its own ac­tion from the pre­vi­ous round, what­ever it was. When dis­con­tent, the agent plays ran­domly.

An agent flips be­tween con­tent and dis­con­tent based on how well things are go­ing for it. There is a pa­ram­e­ter, ex­per­i­men­ta­tion rate (e.r.), which de­ter­mines how likely an agent is to flip states.

  • Dis­con­tent: Dis­con­tent agents look at how good the pre­vi­ous round was for them. They change to “con­tent” based on the pay­off of the pre­vi­ous round and the e.r.: the higher the pay­off and the higher the e.r., the more likely they are to be­come con­tent. A low e.r. makes them very picky, re­main­ing dis­con­tent the ma­jor­ity of the time, but much less likely to be­come con­tent af­ter see­ing low pay­offs, and only mod­er­ately less likely af­ter high pay­offs.

  • Con­tent, con­sis­tent pay­off: When con­tent, the agent’s be­hav­ior de­pends on whether the pay­off is the same as the twice-pre­vi­ous round. If the pre­vi­ous and twice-pre­vi­ous rounds had the same pay­offs, con­tent agents re­main con­tent, and are very likely to take the same ac­tion as on the pre­vi­ous round. They take a differ­ent ac­tion at the ex­plo­ra­tion rate.

  • Con­tent, in­con­sis­tent pay­off: If the pre­vi­ous round had a pay­off not con­sis­tent with the twice-pre­vi­ous round, a con­tent agent is likely to be­come dis­con­tent (re­main­ing con­tent with prob­a­bil­ity in­creas­ing in the pay­off and the e.r.).

The nice prop­erty this strat­egy has is that for a low e.r., play­ers end up even­tu­ally tak­ing an effi­cient set of ac­tions with high prob­a­bil­ity. It ac­com­plishes this with­out agents know­ing any­thing about each other; ev­ery­one acts on the pay­offs they re­ceive alone.

This is kind of nice, but look­ing at the strat­egy, we can see that the con­ver­gence rate will be very poor. The rea­son it works is that all agents have to be­come con­tent at once in or­der to stay that way with sig­nifi­cant prob­a­bil­ity. If only some agents be­come con­tent, then the pay­offs of the game will re­main in­con­sis­tent, due to the agents who are still ran­dom­iz­ing. As e.r. gets small, agents be­come very un­likely to be­come con­tent, but com­par­a­tively very very very un­likely to be­come con­tent when they’ve per­son­ally got­ten a poor out­come. So, what hap­pens over­all is that ev­ery­one thrashes around for a long long time, en­sur­ing that all the differ­ent pos­si­ble com­bined ac­tion-states get ex­plored. When ev­ery­one hap­pens to be­come con­tent at the ex­act same time, then they stay that way for a long long long time. This is an un­likely co­in­ci­dence, but far more likely when the com­bined set of ac­tions in a round max­i­mizes joint util­ity.

What’s Pavlov-like About This?

The ba­sics of TfT are to copy the other player’s move from the pre­vi­ous round, whereas the ba­sics of Pavlov are to copy your own move from the pre­vi­ous round if things went well, and oth­er­wise change. This is cap­tured nicely by the con­tent/​dis­con­tent idea.

On the other hand, the con­ver­gence be­hav­ior de­scribed—where play­ers flail around for a very long time be­fore they set­tle on a jointly effi­cient strat­egy—doesn’t sound much like Pavlov.

In Pri­soner’s Dilemma, we (out­side of the game) have an idea of what “good” and “bad” out­comes are, so that “copy your strat­egy if they co­op­er­ate; change your strat­egy if they defect” makes sense. The con­tent/​dis­con­tent al­gorithm, on the other hand, doesn’t know what pay­offs are rea­son­able to ex­pect. That’s why Pavlov is eas­ily con­tented whereas the con­tent/​dis­con­tent al­gorithm isn’t.

If we know ahead of time which out­comes are rea­son­able (which is easy in a sym­met­ric game, but more gen­er­ally has to rely on a no­tion of fair­ness such as the Nash Bar­gain­ing Solu­tion), we can pro­gram agents to switch to “con­tent” as soon as they see pay­offs at least as high as what they can rea­son­ably ex­pect to get. This makes the strat­egy more like sim­ple Pavlov. How­ever, if the agents try to do this kind of rea­son­ing, it will tend to make them ex­ploitable: “bully” agents can con­vince them to be con­tent with low pay­offs by mak­ing high pay­offs seem im­pos­si­ble.

Even so, it doesn’t make sense to set e.r. ex­tremely low, which is what’s nec­es­sary to guaran­tee even­tual Pareto-op­ti­mal­ity. In prac­tice you’d rather have a mod­er­ate e.r., ac­cept­ing the chance of set­tling on sub­op­ti­mal ac­tion pro­files.

There’s some­thing of a trade-off be­tween ex­ploita­bil­ity and con­ver­gence rate. If e.r. is high, then agents set­tle eas­ily, so bul­lies can take ad­van­tage by only offer­ing ex­ploita­tive op­por­tu­ni­ties. If e.r. is mod­er­ate, then bul­lies have to be ver­rrry pa­tient to pull that off. The gen­er­al­ized-Pavlov agent in­cen­tivises co­op­er­a­tion by be­ing more likely to set­tle for bet­ter offers. If e.r. is low, though, the prob­a­bil­ity of be­com­ing con­tent is so low that there’s no in­cen­tive for other agents to care; they’ll just deal with your ran­dom­iza­tion in what­ever way is op­ti­mal for them.

Cluster Comparison

Us­ing the MTG color wheel to dis­cuss AI al­ign­ment has been pop­u­lar re­cently, so… ac­cord­ing to me, TfT is a very “white” strat­egy, and Pavlov is a very “red” strat­egy. (For the rest of the post, “Pavlov” and “TfT” will re­fer to more gen­eral clusters rather than the spe­cific pris­oner’s dilemma strate­gies.)

  • TfT is eye-for-an-eye jus­tice. There are rules, and there are pro­por­tional con­se­quences. Co­op­er­a­tion is good, and is met with co­op­er­a­tion. Defec­tion is evil, and is pun­ished with defec­tion in turn. TfT is strongest in an en­vi­ron­ment with many TfT agents and many less-co­op­er­a­tive agents; the group of TfT works to­gether and pro­vides effec­tive pun­ish­ment for the defec­tor agents.

  • Pavlov is do-what-works-for-you free­dom. There are no rules; there is no con­cept of good and evil. When things aren’t work­ing well, you leave a situ­a­tion; when things are work­ing well, you stay. Pavlov doesn’t re­ally pun­ish oth­ers, it just ran­dom­izes; it in­cen­tivises co­op­er­a­tion mainly through pos­i­tive re­in­force­ment. Like­wise, Pavlov re­sponds only to pos­i­tive re­in­force­ment; nega­tive re­in­force­ment can only so­licit ran­dom­iza­tion, so you can’t usu­ally use it to get a Pavlov agent to do what you want. Pavlov doesn’t pun­ish defec­tion as well as TfT does, but it takes ad­van­tage of overly co­op­er­a­tive agents in a way TfT doesn’t. So, Pavlov agents thrive in en­vi­ron­ments with naive co­op­er­a­tors, but will tend to ex­ploit those naive co­op­er­a­tors out of ex­is­tence.

Here’s some dis­cus­sion of what the strat­egy clusters look like: (I’ll leave open the ques­tion of what strate­gies might cor­re­spond to the other MTG col­ors.)

TfT Cluster

Even within the do­main of iter­ated pris­oner’s dilemma, there are many strate­gies which are still gen­er­ally within the TfT cluster. Sarah dis­cusses some of them.

Mov­ing away from an iter­ated set­ting and into a log­i­cal one, TfT-like rea­son­ing tries to copy the move of the other player on this round, rather than the pre­vi­ous round. Rea­son­ing about the other player with log­i­cal un­cer­tainty, a nat­u­ral strat­egy is NicerBot, which co­op­er­ates with prob­a­bil­ity ep­silon higher than its es­ti­mate of the other player’s prob­a­bil­ity of co­op­er­a­tion. Ad­ding ep­silon en­sures that two NicerBots co­op­er­ate with each other (oth­er­wise they could end up con­verg­ing to any prob­a­bil­ity of co­op­er­a­tion); more gen­er­ally, NicerBot matches the co­op­er­a­tive­ness of other play­ers.

There is an im­por­tant difficulty in gen­er­al­iz­ing TfT-like strate­gies fur­ther: you don’t want to co­op­er­ate with a rock which has “I co­op­er­ate” en­graved on it. In other words, it doesn’t make sense to co­op­er­ate with non-agents. TfT is a strat­egy which makes sense when we’ve iden­ti­fied that we’re deal­ing with an­other agent, but we need to com­bine its be­hav­ior with more gen­eral de­ci­sion-the­o­retic rea­son­ing, some­how.

In open-source game the­ory with proof-based agents, the Löbian hand­shake could be con­sid­ered a TfT-like con­cept. How­ever, un­like with NicerBot, we nat­u­rally only co­op­er­ate with other agents; we defect against a rock with “I co­op­er­ate” writ­ten on it. So, naively, it seems like we’re in a bet­ter po­si­tion to gen­er­al­ize TfT-like rea­son­ing in pure logic than we are in log­i­cally-un­cer­tain rea­son­ing. This puz­zle is dis­cussed more in this post.

The Löbian hand­shake is still frag­ile to Agent Si­mu­lates Pre­dic­tor type prob­lems. One way of in­ter­pret­ing this (which I think is the right way) is that al­though a Löbian-hand­shake-ca­pa­ble agent won’t co­op­er­ate with rocks, its no­tion of “rock” is any­thing with much less pro­cess­ing power than it­self. This is prob­le­matic, be­cause it only al­lows co­op­er­a­tion with similarly-in­tel­li­gent agents.

The in­tu­itive no­tion of log­i­cal causal­ity I as­so­ci­ate with TfT is one which com­bines the log­i­cally-un­cer­tain rea­son­ing of Nicer­bot and the Löbian hand­shake of the proof-based set­ting. It is very TDT-ilke: you rea­son about your de­ci­sion as if you have log­i­cal con­trol over rele­vantly similar agents. You co­or­di­nate with other agents by rea­son­ing in the same way about what the op­ti­mal joint strat­egy would be.

Pavlov Cluster

Pavlov-type rea­son­ing doesn’t have the same difficul­ties gen­er­al­iz­ing as TfT. Whereas TfT seems to re­quire ex­tra pieces added on to take ad­van­tage of non-agents (like Pru­den­tBot), Pavlov does this nat­u­rally, while also nat­u­rally find­ing co­op­er­a­tive equil­ibria with other Pavlov agents.

It seems pos­si­ble that Pavlov-cluster al­gorithms can do well in Agent Si­mu­lates Pre­dic­tor -type prob­lems, too. Pavlov hap­pily defects against a rock be­cause the rock ex­erts no agen­tic pres­sure to­ward co­op­er­a­tion. How­ever, in Agent Si­mu­lates Pre­dic­tor, the pre­dic­tor does ex­ert “agen­tic pres­sure”—it makes things go poorly for agents who pre­dictably defect. A Pavlov will re­spond to this by be­ing less likely to set­tle on defec­tion as a strat­egy. (This is very hand-wavy, how­ever. It would be good to con­struct a set­ting in which illus­trates this be­hav­ior.)

I’ve already de­scribed a gen­er­al­iza­tion to de­ter­minis­tic iter­ated games, al­though the con­ver­gence be­hav­ior is ad­mit­tedly quite bad (per­haps it can be im­proved). It isn’t hard to gen­er­al­ize the strat­egy I de­scribed to games with stochas­tic pay­offs, if one is will­ing to use hacks like ag­gre­gat­ing iter­a­tions into batches to de-noise the pay­out in­for­ma­tion. It would be in­ter­est­ing to see a non-hacky way of gen­er­al­iz­ing it.

Mov­ing to a more log­i­cal set­ting, we may again want to sub­sti­tute the no­tion of “pre­vi­ous round” with “pre­dic­tion”. TfT-gen­er­al­iza­tion matched its be­hav­ior to the pre­dicted be­hav­ior of the other agent; Pavlov would in­stead copy its own pre­dicted be­hav­ior! Or, rather: if Pavlov ex­pects things to go well, it tries to do what it would ex­pect it­self to do. If Pavlov ex­pects things to go poorly, it tries to in­val­i­date its self-pre­dic­tion.

(This doesn’t seem ob­vi­ously good, but, per­haps it can lead some­where.)

I’m not sure what no­tion of log­i­cal con­trol should be as­so­ci­ated with Pavlov. It feels con­nected with De­ci­sion The­ory is for Mak­ing Bad Out­comes In­con­sis­tent: Pavlov is try­ing to “shake off” bad out­comes, as if that made them less prob­a­ble, much like the (ap­par­ently) bad rea­son­ing dis­cussed in The Happy Dance Prob­lem. The con­tent/​dis­con­tent al­gorithm shows that some­thing like this can work in the long run (though not in the short run).

Another idea is that the con­tent/​dis­con­tent al­gorithm con­trols the sta­tion­ary dis­tri­bu­tion of the Markov chain (the iter­ated play) by set­ting the tran­si­tion prob­a­bil­ities so as to steer to­ward op­ti­mal out­comes prefer­en­tially. Per­haps a no­tion of log­i­cal con­trol in which an agent thinks of it­self as steer­ing in this way?

Over­all, the Pavlov side of things is very sketchy. As com­pet­ing vi­sions, TfT-like rea­son­ing may hold up bet­ter in the long run. Still, it seems use­ful to try to flesh out Pavlov-type rea­son­ing as an al­ter­na­tive.