Let’s Read: Superhuman AI for multiplayer poker

On July 11, a new poker AI is pub­lished in Science. Called Pluribus, it plays 6-player No-limit Texas Hold’em at su­per­hu­man level.

In this post, we read through the pa­per. The level of ex­po­si­tion is be­tween the pa­per (too se­ri­ous) and the pop­u­lar press (too en­ter­tain­ing).

Ba­sics of Texas Hold’em

If you don’t know what it even is, like me, then play­ing a tu­to­rial would be best. I used Learn Poker on my phone.

Now that you know how to play it, it’s time to deal with some of the ter­minolo­gies.

  • Big blind: the min­i­mal money/​poker chips that ev­ery player must bet in or­der to play. For ex­am­ple, $0.1 would be a rea­son­able amount in ca­sual play.

  • No-limit: you can bet as much as you want. Okay, not re­ally. You can’t bet a billion dol­lars. In prac­ti­cal play­ing, it’s usu­ally limited to some­thing “rea­son­able” like 100 times of the big blind.

  • Heads-up: 2-player.

  • Limp­ing: bet­ting the min­i­mal amount that you have to bet, in or­der to keep your­self in the game. This is gen­er­ally con­sid­ered bad: if you feel con­fi­dent, you should raise the bet, and if you feel diffi­dent, you should quit.

  • Donk bet­ting: some kind of un­com­mon play that’s usu­ally con­sid­ered dumb (like a don­key). I didn’t figure out what it ac­tu­ally means.

The authors

The au­thors are Noam Brown and Tuo­mas Sand­holm. Pre­vi­ously, they made the news by writ­ing Libra­tus, a poker AI that beat hu­man cham­pi­ons in 2-player no-limit Texas Hold’em, in 2017.

Pluribus con­tains a lot of the code from Libra­tus and its siblings:

The au­thors have own­er­ship in­ter­est in Strate­gic Ma­chine, Inc. and Strat­egy Robot, Inc. which have ex­clu­sively li­censed prior game-solv­ing code from Prof. Sand­holm’s Carnegie Mel­lon Univer­sity lab­o­ra­tory, which con­sti­tutes the bulk of the code in Pluribus.

Scroll to the bot­tom for more on the two com­pa­nies.

High­lights from the paper

Is Nash equil­ibrium even worth­while?

In mul­ti­player games, Nash equil­ibriums are not easy to com­pute, and might not even mat­ter. Con­sider the Le­mon­ade Stand Game:

It is sum­mer on Le­mon­ade Is­land, and you need to make some cash. You de­cide to set up a lemon­ade stand on the beach (which goes all around the is­land), as do two oth­ers. There are twelve places to set up around the is­land like the num­bers on a clock. Your price is fixed, and all peo­ple go to the near­est lemon­ade stand. The game is re­peated. Every night, ev­ery­one moves un­der cover of dark­ness (si­mul­ta­neously). There is no cost to move. After 100 days of sum­mer, the game is over. The util­ity of the re­peated game is the sum of the util­ities of the sin­gle-shot games.

The Nash equil­ibrium is when three of you are equidis­tant from each other, but there’s no way to achieve that unilat­er­ally. You might de­cide that you will just stay in Stand 0 and wait for the oth­ers to get to Stand 4 and Stand 8, but they might de­cide upon a differ­ent Nash equil­ibrium.

The au­thors de­cided to go all em­piri­cal and not con­sider the prob­lem of Nash equil­ibrium:

The short­com­ings of Nash equil­ibria out­side of two-player zero-sum games, and the failure of any other game-the­o­retic solu­tion con­cept to con­vinc­ingly over­come them, have raised the ques­tion of what the right goal should even be in such games. In the case of six-player poker, we take the view­point that our goal should not be a spe­cific game-the­o­retic solu­tion con­cept, but rather to cre­ate an AI that em­piri­cally con­sis­tently defeats hu­man op­po­nents, in­clud­ing elite hu­man pro­fes­sion­als.

The suc­cess of Pluribus ap­pears to vin­di­cate them:

… even though the tech­niques do not have known strong the­o­ret­i­cal guaran­tees on perfor­mance out­side of the two-player zero-sum set­ting, they are nev­er­the­less ca­pa­ble of pro­duc­ing su­per­hu­man strate­gies in a wider class of strate­gic set­tings.

De­scrip­tion of Pluribus

Pluribus first pro­duces a “blueprint” by offline self-play, then dur­ing live gam­ing, adapt it:

The core of Pluribus’s strat­egy was com­puted via self play, in which the AI plays against copies of it­self, with­out any data of hu­man or prior AI play used… Pluribus’s self play pro­duces a strat­egy for the en­tire game offline, which we re­fer to as the blueprint strat­egy. Then dur­ing ac­tual play against op­po­nents, Pluribus im­proves upon the blueprint strat­egy by search­ing for a bet­ter strat­egy in real time for the situ­a­tions it finds it­self in dur­ing the game.

Since the first round (like chess open­ing vs chess midgame) had the small­est amount of vari­a­tion, Pluribus could af­ford to train an al­most com­plete blueprint strat­egy for the first round. For later rounds, some real-time search was needed:

Pluribus only plays ac­cord­ing to this blueprint strat­egy in the first bet­ting round (of four)… After the first round, Pluribus in­stead con­ducts real-time search to de­ter­mine a bet­ter, finer-grained strat­egy for the cur­rent situ­a­tion it is in.

Pluribus uses Monte Carlo coun­ter­fac­tual re­gret min­i­miza­tion. The de­tails can be found in the link.

The blueprint strat­egy in Pluribus was com­puted us­ing a var­i­ant of coun­ter­fac­tual re­gret min­i­miza­tion (CFR)… We use a form of Monte Carlo CFR (MCCFR) that sam­ples ac­tions in the game tree rather than travers­ing the en­tire game tree on each iter­a­tion.

Pluribus can be sneaky:

… if the player bets in [a win­ning] situ­a­tion only when hold­ing the best pos­si­ble hand, then the op­po­nents would know to always fold in re­sponse. To cope with this, Pluribus keeps track of the prob­a­bil­ity it would have reached the cur­rent situ­a­tion with each pos­si­ble hand ac­cord­ing to its strat­egy. Re­gard­less of which hand Pluribus is ac­tu­ally hold­ing, it will first calcu­late how it would act with ev­ery pos­si­ble hand, be­ing care­ful to bal­ance its strat­egy across all the hands so as to re­main un­pre­dictable to the op­po­nent. Once this bal­anced strat­egy across all hands is com­puted, Pluribus then ex­e­cutes an ac­tion for the hand it is ac­tu­ally hold­ing.

This was cor­rob­o­rated by a com­ment from a hu­man op­po­nent:

“Pluribus is a very hard op­po­nent to play against,” said Chris Fer­gu­son, a World Series of Poker cham­pion. “It’s re­ally hard to pin him down on any kind of hand.”

Scroll down for how Fer­gu­son lost to Pluribus.

Pluribus is cheap, small, and fast

In or­der to make Pluribus small, the blueprint strat­egy is “ab­stracted”, that is, it in­ten­tion­ally con­fuses some game ac­tions (be­cause re­ally, $200 and $201 are not so differ­ent).

We set the size of the blueprint strat­egy ab­strac­tion to al­low Pluribus to run dur­ing live play on a ma­chine with no more than 128 GB of mem­ory while stor­ing a com­pressed form of the blueprint strat­egy in mem­ory.

The ab­strac­tion paid off. Pluribus was cheap to train, cheap to run, and faster than hu­mans:

The blueprint strat­egy for Pluribus was com­puted in 8 days on a 64-core server for a to­tal of 12,400 CPU core hours. It re­quired less than 512 GB of mem­ory. At cur­rent cloud com­put­ing spot in­stance rates, this would cost about $144 to pro­duce.

When play­ing, Pluribus runs on two In­tel Haswell E5-2695 v3 CPUs and uses less than 128 GB of mem­ory. For com­par­i­son… Libra­tus used 100 CPUs in its 2017 matches against top pro­fes­sion­als in two-player poker.

On Ama­zon right now, In­tel® Xeon® Pro­ces­sor E5-2695 v3 CPU cost just $500 each, and a 128 GB RAM cost $750. The whole setup can be con­structed for un­der $2000. It would only take a lit­tle while to re­coup the cost if it goes to on­line poker.

The amount of time Pluribus takes to con­duct search on a sin­gle sub­game varies be­tween 1 s and 33 s de­pend­ing on the par­tic­u­lar situ­a­tion. On av­er­age, Pluribus plays at a rate of 20 s per hand when play­ing against copies of it­self in six-player poker. This is roughly twice as fast as pro­fes­sional hu­mans tend to play.

Pluribus vs Hu­man pro­fes­sion­als. Pluribus wins!

We eval­u­ated Pluribus against elite hu­man pro­fes­sion­als in two for­mats: five hu­man pro­fes­sion­als play­ing with one copy of Pluribus (5H+1AI), and one hu­man pro­fes­sional play­ing with five copies of Pluribus (1H+5AI). Each hu­man par­ti­ci­pant has won more than $1 mil­lion play­ing poker pro­fes­sion­ally.

Pro­fes­sional Poker is an en­durance game, like marathon:

In this ex­per­i­ment, 10,000 hands of poker were played over 12 days. Each day, five vol­un­teers from the pool of [13] pro­fes­sion­als were se­lected to par­ti­ci­pate based on availa­bil­ity. The par­ti­ci­pants were not told who else was par­ti­ci­pat­ing in the ex­per­i­ment. In­stead, each par­ti­ci­pant was as­signed an alias that re­mained con­stant through­out the ex­per­i­ment. The alias of each player in each game was known, so that play­ers could track the ten­den­cies of each player through­out the ex­per­i­ment.

And there was prize money, of course, for the hu­mans. Pluribus played for free—what a champ.

$50,000 was di­vided among the hu­man par­ti­ci­pants based on their perfor­mance to in­cen­tivize them to play their best. Each player was guaran­teed a min­i­mum of $0.40 per hand for par­ti­ci­pat­ing, but this could in­crease to as much as $1.60 per hand based on perfor­mance.

Pluribus had a very high win rate, and is statis­ti­cally demon­strated to be prof­itable when play­ing against 5 elite hu­mans:

After ap­ply­ing AIVAT, Pluribus won an av­er­age of 48 mbb/​game (with a stan­dard er­ror of 25 mbb/​game). This is con­sid­ered a very high win rate in six-player no-limit Texas hold’em poker, es­pe­cially against a col­lec­tion of elite pro­fes­sion­als, and im­plies that Pluribus is stronger than the hu­man op­po­nents. Pluribus was de­ter­mined to be prof­itable with a p-value of 0.028.

“mbb/​game” means “milli big blinds per game”. “big blind” just means “the least amount that one must bet at the be­gin­ning of the game”, and poker play­ers use it as a unit of mea­sure­ment of the size of bets. “milli” means 1/​1000. So Pluribus would on av­er­age win 4.8% of the big blind each game. Very im­pres­sive.

Performance of Pluribus in the 5 humans + 1 AI experiment

AIVAT is statis­ti­cal tech­nique that is de­signed speci­fi­cally to eval­u­ate how good a poker player is. From (Neil Burch et al, 2018):

Eval­u­at­ing agent perfor­mance when out­comes are stochas­tic and agents use ran­dom­ized strate­gies can be challeng­ing when there is limited data available… [AIVAT] was able to re­duce the stan­dard de­vi­a­tion of a Texas hold’em poker man-ma­chine match by 85% and con­se­quently re­quires 44 times fewer games to draw the same statis­ti­cal con­clu­sion. AIVAT en­abled the first statis­ti­cally sig­nifi­cant AI vic­tory against pro­fes­sional poker play­ers in no-limit hold’em.

Pluribus vs Je­sus (and Elias)

The hu­man par­ti­ci­pants in the 1H+5AI ex­per­i­ment were Chris “Je­sus” Fer­gu­son and Dar­ren Elias. Each of the two hu­mans sep­a­rately played 5,000 hands of poker against five copies of Pluribus.

Pluribus did not gang up on the poor hu­man:

Pluribus does not adapt its strat­egy to its op­po­nents and does not know the iden­tity of its op­po­nents, so the copies of Pluribus could not in­ten­tion­ally col­lude against the hu­man player.

The hu­mans were paid on av­er­age $0.60 per game:

To in­cen­tivize strong play, we offered each hu­man $2,000 for par­ti­ci­pa­tion and an ad­di­tional $2,000 if he performed bet­ter against the AI than the other hu­man player did.

Pluribus won!

For the 10,000 hands played, Pluribus beat the hu­mans by an av­er­age of 32 mbb/​game (with a stan­dard er­ror of 15 mbb/​game). Pluribus was de­ter­mined to be prof­itable with a p-value of 0.014.

Fer­gu­son lost less than Elias:

Fer­gu­son’s lower loss rate may be a con­se­quence of var­i­ance, skill, and/​or the fact that he used a more con­ser­va­tive strat­egy that was bi­ased to­ward fold­ing in un­fa­mil­iar difficult situ­a­tions.

Pluribus is an alien, like AlphaZero

And like AlphaZero, it con­firms some hu­man strate­gies, and dis­misses some oth­ers:

Be­cause Pluribus’s strat­egy was de­ter­mined en­tirely from self-play with­out any hu­man data, it also pro­vides an out­side per­spec­tive on what op­ti­mal play should look like in mul­ti­player no-limit Texas hold’em.

Two ex­am­ples in par­tic­u­lar:

Pluribus con­firms the con­ven­tional hu­man wis­dom that limp­ing (call­ing the “big blind” rather than fold­ing or rais­ing) is sub­op­ti­mal for any player ex­cept the “small blind” player… While Pluribus ini­tially ex­per­i­mented with limp­ing… it grad­u­ally dis­carded this ac­tion from its strat­egy as self play con­tinued. How­ever, Pluribus dis­agrees with the folk wis­dom that “donk bet­ting” (start­ing a round by bet­ting when one ended the pre­vi­ous bet­ting round with a call) is a mis­take; Pluribus does this far more of­ten than pro­fes­sional hu­mans do.

Too dan­ger­ous to be re­leased, again

The pro­gram is not re­leased for some kind of un­speci­fied risk. (News ar­ti­cles made it speci­fi­cally about the risk of wreck­ing the on­line gam­bling in­dus­try.)

Be­cause poker is played com­mer­cially, the risk as­so­ci­ated with re­leas­ing the code out­weighs the benefits. To aid re­pro­ducibil­ity, we have in­cluded the pseu­docode for the ma­jor com­po­nents of our pro­gram in the sup­ple­men­tary ma­te­ri­als.

Use­ful quotes from other news report

From Ars Tech­nica:

Pluribus ac­tu­ally con­firmed one bit of con­ven­tional poker-play­ing wis­dom: it’s just not a good idea to “limp” into a hand, that is, call­ing the big blind rather than fold­ing or rais­ing. The ex­cep­tion, of course, is if you’re in the small blind, when mere call­ing costs you half as much as the other play­ers.

Pluribus placed donk bets far more of­ten than its hu­man op­po­nents… Pluribus makes un­usual bet sizes and is bet­ter at ran­dom­iza­tion. “Its ma­jor strength is its abil­ity to use mixed strate­gies… to do this in a perfectly ran­dom way and to do so con­sis­tently. Most peo­ple just can’t.”

From MIT Tech­nol­ogy Re­view:

Sand­holm cites multi-party ne­go­ti­a­tion or pric­ing—such as Ama­zon, Wal­mart, and Tar­get try­ing to come up with the most com­pet­i­tive pric­ing against each other—as a spe­cific ap­pli­ca­tion. Op­ti­mal me­dia spend­ing for poli­ti­cal cam­paigns is an­other ex­am­ple, as well as auc­tion bid­ding strate­gies.

There are a bit of de­tails to the two com­pa­nies of Sand­holm:

Sand­holm has already li­censed much of the poker tech­nol­ogy de­vel­oped in his lab to two star­tups: Strate­gic Ma­chine and Strat­egy Robot. The first startup is in­ter­ested in gam­ing and other en­ter­tain­ment ap­pli­ca­tions; Strat­egy Robot’s fo­cus is on defense and in­tel­li­gence ap­pli­ca­tions.

“Bet­ter com­puter games”… hm, sounds sus­pi­ciously non­spe­cific.

Brown says Face­book has no plans to ap­ply the tech­niques de­vel­oped for six-player poker, al­though they could be used to de­velop bet­ter com­puter games.