Let’s Read: Superhuman AI for multiplayer poker
On July 11, a new poker AI is published in Science. Called Pluribus, it plays 6-player No-limit Texas Hold’em at superhuman level.
In this post, we read through the paper. The level of exposition is between the paper (too serious) and the popular press (too entertaining).
Basics of Texas Hold’em
If you don’t know what it even is, like me, then playing a tutorial would be best. I used Learn Poker on my phone.
Now that you know how to play it, it’s time to deal with some of the terminologies.
Big blind: the minimal money/poker chips that every player must bet in order to play. For example, $0.1 would be a reasonable amount in casual play.
No-limit: you can bet as much as you want. Okay, not really. You can’t bet a billion dollars. In practical playing, it’s usually limited to something “reasonable” like 100 times of the big blind.
Limping: betting the minimal amount that you have to bet, in order to keep yourself in the game. This is generally considered bad: if you feel confident, you should raise the bet, and if you feel diffident, you should quit.
Donk betting: some kind of uncommon play that’s usually considered dumb (like a donkey). I didn’t figure out what it actually means.
The authors are Noam Brown and Tuomas Sandholm. Previously, they made the news by writing Libratus, a poker AI that beat human champions in 2-player no-limit Texas Hold’em, in 2017.
Pluribus contains a lot of the code from Libratus and its siblings:
The authors have ownership interest in Strategic Machine, Inc. and Strategy Robot, Inc. which have exclusively licensed prior game-solving code from Prof. Sandholm’s Carnegie Mellon University laboratory, which constitutes the bulk of the code in Pluribus.
Scroll to the bottom for more on the two companies.
Highlights from the paper
Is Nash equilibrium even worthwhile?
In multiplayer games, Nash equilibriums are not easy to compute, and might not even matter. Consider the Lemonade Stand Game:
It is summer on Lemonade Island, and you need to make some cash. You decide to set up a lemonade stand on the beach (which goes all around the island), as do two others. There are twelve places to set up around the island like the numbers on a clock. Your price is fixed, and all people go to the nearest lemonade stand. The game is repeated. Every night, everyone moves under cover of darkness (simultaneously). There is no cost to move. After 100 days of summer, the game is over. The utility of the repeated game is the sum of the utilities of the single-shot games.
The Nash equilibrium is when three of you are equidistant from each other, but there’s no way to achieve that unilaterally. You might decide that you will just stay in Stand 0 and wait for the others to get to Stand 4 and Stand 8, but they might decide upon a different Nash equilibrium.
The authors decided to go all empirical and not consider the problem of Nash equilibrium:
The shortcomings of Nash equilibria outside of two-player zero-sum games, and the failure of any other game-theoretic solution concept to convincingly overcome them, have raised the question of what the right goal should even be in such games. In the case of six-player poker, we take the viewpoint that our goal should not be a specific game-theoretic solution concept, but rather to create an AI that empirically consistently defeats human opponents, including elite human professionals.
The success of Pluribus appears to vindicate them:
… even though the techniques do not have known strong theoretical guarantees on performance outside of the two-player zero-sum setting, they are nevertheless capable of producing superhuman strategies in a wider class of strategic settings.
Description of Pluribus
Pluribus first produces a “blueprint” by offline self-play, then during live gaming, adapt it:
The core of Pluribus’s strategy was computed via self play, in which the AI plays against copies of itself, without any data of human or prior AI play used… Pluribus’s self play produces a strategy for the entire game offline, which we refer to as the blueprint strategy. Then during actual play against opponents, Pluribus improves upon the blueprint strategy by searching for a better strategy in real time for the situations it finds itself in during the game.
Since the first round (like chess opening vs chess midgame) had the smallest amount of variation, Pluribus could afford to train an almost complete blueprint strategy for the first round. For later rounds, some real-time search was needed:
Pluribus only plays according to this blueprint strategy in the first betting round (of four)… After the first round, Pluribus instead conducts real-time search to determine a better, finer-grained strategy for the current situation it is in.
Pluribus uses Monte Carlo counterfactual regret minimization. The details can be found in the link.
The blueprint strategy in Pluribus was computed using a variant of counterfactual regret minimization (CFR)… We use a form of Monte Carlo CFR (MCCFR) that samples actions in the game tree rather than traversing the entire game tree on each iteration.
Pluribus can be sneaky:
… if the player bets in [a winning] situation only when holding the best possible hand, then the opponents would know to always fold in response. To cope with this, Pluribus keeps track of the probability it would have reached the current situation with each possible hand according to its strategy. Regardless of which hand Pluribus is actually holding, it will first calculate how it would act with every possible hand, being careful to balance its strategy across all the hands so as to remain unpredictable to the opponent. Once this balanced strategy across all hands is computed, Pluribus then executes an action for the hand it is actually holding.
This was corroborated by a comment from a human opponent:
“Pluribus is a very hard opponent to play against,” said Chris Ferguson, a World Series of Poker champion. “It’s really hard to pin him down on any kind of hand.”
Scroll down for how Ferguson lost to Pluribus.
Pluribus is cheap, small, and fast
In order to make Pluribus small, the blueprint strategy is “abstracted”, that is, it intentionally confuses some game actions (because really, $200 and $201 are not so different).
We set the size of the blueprint strategy abstraction to allow Pluribus to run during live play on a machine with no more than 128 GB of memory while storing a compressed form of the blueprint strategy in memory.
The abstraction paid off. Pluribus was cheap to train, cheap to run, and faster than humans:
The blueprint strategy for Pluribus was computed in 8 days on a 64-core server for a total of 12,400 CPU core hours. It required less than 512 GB of memory. At current cloud computing spot instance rates, this would cost about $144 to produce.
When playing, Pluribus runs on two Intel Haswell E5-2695 v3 CPUs and uses less than 128 GB of memory. For comparison… Libratus used 100 CPUs in its 2017 matches against top professionals in two-player poker.
On Amazon right now, Intel® Xeon® Processor E5-2695 v3 CPU cost just $500 each, and a 128 GB RAM cost $750. The whole setup can be constructed for under $2000. It would only take a little while to recoup the cost if it goes to online poker.
The amount of time Pluribus takes to conduct search on a single subgame varies between 1 s and 33 s depending on the particular situation. On average, Pluribus plays at a rate of 20 s per hand when playing against copies of itself in six-player poker. This is roughly twice as fast as professional humans tend to play.
Pluribus vs Human professionals. Pluribus wins!
We evaluated Pluribus against elite human professionals in two formats: five human professionals playing with one copy of Pluribus (5H+1AI), and one human professional playing with five copies of Pluribus (1H+5AI). Each human participant has won more than $1 million playing poker professionally.
Professional Poker is an endurance game, like marathon:
In this experiment, 10,000 hands of poker were played over 12 days. Each day, five volunteers from the pool of  professionals were selected to participate based on availability. The participants were not told who else was participating in the experiment. Instead, each participant was assigned an alias that remained constant throughout the experiment. The alias of each player in each game was known, so that players could track the tendencies of each player throughout the experiment.
And there was prize money, of course, for the humans. Pluribus played for free—what a champ.
$50,000 was divided among the human participants based on their performance to incentivize them to play their best. Each player was guaranteed a minimum of $0.40 per hand for participating, but this could increase to as much as $1.60 per hand based on performance.
Pluribus had a very high win rate, and is statistically demonstrated to be profitable when playing against 5 elite humans:
After applying AIVAT, Pluribus won an average of 48 mbb/game (with a standard error of 25 mbb/game). This is considered a very high win rate in six-player no-limit Texas hold’em poker, especially against a collection of elite professionals, and implies that Pluribus is stronger than the human opponents. Pluribus was determined to be profitable with a p-value of 0.028.
“mbb/game” means “milli big blinds per game”. “big blind” just means “the least amount that one must bet at the beginning of the game”, and poker players use it as a unit of measurement of the size of bets. “milli” means 1/1000. So Pluribus would on average win 4.8% of the big blind each game. Very impressive.
AIVAT is statistical technique that is designed specifically to evaluate how good a poker player is. From (Neil Burch et al, 2018):
Evaluating agent performance when outcomes are stochastic and agents use randomized strategies can be challenging when there is limited data available… [AIVAT] was able to reduce the standard deviation of a Texas hold’em poker man-machine match by 85% and consequently requires 44 times fewer games to draw the same statistical conclusion. AIVAT enabled the first statistically significant AI victory against professional poker players in no-limit hold’em.
Pluribus vs Jesus (and Elias)
The human participants in the 1H+5AI experiment were Chris “Jesus” Ferguson and Darren Elias. Each of the two humans separately played 5,000 hands of poker against five copies of Pluribus.
Pluribus did not gang up on the poor human:
Pluribus does not adapt its strategy to its opponents and does not know the identity of its opponents, so the copies of Pluribus could not intentionally collude against the human player.
The humans were paid on average $0.60 per game:
To incentivize strong play, we offered each human $2,000 for participation and an additional $2,000 if he performed better against the AI than the other human player did.
For the 10,000 hands played, Pluribus beat the humans by an average of 32 mbb/game (with a standard error of 15 mbb/game). Pluribus was determined to be profitable with a p-value of 0.014.
Ferguson lost less than Elias:
Ferguson’s lower loss rate may be a consequence of variance, skill, and/or the fact that he used a more conservative strategy that was biased toward folding in unfamiliar difficult situations.
Pluribus is an alien, like AlphaZero
And like AlphaZero, it confirms some human strategies, and dismisses some others:
Because Pluribus’s strategy was determined entirely from self-play without any human data, it also provides an outside perspective on what optimal play should look like in multiplayer no-limit Texas hold’em.
Two examples in particular:
Pluribus confirms the conventional human wisdom that limping (calling the “big blind” rather than folding or raising) is suboptimal for any player except the “small blind” player… While Pluribus initially experimented with limping… it gradually discarded this action from its strategy as self play continued. However, Pluribus disagrees with the folk wisdom that “donk betting” (starting a round by betting when one ended the previous betting round with a call) is a mistake; Pluribus does this far more often than professional humans do.
Too dangerous to be released, again
The program is not released for some kind of unspecified risk. (News articles made it specifically about the risk of wrecking the online gambling industry.)
Because poker is played commercially, the risk associated with releasing the code outweighs the benefits. To aid reproducibility, we have included the pseudocode for the major components of our program in the supplementary materials.
Useful quotes from other news report
From Ars Technica:
Pluribus actually confirmed one bit of conventional poker-playing wisdom: it’s just not a good idea to “limp” into a hand, that is, calling the big blind rather than folding or raising. The exception, of course, is if you’re in the small blind, when mere calling costs you half as much as the other players.
Pluribus placed donk bets far more often than its human opponents… Pluribus makes unusual bet sizes and is better at randomization. “Its major strength is its ability to use mixed strategies… to do this in a perfectly random way and to do so consistently. Most people just can’t.”
From MIT Technology Review:
Sandholm cites multi-party negotiation or pricing—such as Amazon, Walmart, and Target trying to come up with the most competitive pricing against each other—as a specific application. Optimal media spending for political campaigns is another example, as well as auction bidding strategies.
There are a bit of details to the two companies of Sandholm:
Sandholm has already licensed much of the poker technology developed in his lab to two startups: Strategic Machine and Strategy Robot. The first startup is interested in gaming and other entertainment applications; Strategy Robot’s focus is on defense and intelligence applications.
“Better computer games”… hm, sounds suspiciously nonspecific.
Brown says Facebook has no plans to apply the techniques developed for six-player poker, although they could be used to develop better computer games.