A well-known issue in game theory is that of multiple Nash equilibria in coordination games.

For example, consider the following game. Players 1 and 2 both choose either “high” or “low”. If both choose “high”, they both get 10 utility; if both choose “low”, they both get 5 utility; if they choose different actions, then they get no utility. There are 3 Nash equilibria: one where both always play “high”, one where both always play “low”, and one where each plays “high” with ¹⁄₃ probability and “low” with ²⁄₃ probability. While (high, high) is obviously the best Nash equilibrium, unilaterally switching to high when the current equilibrium is (low, low) is probably unwise.

You could imagine TDT arguments based on symmetry between the different players to argue that “high” is the only rational choice, but it is easy to imagine asymmetric variants of this problem where the symmetry arguments are much less clear.

The presence of suboptimal Nash equilibria looks like a serious problem for reconciling individual and collective rationality, even when players have common payoffs. If everyone is going to play “low”, then playing “low” is individually rational (at least according to CDT); however, the collective of agents fails to act rationally due to a coordination failure.

What I hoped to do before coming up with the idea in this post is to somehow reduce collective rationality to individual rationality. Can we prescribe individual rationality conditions to each player, such that if each player follows these conditions, then a near-optimal outcome for everyone is attained?

Somewhat surprisingly, the answer is “kind of.” It certainly seems like solving coordination problems requires global optimization instead of just local optimization, but under appropriate conditions, local optimization is actually sufficient.

Problem setup

We will consider the setting of normal-form common-payoff games. Let $n$ be the number of players. Let $A$ be some finite set of actions; each player $i$ will select the action $A_{i}$ . Let $U : A^{n} \to R$ be the shared utility function. Each player receives the same utility $U (a_{1}, \dots, a_{n})$ .

I will now assume that the players play some number of practice rounds before the “real” round. These practice rounds are very similar to those in the fictitious play algorithm. Ontologically, what are these practice rounds? One interpretation is that they are actual rounds of the game that have been played in the past; another interpretation is that they are a sequence of analogous one-shot rounds played by agents who have more and more compute and are thus able to simulate previous rounds. (The second interpretation might seem strange, but this is essentially how logical inductors achieve generalization bounds in decision theory problems)

Let $T$ be the number of practice rounds. Let $A_{i}^{t}$ be the action selected by player $i$ in practice round $t$ . I will assume $A_{i} = A_{i}^{T}$ , i.e. the last practice round is the “real” round.

We will assume that, in each round $t$ , some shared random number $R_{t} \sim U n i f o r m ({1, \dots, n})$ is generated, and that players’ policies in round $t$ may depend on this number.

How are players’ policies represented? We will assume that each player has only black-box access to each others’ actions in previous practice rounds. Let $π_{i} : L i s t (A^{n}) \times {1, \dots, n} \to Δ A$ be player $i$ ’s policy, which selects an action distribution for player $i$ in round $t$ given knowledge of the actions in all previous rounds and of the current shared random number.

The generative model is now obvious: For each $t$ in sequence, sample $R_{t} \sim U n i f o r m ({1, \dots, n})$ and then sample each $A_{i}^{t} \sim π_{i} (A_{1 : n}^{1 : t - 1}, R_{t})$ independently, where the notation $A_{1 : n}^{1 : t - 1}$ refers to the list $(A_{1 : n}^{1}, \dots, A_{1 : n}^{t - 1})$ .

Given a set of policies, it is possible to compute, for each $t$ , expected utilities at each time step (which are the expected utilities attained if $T = t$ ):

$v_{t} (π_{1 : n}) := E [U (A_{1 : n}^{t}) | π_{1 : n}]$

The limit ${lim}_{t \to \infty} v_{t} (π_{1 : n})$ and corresponding $lim inf$ and $lim sup$ are interesting; they indicate how well a set of policies does in expectation after a large number of practice rounds.

Policies that solve the game

The policies we will consider are ones that usually stick with the same action they played in the last round, but sometimes switch to a different action, and are more likely to switch the worse the utility in the last round was.

Let $u_{min}$ be a lower bound on $U$ . Let $α > 0$ be some “rationality parameter”. Consider the following set of policies:

$π_{i}^{α} (A_{1 : n}^{1 : t - 1}, R_{t}) = U n i f o r m (A) if t = 1$ $π_{i}^{α} (A_{1 : n}^{1 : t - 1}, R_{t}) = δ (A_{i}^{t - 1}) if t > 1, R_{t} \neq i$ $π_{i}^{α} (A_{1 : n}^{1 : t - 1}, R_{t}) = e^{α \cdot (u_{min} - U (A_{1 : n}^{t - 1}))} \cdot U n i f o r m (A) + (1 - e^{α \cdot (u_{min} - U (A_{1 : n}^{t - 1}))}) \cdot δ (A_{i}^{t - 1}) otherwise$

where $δ (a)$ is a delta distribution on $a$ , and probability distributions are taken to be vectors.

These policies start out with each player selecting a random action. In each round, a random player is selected (according to $R_{t}$ ); this player switches their action to some uniformly random action with a probability that is lower the higher the utility in the previous round was, and otherwise sticks with the same action. Players other than the randomly selected player always stick with the same action.

An important aspect of these policies is that each player only does local optimization. That is, each player shifts their own action depending on the utility in the last round, such that over the course of many iterations they are more likely to take an action that attains a high utility given the other players’ actions. Naively, one would expect a local optimization algorithm like this one to get stuck in traps such as the (low, low) Nash equilibrium, but this turns out not to be the case.

We will prove the following theorem:

Theorem 1: For all $ϵ > 0$ there exists $α$ such that ${lim}_{t \to \infty} v_{t} (π_{1 : n}^{α}) \geq {max}_{a_{1 : n} \in A^{n}} U (a_{1 : n}) - ϵ$ .

i.e. by setting $α$ high enough, these policies achieve an arbitrarily-close-to-optimal expected utility in the limit.

Analyzing the process as a Markov chain

To prove this theorem, we can analyze the multi-round process with these policies as a Markov chain where the state is the action vector $A_{1 : n}^{t}$ . The transition function (giving the distribution of $A_{1 : n}^{t + 1}$ given $A_{1 : n}^{t}$ ) is:

$t^{α} (a_{1 : n}) (a_{1 : i - 1}, a_{i}^{'}, a_{i + 1 : n}) := \frac{e^{α \cdot (u_{min} - U (a_{1 : n}))}}{n | A |} if a_{i}^{'} \neq a_{i}$

$t^{α} (a_{1 : n}) (a_{1 : n}) := 1 - \sum a_{i}^{'} \neq a_{i} t (a_{1 : n}) (a_{1 : i - 1}, a_{i}^{'}, a_{i + 1 : n})$

$t^{α} (a_{1 : n}) (a_{1 : n}^{'}) := 0 if a_{1 : n} and a_{1 : n}^{'} differ on at least two elements$

Consider the following distribution over states:

$p^{α} (a_{1 : n}) := \frac{e^{α U (a_{1 : n})}}{\sum_{a^{'} \in A^{n}} e^{α U (a_{1 : n}^{'})}}$

i.e. it is a softmax that is more likely to select states with higher utilities.

We will show:

Lemma 1: $p$ is the unique stationary distribution of the Markov chain defined by $t$ .

Proof:

It is well-known (e.g. see this Wikipedia article) that an ergodic Markov chain satisfying detailed balance (a condition defined later) with respect to a probability distribution over states has this distribution of states as the unique stationary distribution.

The Markov chain is aperiodic, since every state has a nonzero probability of transitioning to itself. Additionally, it is irreducible (i.e. it is possible to get from any state to any other state), since any action in the state can change to any other action with nonzero probability while leaving the other actions fixed. Therefore, the Markov chain is ergodic.

This Markov chain satisfies a detailed balance condition relative to the state distribution $p$ . Specifically, for any states $a_{1 : n}$ and $a_{1 : n}^{'}$ :

$p^{α} (a_{1 : n}) t^{α} (a_{1 : n}) (a_{1 : n}^{'}) = p^{α} (a_{1 : n}^{'}) t^{α} (a_{1 : n}^{'}) (a_{1 : n})$

i.e. if a state is sampled from $p$ then transitioned once, the probability of a transition from $a_{1 : n}$ to $a_{1 : n}^{'}$ equals the probability of a transition from $a_{1 : n}^{'}$ to $a_{1 : n}$ .

The proof of this detailed balance property proceeds by case analysis on $a_{1 : n}$ and $a_{1 : n}^{'}$ .

If $a_{1 : n} = a_{1 : n}^{'}$ , then the condition is trivial.
If $a_{1 : n}$ and $a_{1 : n}^{'}$ differ on more than one element, then both transition probabilities are 0, so the condition is trivial.
If they differ on a single element $i$ , then:

$p^{α} (a_{1 : n}) t^{α} (a_{1 : n}) (a_{1 : n}^{'}) = \frac{e^{α U (a_{1 : n})}}{\sum_{b_{1 : n} \in A^{n}} e^{α U (b_{1 : n})}} \frac{e^{α \cdot (u_{min} - U (a_{1 : n}))}}{n | A |}$ $= \frac{e^{α u_{min}}}{n | A | \sum_{b_{1 : n} \in A^{n}} e^{α U (b_{1 : n})}}$

and similarly

$p^{α} (a_{1 : n}^{'}) t^{α} (a_{1 : n}^{'}) (a_{1 : n}) = \frac{e^{α U (a_{1 : n}^{'})}}{\sum_{b_{1 : n} \in A^{n}} e^{α U (b_{1 : n})}} \frac{e^{α \cdot (u_{min} - U (a_{1 : n}^{'}))}}{n | A |}$ $= \frac{e^{α u_{min}}}{n | A | \sum_{b_{1 : n} \in A^{n}} e^{α U (b_{1 : n})}} .$

Since the Markov chain is ergodic and satisfies detailed balance with respect to $p^{α}$ , it follows that $p^{α}$ is the unique stationary distribution of the Markov chain.

$□$

Now we will prove Theorem 1.

Theorem 1: For all $ϵ > 0$ there exists $α$ such that ${lim}_{t \to \infty} v_{t} (π_{1 : n}^{α}) \geq {max}_{a_{1 : n} \in A^{n}} U (a_{1 : n}) - ϵ$ .

Proof:

Clearly, ${lim}_{t \to \infty} v_{t} (π_{1 : n}^{α}) = E_{a_{1 : n} \sim p^{α}} [U (a_{1 : n})]$ , since $p^{α}$ is the unique stationary distribution of $A_{1 : n}^{t}$ considered as a Markov chain.

Define $u^{*} := {max}_{a_{1 : n} \in A^{n}} U (a_{1 : n})$ , i.e. the highest achievable utility. Define $u^{'}$ to be the second-highest achievable utility.

At this point, we need only set $α$ to be large enough that the $p^{α}$ assigns almost all of its probability mass to optimal states. The argument that this is possible to do follows.

Fix $ϵ > 0$ . Set $α := \frac{n log | A | + log (u^{*} - u_{min}) - log ϵ}{u^{*} - u^{'}}$ .

Let $a_{1 : n}^{*}$ be an optimal action profile, and let $a_{1 : n}^{'}$ be a non-optimal action profile. We have

$\frac{p^{α} (a_{1 : n}^{*})}{p^{α} (a_{1 : n}^{'})} = e^{α (U (a_{1 : n}^{*}) - U (a_{1 : n}^{'}))} \geq e^{α (u^{*} - u^{'})} = \frac{| A |^{n} (u^{*} - u_{min})}{ϵ}$

Since there is at least one optimal state and fewer than $| A |^{n}$ non-optimal states, the ratio between the probability (under $p^{α}$ ) of a state being optimal and the probability of a state being non-optimal is at least $\frac{u^{*} - u_{min}}{ϵ}$ . Since in general $\frac{x}{1 + x} \geq 1 - \frac{1}{x}$ for positive $x$ , this means that the probability of a state being optimal is at least $1 - \frac{ϵ}{u^{*} - u_{min}}$ . A non-optimal state has utility at least $u_{min}$ , so the overall expected suboptimality (i.e. difference between $u^{*}$ and $u (a_{1 : n})$ ) is at most $ϵ$ , as desired.

$□$

Extension to non-common-payoff games

The policies defined above can be extended to play non-common-payoff games. Specifically, each agent can have a higher likelihood of switching action in each round if their own utility is lower. Unfortunately, this results in mutual defection in a prisoner’s dilemma; while I won’t present a detailed analysis here, the gist of the argument is that (C, C) is much more likely to switch into (C, D) or (D, C), than (C, D) or (D, C) is to switch into (C, C). So a different approach than this one will be needed for achieving Pareto optimal outcomes in non-common-payoff games.

Conclusion

In this post I have shown that, in common-payoff games under a multi-round structure, there is a set of policies that only does local optimization but which, together, yield a globally near-optimal outcome. This is important because getting global optimality from local optimization is required for decision theory (i.e. each agent taking an action based on its own expected utility) to produce Pareto optimal outcomes in common-payoff games.

In a previous post on coordination games, I presented a plan for solving formal decision theory:

Solve Cartesian common-payoff decision theory problems where players reason about each other using something like reflective oracles.
Extend the solution in 1 to Cartesian multiplayer game theory problems where players have different utility functions and reason about each other using something like reflective oracles.
Extend the solution in 2 to a logically uncertain naturalized setting.

This post has done something pretty close to solving 1, with the multi-round structure taking the place of reflective oracles. Unfortunately, step 2 fails since the solution method does not extend to non-common-payoff games. So it appears that, for my plan to work, 1 must be solved in a different way that can be extended to also solve non-common-payoff games.

Alternatively, the problem of non-common-payoff games may be attacked directly, though this seems difficult given that non-common-payoff games are more complex than common-payoff games, and include all the dynamics present in common-payoff games. But perhaps restricted subsets of general games, other than the set of common-payoff games, can be found and solved.

Reducing collective rationality to individual optimization in common-payoff games using MCMC

Problem setup

Policies that solve the game

Analyzing the process as a Markov chain

Extension to non-common-payoff games

Conclusion