My take on higher-order game theory

Nisan30 Nov 2021 5:56 UTC

LW: 40 AF: 18

Domain Theory Game Theory World Modeling

This is how I currently think about higher-order game theory, the study of agents thinking about agents thinking about agents....

This post doesn’t add any new big ideas beyond what was already in the post by Diffractor linked above. I just have a slightly different perspective that emphasizes the “metathreat” approach and the role of nondeterminism.

This is a work in progress. There’s a bunch of technical work that must be done to make this rigorous. I’ll save the details for the last section.

Multiple levels of strategic thinking

Suppose you’re an agent with accurate beliefs about your opponents. It doesn’t matter where your beliefs come from; perhaps you have experience with these opponents, or perhaps you read your opponents’ source code and thought about it. Your beliefs are accurate, although for now we’ll be vague about what exactly “accurate” means.

In a game of Chicken, you may want to play Swerve, since that’s the maximin strategy. This is zeroth-order thinking because you don’t need to predict what your opponent will do.

Or maybe you predict what your opponent will do and play the best response to that, Swerving if they’ll go Straight and going Straight if they’ll Swerve. This is first-order thinking.

Or maybe you know your opponent will use first-order thinking. So you resolve to go Straight, your opponent will predict this, and they will Swerve. This is second-order thinking.

Beyond second-order, Chicken turns into a commitment race.

In Prisoner’s Dilemma, zeroth- and first-order thinking recommend playing Defect, as that’s a dominant strategy. Second-order thinking says that it’s good to Cooperate on the margin if by doing so you cause your opponent to also Cooperate on the margin. Let’s say it’s worth it to Cooperate with probability $p$ if your opponent thereby Cooperates with probability more than $\frac{p}{2}$ (although the exact numbers depend on the payoff matrix).

If you’re a third-order agent, you might commit to Cooperating with probability equal to 51% times the probability that your opponent Cooperates. This causes your opponent to Cooperate with certainty, and thus you’re bound to Cooperate with probability 51%. This is a Pareto-optimal outcome that favors you heavily.

Fixed points

So far we’ve talked about agents of order $n + 1$ playing against agents of order $n$ . What about games between agents of equal order?

If two first-order agents play the best response to each other, then they play a Nash equilibrium. Let’s see how this works in Matching Pennies, which has a mixed-strategy Nash equilibrium:

If the Even player thinks Odd is more likely to play Heads, then Even will play Heads. If the Even player thinks Odd is more likely to play Tails, then Even will play Tails. If Even thinks Odd is equally likely to play Heads or Tails, then Even will be indifferent. So if both players have well-calibrated beliefs, we must be in the equilibrium where both players are indifferent and both players play $\frac{1}{2}$ Heads, $\frac{1}{2}$ Tails.

(What if Even always plays Heads when it’s indifferent? In that case, there is no equilibrium, which is awkward for our theory. But this behavior requires Even to represent its beliefs as real numbers. If we make the more realistic assumption that Even’s beliefs are arbitrary-precision numbers, then it can never be sure that it’s on the decision boundary. It gets stuck in a for loop, querying its belief for ever more precision in case it’s not exactly $0.5000 \dots$ . (Then some other process, like infinitesimal errors in the beliefs, causes the agent to output the equilibrium strategy of $\frac{1}{2}$ Heads, $\frac{1}{2}$ Tails.))

The space of all agents

Now we’re ready to define our types. (I’ll leave some details to the last section.) The space $A_{0}$ of 0th-order agents is just the set of actions or pure strategies.^[1]

The space of $(n + 1)$ st-order agents is the space of functions on probability distributions on $A_{n}$ . That is, $A_{n + 1} = [Δ A_{n} \to Δ A_{n}]$ .

So an $(n + 1)$ st-order agent is a stochastic function from its $n$ th-order beliefs to a successor agent of order $n$ . Creating a successor agent is a form of currying.

Playing a game

Two agents $a_{n + 1}, b_{n + 1}$ in $A_{n + 1}$ interact as follows:

Find distributions $c_{n}, d_{n}$ on $A_{n}$ such that $c_{n} = a_{n + 1} (d_{n})$ and $d_{n} = b_{n + 1} (c_{n})$ .

So $(c_{n}, d_{n})$ is a fixpoint representing consistent $n$ th-order beliefs that the agents have about each other.^[2]
Sample $a_{n}$ from $c_{n}$ and sample $b_{n}$ from $d_{n}$ .
Repeat steps 1 and 2 on $a_{n}$ and $b_{n}$ until we end up with $a_{0}, b_{0}$ in $A_{0}$ .
Calculate the payoff $u (a_{0}, b_{0})$ to the first player.

Now take the expectation over all samples in step 2, and take the minimum over all choices of fixpoint in step 1. This gives a lower bound on the first player’s expected payoff.

Some agents

Now we can write down formulas for the kind of higher-order reasoning discussed earlier.

The first-order agent that always plays the best response to its opponent is defined by $a_{1} (d_{0}) = {argmax}_{a_{0}} E [u (a_{0}, d_{0})]$ . It achieves a Nash equilibrium when playing against itself.

A second-order agent that always plays the best response to its opponent is $a_{2} (d_{1}) (d_{0}) = {argmax}_{c_{0}} E [u (c_{0}, d_{1} (c_{0}))]$ .^[3] This agent also achieves a Nash equilibrium when playing against itself. In a future post I’ll construct a different 2nd-order agent that both plays the best response to any first-order agent, and also achieves a Pareto-optimal outcome when playing against itself.

Summary

Order | Beliefs about other agent | Strategic behavior | n vs. n equilibrium
----------------------------------------------------------------------------
0     | None                      | Maximin            | Maximin
----------------------------------------------------------------------------
1     | 0th order                 | Best response      | Nash equilibrium
----------------------------------------------------------------------------
2     | 1st order                 | Argmax/optimize    | Pareto optimum
----------------------------------------------------------------------------
...

Infinity

There’s a sense in which the 0th-order maximin agent is an approximation of the 1st-order best-response agent, which is itself an approximation of the 2nd-order argmaxing agent. Standard domain-theoretical constructions should allow construction of infinite-order agents, including an infinite-order strategic agent that subsumes all finite-order best-response behavior. Fortunately the hierarchy ends at infinity; $A_{\infty}$ has type $[Δ A_{\infty} \to Δ A_{\infty}]$ . However, whether and how this works depends a lot on the technical details I haven’t worked out, so I hesitate to say any more about it.

Correlated strategies

This framework doesn’t have correlated strategies built in. But I have a hunch that agents of sufficiently high order can play correlated strategies anyways.

Technical details

Everything is a domain; in particular, a dcpo. $X_{0}$ is also a lower semilattice. The mapping spaces $[X \to Y]$ are spaces of Scott-continuous functions.

$Δ$ is some kind of probabilistic powerdomain. An approach similar to (Jones & Plotkin 1989) might just work: $Δ X$ would be the set of probability measures on the Borel $σ$ -algebra of $X$ which restrict to continuous evaluations on open sets.

But it’s possible $Δ X$ will have to contain sets of probability measures, in order to express nondeterministic choice of fixpoint. Possible approaches include (Mislove 2000), (Tix, Keimel, Plotkin 2009), (Goubault-Larrecq 2007), and of course (Appel and Kosoy 2021). There’s also the field of quantitative semantics, which I don’t know anything about.

I’d like to have fixpoints $(c_{n}, d_{n})$ which are maximal elements of $Δ X_{n} \times Δ X_{n}$ . This might follow easily from the Kakutani fixed point theorem, but it really depends on how $Δ$ is defined.

Another issue is that higher-order strategic agents have to compute fixpoints themselves (in order to calculate the expected value of playing a lower-order policy). I’m not sure if this can be done continuously. It might require the use of a reflective oracle somehow.

This work was supported by a grant from the Center on Long-Term Risk. Thanks to Alex Appel, Adele Dewey-Lopez, and Patrick LaVictoire for discussions about these ideas.

↩︎
Actually, we want the set of nonempty sets of actions, so an agent can express indifference between actions.
↩︎
Really we want $c_{n}$ and $d_{n}$ to be maximal elements of the poset $Δ A_{n}$ such that $c_{n} ⊒ a_{n + 1} (d_{n})$ and $d_{n} ⊒ b_{n + 1} (c_{n})$ . See the section on technical details.
↩︎
Actually you can’t compute an argmax over a function of a continuous variable. The best you can do is a distribution that’s biased towards higher-payoff outcomes, like quantilization or best-of- $k$ sampling. I’ll write more about this in the future.

What links here?

Nisan30 Nov 2021 5:56 UTC

LW: 40 AF: 18

8 comments5 min readLW link

Domain Theory Game Theory World Modeling

Cleo Nardo 17 Oct 2023 1:11 UTC
LW: 5 AF: 2
0
AF
Hey Nisan. Check the following passage from Domain Theory (Samson Abramsky and Achim Jung). This might be helpful for equipping $Δ$ with an appropriate domain structure. (You mention [JP89] yourself.)
We should also mention the various attempts to define a probabilistic version of the powerdomain construction, see [SD80, Mai85, Gra88, JP89, Jon90].
- [SD80] N. Saheb-Djahromi. CPO’s of measures for nondeterminism. Theoretical Computer Science, 12:19–37, 1980.
- [Mai85] M. Main. Free constructions of powerdomains. In A. Melton, editor, Mathematical Foundations of Programming Semantics, volume 239 of Lecture Notes in Computer Science, pages 162–183. Springer Verlag, 1985.
- [Gra88] S. Graham. Closure properties of a probabilistic powerdomain construction. In M. Main, A. Melton, M. Mislove, and D. Schmidt, editors, Mathematical Foundations of Programming Language Semantics, volume 298 of Lecture Notes in Computer Science, pages 213–233. Springer Verlag, 1988.
- [JP89] C. Jones and G. Plotkin. A probabilistic powerdomain of evaluations. In Proceedings of the 4th Annual Symposium on Logic in Computer Science, pages 186–195. IEEE Computer Society Press, 1989.
- [Jon90] C. Jones. Probabilistic Non-Determinism. PhD thesis, University of Edinburgh, Edinburgh, 1990. Also published as Technical Report No. CST63-90.
During my own incursion into agent foundations and game theory, I also bumped into this exact obstacle — namely, that there is no obvious way to equip $Δ$ with a least-fixed-point constructor ${Fix}_{X}^{Δ} : (X \to Δ X) \to Δ X$ . In contrast, we can equip $P$ with a LFP constructor ${Fix}_{X}^{P} : (X \to P X) \to P X, g \mapsto {x \in X : x \in g (x)}$ .
One trick is to define ${Fix}_{X}^{Δ} (g)$ to be the distribution $π \in Δ X$ which maximises the entropy $H (π)$ subject to the constraint $g^{Δ} (π) = π$ .
- A maximum entropy distribution $π^{*}$ exists, because —
  - For $g : X \to Δ X$ , let $g^{Δ} : Δ X \to Δ X$ be the lift via the $Δ$ monad, and let $G = {π \in Δ X | g^{Δ} (π) = π}$ be the set of fixed points of $g^{Δ}$ .
  - $Δ X$ is Hausdorff and compact, and $g^{Δ} : Δ X \to Δ X$ is continuous, so $G = {π \in Δ X : π = g^{Δ} (π)}$ is compact.
  - $H : Δ X \to R$ is continuous, and $G \subseteq Δ X$ is compact, so $H$ obtains a maximum $π^{*}$ in $G$ .
- Moreover, $π^{*}$ must be unique, because —
  - $G$ is a convex set, i.e. if $π_{1} = g^{Δ} (π_{1})$ and $π_{2} = g^{Δ} (π_{2})$ then $λ_{1} π_{1} + λ_{2} π_{2} = g^{Δ} (λ_{1} π_{1} + λ_{2} π_{2})$ for all $λ_{1} + λ_{2} = 1$ .
  - $H : Δ X \to R$ is strictly concave, i.e. $H (λ_{1} π_{1} + λ_{2} π_{2}) \geq λ_{1} H (π_{1}) + λ_{2} H (π_{2})$ for all $λ_{1} + λ_{2} = 1$ , and moreover the inequality is strict if $π_{1} \neq π_{2}$ and $λ_{1}, λ_{2} > 0$ .
  - Hence if $π_{1}^{*}, π_{2}^{*} \in G$ both obtain the maximum entropy, then $π_{1}^{*} \neq π_{2}^{*} ⟹ H (0.5 π_{1} + 0.5 π_{2}) > 0.5 H (π_{1}) + 0.5 H (π_{2})$ , a contradiction.
The justification here is the Principle of Maximum Entropy:
Given a set of constraints on a probability distribution, then the “best” distribution that fits the data will be the one of maximum entropy.
More generally, we should define ${Fix}_{X}^{Δ} (g)$ to be the distribution $π \in Δ X$ which minimises cross-entropy $D_{K L} (π | | π_{0})$ subject to the constraint $π = g (π)$ , where $π_{0}$ is some uninformed prior such as Solomonoff. The previous result is a special case by considering $π_{0}$ to be the uniform prior. The proof generalises by noting that $D_{K L} (- | | π_{0}) : Δ X \to R$ is continuous and strictly convex. See the Principle of Minimum Discrimination.
Ideally, we’d like ${Fix}^{P}$ and ${Fix}^{Δ}$ to “coincide” modulo the maps $Supp : Δ X \to P X$ , i.e. $Supp ({Fix}^{Δ} (g)) = {Fix}^{P} (Supp \circ g)$ for all $g : X \to Δ X$ . Unfortunately, this isn’t the case — if $g : H \mapsto 0.5 \cdot | H ⟩ + 0.5 \cdot | T ⟩, T \mapsto | T ⟩$ then ${Fix}^{P} (Supp \circ g) = {H, T}$ but $Supp ({Fix}^{Δ} (g)) = {T}$ .
Alternatively, we could consider the convex sets of distributions over $X$ .
Let $C (X)$ denote the set of convex sets of distributions over $X$ . There is an ordering $\leq$ on $C (X)$ where $A \leq B ⟺ A \supseteq B$ . We have a LFP operator ${Fix}_{X}^{C} : (X \to C X) \to C X$ via $g \mapsto ⋃ {S \in C X : g_{X}^{C} (S) = S}$ where $g^{C} : C X \to C X, S \mapsto {\sum_{i = 1}^{n} α_{i} \cdot π_{i} | π_{i} \in g (x_{i}), \sum_{i = 1}^{n} α_{i} | x_{i} ⟩ \in S}$ is the lift of $g : X \to C X$ via the $C$ monad.
- Nisan 21 Oct 2023 19:44 UTC
  LW: 3 AF: 1
  0
  AF Parent
  Thanks! For convex sets of distributions: If you weaken the definition of fixed point to $g^{C} (S) \leq S$ , then the set ${S \in C X : g^{C} (S) = S}$ has a least element which really is a least fixed point.
romeostevensit 1 Dec 2021 16:21 UTC
LW: 3 AF: 2
0
AF
Tangential, but did you ever happen to read statistical physics of human cooperation?
- Nisan 1 Dec 2021 18:59 UTC
  LW: 2 AF: 1
  0
  AF Parent
  No, I just took a look. The spin glass stuff looks interesting!
  - romeostevensit 1 Dec 2021 22:10 UTC
    LW: 3 AF: 1
    0
    AF Parent
    Are we talking about the same thing?
    
    https://www.sciencedirect.com/science/article/am/pii/S0370157317301424
    - Nisan 2 Dec 2021 4:48 UTC
      LW: 2 AF: 1
      0
      AF Parent
      Yep, I skimmed it by looking at the colorful plots that look like Ising models and reading the captions. Those are always fun.
Charlie Steiner 1 Dec 2021 4:48 UTC
LW: 2 AF: 1
0
AF
I have a question about this entirely divorced from practical considerations. Can we play silly ordinal games here?
If you assume that the other agent will take the infinite-order policy, but then naively maximize your expected value rather than unrolling the whole game-playing procedure, this is sort of like $ω + 1$ . So I guess my question is, if you take this kind of dumb agent (that still has to compute the infinite agent) as your baseline and then re-build an infinite tower of agents (playing other agents of the same level) on top of it, does it reconverge to $A_{\infty}$ or does it converge to some weird $A_{ω 2}$ ?
- Nisan 1 Dec 2021 7:23 UTC
  LW: 2 AF: 1
  0
  AF Parent
  I think you’re saying $A_{ω + 1} := [Δ A_{ω} \to Δ A_{0}]$ , right? In that case, since $A_{0}$ embeds into $A_{ω}$ , we’d have $A_{ω + 1}$ embedding into $A_{ω}$ . So not really a step up.
  
  If you want to play ordinal games, you could drop the requirement that agents are computable / Scott-continuous. Then you get the whole ordinal hierarchy. But then we aren’t guaranteed equilibria in games between agents of the same order.
  
  I suppose you could have a hybrid approach: Order $ω + 1$ is allowed to be discontinuous in its order- $ω$ beliefs, but higher orders have to be continuous? Maybe that would get you to $ω 2$ .