Game Theory without Argmax [Part 2]

Cleo Nardo11 Nov 2023 16:02 UTC

31 points

Exercises / Problem-Sets Satisficer Corrigibility Subagents MATS Program Prisoner's Dilemma Agent Foundations Game Theory Utility Functions Optimization Decision theory

Written during the SERI MATS program under the joint mentorship of John Wentworth, Nicholas Kees, and Janus.

Create a black and white image of a blindfolded robot and a blindfolded human playing Hungry Hungry Hippos in a 3:1 aspect ratio. The robot, with a futuristic, sleek, humanoid design, is seated at the game table with a blindfold covering its optical sensors. Across from the robot, depict a human player, also blindfolded, with an expression of concentration, as they engage in the game. The human's appearance should be neutral, representing a general human figure. The Hungry Hungry Hippos game is in the center, capturing a moment of playful competition. The background should be simple, emphasizing the players and the game in a monochromatic palette.

Motivation

I am surrounded, incessantly, by other people, possessing a range of different capabilities, striving for goals both common and opposing — and, what is more, they are incessantly surrounded by me. Fangs pierce the cartesian boundary. Outsideness leaks in. Insideness leaks out. It is the source of all joy and misery in my life.

Anyway. It’s the interaction of $n \geq 2$ agents that characterises game theory.^[1] Classical game theory says that when two utility-maximisers interact in a shared environment, they will choose options in nash equilibrium, where a profile of options is in nash equilibrium if no player can increase their utility function by unilaterally changing their option.

What does higher-order game theory (which dispenses with utility functions and maximisation) say about the interaction between agents? Would a satisficer cooperate with a quantiliser in prisoner’s dilemma?

Let’s find out.

Preliminaries

All you need to know is Definition 3 from [Part 1], and some basic classical game theory.

Definition 3: Let $X$ be any set of options and $R$ be any set of payoffs. An optimiser is any functional $ψ : (X \to R) \to P (X)$ . A $ψ$ -task is any function $u : X \to R$ . An option $x \in X$ is $ψ$ -optimal for a task $u : X \to R$ if and only if $x \in ψ (u)$ .

Let’s review the textbook definition of classical simultaneous games.

Definition 5.
Let ${X_{i}}_{i \in I}$ be a family of sets indexed by a set $I$ of players.

A classical simultaneous game over ${X_{i}}_{i \in I}$ is a function $g : \prod_{i \in I} X_{i} \to R^{I}$ .

The option-profiles of $g$ are the elements of the product $X = \prod i \in I X_{i}$ .
The $i$ -th deviation map $U_{i} : X \to (X_{i} \to X)$ is given by $U_{i} (x) (α)_{j} = {\begin{matrix} α & if i = j x_{j} & otherwise \end{matrix}$ for all $x \in X$ , $i \in I$ , $α \in X_{i}$ , and $j \in I$ .

In the finite case, where $I = {1, \dots, n}$ , then $U_{i} (x_{1}, \dots, x_{n}) : X_{i} \to X, α \mapsto (x_{1}, \dots, x_{i - 1}, α, x_{i}, \dots, x_{n})$ .

The deviation map $U_{i} : X \to (X_{i} \to X)$ describes how the $i$ th player can change an option-profile $x$ by unilaterally changing their own option. Note that the deviation maps only depend on the option spaces, not the game itself.

The best-response function of $g$ is the the function $B : X \to P (X)$ defined by $B : x \mapsto \prod i \in I {argmax}_{X_{i}} (π_{i} \circ g \circ U_{i} (x))$ where $U_{i} : X \to (X_{i} \to X)$ is the $i$ th deviation map.

Finally, the nash equilibria of $g$ is the set of option-profiles $x \in X$ such that $x \in B (x)$ .

Note that the fact that the option-profiles are in nash equilibrium follows logically from the assumption each player’s choice is utility-maximising for the task they face, plus the assumption that, if the option-profile is $x \in X$ , then the $i$ th player faces the task $π_{i} \circ g \circ U_{i} (x) : X_{i} \to R$ .

It doesn’t require the assumption that the player’s have “common knowledge” of each other’s rationality, or that the players have any kinds of beliefs whatsoever. That being said, an economist might explain the optimality of the choices by assuming that each player is rational and knows all the relevant facts (i.e. the rules of the game, the rationality of their peers, and the beliefs of their peers). However, an evolutionary biologist might appeal to natural selection to explain optimality. An ML engineer might appeal to SGD, etc.

Higher-Order Nash Equilibrium

By inspection, we can see that at no point does the definition of the classical nash equilibrium appeal to any special properties of $R$ or ${argmax}_{X}$ . This suggests that we can effortless generalise the definition of nash equilibrium to any family of optimisers, rather than only utility-maximisers.

Definition 6.
Let ${X_{i}}_{i \in I}$ be a family of sets indexed by a set $I$ of players. Let $R$ be a set of payoffs, and let $Ψ = {ψ_{i}}_{i \in I}$ be a family of optimisers with $ψ_{i} \in J^{P} (X_{i}, R)$ .^[2]

A higher-order simultaneous $Ψ$ -game is a function $g : \prod_{i \in I} X_{i} \to R$ .

Like before, the option-profiles of $g$ are the elements of the product $X = \prod i \in I X_{i}$ .
Like before, the $i$ -th deviation map $U_{i} : X \to (X_{i} \to X)$ is given by $U_{i} (x) (α)_{j} = {\begin{matrix} α & if i = j x_{j} & otherwise \end{matrix}$ for all $x \in X$ , $i \in I$ , $α \in X_{i}$ , and $j \in I$ .

The $Ψ$ -best-response function of $g$ is the the function $B : X \to P (X)$ defined by $B : x \mapsto \prod i \in I ψ_{i} (g \circ U_{i} (x))$ where $U_{i} : X \to (X_{i} \to X)$ is the $i$ th deviation map

Like before, the $Ψ$ -nash equilibria of $g$ is the set of option-profiles $x \in X$ such that $x \in B (x)$ .
When clear from context, I’ll just say game, best-response, and nash equilibrium.

The intuition is this —

Suppose that $x = (x_{1}, \dots, x_{n}) \in X_{1} \times \dots \times X_{n}$ is the option-profile that’s collectively chosen.
Then player $i \in I$ can achieve a payoff $[g \circ U_{i} (x)] (α) \in R$ by unilaterally changing their option to $α \in X_{i}$ because $[g \circ U_{i} (x)] (α) = g (x_{1}, \dots, x_{i - 1}, α, x_{i + 1}, \dots, x_{n})$ .
So player $i \in I$ is faced with the task $g \circ U_{i} (x) : X_{i} \to R$ .
Their choice $x_{i} \in X_{i}$ must be $ψ_{i}$ -optimal for this task, so $x_{i} \in ϕ_{i} (g \circ U_{i} (x)) \subseteq X_{i}$ .
If we assume $ψ_{i}$ -optimality for every player conjunctively, then $x \in B (x)$ .
If we know that $x \in B (x)$ , and we know nothing else, then the possible option profiles are ${x \in X : x \in B (x)} \in P (X)$ .

This definition applies even to unusual cases —

Zero-player games? When $| I | = 0$ , then game $g$ is just an element of the payoff space $R$ . There’s only one option-profile, the empty sequence, and it’s vacuously nash.
One-player games? When $| I | = 1$ , then a game $g : X \to R$ is just a task for the solitary player. An option-profile for the game is specified by an option for the player, and this option-profile is nash whenever the option is optimal for the player.
When $I$ is countably infinite, or indeed uncountably infinite, then the definition still makes sense and gives the right answer.^[3]

We obtain the classical simultaneous game when $R = R^{n}$ and $ψ_{i} = {argmax}_{X} (π_{i} \circ -) \in J^{P} (X, R^{n})$ , and in this case the higher-order nash equilibrium coincides with the classical nash equilibrium.

The higher-order nash equilibrium allows us to answer somewhat esoteric questions —

An argmaxer, a satisficer, and a quantiliser walk into a bar. They can order from menus $A$ , $S$ , and $Q$ respectively, and the barman will mix their three orders $(a, s, q) \in A \times S \times Q$ into a single cocktail $f (a, s, q) \in C$ . The argmaxer’s preferences over the cocktails are given by $v_{a} : C \to R$ , and likewise $v_{s} : C \to R$ for the satisfier and $v_{q} : C \to R$ for the quantiliser.
What cocktail might they be served?
Well, the possible cocktails are $f (a, s, q) \in C$ such that $a \in {argmax}_{A} (λ^{a^{'} \in A} . v_{a} \circ f (a^{'}, s, q))$ , $s \in {satisfice}_{s_{0}} (λ^{s^{'} \in S} . v_{s} \circ f (a, s^{'}, q))$ , and $q \in {quant}_{ϵ} (λ^{q^{'} \in Q} . v_{q} \circ f (a, s, q^{'}))$ .

Many well-studied variations of the classical nash equilibrium can be seen as the higher-order nash equilibrium between non-classical optimisers. For example, epsilon-equilibrium is precisely the higher-order nash equilibrium between ${argmax- ϵ -slack}_{X} (π_{i} \circ -)$ .^[4]

Optimisation is nash-sum closed

There’s a nice upshot of higher-order game theory — the nash equilibria between optimisers is itself an optimiser!

Suppose that two agents are modelled by $ψ_{A} \in J^{P} (A, R)$ and $ψ_{B} \in J^{P} (A, R)$ . We can combine them into a single function which assigns, to each game $g : A \times B \to R$ , the set of ${ψ_{A}, ψ_{B}}$ -nash equilibria of the game $g$ . This function has type-signature $(A \times B \to R) \to P (A \times B)$ , so it corresponds to a single optimiser with option space $(A \times B)$ with payoff space $R$ .

This gives us a binary operator on optimisers, $\oplus : J^{P} (A, R) \times J^{P} (B, R) \to J^{P} (A \times B, R)$ , given by $(a, b) \in (ψ_{A} \oplus ψ_{B}) (g) ⟺ a \in ψ_{A} (g \circ U_{1} (a, b)), b \in ψ_{B} (g \circ U_{2} (a, b))$ .

This operator is both associative and commutative, justifying the name “nash sum”.

For example, the nash sum $({argmax}_{A} \oplus {argmax}_{B}) : (A \times B \to R \times R) \to P (A \times B)$ will calculate the classical nash equilibria of a generic payoff matrix $g : A \times B \to R^{n}$ .

The operator can be extended to any family of optimisers, i.e. $⨁ i \in I : \prod i \in I J^{P} (X_{i}, R) \to J^{P} (\prod i \in I X_{i}, R)$ where $x \in (⨁ i \in I ψ_{i}) (g) ⟺ x \in n \prod i = 1 ψ_{i} (g \circ U_{i} (x))$ .

Terminology:

$⨁_{i \in I} ψ_{i}$ is a single optimiser, while ${ψ_{i}}_{i \in I}$ is a family of optimisers.
$⨁_{i \in I} ψ_{i}$ is the nash-sum of ${ψ_{i}}_{i \in I}$ , and ${ψ_{i}}_{i \in I}$ are subagents of $⨁_{i \in I} ψ_{i}$ .
An option for $⨁_{i \in I} ψ_{i}$ is an option-profile for ${ψ_{i}}_{i \in I}$ .
A task for the optimiser $⨁_{i \in I} ψ_{i}$ is a game for the family of optimisers ${ψ_{i}}_{i \in I}$ .
An option $x \in \prod_{i \in I} X_{i}$ is optimal for $⨁_{i \in I} ψ_{i}$ in the task $g : \prod_{i \in I} X_{i} \to R$ if and only if the option-profile $x \in \prod_{i \in I} X_{i}$ is a nash equilibrium for ${ψ_{i}}_{i \in I}$ in the game $g : \prod_{i \in I} X_{i} \to R$ .

Totality is not nash-sum closed

Just because two optimisers $ψ_{a} \in J^{P} (A, R)$ and $ψ_{b} \in J^{P} (B, R)$ are total does not imply that their nash sum $(ψ_{a} \oplus ψ_{b}) \in J^{P} (A \times B, R)$ is total.^[5] In other words, we may have a model of an agent which works perfectly in all solitary situations, but breaks down when many such agents interact in a shared environment.

The textbook example is Matching Pennies. Each player must choose either heads or tails — player 1 wins if the coins match and player 2 wins if they don’t. If both players are utility-maximisers then there’s no nash equilibria, even though utility-maximisers are total optimisers.

$u$	H	T
H	$(1, 0)$	$(0, 1)$
T	$(0, 1)$	$(1, 0)$

Formally speaking, $A = {H, T}$ , $B = {H, T}$ , and $u : A \times B \to R \times R, (x, y) \mapsto (δ_{x y}, 1 - δ_{x y})$ .^[6] Then $({argmax}_{A} (π_{1} \circ -) \oplus {argmax}_{B} (π_{2} \circ -)) (u) = \emptyset$ , although ${argmax}_{A} (π_{1} \circ -) (w) \neq \emptyset$ and ${argmax}_{B} (π_{1} \circ -) (w) \neq \emptyset$ for all $w : {H, T} \to R \times R$ .

This shows why we couldn’t have defined optimisers as functions with type-signature $(X \to R) \to P^{*} (X)$ where $P^{*} (X)$ denotes the set of non-empty subsets of $X$ . In order to ensure nash-sum closure of optimisers, we needed to include non-total optimisers.

Utility-maximisation is not nash-sum closed.

It is well-known that the nash sum of $n$ utility-maximisers isn’t a utility-maximiser for any $n \geq 2$ . For many games, none of the nash equilibrium are even pareto-optimal!

The textbook case is the prisoner’s dilemma. Let $B = {C, D}$ and consider two players ${argmax}_{B} (π_{1} \circ -)$ and ${argmax}_{B} (π_{2} \circ -)$ which respectively value the first and second coordinate of the payoff. Their nash sum is the optimiser in $J^{P} (B^{2}, R^{2})$ which calculates the classical nash equilibria of 2-by-2 payoff matrices $u : B^{2} \to R^{2}$ . In particular, when $u : B^{2} \to R^{2}$ is the payoff matrix defined below, then $({argmax}_{B} (π_{1} \circ -) \oplus {argmax}_{B} (π_{2} \circ -)) (u) = {(D, D)}$ . However, both players prefer the payoff $u (C, C) = (2, 2)$ to $u (D, D) = (1, 1)$ .

$u$	$C$	$D$
$C$	$(2, 2)$	$(0, 3)$
$D$	$(3, 0)$	$(1, 1)$

Consequentialism is not nash-sum closed.

In fact, the nash sum of two utility-maximisers isn’t even consequentialist!^[7]

Suppose Angelica is babysitting her cousin Tommy, and each child has been given a cookie by their grandmother. If Angelica is awake while Tommy sleeps, then she can steal both cookies for herself. If Angelica sleeps while Tommy is awake then he’ll cause mayhem getting both children in trouble. If Angelica and Tommy both stay sleep, or if they both wake up, then they’ll each keep one cookie.

The payoffs $u : {W_{A}, S_{A}} \times {W_{T}, S_{T}} \to R \times R$ are summarised in the matrix below.

$u$	$W_{T}$	$S_{T}$
$W_{A}$	$(1, 1)$	$(2, 0)$
$S_{A}$	$(- 10, - 10)$	$(1, 1)$

When Angelica and Tommy are both awake, this is nash — she doesn’t want to sleep because then he’ll cause mayhem, and he doesn’t want to sleep because then she’ll steal his cookie. In contrast, when they’re both sleeping, this isn’t nash — she can wake up and steal his cookie. But the same outcome will result whether Angelica and Tommy are both sleeping or both awake. So the nash sum of Angelica and Tommy isn’t a consequentialist optimiser!^[8]

This shows why we couldn’t have defined optimisers as functions with type-signature $(X \to R) \to P (R)$ , which bakes-in the assumption that all optimisers are consequentialist. In order to ensure the nash-sum closure of optimisers, we needed to include nonconsequentialist optimisers.^[9]

Beyond CDT?

Warning: This section is a speculative extension of the existing literature, it’s merely a proof-of-concept of how higher-order game theory would extend to non-CDT agents.

Let $u : {C, D} \times {C, D} \to R \times R$ be the game defined by the payoff matrix below. We may readily check that $({argmax}_{{C, D}} \oplus {argmax}_{{C, D}}) (u) = {(1, 1)}$ , which corresponds to the well-known result that two utility-maximisers will defect in the prisoners’ dilemma.

$u$	$C$	$D$
$C$	$(2, 2)$	$(0, 3)$
$D$	$(3, 0)$	$(1, 1)$

Now, some decision theorists think that it’s not generally true that two rational utility-maximisers will defect in the prisoners’ dilemma — e.g. if they’re both perfect replicas of one another then they’ll both cooperate! The reasoning is this — cooperation from the first prisoner would count as evidence (to the first player) of high utility, and defection from the first prisoner would count as evidence of low utility. This corresponds to the observation that $π_{1} \circ u (C, C) > π_{1} \circ u (D, D)$ . So the first player should cooperate. Ditto the second player.

So what went wrong?

It all comes down to an ambiguity with the task $u : X \to R$ . What relationship is the task $u$ supposed to capture exactly? The formalism itself is agnostic.

According to causal decision theory, the task $u : X \to R$ is supposed to track which payoff $u (x) \in R$ is caused by the agent choosing $x \in X$ .
According to evidential decision theory, the task $u : X \to R$ is supposed to track which payoff $u (x) \in R$ is evidenced by the agent choosing $x \in X$ .

So let’s explain our answer $({argmax}_{{C, D}} \oplus {argmax}_{{C, D}}) (u) = {(1, 1)}$ , which agrees with CDT. We can trace the problem to the deviation maps $U_{i} : X \to (X_{i} \to X)$ in the definition of $\oplus$ .

The $i$ -th deviation map $U_{i} : X \to (X_{i} \to X)$ is given by $U_{i} (x) (α)_{j} = {\begin{matrix} α & if i = j x_{j} & otherwise \end{matrix}$ for all $x \in X$ , $i \in I$ , $α \in X_{i}$ , and $j \in I$ .

The map $U_{i} (x)$ is the causal dependence of the overall option-profile on the $i$ th player’s choice. Hence, when the game $u : X \to R$ is the causal dependence of the overall payoff on the option-profile, it follows that the task $u \circ U_{i} (x) : X_{i} \to R$ is the causal dependence of the payoff on the $i$ th player’s choice. This is why we got the CDT answer — causation composed with causation is causation.

According to EDT, however, the task is supposed to capture the evidential dependency of the player’s choice on the payoff. And when the game $u : X \to R$ is the evidential dependence of the overall payoff on the option-profile, it isn’t true that that the task $u \circ U_{i} (x) : X_{i} \to R$ is the evidential dependence of the payoff on the $i$ -th player’s choice. Evidence composed with causation isn’t evidence.

To achieve the EDT answer, we need is a function $U_{i}^{E} (x) : X_{i} \to X$ which captures the evidential dependence of the option-profile on the $i$ th player’s choice. In the case that all the players are replicas, $U_{i}^{E} (x) : X_{i} \to X, α \mapsto (α, \dots, α)$ . Replacing $U_{i} : X \to (X_{i} \to X)$ in the definition of $⨁$ with $U_{i}^{E} : X \to (X_{i} \to X)$ then we’d get a different binary operator $⨁^{E}$ which yields the EDT prediction, namely that $({argmax}_{A} \oplus^{E} {argmax}_{B}) (u) = {(C, C)}$ .

It seems that we have another free parameter in our model! We can the replace the family of deviation maps $U_{i} : X \to (X_{i} \to X)$ with an arbitrary family of non-casual deviation maps $D_{i} : X \to (X_{i} \to X)$ , and thereby encode information about each player’s idiosyncratic decision theory. If the $i$ th player is CDT then $D_{i} = U_{i}$ and if the $i$ th player is EDT then $D_{i} = U_{i}^{E}$ .

Suppose we have a payoff space $R$ , a family of sets ${X_{i}}_{i \in I}$ with $X = \prod_{i \in I} X_{i}$ , a family of optimisers ${ψ_{i}}_{i \in I}$ where $ψ_{i} \in J^{P} (X_{i}, R)$ , and a family of decision theories ${D_{i}}_{i \in I}$ where $D_{i} : X \to (X_{i} \to X)$ , and a game is a map $g : X \to R$ . Then we can define the non-CDT best-response function of $g$ to be the function $B : X \to P (X), x \mapsto n \prod i = 1 ψ_{i} (u \circ D_{i} (x))$ , and the non-CDT nash equilibria of $g$ to be the set ${x \in X : x \in B (x)}$ . Or something like that.

With this machinery in place, we can answer even sillier questions —

An EDT utility-satisficer and a CDT utility-maximiser are playing the game $u : X \times X \to R \times R$ . They are causally separated, but they are negligibly likely to choose different options. What options might they choose?^[10]
Well, a pair of options $(a, b) \in X \times X$ is in (non-CDT) nash equilibrium if and only if $π_{1} \circ u (s, s) \leq π_{1} \circ u (a, a)$ and $\forall b^{'} \in B . π_{2} \circ u (a, b^{'}) \leq π_{2} \circ u (a, b)$ , where $s \in X$ is the satisficer’s anchor point. We can also filter out the option profiles such that $a \neq b$ . The resulting list of option-profiles is easily computable from $u : X \times X \to R \times R$ and $s \in S$ .
Suppose that $u : {C, D} \times {C, D} \to R \times R$ is the prisoner’s dilemma, and $s = C$ is the satisificer’s anchor point. Then the unique nash equilibrium is $(C, D)$ .
$u$ $C$ $D$
$C$ $(2, 2)$ $(0, 3)$
$D$ $(3, 0)$ $(1, 1)$
Exercise 7: If Angelica is an EDT utility-maximiser and Tommy is a CDT utility-maximiser, then which profiles are nash?

Recap

Classical game theory is the study agents which choose options causing maximum utility. When many such agents (CDT-utility-maximisers) make simultaneous choices they’ll collectively choose an option-profile in nash equilibrium, i.e. $x \in B (x)$ where $B : X \to P (X), x \mapsto n \prod i = 1 {argmax}_{X_{i}} (π_{i} \circ u \circ U_{i} (x))$ is the best-response function.
Many AI safety researchers are worried about agents which choose options causing maximum utility, either because the word “cause” or “maximum” or “utility”, so they study agents which relax one (or many) of these three assumptions.
Today we introduced high-order game theory which generalises classical game theory by considering agents which don’t maximise utility. We generalised the classical nash equilibrium to arbitrarily many agents with arbitrary optimisers.
Finally, I gestured at how we might generalise the nash equilibrium even further to the simultaneous choices of non-CDT agents.

Pretty cool!

Next time...

By tomorrow, I will hopefully have posted about mesa-optimisation, sequential choice, and non-possibilistic models of agents.

^
They study of $0$ -player games is called automata theory (in the discrete case) and dynamics (in the continuous case), while the study of $1$ -player games is called optimisation.
^
Recall that $J^{P} (X, R)$ denotes the set of functions of type $(X \to R) \to P (X)$ .
That is, if $u : X \to R$ and $ψ \in J^{P} (X, R)$ , then $ψ (u) \subseteq X$ .
^
See Aumann’s 1964 paper Markets with a Continuum of Trader for a game with uncountable players.
^
Recall that $argmax- ϵ -slack \in J^{P} (X, R)$ is the optimiser which maximises the function $u$ up to some fixed slack $ϵ > 0$ .
That is, $x \in argmax- ϵ -slack (u) ⟺ \forall x^{'} \in X . u (x^{'}) \leq u (x) + ϵ$ .
^
Recall that an optimiser $ψ \in J^{P} (X, R)$ is total if $ψ (u) \neq \emptyset$ for all $u : X \to R$ .
^
$δ_{x y} = {\begin{matrix} 1 & if x = y 0 & otherwise \end{matrix}$ is the Kronecker delta function.
^
Recall that an optimiser $ψ \in J^{P} (X, R)$ is consequentialist if $ψ = λ^{u : X \to R} . {x \in X | u (x) \in q (u)}$ for some $q : (X \to R) \to P (R)$ .
In other words, for any task $u : X \to R$ , if $u (x) = u (x^{'})$ then $x \in ψ (u) ⟺ x^{'} \in ψ (u)$ .
This condition says that, once we know the agent’s task, then the optimality of a particular choice is determined by its payoff.
^
Maybe there’s something deep here about group-level deontology arising from individual-level consequentialism.
^
Warning: There is a remark that one regularly encounters in metaethics: moral consequentialism is tautological because we can consider the choice of the decision-maker as a consequence of the decision itself, ensuring that any decision-rule is consequentialist.
This metaethical remark corresponds to the following fact in higher-order decision theory: if a nonconsequentialist optimiser $ψ \in J^{P} (X, R)$ faces the task $u : X \to R$ then they will choose the same options as the consequentialist optimiser $ψ (π_{R} \circ -) \in J^{P} (X, X \times R)$ facing the task $u^{'} : X \to X \times R, x \mapsto (x, u (x))$ . It follows that whether an agent is a consequentialist heavily depends on which physical consequences of the decision we are tracking.
^
Recall that a satisficer might choose any option $x \in X$ which dominates some fixed option $s \in X$ , i.e. $x \in {satisfice}_{s} (u) ⟺ u (s) \leq u (x)$ . The option $s$ is called the anchor point.

What links here?

Cleo Nardo11 Nov 2023 16:02 UTC

31 points

14 comments13 min readLW link

Exercises / Problem-Sets Satisficer Corrigibility Subagents MATS Program Prisoner's Dilemma Agent Foundations Game Theory Utility Functions Optimization Decision theory

Daniel Kokotajlo 13 Nov 2023 14:02 UTC
2 points
0
e upshot of higher-order game theory — the nash equilibria between optimisers is itself an optimiser!

Isn’t this pretty trivial though? I guess it’s still probably convenient for the math.
- Cleo Nardo 13 Nov 2023 14:58 UTC
  6 points
  0
  Parent
  The observation is trivial mathematically, but it motivates the characterisation of an optimiser as something with the type-signature $(X \to R) \to P (X)$ .
  You might instead be motivated to characterise optimisers by...
  - A utility function $u : X \to R$
  - A quantifier $(X \to R) \to P (X)$
  - A preorder $(R, \leq)$ over the outcomes
  - Etc.
  However, were you to characterise optimisers in any of the ways above, then the nash equilibrium between optimisers would not itself be an optimiser, and therefore we lose compositionality. The compositionality is conceptually helpfully because it means that your $n \geq 2$ definitions/theorems reduce to the $n = 1$ case.
  - rotatingpaguro 19 Nov 2023 3:39 UTC
    1 point
    0
    Parent
    A quantifier $(X \to R) \to P (X)$
    Here you mean $(X \to R) \to P (X)$ , right?
    - Cleo Nardo 19 Nov 2023 16:20 UTC
      3 points
      0
      Parent
      Wait I mean a quantifier in $(X \to R) \to P (R)$ .
      If we characterise an agent with a quantifier $(X \to R) \to P (R)$ , then we’re saying which payoffs the agent might achieve given each task. Namely, $r \in q (u)$ if and only if it’s possible that the agent achieves payoff $r \in R$ when faced with a task $u : X \to R$ .
      But this definition doesn’t play well with a nash equilibria.
momom2 5 Dec 2023 15:51 UTC
1 point
0
Thank you, this is incredibly interesting! Did you ever write up more on the subject? I’m excited to see how it relates to mesa-optimisation in particular.
In the finite case, where $I = {1, \dots, n}$ , then $U_{i} (x_{1}, \dots, x_{n}) : X_{i} \to X, α \mapsto (x_{1}, \dots, x_{i - 1}, α, x_{i}, \dots, x_{n})$
Typo: I think you mean $α \mapsto (x_{1}, \dots, x_{i - 1}, α, x_{i + 1}, \dots, x_{n})$ ?
rotatingpaguro 13 Nov 2023 4:10 UTC
1 point
0
I would have liked a footnote saying $π_{i} (x) = x_{i}$ .
I’m confused about the last example, EDT-satisficer vs. CDT-maximizer. It says that we assume the agents to choose the same options. Accordingly, it says to “filter out the option profiles such that $a \neq b$ ”. But then, in the concrete case of the Prisoner’s dilemma, it says the solution is $(C, D)$ instead of $\emptyset$ .
Exercise 7:
In the Prisoner’s dilemma, the answer doesn’t change compared to the one in the worked-out example, because in that example the satisficer was already anchoring to the best EDT solution $(C, C)$ . So the Nash equilibrium is $(C, D)$ . More in general, the first agent condition is replaced by $\forall a^{'} \in X . u_{1} (a^{'}, a^{'}) \leq u_{1} (a, a)$ .
4. The $Ψ$ -best-response function of $g$ is the the function $B : X \to P (X)$ defined by $B : x \mapsto \prod i \in I ψ_{i} (g \circ U_{i} (x))$ where $U_{i} : X \to (X_{i} \to X)$ is the $i$ th deviation map
Here $\prod i \in I A_{i} = {(a_{i})_{i \in I} ∣ \forall i . a_{i} \in A_{i}}$ , right?
- Cleo Nardo 25 Nov 2023 0:25 UTC
  4 points
  1
  Parent
  Yes, $\prod i \in I A_{i} = {(a_{i})_{i \in I} . \forall i \in I . a_{i} \in A_{i}}$ , i.e. the cartesian product of a family of sets. Sorry if this wasn’t clear, it’s standard maths notation. I don’t know what the other commenter is saying.
  What links here?
  - StrivingForLegibility's comment on Game Theory without Argmax [Part 2] by Cleo Nardo (24 Nov 2023 2:59 UTC; 2 points)
  - StrivingForLegibility's comment on Game Theory without Argmax [Part 2] by Cleo Nardo (23 Nov 2023 7:49 UTC; 2 points)
  - StrivingForLegibility 25 Nov 2023 5:16 UTC
    1 point
    0
    Parent
    Got it, I misunderstood the semantics of what $B (x)$ was supposed to capture. I thought the elements needed to be mutual best-responses. Thank you for the clarification, I’ve updated my implementation accordingly!
  - rotatingpaguro 25 Nov 2023 2:19 UTC
    1 point
    0
    Parent
    I interpreted it the standard way too initially, but then I had a hunch there was… I dunno, something fishy, and then indeed it turned out @StrivingForLegibility understood it in a completely different way, so somehow it wasn’t clear! Magic.
- StrivingForLegibility 23 Nov 2023 7:49 UTC
  2 points
  −3
  Parent
  Edit: Cleo Nardo has confirmed that they intended $\prod i \in I$ to mean the cartesian product of sets, the ordinary thing for that symbol to mean in that context. I misunderstood the semantics of what $B (x)$ was intended to represent. I’ve updated my implementation to use the intended cartesian product when calculating the best response function, the rest of this comment is my initial (wrong) interpretation of $B (x)$ .
  I needed to go back to one of the papers cited in Part 1 to understand what that $\prod i \in I$ was doing in that expression. I found the answer in A Generalization of Nash’s Theorem with Higher-Order Functionals. I’m going to do my best to paraphrase Hedges’ notation into Cleo’s notation, to avoid confusion.
  TLDR: $B (x)$ is picking out the set of option-profiles $P (X)$ that are simultaneously best-responses by all players to that option-profile $x$ . It does this by considering all of the option-profiles that can result by each player best-responding, then takes the intersection of those sets.
  On page 6, Hedges defines the best response correspondence $B \in X \to P (X)$
  $B (x) = ⋂ i \in I B_{i} (x)$
  Where
  $B_{i} \in X \to P (X)$
  Hedges builds up the idea of Nash Equilibria using quantifiers rather than optimizers, (like $max$ rather than $argmax$ ), but I believe the approaches are equivalent. Unpacking $B : x \mapsto \prod i \in I ψ_{i} (g \circ U_{i} (x))$ from the inside out:
  $U_{i} (x) \in X_{i} \to X$
  $g \circ U_{i} (x) \in X_{i} \to R$
  That makes $g \circ U_{i} (x$ ) a $ψ_{i}$ -task. Since $ψ_{i} \in (X_{i} \to R) \to P (X_{i})$ , we know that $ψ_{i} (g \circ U_{i} (x)) \in P (X_{i})$ .
  This is where I had to go looking through papers. What sort of product takes a set of best-responses from each player, relative to a given option-profile, and returns a set of option-profiles that are simultaneously regarded by each player as a best-response? I thought about just taking the Cartesian product of the sets, but that wouldn’t get us only the mutual best-responses.
  Let’s call the way that each player maps option-profiles to best-responses $b_{i} \in X \to P (X_{i})$ . This is exactly the sets we want to take the product of:
  $b_{i} (x) = ψ_{i} (g \circ U_{i} (x))$
  Hedges introduces notation on page 3 to handle the operation of taking an option-profile, varying one player’s option, and leaving the rest the same. Paraphrasing, Hedges defines $x (i \mapsto α) \in \prod j \in I X_{j}$ by
  $x (i \mapsto α)_{j} = {\begin{matrix} α & if i = j x_{j} & o t h e r w i s e \end{matrix}$
  You can read $x (i \mapsto α)$ as “give me a new copy of $x$ , where the $i$ th entry has been set to the value $α$ .” Hedges uses this to define the deviation maps equivalently to the way Cleo did. $U_{i} : X \to (X_{i} \to X)$
  $U_{i} (x) (α) = x (i \mapsto α)$
  The correspondences $B_{i} \in X \to P (X)$ take as input an option profile, and returns the set of option-profiles which are player $i$ ’s optimal unilateral deviations from that option profile. To construct $B_{i}$ from $b_{i}$ , we want to map $b_{i} (x) \in P (X_{i})$ to the option-profiles which deviate from $x$ in those exact ways.
  $B_{i} (x) = {x (i \mapsto α) : α \in b_{i} (x)}$
  We can then use Hedges’ $B (x) = ⋂ i \in I B_{i} (x)$ to get the best-response correspondence! We can unpack this to get a definition of $B$ using objects that Cleo defined, using that deviation notation from Hedges:
  $B (x) = ⋂ i \in I {x (i \mapsto α) : α \in ψ_{i} (g \circ U_{i} (x))}$
  Thank you Cleo for writing this article! This was my first introduction to Higher-Order Game Theory, and I wrote up an implementation in TypeScript to help me understand how all of the pieces fit together!
  - rotatingpaguro 23 Nov 2023 18:15 UTC
    2 points
    −1
    Parent
    I’m weirded out by this. To look at everything together, I write the original expression, and your expression rewritten using the OP’s notation:
    Original: $B : x \mapsto \prod i \in I ψ_{i} (g \circ U_{i} (x))$
    Yours: $\begin{matrix} B (x) & = ⋂ i \in I {x (i \mapsto α) : α \in ψ_{i} (g \circ U_{i} (x))} = ⋂ i \in I U_{i} (x) (ψ_{i} (g \circ U_{i} (x))) \end{matrix}$
    (I’m using the notation that a function applied to a set is the image of that set.)
    So the big pi symbol stands for
    $\prod_{i \in I} A_{i} = ⋂_{i \in I} U_{i} (x) (A_{i})$
    So it’s not a standalone operator: it’s context-dependent because it pops out an implicit $x$ . The OP otherwise gives the impression of a more functional mindset, so I suspect the OP may mean something different from your guess.
    Other problem with your interpretation: it yields the empty set unless all agents consider doing nothing an option. The only possible non-empty output is ${x}$ . Reason: each set you are intersecting contains tuples with all elements equal to the ones in $x$ , but for one. So the intersection will necessarily only contain tuples with all elements equal to those in $x$ .
    - StrivingForLegibility 24 Nov 2023 2:59 UTC
      2 points
      −6
      Parent
      Edit: Cleo Nardo has confirmed that they intended $\prod i \in I$ to mean the cartesian product of sets, the ordinary thing for that symbol to mean in that context. I misunderstood the semantics of what $B (x)$ was intended to represent. I’ve updated my implementation to use the intended cartesian product when calculating the best response function, the rest of this comment is based on my initial (wrong) interpretation of $B (x)$ .
      I write the original expression, and your expression rewritten using the OP’s notation:
      Original: $B : x \mapsto \prod i \in I ψ_{i} (g \circ U_{i} (x))$
      Yours: $\begin{matrix} B (x) & = ⋂ i \in I {x (i \mapsto α) : α \in ψ_{i} (g \circ U_{i} (x))} = ⋂ i \in I U_{i} (x) (ψ_{i} (g \circ U_{i} (x))) \end{matrix}$
      (I’m using the notation that a function applied to a set is the image of that set.)
      This is a totally clear and valid rewriting using that notation! My background is in programming and I spent a couple minutes trying to figure out how mathematicians write “apply this function to this set.”
      I believe the way that $B (x)$ is being used is to find Nash equilibria, using Cleo’s definition 6.5:
      Like before, the $Ψ$ -nash equilibria of $g$ is the set of option-profiles $x \in X$ such that $x \in B (x)$ .
      These are going to be option-profiles where “not deviating” is considered optimal by every player simultaneously. I agree with your conclusion that this leads $B (x)$ to take on values that are either ${}$ or ${x}$ . When $B (x) = {}$ , this indicates that $x$ is not a Nash equilibrium. When $B (x) = {x}$ , we know that $x$ is a Nash equilibrium.
      - rotatingpaguro 24 Nov 2023 20:30 UTC
        1 point
        1
        Parent
        Oh I see now, $B$ just needs to work to pinpoint Nash equilibria, I did not make that connection.
        But anyway, the reason I’m suspicious of your interpretation is not that your math is not correct, but that it makes the OP notation so unnatural. The unnatural things are:
        $\prod$ being context-dependent.
        $\prod$ not having its standard meaning.
        $U_{i}$ used implicitly instead of explicitly, when later it takes on a more important role to change decision theory.
        Using $x \in B (x)$ as condition without mentioning that already $B (x) \neq \emptyset ⟺ x is Nash$ if $| I | \geq 2$ .
        So I guess I will stay in doubt until the OP confirms “yep I meant that”.
        Cleo Nardo 25 Nov 2023 0:35 UTC
        4 points
        1
        Parent
        $B (x) \neq \emptyset$ isn’t equivalent to $x$ being Nash.
        Suppose Alice and Bob are playing prisoner’s dilemma. Then the best-response function of every option-profile is nonempty. But only one option-profile is nash.
        $x \in B (x)$ is equivalent to $x$ being Nash.