Formalizing Newcombian Problems with Fuzzy Infra-Bayesianism

Introduction

In this post, we introduce contributions and supracontributions^[1], which are basic objects from infra-Bayesianism that go beyond the crisp case (the case of credal sets). We then define supra-POMDPs, a generalization of partially observable Markov decision processes (POMDPs). This generalization has state transition dynamics that are described by supracontributions.

We use supra-POMDPs to formalize various Newcombian problems in the context of learning theory where an agent repeatedly encounters the problem. The one-shot version of these problems are well-known to highlight flaws with classical decision theories.^[2] In particular, we discuss the opaque, transparent, and epsilon-noisy versions of Newcomb’s problem, XOR blackmail, and counterfactual mugging.

We conclude by stating a theorem that describes when optimality for the supra-POMDP relates to optimality for the Newcombian problem. This theorem is significant because it gives a sufficient condition under which infra-Bayesian decision theory (IBDT) can approximate the optimal decision. Furthermore, we demonstrate through the examples that IBDT is optimal for problems for which evidential and causal decision theory fail.

Fuzzy infra-Bayesianism

Contributions, a generalization of probability distributions, are defined as follows.

Definition: Contribution
A contribution on a finite set $X$ is a function^[3] $θ : X \to [0, 1]$ such that $\sum_{x \in X} θ (x) \leq 1.$ The set of contributions on $X$ is denoted by $Δ^{C} X$ .

Given $A \subseteq X,$ we write $θ (A)$ to denote $\sum_{x \in A} θ (x) .$ A partial order on $Δ^{C} X$ is given by $θ_{1} \leq_{C} θ_{2}$ if for all subsets $A \subseteq X,$ $θ_{1} (A) \leq θ_{2} (A) .$ For example, the constant-zero function 0 is a contribution that lies below every element in $Δ^{C} X$ in the partial order. A set of contributions $Θ$ is downward closed if for all $θ \in Θ,$ $θ^{'} \leq_{C} θ$ implies $θ^{'} \in Θ .$ Given $Θ \subseteq Δ^{C} X,$ the downward closure of $Θ$ in $Δ^{C} X$ is defined by

Θ^{↓} := {θ^{'} \in Δ^{C} X | \exists θ \in Θ, θ^{'} \leq_{C} θ} .

Figure 1 illustrates a set of contributions together with its downward closure.

Figure 1: (Purple) Graphical representation of a set $Θ$ of contributions $θ$ on $X = {x_{1}, x_{2}} .$ (Gray shaded region) Elements in the downward closure of $Θ$ that are not in $Θ .$

The set of contributions shown in Figure 1 together with its downward closure is an example of a supracontribution, defined as follows.

Definition: Supracontribution
A supracontribution on $X$ is a set of contributions $Θ \subseteq Δ^{C} X$ such that
$0 \in Θ$ ,
$Θ$ is closed,
$Θ$ is convex, and
$Θ$ is downward closed.
The set of supracontributions on $X$ is denoted by $□^{C} X$ .

Figure 2 shows another example of a supracontribution.

Figure 2: Graphical representation of a supracontribution on $X = {x_{1}, x_{2}} .$

Supracontributions can be regarded as fuzzy sets of distributions, namely sets of distributions in which membership is described by a value in $[0, 1]$ rather than ${0, 1} .$ In particular, the membership $Γ (θ, Θ)$ of $θ \in Δ X$ in $Θ \in □^{C} X$ is given by

Γ (θ, Θ) := max {p | p θ \in Θ, p \in [0, 1]}

where $p θ$ denotes the scaling of $θ$ by $p .$ See Figure 3 for two examples. Note that $Γ$ is well-defined since all supracontributions contain 0. By this viewpoint, supracontributions can be seen as a natural generalization of credal sets, which are “crisp” or ordinary sets of distributions.

Figure 3: (Left) A distribution $θ$ and a supracontribution $Θ$ that is the downward closure of a different distribution. Here $Γ (θ, Θ) = 0.$ (Right) A distribution $θ$ and a supracontribution $Θ$ with $Γ (θ, Θ) = p > 0.$

There is a natural embedding $τ : □ X \to □^{C} X$ of credal sets on $X$ into the space of supracontributions on $X .$ Let $⊥_{X} := {0} .$ Define $τ (Θ) := Θ^{↓} \cup ⊥_{X} .$ Note that under this definition, $τ (\emptyset) = ⊥_{X}$ (and otherwise the union with $⊥_{X}$ in the definition of $τ$ is redundant).

Figure 4: (Purple line segment) A credal set $Θ$ on $X = {x_{1}, x_{2}}$ , and (gray shaded region) the supracontribution $τ (Θ) = Θ^{↓} .$

Generalizing the notion of expected value, we write $E_{θ} [L] := \sum_{x \in X} θ (x) L (x) .$ We define expectation (similarly as in the crisp case) as the max over all expectations for elements of the supracontribution. By definition, a supracontribution is closed and thus this notion of expectation is well-defined.

Definition: Expectation with respect to a supracontribution
Given a continuous function $L : X \to [0, 1]$ and $Θ \in □^{C} X,$ define the expectation of $L$ with respect to $Θ$ by
$E_{Θ} [L] := max θ \in Θ E_{θ} [L] .$

Let $~ Θ \subseteq Δ^{C} X$ denote a non-empty set of contributions. Then ${sup}_{θ \in ~ Θ} E_{θ} [L] = {max}_{θ \in ¯ ¯¯¯¯¯¯¯¯¯¯¯¯ ¯ ch (~ Θ)^{↓}} E_{θ} [L]$ where $ch$ denotes convex hull and $¯ \cdot$ denotes closure. Therefore, in the context of optimization we may always replace a non-empty set of contributions by the supracontribution obtained by taking the convex, downward, and topological closure.

Recall that environments in the classical theory and in crisp infra-Bayesianism have type $(A \times O)^{*} \times A \to Δ O .$ We generalize this notion to the fuzzy setting using semi-environments.^[4]

Definition: Semi-environment
A semi-environment is a map of the form $μ : (A \times O)^{*} \times A \to Δ^{C} O .$

The interaction of a semi-environment and a policy $π : (A \times O)^{*} ⇀ A$ determines a contribution on destinies $μ^{π} \in Δ^{C} (A \times O)^{ω} .$ ^[5]

Definition: (Fuzzy) law
A (fuzzy) law generated by a set of semi-environments $E$ is a map $Λ : Π \to □^{C} (A \times O)^{ω}$ such that for all $π \in Π,$ $Λ (π) = {¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ ch ({μ^{π} | μ \in E})}^{↓}$ where $ch$ denotes convex hull and $¯ \cdot$ denotes closure.

Fuzzy Supra-POMDPs

Our tool for formalizing Newcombian problems using the mathematical objects described in the last section is fuzzy supra-POMDPs, a generalization of partially observable Markov decision processes (POMDPs). Given a set of states $S$ and a contribution $θ \in Δ^{C} (S),$ the missing probability mass, $1 - θ (S),$ can be interpreted as the probability of a logical contradiction.

Under a fuzzy supra-POMDP model, uncertainty of the initial state is described by an initial supracontribution over states. Similarly to the state transition dynamics of a crisp supra-POMDP as defined in the preceding post, the state transition dynamics of a fuzzy supra-POMDP are multivalued. Given a state and action, the transition suprakernel returns a supracontribution, and the true dynamics are described by any element of the supracontribution.

A significant feature of fuzzy supra-POMDPs is that for some state $s$ and action $a,$ we may have $T (s, a) = ⊥_{S},$ which corresponds to a logical contradiction in which the state transition dynamics come to a halt. We use this feature to model Newcombian problems where it is assumed there is a perfect predictor Omega predicting an agent’s policy. When an action deviates from the predicted policy (which is encoded in the state), the transition kernel returns $⊥_{S} .$

We formally define fuzzy supra-POMDPs as follows.

Definition: Fuzzy supra-POMDP
A fuzzy supra-partially observable Markov decision process (supra-POMDP) is a tuple $(S, Θ_{0}, A, O, T, B)$ where
$S$ is a set of states,
$Θ_{0} \in □^{C} S$ is an initial supracontribution over states,
$A$ is a set of actions,
$O$ is a set of observations,
$T : S \times A \to □^{C} S$ is a transition suprakernel,^[6]
$B : S \to O$ is an observation mapping.

Defining laws from fuzzy supra-POMDPs

Every fuzzy supra-POMDP defines a (fuzzy) law. The construction is similar to the construction of a crisp law given a crisp supra-POMDP, which is discussed in the preceding post. A copolicy to a fuzzy supra-POMDP is a map $σ : (S \times A)^{*} \to Δ^{C} S$ that is consistent with the transition suprakernel. More specifically, given a history of states and actions, the transition kernel determines a supracontribution from the most recent state and action. The copolicy can be thought of as a map that selects a contribution from that supracontribution.

Definition: Copolicy to a fuzzy supra-POMDP
Let $M = (S, Θ_{0}, A, O, T, B)$ be a fuzzy supra-POMDP. A map $σ : (S \times A)^{*} \to Δ^{C} S$ is an $M$ -copolicy if
$σ (ϵ) \in Θ_{0}$ for the empty string $ϵ,$ and
For all non-empty strings $h s a \in (S \times A)^{*}$ , $σ (h s a) \in T (s, a)$ .

An $M$ -copolicy $σ : (S \times A)^{*} \to Δ^{C} S$ and the observation map $B : S \to O$ of $M$ together determine a semi-environment $μ_{σ} : (A \times O)^{*} \times A \to Δ^{C} O .$ Let

E_{M} := {μ_{σ} | σ is an M -copolicy} .

Then $M$ defines the law generated by $E_{M} .$

Formalizing Newcombian problems

In this section, we give a mathematical definition for Newcombian problems and describe how to model Newcombian problems using fuzzy supra-POMDPs.

Let $O$ denote a set of observations. Given a time-horizon $H \in N,$ let $O^{< H}$ denotes strings in $O$ of length less than $H .$ Let $Π_{H}$ denote the set of horizon- $H$ policies, i.e. maps of the form $π : O^{< H} \to A .$

Definition: Newcombian problem with horizon $H$
A Newcombian problem with horizon $H \in N$ is a map $ν : Π_{H} \times O^{< H} \to Δ O$ together with a loss function $L : (A \times O)^{H} \to [0, 1] .$

Intuitively speaking, given a policy and a sequence of observations, a Newcombian problem specifies some distribution that describes uncertainty about the next observation. This framework allows for a mathematical description of an environment in which there is a perfect predictor Omega and a distribution over observations that depends on the policy that Omega predicts.

Optimal policy for a Newcombian problem

Similar to how the interaction of a policy and an environment produce a distribution over destinies, a Newcombian problem $ν$ and a policy $π \in Π_{H}$ together determine a distribution on outcomes, $ν^{π} \in Δ (A \times O)^{H} .$ The policy that minimizes expected loss with respect to this distribution is said to be a $ν$ -optimal policy.

Definition: Optimal policy for a Newcombian problem
A policy $π \in Π_{H}$ is optimal for a Newcombian problem if $π \in {argmin}_{π \in Π_{H}} E_{ν^{π}} [L]$ .

If $π \in Π_{H}$ is optimal for $ν,$ then $E_{ν^{π}} [L]$ is said to be the optimal loss for $ν .$

Formalism for multiple episodes

In order to discuss learning, we consider the case of multiple episodes where an agent repeatedly encounters the Newcombian problem.

Given some number of episodes $n > 1,$ let $Π : O^{< n H} \to A$ denote the set of multi-episode policies. A multi-episode policy $π \in Π$ gives rise to a sequence of single-episode policies ${π_{k} | π_{k} (o_{0} \dots o_{m}) := π (o_{0} \dots o_{k H + m}), m \leq H - 1}_{k = 0}^{n - 1} \subseteq Π_{H} .$

By means of this sequence of single-episode policies, a Newcombian problem $ν$ and a multi-episode policy together determine a distribution on outcomes over multiple episodes, which we also denote by $ν^{π} \in Δ (A \times O)^{n H} .$

The loss function $L : (A \times O)^{H} \to [0, 1]$ can be naturally extended to multiple episodes by considering the mean loss per episode. In particular, if $n \in N$ and $h \in (A \times O)^{n H}$ is given by $h = a_{0} o_{0} \dots a_{n H - 1} o_{n H - 1},$ then the mean loss over $n$ episodes is defined as

L_{n} (h) := \frac{1}{n} n - 1 \sum k = 0 L (a_{k H} o_{k H} \dots a_{k H + H - 1} o_{k H + H - 1}) .

It is also possible to extend the loss to multiple episodes by considering the sum of the per-episode loss with a geometric time discount. In this case, the total loss is defined by

L_{n}^{γ} (h) = (1 - γ) n - 1 \sum k = 0 γ^{k} L (a_{k H} o_{k H} \dots a_{k H + H - 1} o_{k H + H - 1}) .

The fuzzy supra-POMDP associated with a Newcombian problem

In this section, we describe how to model iterated Newcombian problems (i.e. repeated episodes of the problem) by a fuzzy supra-POMDP. We work in the iterated setting since this allows us to talk about learning. Examples are given in the following section.

The state space, initialization, and observation mapping

Let $ν : Π_{H} \times O^{< H} \to Δ O$ (together with $L : (A \times O)^{H} \to [0, 1]$ ) be a Newcombian problem. Let $S = Π_{H} \times O^{< H} .$ Informally speaking, the state always encodes both a policy and a sequence of observations.

Let the initial supracontribution over states be $Θ_{0} = ⊤_{Π_{H} \times λ} := ¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ ch ({δ_{π \times {λ}} | π \in Π_{H}})^{↓}$ where $λ$ denotes the empty observation. This supracontribution represents complete ambiguity over the policy (and certainty over the empty observation).

The observation mapping $B : S \to O$ simply returns the most recent observation datum from a state, i.e. $B (π, o_{0} \dots o_{n}) = o_{n}$ for non-empty observation strings. If the observation string of the state is the empty string $λ,$ then $B (π, λ)$ may be chosen arbitrarily.

The transition suprakernel

We start with an informal description of the transition suprakernel $T : S \times A \to □^{C} S,$ which is defined in three cases. In short:

If the action is compatible with the policy encoded by the state, then $T$ is determined by $ν .$
If the action is not compatible with the policy encoded by the state, then $T$ returns $⊥_{S} .$
If the end of the episode is reached, then $T$ returns $⊤_{Π_{H} \times λ} .$

To elaborate, we have the following three cases.

The action is compatible with the policy encoded by the state, and the time-step is less than the horizon, i.e. the episode is not yet complete. In this case, the policy datum of the state remains the same with certainty, whereas $ν$ determines a distribution over next observations. This distribution determines a distribution over next states. The transition suprakernel returns the downward closure of this distribution. (Here the downward closure is used simply to obtain a supracontribution.)
The action is not compatible with the policy encoded by the state, and the time-step is less than the horizon. This case is a logical contradiction, and thus $T$ returns $⊥_{S} .$
The time-step is equal to the horizon, i.e. the end of the episode has been reached. The transition suprakernel returns the initial supracontribution $⊤_{Π_{H} \times λ}$ again, which represents complete ambiguity over the policy for the next episode (and certainty over the empty observation).

We now formally define $T : S \times A \to □^{C} S .$ Given $(π, o_{0} \dots o_{n}) \in S$ such that $n < H - 1,$ define $pref (π, o_{0} \dots o_{n}) : O \to S$ by $pref (π, o_{0} \dots o_{n}) (q) := (π, o_{0} \dots o_{n} q)$ . Namely, $pref (π, o_{0} \dots o_{n})$ appends a given observation to the prefix of observations $o_{0} \dots o_{n}$ and returns the corresponding state. Let ${pref}_{*}$ denote the pushforward of $pref .$

Define

T (π, o_{0} \dots o_{n}, a) := ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} {pref (π, o_{0} \dots o_{n})_{*} (ν (π, o_{0} \dots o_{n}))}^{↓} & if a = π (o_{0} \dots o_{n}) and n < H - 1, ⊥_{S} & if a \neq π (o_{0} \dots o_{n}) and n < H - 1, ⊤_{Π_{H} \times λ} & if n = H - 1. \end{matrix}

Examples of Newcombian problems

In this section, we explain how to formalize various Newcombian problems using the fuzzy supra-POMDP framework. We provide the most detail for the first two examples

Newcomb’s Problem

We first consider Newcomb’s Problem. In this problem, there are two boxes: Box A, a transparent box that always contains $1K, and Box B, an opaque box that either contains $0 or $1M. An agent can choose to “one-box”, meaning that they only take Box B, or “two-box”, meaning they take both boxes. A perfect predictor Omega fills Box B with $1M if and only if Omega predicts that the agent will one-box.

Evidential decision theory (EDT) prescribes that an agent should choose the action that maximizes the expected utility conditioned on choosing that action. Thus, EDT recommends one-boxing because choosing to one-box can be seen as evidence that Box B contains $1M. This is the case even though the correlation is spurious, i.e. choosing to one-box does not cause there to be $1M in Box B. We will see that IBDT also recommends one-boxing. In comparison to EDT, causal decision theory (CDT)^[7] prescribes that an agent should only take into account what an action causes to happen and therefore recommends two-boxing.

Let $O = {λ, o_{$ 1 K}, o_{$ 1 M}, o_{$ 1.001 M}}$ where $λ$ denotes the empty observation and the remaining observations represent the total amount of money received.

Let $A = {a_{1}, a_{2}},$ where $a_{1}$ corresponds to one-boxing and $a_{2}$ corresponds to two-boxing. Without loss of generality, $Π_{H} = {π_{1}, π_{2}}$ where $π_{1} (λ) = a_{1}$ and $π_{2} (λ) = a_{2} .$

Then $ν : Π_{H} \times O^{< H} \to Δ O$ is defined by

ν (π_{1}, λ) = δ_{o_{$ 1 M}}, and

ν (π_{2}, λ) = δ_{o_{$ 1 K}} .

The loss of an episode $L : (A \times O) \to [0, 1]$ is defined by

L (a o) = {\begin{matrix} 0 if o = o_{$ 1 M} 1 if o = o_{$ 1 K} . \end{matrix}

(We don’t define $L (a o)$ for $o = o_{$ 1.001 M}$ because under this model, this observation never occurs.)

Note that $E_{ν^{π_{1}}} [L] = 1 \cdot 0 = 0,$ and $E_{ν^{π_{2}}} [L] = 1 \cdot 1 = 1.$ Therefore, $π_{1}$ is $ν -$ optimal.

We now consider the corresponding supra-POMDP $M_{ν} .$ The transition suprakernel is given by

T ((π_{1}, λ), a_{1}) = {δ_{(π_{1}, o_{$ 1 M})}}^{↓},

T ((π_{1}, λ), a_{2}) = ⊥_{S},

T ((π_{2}, λ), a_{2}) = {δ_{(π_{2}, o_{$ 1 K})}}^{↓},

T ((π_{2}, λ), a_{1}) = ⊥_{S}, and

T ((π_{1}, o_{$ 1.001 M}), \cdot) = T ((π_{2}, o_{$ 1 K}), \cdot) = ch (δ_{(π_{1}, λ)}, δ_{(π_{2}, λ)})^{↓} = ⊤_{Π_{H} \times λ} .

Figure 5 shows the state transition graph of $M_{ν} .$ Notably, it is not possible under $M_{ν}$ for an agent to two-box when Box B is full (the left branch in Figure 5). This is assured by the fact that $T ((π_{1}, λ), a_{2}) = ⊥_{S} .$

Figure 5: State transitions of the supra-POMDP that represents Newcomb’s problem.

The interaction of $M_{ν}$ and $π_{1}$ produces a supracontribution over outcomes given by $Θ_{1} = {δ_{a_{1} o_{$ 1 M}}}^{↓} .$ Similarly, the interaction of $M_{ν}$ and $π_{2}$ produces the supracontribution $Θ_{2} = {δ_{a_{2} o_{$ 1 K}}}^{↓} .$

Then the expected loss for $π_{1}$ in one round is

\begin{matrix} E_{M_{ν}^{π_{1}}} [L] & := E_{Θ_{1}} [L] = max θ \in Θ_{1} E_{θ} [L] = E_{δ_{a_{1} o_{$ 1 M}}} [L] = 0. \end{matrix}

Similarly, the expected loss for $π_{2}$ in one round is

\begin{matrix} E_{M_{ν}^{π_{2}}} [L] & := E_{Θ_{2}} [L] = max θ \in Θ_{2} E_{θ} [L] = E_{δ_{a_{2} o_{$ 1 K}}} [L] = 1. \end{matrix}

From another viewpoint, the optimal (worst-case from the agent’s perspective) copolicy $σ_{π_{i}}$ to $π_{i}$ initializes the state to $(π_{i}, λ)$ for $i = 1, 2,$ i.e. $σ_{π_{i}} (λ) = δ_{(π_{i}, λ)} .$ In other words, the policy encoded in the state chosen by the copolicy matches the policy of the agent. The law defined by this supra-POMDP is equivalent to an ordinary environment in which one-boxing results in observing a full box and two-boxing results in observing an empty box.

We see from the above calculations that the optimal policy for $M_{ν}$ is $π_{1},$ and moreover $π_{1}$ achieves the $ν$ -optimal loss. This analysis holds for any number of episodes. This is significant because if a learning agent has $M^{ν}$ in their hypothesis space, then they must converge to one-boxing if they are to achieve low regret for the iterated Newcomb’s problem.

Note that for this example, we have only used what might be considered the most basic supracontributions, namely $⊤,$ $⊥,$ and the downward closure of a single probability distribution. In the next example, we will see the full power of supracontributions.

XOR Blackmail

In this section we describe how to use a supra-POMDP to model the XOR blackmail problem. For a more in depth discussion of XOR blackmail, see e.g. Toward Idealized Decision Theory §2.1 (Soares and Fallenstein, 2015) and Cheating Death in Damascus §2 (Levinstein and Soares, 2020).

The problem is given as follows. Suppose there is a 1% probability that an agent’s house may have a termite infestation that would cause $1M in damages. A blackmailer can predict the agent and also knows whether or not there is an infestation. The blackmailer sends a letter stating that exactly one of the following is true, if and only if the letter is truthful:

There are no termites, and you will pay $1K, or
There are termites, and you will not pay.

The agent can then accept or reject the blackmail. Note that as stated, the probability of blackmail depends on the agent’s policy. Because policies are encoded in the state space of the associated supra-POMDP, we are able to model this. EDT recommends accepting the blackmail because accepting blackmail is evidence that there is not an infestation, even though this correlation is spurious (i.e. accepting the blackmail does not causally influence whether or not there is an infestation). On the other hand, CDT recommends rejecting the blackmail. Thus we see that these two decision theories are split across the two examples that we have seen so far and neither always recommends an optimal action. We now see that IBDT does recommend the optimal action again.

Let $O = {λ, o_{B}, o_{N B}, o_{$ 0}, o_{- $ 1K}, o_{- $ 1M}},$ where $λ$ denotes the empty observation, $o_{B}$ represents receiving the blackmail, $o_{N B}$ represents not receiving the blackmail, and the remaining observations represent the various monetary outcomes.

Let $A = {a_{a}, a_{r}},$ where $a_{a}$ corresponds to accepting the blackmail and $a_{r}$ corresponds to rejecting the blackmail. Without loss of generality, $Π_{H} = {π_{a}, π_{r}}$ where $π_{a} : (\dots o_{B}) = a_{a}$ and $π_{r} (\dots o_{B}) = a_{r} .$

Interpreting the statement of the problem, we define $ν : Π_{H} \times O^{< H} \to Δ O$ as follows:

ν (π_{a}, λ) (o_{B}) = 0.99 and ν (π_{a}, λ) (o_{N B}) = 0.01,

ν (π_{a}, o_{B}) (o_{- $ 1 K}) = 1,

ν (π_{a}, o_{N B}) (o_{- $ 1 M}) = 1,

ν (π_{r}, λ) (o_{N B}) = 0.99 and ν (π_{r}, λ) (o_{B}) = 0.01,

ν (π_{r}, o_{N B}) (o_{$ 0}) = 1, and

ν (π_{r}, o_{B}) (o_{- $ 1 M}) = 1.

We normalize in order to define the loss of an episode $L : (A \times O)^{H} \to [0, 1]$ by

L (a_{0} o_{0} \dots a_{H - 1} o_{H - 1}) = ⎧ ⎨ ⎩ \begin{matrix} 0 & if o_{H - 1} = o_{$ 0} 0.001 & if o_{H - 1} = o_{- $ 1 K} 1 & if o_{H - 1} = o_{- $ 1 M} . \end{matrix}

Note that $E_{ν^{π_{a}}} [L] = 0.99 \cdot 0.001 + 0.01 \cdot 1,$ whereas $E_{ν^{π_{r}}} [L] = 0.99 \cdot 0 + 0.01 \cdot 1.$ Therefore $π_{r}$ is $ν -$ optimal.

We now consider the corresponding supra-POMDP $M_{ν} .$ The state transitions of $M_{ν}$ are summarized in Figure 6. We first define the transition suprakernel on $Π_{H} \times λ .$ Using $ν,$ we have

T (π_{a}, λ) = {0.99 δ_{(π_{a}, o_{B})} + 0.01 δ_{(π_{a}, o_{N B})}}^{↓} and

T (π_{r}, λ) = {0.99 δ_{(π_{r}, o_{N B})} + 0.01 δ_{(π_{r}, o_{B})}}^{↓} .

We now consider the next level of the supra-POMDP. Since $π_{a} (o_{B}) = a_{a},$

T ((π_{a}, o_{B}), a_{a}) = {δ_{(π_{a}, o_{B} o_{- $ 1 K})}}^{↓} and

T ((π_{a}, o_{B}), a_{r}) = ⊥_{S} .

Here we see that when the action is $a_{a}$ , which is consistent with the policy of the state $π_{a}$ , the transition kernel returns the downward closure of a distribution specified by $ν .$ On the other hand, when the action is $a_{r}$ , which is not consistent with $π_{a}$ , the transition kernel returns $⊥_{S}$ .

The action does not matter when there is no blackmail (i.e. both actions are consistent with the policy), so

T ((π_{a}, o_{N B}), a_{a}) = T ((π_{a}, o_{N B}), a_{r}) = {δ_{(π_{a}, o_{N B} o_{- $ 1 M})}}^{↓} .

A similar analysis for $π_{r}$ yields

T ((π_{r}, o_{N B}), a_{a}) = T ((π_{r}, o_{N B}), a_{r}) = {δ_{(π_{r}, o_{N B} o_{$ 0})}}^{↓},

T ((π_{r}, o_{B}), a_{a}) = ⊥_{S}, and

T ((π_{r}, o_{B}), a_{r}) = {δ_{(π_{r}, o_{B} o_{- $ 1 M})}}^{↓} .

Then on the final level: for all $s \in {π_{a}} \times {o_{B} o_{- $ 1 K}, o_{N B} o_{- $ 1 M}} \cup {π_{r}} \times {o_{N B} o_{$ 0}, o_{B} o_{- $ 1 M}},$ define

T (s, \cdot) = ch (δ_{(π_{a}, λ)}, δ_{(π_{r}, λ)})^{↓} = ⊤_{Π_{H} \times λ} .

Figure 6: State transitions of the supra-POMDP that represents XOR Blackmail.

We now consider the expected loss of each policy for $M_{ν} .$ The interaction of $M_{ν}$ and $π_{a}$ produces a supracontribution over outcomes given by $Θ = ch {θ_{0}, θ_{1}}^{↓}$ where

θ_{0} = 0.99 (δ_{o_{B} a_{a} o_{- $ 1 K}}) + 0.01 (δ_{o_{N B} a_{a} o_{- $ 1 M}}), and

θ_{1} = 0.99 (δ_{o_{N B} a_{a} o_{$ 0}}) .

Here, $θ_{0}$ arises from the interaction of $π_{a}$ with the branch starting with state $(π_{a}, λ) .$ We have a probability distribution in this case because $π_{a}$ is always consistent with itself, which is the policy encoded in the states of this branch. On the other hand, the contribution $θ_{1}$ arises from the interaction of $π_{a}$ with the branch starting with state $(π_{r}, λ) .$ In the case of blackmail, $π_{a}$ and $π_{r}$ disagree on the action and thus a probability mass of 0.01 is lost on this branch.

Therefore, the expected loss for $π_{a}$ in one round is given by

\begin{matrix} E_{M_{ν}^{π_{a}}} [L] & := E_{Θ} [L] = max θ \in Θ E_{θ} [L] = max {E_{θ_{0}} [L], E_{θ_{1}} [L]} = max {0.99 \cdot 0.001 + 0.01 \cdot 1, 0.99 \cdot 0} = 0.99 \cdot 0.001 + 0.01. \end{matrix}

Another way to view this calculation is that the optimal $M_{ν}$ -copolicy $σ$ initializes the state to $(π_{a}, λ),$ meaning $σ (λ) = δ_{(π_{a}, λ)} .$

By a similar calculation,

E_{M_{ν}^{π_{r}}} [L] = max {0.01 \cdot 1, 0.99 \cdot 0 + 0.01 \cdot 1} = 0.01.

Therefore, the optimal policy for $M_{ν}$ is also $π_{r}$ , i.e. under this formulation it is optimal to reject the blackmail. This analysis holds for any number of episodes. Moreover, the optimal loss for $M_{ν}$ is equal to the optimal loss for $ν .$ This is significant because if a learning agent has $M_{ν}$ in their hypothesis space, then they must converge to rejecting the blackmail if they are to achieve low regret for the iterated Newcombian problem.

Counterfactual Mugging

We now consider the problem of counterfactual mugging. In this problem, a perfect predictor (the “mugger”) flips a coin. If the outcome is heads, the mugger asks the agent for $100, at which point they can decide to pay or not pay. Otherwise, they give the agent $10K if and only if they predict the agent would have paid the $100 if the outcome was heads.

Both CDT and EDT recommend not paying, and yet we will see that IBDT recommends to pay the mugger.

Let $O = {λ, o_{H}, o_{T}, o_{$ 10 K}, o_{- $ 100}, o_{$ 0}},$ where $o_{H}$ represents heads, $o_{T}$ represents tails, and the remaining (non-empty) observations represent the various monetary outcomes. Let $A = {a_{p}, a_{n p}},$ where $a_{p}$ represents paying the mugger and $a_{n p}$ represents not paying the mugger. Without loss of generality, $Π_{H} = {π_{p}, π_{n p}}$ where $π_{p} (\dots o_{H}) = a_{p}$ and $π_{n p} (\dots o_{H}) = a_{n p} .$

Let $α := 1 - \frac{100}{10, 100}$ . We normalize^[8] to define the loss $L : (A \times O)^{H} \to [0, 1]$ of an episode by

L (a_{0} o_{0} \dots a_{H - 1} o_{H - 1}) = ⎧ ⎨ ⎩ \begin{matrix} 0 & if o_{H - 1} = o_{$ 10 K} α & if o_{H - 1} = o_{$ 0} 1 & if o_{H - 1} = o_{- $ 100} . \end{matrix}

Note that $E_{ν^{π_{p}}} [L] = 0.5 (1) + 0.5 (0) = 0.5,$ whereas $E_{ν^{π_{n p}}} [L] = 0.5 (α) + 0.5 (α) = α .$ Therefore, $π_{p}$ is $ν$ -optimal.

The state transitions of $M_{ν}$ are shown in Figure 7.

Figure 7: State transitions of the supra-POMDP that represents counterfactual mugging.

We have

E_{M_{ν}^{π_{p}}} [L] = max {0.5 \cdot 1 + 0.5 \cdot 0, 0.5 \cdot α} = 0.5,

whereas

E_{M_{ν}^{π_{n p}}} [L] = max {0.5 \cdot 0, 0.5 \cdot α + 0.5 \cdot α} = α .

Therefore, the optimal policy for $M^{ν}$ is also $π_{p}$ , i.e. under this formulation it is optimal to pay the mugger.

Transparent Newcomb’s Problem

We now consider Transparent Newcomb’s Problem. In this problem, both boxes of the original problem are transparent. We consider three versions. See Figure 11 in the next section for a summary of the decision theory recommendations.

Empty-box dependent

In the empty-box dependent version, a perfect predictor Omega leaves Box B empty if and only if Omega predicts that the agent will two-box upon seeing that Box B is empty.

Let $O = {λ, o_{E}, o_{F}, o_{$ 1 M}, o_{$ 1 K}, o_{$ 1.001 M}} .$ Here $o_{E}$ corresponds to observing an empty box and $o_{F}$ corresponds to observing a full box. Let $A = {a_{1}, a_{2}},$ where $a_{1}$ corresponds to one-boxing and $a_{2}$ corresponds to two-boxing.

Without loss of generality, $Π_{H} = {π_{1, 1}, π_{1, 2}, π_{2, 1}, π_{2, 2}}$ where $π_{i, j} (\dots o_{E}) = a_{i}$ and $π_{i, j} (\dots o_{F}) = a_{j}$ for $i, j \in {1, 2} .$ Namely, the policies are distinguished by the action chosen upon observing an empty or full box.

Let $α := 1 - \frac{1, 000, 000}{1, 001, 000}$ . Define $L : (A \times O)^{H} \to [0, 1]$ by

L (a_{0} o_{0} \dots a_{H - 1} o_{H - 1}) = ⎧ ⎨ ⎩ \begin{matrix} 0 & if o_{H - 1} = o_{$ 1.001 M} α & if o_{H - 1} = o_{$ 1 M} 1 & if o_{H - 1} = o_{$ 1 K} . \end{matrix}

The state transition graph of the supra-POMDP $M_{ν}$ representing this problem is shown in Figure 8.

Figure 8: State transitions of the supra-POMDP $M_{ν}$ that represents the empty-box dependent transparent Newcomb’s problem.

We now consider the expected loss in one round for each policy interacting with $M_{ν} .$ Similarly to the original version of Newcomb’s problem, the optimal copolicy to a policy $π_{i, j}$ initializes the state to $(π_{i, j}, λ)$ , meaning the policy encoded in the state chosen by the copolicy matches the true policy of the agent.

We have

\begin{matrix} E_{M_{ν}^{π_{1, 1}}} [L] & = max {L (\dots o_{$ 1 M}), E_{0} [L], E_{0} [L], E_{0} [L]} = max {α, 0, 0, 0} = α, E_{M_{ν}^{π_{1, 2}}} [L] & = max {E_{0} [L], L (\dots o_{$ 1.001 M}), E_{0} [L], E_{0} [L]} = max {0, 0, 0, 0} = 0, E_{M_{ν}^{π_{2, 1}}} [L] & = max {L (\dots o_{$ 1 M}), E_{0} [L], L (\dots o_{$ 1 K}), L (\dots o_{$ 1 K})} = max {α, 0, 1, 1} = 1, and E_{M_{ν}^{π_{2, 2}}} [L] & = max {E_{0} [L], L (\dots o_{$ 1.001 M}), L (\dots o_{$ 1 K}), L (\dots o_{$ 1 K})} = max {0, 0, 1, 1} = 1. \end{matrix} .

Therefore, the optimal policy for $M_{ν}$ is $π_{1, 2},$ meaning it is optimal to one-box upon observing an empty box and to two-box upon seeing a full box. Note that $π_{1, 2}$ is also the $ν$ -optimal policy.

Full-box dependent

In the full-box dependent version, Omega (a perfect predictor) puts $1M in Box B if and only if Omega predicts that the agent will one-box upon seeing that Box B is full. This example does not satisfy pseudocausality (discussed below), and therefore we will see that there is an inconsistency between the optimal policies for the supra-POMDP $M_{ν}$ and the Newcombian problem $ν .$

Let $O = {λ, o_{E}, o_{F}, o_{$ 1 M}, o_{$ 1 K}, o_{$ 0}},$ and $A = {a_{1}, a_{2}} .$ Without loss of generality, we again have $Π_{H} = {π_{1, 1}, π_{1, 2}, π_{2, 1}, π_{2, 2}}$ where $π_{i, j} (\dots o_{E}) = a_{i}$ and $π_{i, j} (\dots o_{F}) = a_{j}$ for $i, j \in {1, 2} .$

The state transition graph of the supra-POMDP $M_{ν}$ representing this problem is shown in Figure 9.

Figure 9: State transitions of the supra-POMDP that represents the full-box dependent transparent Newcomb’s problem.

Define $L : (A \times O)^{H} \to [0, 1]$ by

L (a_{0} o_{0} \dots a_{H - 1} o_{H - 1}) = ⎧ ⎨ ⎩ \begin{matrix} 0 & if o_{H - 1} = o_{$ 1 M} 0.999 & if o_{H - 1} = o_{$ 1 K} 1 & if o_{H - 1} = o_{$ 0} . \end{matrix}

Then

\begin{matrix} E_{M_{ν}^{π_{1, 1}}} [L] & = max {0, 0, 1, 0} = 1, and E_{M_{ν}^{π_{1, 2}}} [L] & = max {0, 0, 1, 0} = 1. \end{matrix}

Furthermore,

\begin{matrix} E_{M_{ν}^{π_{2, 1}}} [L] & = max {0, 0, 0, 0.999} = 0.999, and E_{M_{ν}^{π_{2, 2}}} [L] & = max {0, 0, 0, 0.999} = 0.999. \end{matrix}

Therefore, the optimal policies for the supra-POMDP $M_{ν}$ are $π_{2, 2}$ and $π_{2, 1} .$ From another perspective, the optimal copolicy to $π_{1, j}$ for $j = 1, 2$ initializes the state to $(π_{1, 2}, λ)$ . The optimal copolicy to $π_{2, j}$ for $j = 1, 2$ initializes the state to $(π_{2, 2}, λ)$ . As a result, an agent can achieve low regret on $M_{ν}$ and either one- or two-box upon observing the full box. In particular, for all policies they will learn to expect an empty box.

On the other hand, the optimal policies for the Newcombian problem $ν$ are $π_{1, 1}$ and $π_{2, 1}$ . To see this, note that $supp (ν^{π_{1, 1}}) = supp (ν^{π_{2, 1}}) = {o_{F} a_{1} o_{$ 1 M}},$ $supp (ν^{π_{1, 2}}) = {o_{E} a_{1} o_{$ 0}},$ and $supp (ν^{π_{2, 2}}) = {o_{E} a_{2} o_{$ 1 K}} .$ Then $E_{ν^{π_{1, 1}}} [L] = E_{ν^{π_{2, 1}}} [L] = 0,$ whereas $E_{ν^{π_{1, 2}}} [L] = 1$ and $E_{ν^{π_{2, 2}}} = 0.999.$ The inconsistency between the optimal policies for $M_{ν}$ and $ν$ is a result of the fact that this Newcombian problem fails to satisfy pseudocausality, a condition we describe in the last section.

Epsilon-noisy full-box dependent

We now consider a variant of the full-box dependent version in which we assume that Omega is not a perfect predictor. In this case, Omega also puts $1M in Box B with probability $ϵ \in (0, 0.999)$ when the agent will two-box upon seeing that Box B is full.

Figure 10 shows the state transitions starting from states $(π_{1, 2}, λ)$ and $(π_{2, 2}, λ) .$ (The full state transition graph is the graph in which the the two right-most paths of the graph of Figure 9 are replaced by the trees in Figure 10.)

Figure 10: Modified branches of the two right-most paths in the state transitions in Figure 6 for the epsilon-noisy variant of the full-box dependent Newcomb’s problem.

Let $α := 1 - \frac{1, 000, 000}{1, 001, 000}$ and $β := 1 - \frac{1, 000}{1, 001, 000} .$ Define $L : (A \times O)^{H} \to [0, 1]$ by

L (a_{0} o_{0} \dots a_{H - 1} o_{H - 1}) = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ \begin{matrix} 0 & if o_{H - 1} = o_{$ 1.001 M} α & if o_{H - 1} = o_{$ 1 M} β & if o_{H - 1} = o_{$ 1 K} 1 & if o_{H - 1} = o_{$ 0} . \end{matrix}

A distinguishing feature of the supra-POMDP for this problem (compared to the other problems we have considered) is that the optimal policy for $M_{ν}$ depends on the number of episodes. In the case of one episode, the optimal copolicy to $π_{2, 1}$ and $π_{2, 2}$ is the same. Namely, $σ (λ) = δ_{(π_{2, 2}, λ)} .$ Then the expected loss for one episode is

E_{M_{ν}^{π_{2, 1}}} [L] = E_{M_{ν}^{π_{2, 2}}} [L] = (1 - ϵ) β .

We consider the extension of $L$ given by the mean per-episode loss. For two episodes, the optimal copolicy to $π_{2, 1}$ and $π_{2, 2}$ again has $σ (λ) = δ_{(π_{2, 2}, λ)} .$ The interaction of $σ$ and $π_{2, 1}$ produces a supracontribution over $(A \times O)^{2 H}$ given by $Θ_{2, 1} = ch {θ}^{↓}$ where $h = λ o_{E} a_{2} o_{$ 1 K} λ o_{E} a_{2} o_{$ 1 K}$ ^[9] and $θ = (1 - ϵ)^{2} δ_{h} .$ Then the expected mean loss for $π_{2, 1}$ over two episodes is

E_{M_{ν}^{π_{2, 1}}} [L_{2}] = E_{Θ_{2, 1}} [L_{2}] = β (1 - ϵ)^{2} .

On the other hand, the interaction of $σ$ and $π_{2, 2}$ produces a supracontribution over $(A \times O)^{2 H}$ given by $Θ_{2, 2} = ch {θ^{'}}^{↓}$ where $θ^{'} \in Δ (O \times A)^{2}$ is defined by

\begin{matrix} θ^{'} (λ o_{E} a_{2} o_{$ 1 K} λ o_{E} a_{2} o_{$ 1 K}) & = (1 - ϵ)^{2} θ^{'} (λ o_{E} a_{2} o_{$ 1 K} λ o_{F} a_{2} o_{$ 1.001 M}) & = ϵ (1 - ϵ) θ^{'} (λ o_{F} a_{2} o_{$ 1.001 M} λ o_{E} a_{2} o_{$ 1 K}) & = ϵ (1 - ϵ) θ^{'} (λ o_{F} a_{2} o_{$ 1.001 M} λ o_{F} a_{2} o_{$ 1.001 M}) & = ϵ^{2} . \end{matrix}

Then

\begin{matrix} E_{M_{ν}^{π_{2, 2}}} [L_{2}] & = E_{Θ_{2, 2}} [L_{2}] = \frac{1}{2} (2 β (1 - ϵ)^{2} + 2 β ϵ (1 - ϵ)) = β (1 - ϵ) . \end{matrix}

More generally, for any number $n \in N$ of episodes, the worst-case copolicy to $π_{2, 2}$ has $σ (ϵ) = δ_{(π_{2, 2}, λ)} .$ As a result, $E_{M_{ν}^{π_{2, 2}}} [L_{n}] = β (1 - ϵ)$ for all $n \in N .$

On the other hand, over $n$ episodes, the copolicy to $π_{2, 1}$ that always has $σ (λ) = δ_{(π_{2, 2}, λ)}$ results in an expected mean loss of $β (1 - ϵ)^{n},$ which tends to zero as $n \to \infty .$ The copolicy to $π_{2, 1}$ that instead has $σ (λ) = δ_{(π_{2, 1}, λ)}$ on most episodes results in a mean expected loss that converges to $α .$ Therefore, for sufficiently many episodes, $π_{2, 1}$ is the optimal policy for $M_{ν},$ and moreover the optimal loss for $M_{ν}$ converges to the optimal loss for $ν .$

Summary of decision theory recommendations

In Figure 11, we summarize the extent to which each decision theory makes optimal recommendations on the example problems.

Figure 11: Table indicating for which problems CDT, EDT, and IBDT make optimal recommendations. **For readability, a blank entry indicates that the recommendation is** **not** **optimal.** Note that EDT is ill-posed when it involves conditioning on a probability zero event. The only example where IBDT fails to be optimal or optimal in the limit does not satisfy pseudocausality.

Readers familiar with decision theory will observe that IBDT can be seen as an approximation to functional decision theory, which makes optimal recommendations across all the examples here. IBDT has the advantage of being well-defined in the sense that it can be run as code in an agent learning from the environment.

Pseudocausality

In this section, we define a condition (pseudocausality) that holds for all of the Newcombian problems discussed above, except for the (non-noisy) full-box dependent transparent Newcomb’s problem. We then state a theorem that illuminates the significance of this condition. In particular, pseudocausality allows one to translate optimality from the supra-POMDP $M_{ν}$ to optimality for the corresponding Newcombian problem $ν .$ Intuitively, pseudocausality means that there does not exist a suboptimal policy for $ν$ such that the optimal policy and the suboptimal policy disagree only on events that are probability zero under the suboptimal policy.

To formally define pseudocausality, we consider the set of outcomes $C_{π}$ that are compatible with a given policy $π .$ Namely, define

C_{π} := {h \in (A \times O)^{H} | \forall prefixes h^{'} a \in (A \times O)^{*} \times A of h, a = π (h^{'})} .

In other words, $h \in C_{π}$ if the sequence of actions in $h$ agrees with $π,$ whereas the observations can be arbitrary.

Definition: Pseudocausality
A Newcombian problem $ν$ satisfies pseudocausality if there exists a $ν$ -optimal policy $π_{*} \in Π_{H}$ such that for all $π \in Π_{H}$ if $supp (ν^{π}) \subseteq C_{π_{*}},$ then $π$ is also optimal for $ν .$

An example where pseudocausality fails

To see why pseudocausality fails for the full-box dependent transparent Newcomb’s problem, recall that the optimal policies for $ν$ are $π_{1, 1}$ and $π_{2, 1} .$ We have

C_{π_{1, 1}} = o_{E} a_{1} \times O \cup o_{F} a_{1} \times O

and

C_{π_{2, 1}} = o_{E} a_{2} \times O \cup o_{F} a_{1} \times O .

However,

supp (ν^{π_{1, 2}}) = {o_{E} a_{1} o_{$ 0}} \subset C_{π_{1, 1}},

and

supp (ν^{π_{2, 2}}) = {o_{E} a_{2} o_{$ 1 K}} \subset C_{π_{2, 1}},

but $π_{1, 2}$ and $π_{2, 2}$ are not optimal for $ν .$

We leave it to the reader to check that all other examples discussed in this post satisfy pseudocausality.

Theorem on pseudocausality and optimality

The significance of pseudocausality is given by the next theorem. It states that if pseudocausality holds for a Newcombian problem $ν$ , then the optimal loss for the corresponding fuzzy supra-POMDP $M_{ν}$ converges to the optimal loss for the Newcombian problem. Furthermore, given a time discount $γ \in [0, 1),$ if a $γ -$ indexed family of policies is optimal for $M_{ν}$ in the $γ \to 1$ limit, then the family is also optimal for $ν$ in the $γ \to 1$ limit.

Theorem [Alexander Appel (@Diffractor), Vanessa Kosoy (@Vanessa Kosoy)]:
Let $ν : Π_{H} \times O^{< H} \to Δ O$ be a Newcombian problem that satisfies pseudocausality. Then
$lim γ \to 1 | min π \in Π E_{M_{ν}^{π}} [L^{γ}] - min π \in Π E_{ν^{π}} [L^{γ}] | = 0.$
Furthermore, if ${π_{γ}}_{γ \in [0, 1)}$ is a family of policies such that ${lim}_{γ \to 1} E_{M_{ν}^{π_{γ}}} [L^{γ}] - {min}_{π \in Π} E_{M_{ν}^{π}} [L^{γ}] = 0,$ then
$lim γ \to 1 E_{ν^{π_{γ}}} [L^{γ}] - min π \in Π E_{ν^{π}} [L^{γ}] = 0.$

See the proof section for the proof.

Acknowledgements

Many thanks to Vanessa Kosoy, Marcus Ogren, and Mateusz Bagiński for their valuable feedback on initial drafts. Vanessa’s video lecture on formalizing Newcombian problems was also very helpful in writing this post.

^
Previously called ultracontributions.
^
To make comparisons, we briefly review these decision theories, but this is not the focus of the post.
^
More generally, if $X$ is a measurable space, we define a contribution $θ$ to be a measure such that $θ (X) \leq 1.$
^
This terminology is motivated by the notions of semimeasures and semi-probabilities as discussed in An Introduction to Universal Artificial Intelligence (M. Hutter, D. Quarel, and E. Catt).
^
Here it is necessary to use the more general definition of a contribution as a measure.
^
Here suprakernel simply means a map in which the range is the set of supracontributions over some set.
^
For more detail, see Toward Idealized Decision Theory §2.2 (Soares and Fallenstein, 2015) and Cheating Death in Damascus §3 (Levinstein and Soares, 2020).
^
If $o_{H - 1} = o_{$ q},$ then $L (a_{0} o_{0} \dots a_{H - 1} o_{H - 1}) = 1 - \frac{q + 100}{10, 100} .$
^
As a technicality, we always assume that the empty action is taken at the beginning of an episode.