An Introduction to Reinforcement Learning for Understanding Infra-Bayesianism

Introduction

The goal of this post is to give a summary of classical reinforcement learning theory that primes the reader to learn about infra-Bayesianism, which is a new framework for reinforcement learning that aims to solve problems related to AI alignment. We will concentrate on basic aspects of the classical theory that have analogous concepts in infra-Bayesianism, and explain these concepts using infra-Bayesianism conventions. The more technical proofs are contained in the proof section.

For the first part of this sequence and for links to other writings, see What is Inadequate about Bayesianism for AI Alignment: Motivating Infra-Bayesianism.

One special case of reinforcement learning is the case of stochastic bandits. For example, a bird may have a choice of three levers. When the bird steps on a lever, either some tasty seeds come tumbling out of a feeder or nothing happens. A mechanical system that the bird cannot observe delivers these seeds according to probabilities that are fixed for each lever.

An all-knowing bird can devise a strategy to maximize the expected number of seeds they will obtain. Some natural questions include: If a bird doesn’t have any knowledge of the mechanical system, what is a good algorithm to follow to learn how to gather lots of seeds? How much worse off is this bird compared to the all-knowing bird? These are the types of questions that can be mathematically formalized and studied using stochastic bandits.

Actions, Observations, and Environments

The formal set-up for reinforcement learning starts with time-steps, actions, observations, and environments:

At discrete time-steps indexed by natural numbers, an agent performs an action $a$ , chosen from a set $A$ of actions.
The agent then receives an observation from a set $O$ of observations.

Assumption: Finiteness of actions and observations
For simplicity, we assume that the set of actions and the set of observations are both finite and non-empty.

The time-horizon of a problem is the number of time-steps after which the alternation of action and observations terminates, which may be finite or infinite. In the infra-Bayesianism sequence, time-horizons are assumed to be infinite, which is arguably the more realistic assumption for modeling open-world agents. We will similarly assume infinite time-horizons, in contrast to episodic reinforcement learning, in which the time-horizon is finite and the action-observation process sequentially repeats in episodes, such as in repeated games of Go.

In the bird example from the introduction, we have

A = {Lever 1, Lever 2, Lever 3}, and

O = {Seeds, No Seeds} .

Time-steps correspond to sequential opportunities to step on levers.

An element of the cartesian product $A \times O$ is a pair of an action $a$ and the following observation $o$ , which is written as $(a, o)$ .

Definition: Histories and destinies
The set $(A \times O)^{*}$ denotes the set of finite sequences of action-observation pairs. Elements of $(A \times O)^{*}$ are called histories.
The set $(A \times O)^{ω}$ denotes the set of infinite sequences of action-observation pairs. Elements of $(A \times O)^{ω}$ are called destinies.

Cantor’s diagonal argument proves that the set of infinite binary sequences is an uncountable set. If either $A$ or $O$ has at least two elements, then there exists a bijection from a subset of $(A \times O)^{ω}$ to the set of infinite binary sequences, and thus $(A \times O)^{ω}$ is also uncountable. For example, if $A$ contains $a_{0}$ and $a_{1}$ and $O$ contains $o$ , then the subset of $(A \times O)^{ω}$ of sequences with actions equal to $a_{0}$ or $a_{1}$ and observations all equal to $o$ is in bijection with the set of infinite binary sequences.

The levers in the bird example are what we intuitively might want to call the “environment,” which responds to agent actions by sending back or providing observations. An environment is formally defined as follows:

Definition: Probabilistic environment
A probabilistic environment $μ : (A \times O)^{*} \times A \to Δ O$ is a function that takes in a history and an action and returns a probability distribution over observations.

Here is a concrete example of an environment. Let $h \in (A \times O)^{*}$ denote a history. Define $μ : (A \times O)^{*} \times A \to Δ O$ by

μ (h, Lever 1) := P (Seeds) = 0.9,

μ (h, Lever 2) := P (Seeds) = 0.5, and

μ (h, Lever 3) := P (Seeds) = 0.3.

We’ll use the notation $μ (o | h, a)$ to denote the probability of $o \in O$ according to the distribution that the environment outputs when given $(h, a) \in (A \times O)^{*} \times A .$

Notice that in this example, $μ$ does not depend on $h .$ However, in full generality, an environment may depend on $h$ . For example, a lever could give seeds with probability 1 only if it was not the most recently visited lever, and otherwise it could give seeds with probability 0.5.

Formalizing Preferences: Loss Functions and Momentary Loss

One motivation for reinforcement learning as a framework for artificial intelligence is that many—or perhaps all—”intelligence-requiring” tasks can be formulated by means of an agent who prefers some observations over others and seeks to take actions that bring about their preferred observations. This preference is captured numerically with loss or reward functions.

For example, we can define a function $L : O \mapsto R$ that captures the bird’s preference for seeds over no seeds and thus the lost value when no seeds are received by:

L (Seeds) = 0, and

L (No Seeds) = 1.

This is an example of a loss function.

Definition: Loss function
A deterministic loss function is a function $L$ of type $(A \times O)^{*} \mapsto [0, 1]$ . For a given history $h$ ending in observation $o \in O,$ $L (h)$ is the momentary loss of $o .$

Rather than focusing on lost value, we could measure gained value with a reward function $R : O \mapsto R$ such that $R (Seeds) = 1$ and $R (No Seeds) = 0.$ In this post, we will use loss functions, but everything can be reformulated in terms of reward functions.

Total loss: What is an agent optimizing for?

Imagine that in the bird example, a bird steps on Lever 3, and receives seeds. This gives some evidence that Lever 3 is a good choice, but a bird would be much happier in the long run knowing that, in expectation, Lever 1 delivers more seeds. Any such bird has a tradeoff between exploring to gain a better idea of how good the various levers are versus exploiting their current knowledge, i.e. stepping on the lever that looks best with the current information. This is called the exploration versus exploitation tradeoff.

Suppose a bird is famished, and therefore cares strongly about obtaining seeds in the short-term. This bird might step on the lever that looks best under the current information. As a result, a famished bird might not discover a better lever, getting stuck stepping on a worse lever forever, exploiting too much and exploring too little. So, “famished bird” doesn’t seem like the best model of a rational learning agent.

Suppose the time-horizon is theoretically infinite, but the bird is mortal and has no knowledge about when they will die. Such a bird should value seeds in the short-term and the long-term unequally. Furthermore, a bird will get hungry without enough seeds in the short-term. This motivates the notion of discounting, which is weighting the momentary loss of observations unequally over time when it comes to calculating the sum of the momentary losses. This sum is defined as total loss.

Definition: Total loss of a history (with geometric time discount)
Let $γ \in [0, 1)$ and let $N$ denote the time-horizon, which may be infinite. The total loss of a given history $(a_{t}, o_{t})_{t = 0}^{N}$ with time discount $γ$ is defined by $L^{γ} ((a_{t}, o_{t})_{t = 0}^{N}) = (1 - γ) \sum_{t = 0}^{N} γ^{t} L (o_{t}) .$

Note that the coefficient $(1 - γ)$ on the total loss ensures that it is upper-bounded by 1. This fact is used in various proofs. The time discount is especially important when $N$ is infinite, since without it, the total loss of various histories can diverge, making it impossible to meaningfully compare them. So, the time-discount is (1) a choice that makes strategies more or less farsighted, and (2) a mathematical trick that ensures that the total loss is finite.

Formalizing “Strategies”: Policies

We eluded that a bird could develop a strategy for gathering lots of seeds; such a strategy is formally called a policy. For example, one valid, albeit bad, policy is to choose Lever 3 with probability one without regard to past actions and observations.

As another example, before the bird has stepped on any levers, they could flip a coin to decide whether to step on Lever 1 or 2. At later time-steps, the bird could calculate the past percentage of times that each lever yielded seeds, and step on the lever with the best seed percentage ²⁄₃ of the time, taking a random action otherwise. In the case of ties, the bird could also make a random choice.

By using the seed percentages, which are determined by past actions and observations, this strategy is a function of $(A \times O)^{*} .$ The range is $Δ A$ because the policy outputs a probability distribution over actions. Formally defined, a policy is exactly a function of this type:

Definition: Policy
A policy is a function $π : (A \times O)^{*} \to Δ A$ . We denote the set of policies by $Π$ . We say that a policy is deterministic if it can be written as $π : (A \times O)^{*} \to A$ . Otherwise, the policy is said to be stochastic.

Real-valued sequences are formally defined as functions from $N$ to $R,$ with the notation $(a_{0}, a_{1}, \dots)$ meaning that $a_{i} \in R$ is the image of $i \in N$ . For example, a sequence of zeros is the constant function that sends every natural number to zero. The set of real-valued-sequences can then be written as $\prod_{N} R := {α : N \to R},$ which is known as the product of $R$ over $N$ . In the same way, the set of stochastic policies is the product space $\prod_{(A \times O)^{*}} Δ A .$

Lemma 2 below states that the set of policies can be given a topology in which it is compact. For our purposes, topology provides generalized notions of continuity and convergence in spaces that are not the real numbers. Compactness is a significant property for a topological space to have in the context of optimization, because if a space is compact, then any continuous function from that space to the real numbers attains both a maximum and a minimum in the space.

The proof of Lemma 2 depends on Lemma 1, which states that $Δ A$ is a compact metric space. In the infra-Bayesianism sequence, the metric used on $Δ A$ is the Kantorovich-Rubinstein metric, but in the case when $A$ is finite, the Kantorovich-Rubinstein metric is equivalent to a more simple metric known as the total-variation metric.

Definition: Total-variation metric
Let $P_{1}$ and $P_{2}$ denote probability distributions over a finite set X. The total variation metric, which we denote by $d_{T V}$ is defined by $d_{T V} (P_{1}, P_{2}) := \frac{1}{2} \sum_{x \in X} | P_{1} (x) - P_{2} (x) | .$

It is straightforward to check that $d_{T V}$ satisfies the definition of a metric. As a simple case, if $X = {0, 1}$ , then given $P_{1}$ and $P_{2}$ in $Δ X,$ $d_{T V} (P_{1}, P_{2}) = \frac{1}{2} (| P_{1} (0) - P_{2} (0) | + | P_{1} (1) - P_{2} (1) |)$ but $| P_{1} (1) - P_{2} (1) | = | 1 - P_{1} (0) - (1 - P_{2} (0)) | = | P_{1} (0) - P_{2} (0) | .$ Observe then that in this simple example, $d_{T V} (P_{1}, P_{2}) = | P_{1} (0) - P_{2} (0) | .$

Lemma 1: Compactness of $Δ A$ under the total-variation metric
If $A$ is finite, then the set $Δ A$ is compact under the total-variation metric.

Proof: By assumption, $A$ is finite, so we can enumerate $A$ as $A = {a_{1}, a_{2}, \dots, a_{n}} .$ Elements of $Δ A$ , which are probability distributions over $A,$ can be written as vectors in $R^{n}$ of the form $(x_{1}, x_{2}, \dots, x_{n})$ such that $x_{i}$ denotes the probability of $a_{i} .$ By definition of a probability distribution, $\sum_{i = 1}^{n} x_{i} = 1$ and $x_{i} \geq 0$ for all $i .$ Therefore, there is a bijection between $Δ A$ and the standard $(n - 1) -$ dimensional simplex, which is defined as the set $Δ_{n - 1} := {(x_{1}, x_{2}, \dots, x_{n}) \in R^{n} : \sum_{i = 1}^{n} x_{i} = 1, x_{i} \geq 0 for all i} .$ (See Figure 1 for a depiction of the standard 1-dimensional and 2-dimensional simplices.)

The total-variation metric is induced by the Euclidean norm on $R^{n},$ and thus both induce the same topology on $Δ A$ . Under the Euclidean metric, the $n$ -simplex is compact by the Heine-Borel theorem since it is closed and bounded. Thus, $Δ A$ is compact. $□$

Figure 1: (Left) The standard 1-simplex. (Right) The standard 2-simplex.

Visualizing $Δ A$ as $Δ_{n}$ as in Figure 1, computing the total-variation distance between two elements of $Δ A$ can be conceptualized as locating the corresponding points in $R^{n},$ projecting the points to each of the coordinate axes, then connecting the projections with a line segment and computing $1 / 2$ of the sum of the line segment lengths. As the earlier calculation (and Euclidean geometry) suggest, in the special case of $Δ_{1},$ the two segments drawn for any two elements of $Δ_{1}$ will have the same length.

Lemma 2: Compactness of the set of policies
The set of policies $Π$ is compact with respect to the product topology.

Proof: By Lemma 1, $Δ A$ is compact. Then $Π = \prod_{(A \times O)^{*}} Δ A$ is compact with respect to the product topology by Tychonoff’s Theorem, a theorem that states that the product of compact topological spaces is compact with respect to the product topology. $□$

Lemma 2 remains true if we restrict attention to the set of deterministic policies.

Expected Total Loss of a Policy

Suppose we have a policy, an environment, and a history of the form: “Lever 1, No seeds.” How likely is this history to occur compared to other histories of length one? This question is fundamental to comparing various policies. To calculate the answer, we would want to know:

The probability of stepping on Lever 1 at the start according to the policy
The probability of no seeds conditional on stepping on Lever 1 according to the environment

For example, consider the previously described policy:

Before the bird has stepped on any levers, they could flip a coin to decide whether to step on Lever 1 or 2. At later time-steps, the bird could calculate the past percentage of times that each lever yielded seeds, and step on the lever with the best seed percentage ²⁄₃ of the time, taking a random action otherwise. In the case of ties, the bird could also make a random choice.

We will examine how this policy interacts with the environment given by:

$μ (h, Lever 1) := P (Seeds) = 0.9,$ $μ (h, Lever 2) := P (Seeds) = 0.5, and$ $μ (h, Lever 3) := P (Seeds) = 0.3.$

Let $ϵ$ denote the empty history, so that $π (Lever 1 | ϵ)$ denotes the probability that Lever 1 is the first action. Using the definition of conditional probability, the probability of “Lever 1, no seeds” among length-one histories is:

P (Lever 1 and no seeds)

= P (Lever 1) \cdot P (no seeds | Lever 1)

= π (Lever 1 | ϵ) \cdot μ (no seeds | Lever 1)

= 0.5 \cdot 0.1 = 0.05.

We could repeat this process to calculate the probability of any history of length one: the policy and environment determine everything we need to do that. The resulting distribution over histories of length one is represented as the tree in Figure 2.

Each path in the tree corresponds to a history of length one, with probabilities along the edges determined by the policy or the environment. To the right is the probability p of the path ending at that height, which is obtained by multiplying together the probabilities along the path. — Figure 2: Each path in the tree corresponds to a history of length one, with probabilities along the edges determined by the policy or the environment. To the right is the probability $p$ of the path ending at that height, which is obtained by multiplying together the probabilities along the path.

We can further calculate the probability of a specific history of length $l + 1 > 1$ among all length- $(l + 1)$ histories. By the chain rule of probabilities, $P ((a_{o}, o_{o}), . . ., (a_{l}, o_{l}))$ is given by

P (a_{0}) \cdot P (o_{0} | a_{0}) \cdot l \prod t = 1 P (a_{t} | a_{0}, o_{0}, . . ., a_{t - 1}, o_{t - 1}) \cdot P (o_{t} | o_{0}, a_{0}, . . ., a_{t}) .

Note that for all $t,$ $P (a_{t} | a_{0}, o_{0}, . . ., a_{t - 1}, o_{t - 1})$ is determined by a chosen policy, whereas $P (o_{t} | o_{0}, a_{0}, . . ., a_{t})$ is determined by a given environment. So pairing together a policy and an environment produces a distribution over histories of a fixed length.^[1] We denote all such distributions by $μ^{π}$ , since the length will hopefully always be clear from context. This is formally stated in the next definition. As a convention, a sequence written with an ellipsis such that the final index is negative denotes an empty sequence.

Definition: Probability distributions on histories induced by a policy and environment
Let $(a_{t}, o_{t})_{t = 0}^{l} \in (A \times O)^{*}$ . The probability distribution $μ^{π} \in Δ (A \times O)^{*}$ is defined by $μ^{π} ((a_{t}, o_{t})_{t = 0}^{l}) = \prod_{t = 0}^{l} π (a_{t} | a_{0}, o_{0}, \dots, a_{t - 1}, o_{t - 1}) μ (o_{t} | a_{0}, o_{0}, \dots, a_{t}) .$

Recall that we previously defined the total loss of a history $h = (a_{t}, o_{t})_{t = 0}^{N}$ by $L^{γ} (h) = \sum_{t = 0}^{N} γ^{t} L (o_{t})$ . Since $μ^{π}$ is a distribution over the domain of $L^{γ},$ we can now calculate the expected total loss of a policy for a given environment.

Definition: Expected total loss of policy
The expected total loss of a policy $π$ in the environment $μ$ is $E_{h \sim μ^{π}} [L^{γ} (h)]$ .

The distribution $μ^{π}$ gives another way to describe an environment: For a fixed environment, we can construct a function $~ μ$ that takes in a policy and outputs a probability distribution over histories. This function is an example of a Markov kernel. We will repeat a similar construction to define crisp causal laws in the next post.

Optimal and Bayes-Optimal Policies

In this section, we will pin down what we mean by a “best” policy, which provides a benchmark to which we can compare other policies. One choice is to compare the performance of an agent learning about the environment to a hypothetical agent who “knows” the environment ahead of time and therefore acts as if to achieve minimum loss in expectation. The policy of this “all-knowing” agent is the optimal policy. Given uncertainty about the true environment, we could consider the policy that achieves minimum loss in expectation with respect to a distribution over possible environments and with respect to the environments. This is called the Bayes-optimal policy.

Definition: Optimal policy for an environment $μ$
Let an environment $μ$ be fixed. We say that $π^{*}$ is an optimal policy for $μ$ if $π^{*} \in {arg min}_{π \in Π} E_{h \sim μ^{π}} [L^{γ} (h)]$ .

So, the optimal policy for a given environment is the policy that minimizes the expected total loss in that environment. There is something to prove, which we do later in this section, in order to say that the optimal policy exists, namely that

{arg min}_{π \in Π} E_{h \sim μ^{π}} [L^{γ} (h)] \neq \emptyset .

Note that the optimal policy isn’t necessarily unique, as different observations could have the same momentary loss, or different actions could lead to a certain observation with equal probability.

If you understand the expression

π^{*} \in {arg min}_{π \in Π} E_{h \sim μ^{π}} [L^{γ} (h)],

you can skip the next example and go straight to the section where we prove that the optimal policy is well-defined. Otherwise, read the next section to understand the notation through an example.

Example of an optimal policy in the HeavenHell environment:

To understand the definition of the optimal policy, we’ll walk through an example of an optimal policy for the HeavenHell environment described by Marcus Hutter in Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability.

Let the set of actions be $A = {l, r}$ for left and right and $O = {V, H}$ for heaven and hell. The loss function is defined by $L (V) = 0$ and $L (H) = 1.$ We will use an infinite time horizon. If the first action is $l$ , then the agent is in heaven and receives loss of zero at each time step forever. If the first action is $r$ , then the agent is in hell and receives loss of one at each time step forever.

One optimal policy $π^{*}$ is to go left with probability one at every time step, which is denoted by $π^{*} (l | h) = 1$ and $π^{*} (r | h) = 0$ for all $h \in (A \times O)^{*} .$

Given any history that starts with $l$ , at every time-step, the environment outputs a distribution over ${V, H}$ that is one at $V$ . So $μ^{π_{μ}^{*}}$ is a distribution that is one on histories of the form $(l, V, l, V, . . .)$ and zero otherwise.

The loss of any history $h$ starting with $l$ is $L^{γ} (h) = (1 - γ) \sum_{t = 0}^{\infty} 0 \cdot γ^{t} = 0$ . So the expected total loss is $E_{h \sim μ^{π_{μ}^{*}}} [L^{γ} (h)] = 1 \cdot 0 = 0.$ Since expected loss is bounded between 0 and 1, $π^{*}$ is an optimal policy.

We can check that the loss of any history that starts with $r$ is $(1 - γ) \sum_{t = 0}^{\infty} 1 \cdot γ^{t} = (1 - γ) \frac{1}{1 - γ} = 1$ . So then a policy $π_{p}$ that says to go right with probability $p \in (0, 1]$ at the first time-step has expected loss $E_{h \sim μ^{π_{p}}} [L^{γ} (h, a)] = 1 \cdot p + (1 - p) \cdot 0 = p > 0.$

Existence of an optimal policy

Proposition 1: Existence of an optimal policy
If $A$ and $O$ are finite, an optimal policy for a given environment $μ$ exists; namely, ${arg min}_{π \in Π} E_{h \sim μ^{π}} [L^{γ} (h)] \neq \emptyset .$ Furthermore, the optimal policy can be chosen to be deterministic.

We start this section with a summary of the main idea of the existence proof. The full proof is in the proof section. A restatement of Proposition 1 is that the function $π \mapsto E_{h \sim μ^{π}} [L^{γ} (h)]$ attains a minimum. The usual technique to show that a function attains a minimum or maximum over some domain is to prove that the domain is compact and the function is continuous: then the result follows from the fact that a continuous function over a compact domain attains a maximum and a minimum. We have compactness of $Π$ by Lemma 2.

To show continuity, we take a convergent sequence $π_{n} \to π \in Π$ , aiming to show that $E_{h \sim μ^{π_{n}}} [L^{γ} (h)] \to E_{h \sim μ^{π}} [L^{γ} (h)]$ as $n \to \infty .$ In other words, we need to show that for large enough $n,$ $| E_{h \sim μ^{π_{n}}} [L^{γ} (h)] - E_{h \sim μ^{π}} [L^{γ} (h)] |$ becomes arbitrarily small. The key idea of the proof for the infinite time-horizon case is to define $L^{γ} |_{N}$ for $N \in N$ as

L^{γ} |_{N} ((a_{t}, o_{t})_{t = 0}^{\infty}) := (1 - γ) N \sum t = 0 γ^{t} L (o_{t}),

which we can conceptualize as the restriction of $L^{γ}$ to a finite time-horizon. We can show that uniformly over all probability distribution, the expectation of $L^{γ} |_{N}$ gets close to the expectation of $L^{γ}$ as $N$ becomes large. Furthermore, for any fixed $N < \infty,$ $| E_{h \sim μ^{π_{n}}} [L^{γ} |_{N} (h)] - E_{h \sim μ^{π}} [L^{γ} |_{N} (h)] |$ becomes arbitrarily small as $n$ becomes large. By applying the triangle inequality, these two statements are enough to complete the proof for infinite time-horizons.

It is natural to do an approximation argument using a finite time horizon because of the definition of the product topology. Recall that $Π = \prod_{(A \times O)^{*}} Δ A$ . A set is open under the product topology on $Π$ if it is the product of open sets in $Δ A$ not equal to $Δ A$ in finitely many components. Convergence of $π_{n} \to π$ in the product topology then means that for all open sets $U$ containing $π$ , we can find find an $M$ large enough such that for all $n \geq M, π_{n} \in U$ . For any open set $U$ , this only puts a condition on $π_{n}$ that is relevant to finitely many histories because $U$ is not equal to $Δ A$ in finitely many components.

Bayes Optimal Policy

Since the problem of reinforcement learning is about agents learning in an unknown environments, we now define a different notion of optimality — the Bayes optimal policy — that is relevant when we have a prior over environments rather than certainty about the true environment. To simplify the definition, we can write a prior over environments as a new environment called a Bayes mixture of environments.

Bayes Mixture of Environments

Consider for example an environment, HeavenHell Opp $(H & H Opp),$ which does the opposite of HeavenHell $(H & H) .$ A possible prior $ζ$ over these environments specifies probability 0.4 of being in HeavenHell and probability 0.6 of being in HeavenHell Opp. This prior represents a new environment, $μ_{ζ},$ that takes in a history and action and returns a distribution over observations in the following way.

Given a history, Bayes’ rule produces a posterior distribution over the environments. For example, given the history $(l, V),$ the probability of $H & H$ is

p (H & H | l, V) = \frac{p (l, V | H & H)}{p (l, V)} p (H & H) = \frac{1}{0.4} 0.4 = 1.

This calculation can be conceptualized using the tree in Figure 3. To calculate $p (l, V | H & H),$ we use the path starting at $H & H$ and ending at $V .$ To calculate $p (l, V),$ we start from the base node of the tree, and find all paths that go through $l$ then $V .$ We then sum together the probabilities of these paths, which gives a weighted sum with weights determined by the prior. The value of $p (H & H)$ is determined by the prior $ζ .$

Observe that this calculation agrees with the fact that the history $(l, V)$ is impossible in $H & H Opp .$

A tree that represents an environment constructed from a prior over the environments H & H and H & H Opp. — Figure 3: Each path in the tree corresponds to a history of length one, with probabilities along the edges determined by the prior, policy, or the environment.

Now we define the Bayes optimal policy with respect to a prior over environments.

Definition: Bayes optimal policy
Let $ζ$ be a prior over environments, which can be written as an environment $μ_{ζ}$ . A policy $π_{ζ}^{*}$ is said to be Bayes optimal with respect to the prior $ζ$ if it is the optimal policy for $μ_{ζ} .$ Equivalently, $π_{ζ}^{*} \in {argmin}_{π \in Π} E_{μ \sim ζ} [E_{h \sim μ^{π}} [L^{γ} (h)]]$ .

The existence of the Bayes optimal policy given a prior over environments follows from the existence of the optimal policy for the environment formed out of that prior. Note that these optimality definitions depend on the choice of time-discount $γ .$ A Bayes-optimal family of policies is a set of policies that depend on $γ,$ each of which are optimal for the corresponding $γ .$

Definition: Bayes-optimal family of policies
Let $ζ$ be a prior over environments. A family of policies ${π^{γ}}_{γ \in [0, 1)}$ is a Bayes-optimal family with respect to $ζ$ if for all $γ \in [0, 1),$ $π^{γ}$ is Bayes-optimal with respect to $ζ$ .

Regret and Bayesian Regret

One way to evaluate a policy given an environment is to measure the difference between the expected loss for that policy and the expected loss of the optimal policy. Informally speaking, this quantifies how much an agent “regrets” using a policy that is not the optimal one for some environment.

Definition: Regret of a policy $π$ with respect to an environment $μ$
Let an environment $μ$ and continuous loss function $L^{γ}$ be given, and let $π_{μ}^{*}$ be the optimal policy. Then the regret of a policy $π$ is $Reg (π, μ, L^{γ}) := E_{h \sim μ^{π}} [L^{γ} (h)] - E_{h \sim μ^{π_{μ}^{*}}} [L^{γ} (h)]$ .

Since the optimal policy minimizes expected loss, an equivalent definition of the regret of a policy $π$ is $Reg (π, μ, γ) := E_{h \sim μ^{π}} [L^{γ} (h)] - {min}_{π^{*} \in Π} E_{h \sim μ^{π^{*}}} [L^{γ} (h)]$ . Note that regret is always nonnegative.

Similarly to how we define a Bayes optimal policy for a prior, Bayesian regret is defined for a prior over environments:

Definition: Bayesian regret of a policy $π$ given a prior $ζ$ over environments (Bayes risk)
Let $ζ$ denote a prior over environments and $L^{γ}$ denote a continuous loss function. Then the Bayesian regret of a policy $π$ is $BReg (π, ζ, L^{γ}) := E_{μ \sim ζ} Reg (π, μ, L^{γ}) .$

In the next section, we will use regret and Bayesian regret to define a property of priors and classes of environments called learnability.

Traps and Learnability

Recall that $γ \in [0, 1)$ is a time discount factor in the definition of the loss function. The closer $γ$ is to one, the slower the weights decay when proceeding through time. A time discount near one is called a low time-discount, since future rewards are weighted similarly to rewards at the start.

Different choices of time discounts result in different loss functions. Therefore, it might be desirable to construct a set of policies that depend on the time discount. This is denoted by a family of policies ${π^{γ}}_{γ \in [0, 1)}$ that is indexed by $γ \in [0, 1) .$

Learning requires acquiring information and therefore takes time. One way to judge a time-indexed family of policies while taking this into account is to judge the performance of the policies as the time-discount goes to one. This is captured below in the definition of non-anytime learnability.

Low loss in the $γ$ -to-one limit necessitates avoiding traps; in the case when total loss is normalized to be less than or equal to one, a trap is a state that forces the regret to be bounded away from zero. (In the case when total loss is not normalized, a trap is a state that forces the regret to grow linearly with time.)

An example of a trap is the hell state of HeavenHell and HeavenHell Opp, where the momentary regret is one at each time-step. There is no policy that can be used in both HeavenHell and HeavenHell Opp such that the regret of a policy goes to zero in the low-time-discount limit. In particular, we could only design a policy that has this property in one of these environments, but not both. To say that no such policy exists means that HeavenHell and HeavenHell Opp, when considered together as one class of environments, is not “learnable” to the agent, which we now define:

Definition: Non-anytime learnable class of environments
A class of environments (hypotheses) ${μ_{i}}_{i \in I}$ is non-anytime learnable if there exists a family of policies $π^{γ}$ indexed by $γ \in [0, 1)$ such that for all $i \in I$ , ${lim}_{γ \to 1} Reg (π^{γ}, μ_{i}, γ) = 0$ . In this case, we say that $π^{γ}$ learns the class of environments.

In this definition, “non-anytime” refers to the fact that we have a different policy for each time discount. The definition does not require that any particular element of the family is actually optimal in any of the environments. Instead, it requires that the family converges to optimal in all the environments as the time discount goes to one.

We can write an analogous definition for a learnable prior.

Definition: Non-anytime learnable prior over environments
A prior $ζ$ over environments is non-anytime learnable if there exists a family of policies $π^{γ}$ indexed by $γ \in [0, 1)$ such that ${lim}_{γ \to 1} BReg (π^{γ}, ζ, γ) = 0$ . In this case, we say that $π^{γ}$ learns the prior.

We now consider the connection between learnability of a class of environments to Bayes-optimality of a prior on that class. The condition on the prior is that it is non-dogmatic, defined as follows:

Definition: Non-dogmatic prior
A prior $ζ$ on a countable class of environments ${μ_{i}}_{i \in I}$ is non-dogmatic if for all $i \in I$ , $ζ (μ_{i}) \neq 0.$

The next proposition states that if a countable class of environments is learnable, given any non-dogmatic prior over it, then any family of policies that is Bayes-optimal with respect to the prior must also learn the class. As is demonstrated in the proof (in the proof section), countability is an important hypothesis. The non-dogmatic prior condition is also essential because if the prior assigns probability zero to some environment, the performance of the family on that environment does not affect the Bayesian-regret; this means that the policy could perform arbitrarily badly with respect to that environment and still be Bayes-optimal.

Proposition 2: If a countable class of environments ${μ_{i}}_{i \in I}$ is non-anytime learnable, then the Bayes-optimal family of policies for any non-dogmatic prior $ζ$ on ${μ_{i}}_{i \in I}$ learns the class.

Markov Decision Processes

A Markov decision process (MDP) is a commonly-used model for sequential decision making. In this section, we discuss how every environment can be modeled as an MDP in which the set of states is the set of histories from the domain of the environment.

Definition: Markov decision process
A Markov decision process $M$ is a tuple $M = (S, A, T, L, γ)$ where
$S$ is a set of states;
$A$ is a set of actions;
$T : S \times A \to Δ S$ is a transition kernel;
$L : S \times A \to [0, 1]$ is a momentary loss function; and
$γ \in [0, 1)$ is a time discount.

Intuitively, the transition kernel of the MDP that models an environment must ensure that the agent can only transition to a history that is equal to a single observation appended to the end of the current history. Furthermore, the transition kernel should agree with the environment about the various probabilities of those valid transitions. To ensure this, the probability of a valid transition can be taken to be equal to the probability of the corresponding observation that is appended. This is formalized in the proof of the following proposition.

Proposition 3: Every probabilistic environment can be modeled as an MDP with at most countably infinitely many states.

Proof: Recall that a probabilistic environment is defined to be a function $μ : (A \times O)^{*} \times A \to Δ O$ . Define $S = (A \times O)^{*},$ which is countable since by assumption, $A$ and $O$ are finite. Define $T : S \times A \to Δ S$ by

T (s^{'} | s, a) = {\begin{matrix} μ (o | s, a) if s^{'} = (s, a, o) 0 otherwise \end{matrix}

This is a valid transition kernel. $□$

In the following post, we consider the infra-Bayesian analogues to the machinery we have developed so far.

Acknowledgements

I am grateful to Vanessa Kosoy for her valuable comments and suggestions on initial drafts. I also thank Vanessa and Alex Appel for addressing numerous questions and Hannah Gabor, whose feedback improved the writing and clarity of ideas.

^{^}The space of destinies is a measurable space with respect to the cylindrical $σ -$ algebra. By Carathéodory’s extension theorem, there exists a unique probability measure on this space that agrees on cylinder sets with the probability measure on histories. For more detail, see Theorem 3.1 in Probability and Measure by Patrick Billingsley (1995) or remarks of Eilon Solan on page 8 of A Course in Stochastic Game Theory (2022).