The goal of this post is to give a summary of classical reinforcement learning theory that primes the reader to learn about infra-Bayesianism, which is a new framework for reinforcement learning that aims to solve problems related to AI alignment. We will concentrate on basic aspects of the classical theory that have analogous concepts in infra-Bayesianism, and explain these concepts using infra-Bayesianism conventions. The more technical proofs are contained in the proof section.
One special case of reinforcement learning is the case of stochastic bandits. For example, a bird may have a choice of three levers. When the bird steps on a lever, either some tasty seeds come tumbling out of a feeder or nothing happens. A mechanical system that the bird cannot observe delivers these seeds according to probabilities that are fixed for each lever.
An all-knowing bird can devise a strategy to maximize the expected number of seeds they will obtain. Some natural questions include: If a bird doesn’t have any knowledge of the mechanical system, what is a good algorithm to follow to learn how to gather lots of seeds? How much worse off is this bird compared to the all-knowing bird? These are the types of questions that can be mathematically formalized and studied using stochastic bandits.
Actions, Observations, and Environments
The formal set-up for reinforcement learning starts with time-steps, actions, observations, and environments:
At discrete time-steps indexed by natural numbers, an agent performs an action a, chosen from a set A of actions.
The agent then receives an observation from a set O of observations.
Assumption: Finiteness of actions and observations
For simplicity, we assume that the set of actions and the set of observations are both finite and non-empty.
The time-horizon of a problem is the number of time-steps after which the alternation of action and observations terminates, which may be finite or infinite. In the infra-Bayesianism sequence, time-horizons are assumed to be infinite, which is arguably the more realistic assumption for modeling open-world agents. We will similarly assume infinite time-horizons, in contrast to episodic reinforcement learning, in which the time-horizon is finite and the action-observation process sequentially repeats in episodes, such as in repeated games of Go.
In the bird example from the introduction, we have
A={Lever 1, Lever 2, Lever 3}, andO={Seeds, No Seeds}.
Time-steps correspond to sequential opportunities to step on levers.
An element of the cartesian product A×O is a pair of an action a and the following observation o, which is written as (a,o).
Definition: Histories and destinies
The set (A×O)∗ denotes the set of finite sequences of action-observation pairs. Elements of (A×O)∗ are called histories.
The set (A×O)ω denotes the set of infinite sequences of action-observation pairs. Elements of (A×O)ω are called destinies.
Cantor’s diagonal argument proves that the set of infinite binary sequences is an uncountable set. If either A or O has at least two elements, then there exists a bijection from a subset of (A×O)ω to the set of infinite binary sequences, and thus (A×O)ω is also uncountable. For example, if A contains a0 and a1 and O contains o, then the subset of (A×O)ω of sequences with actions equal to a0 or a1 and observations all equal to o is in bijection with the set of infinite binary sequences.
The levers in the bird example are what we intuitively might want to call the “environment,” which responds to agent actions by sending back or providing observations. An environment is formally defined as follows:
Definition: Probabilistic environment
A probabilistic environmentμ:(A×O)∗×A→ΔO is a function that takes in a history and an action and returns a probability distribution over observations.
Here is a concrete example of an environment. Let h∈(A×O)∗ denote a history. Define μ:(A×O)∗×A→ΔO by
μ(h,Lever 1):=P(Seeds)=0.9,μ(h,Lever 2):=P(Seeds)=0.5, and μ(h,Lever 3):=P(Seeds)=0.3.
We’ll use the notation μ(o|h,a) to denote the probability of o∈O according to the distribution that the environment outputs when given (h,a)∈(A×O)∗×A.
Notice that in this example, μ does not depend on h. However, in full generality, an environment may depend on h. For example, a lever could give seeds with probability 1 only if it was not the most recently visited lever, and otherwise it could give seeds with probability 0.5.
Formalizing Preferences: Loss Functions and Momentary Loss
One motivation for reinforcement learning as a framework for artificial intelligence is that many—or perhaps all—”intelligence-requiring” tasks can be formulated by means of an agent who prefers some observations over others and seeks to take actions that bring about their preferred observations. This preference is captured numerically with loss or reward functions.
For example, we can define a function L:O↦R that captures the bird’s preference for seeds over no seeds and thus the lost value when no seeds are received by:
L(Seeds)=0, andL(No Seeds)=1.
This is an example of a loss function.
Definition: Loss function
A deterministic loss function is a function L of type (A×O)∗↦[0,1]. For a given history h ending in observation o∈O,L(h) is the momentary loss of o.
Rather than focusing on lost value, we could measure gained value with a reward functionR:O↦R such that R(Seeds)=1 and R(No Seeds)=0. In this post, we will use loss functions, but everything can be reformulated in terms of reward functions.
Total loss: What is an agent optimizing for?
Imagine that in the bird example, a bird steps on Lever 3, and receives seeds. This gives some evidence that Lever 3 is a good choice, but a bird would be much happier in the long run knowing that, in expectation, Lever 1 delivers more seeds. Any such bird has a tradeoff between exploring to gain a better idea of how good the various levers are versus exploiting their current knowledge, i.e. stepping on the lever that looks best with the current information. This is called the exploration versus exploitation tradeoff.
Suppose a bird is famished, and therefore cares strongly about obtaining seeds in the short-term. This bird might step on the lever that looks best under the current information. As a result, a famished bird might not discover a better lever, getting stuck stepping on a worse lever forever, exploiting too much and exploring too little. So, “famished bird” doesn’t seem like the best model of a rational learning agent.
Suppose the time-horizon is theoretically infinite, but the bird is mortal and has no knowledge about when they will die. Such a bird should value seeds in the short-term and the long-term unequally. Furthermore, a bird will get hungry without enough seeds in the short-term. This motivates the notion of discounting, which is weighting the momentary loss of observations unequally over time when it comes to calculating the sum of the momentary losses. This sum is defined as total loss.
Definition: Total loss of a history (with geometric time discount)
Let γ∈[0,1) and let N denote the time-horizon, which may be infinite. The total loss of a given history (at,ot)Nt=0 with time discount γ is defined by Lγ((at,ot)Nt=0)=(1−γ)∑Nt=0γtL(ot).
Note that the coefficient (1−γ) on the total loss ensures that it is upper-bounded by 1. This fact is used in various proofs. The time discount is especially important when N is infinite, since without it, the total loss of various histories can diverge, making it impossible to meaningfully compare them. So, the time-discount is (1) a choice that makes strategies more or less farsighted, and (2) a mathematical trick that ensures that the total loss is finite.
Formalizing “Strategies”: Policies
We eluded that a bird could develop a strategy for gathering lots of seeds; such a strategy is formally called a policy. For example, one valid, albeit bad, policy is to choose Lever 3 with probability one without regard to past actions and observations.
As another example, before the bird has stepped on any levers, they could flip a coin to decide whether to step on Lever 1 or 2. At later time-steps, the bird could calculate the past percentage of times that each lever yielded seeds, and step on the lever with the best seed percentage 2⁄3 of the time, taking a random action otherwise. In the case of ties, the bird could also make a random choice.
By using the seed percentages, which are determined by past actions and observations, this strategy is a function of (A×O)∗. The range is ΔA because the policy outputs a probability distribution over actions. Formally defined, a policy is exactly a function of this type:
Definition: Policy
A policy is a function π:(A×O)∗→ΔA. We denote the set of policies by Π. We say that a policy is deterministic if it can be written as π:(A×O)∗→A. Otherwise, the policy is said to be stochastic.
Real-valued sequences are formally defined as functions from N to R, with the notation (a0,a1,…) meaning that ai∈R is the image of i∈N. For example, a sequence of zeros is the constant function that sends every natural number to zero. The set of real-valued-sequences can then be written as ∏NR:={α:N→R}, which is known as the product of R over N. In the same way, the set of stochastic policies is the product space ∏(A×O)∗ΔA.
Lemma 2 below states that the set of policies can be given a topology in which it is compact. For our purposes, topology provides generalized notions of continuity and convergence in spaces that are not the real numbers. Compactness is a significant property for a topological space to have in the context of optimization, because if a space is compact, then any continuous function from that space to the real numbers attains both a maximum and a minimum in the space.
The proof of Lemma 2 depends on Lemma 1, which states that ΔA is a compact metric space. In the infra-Bayesianism sequence, the metric used on ΔA is the Kantorovich-Rubinstein metric, but in the case when A is finite, the Kantorovich-Rubinstein metric is equivalent to a more simple metric known as the total-variation metric.
Definition: Total-variation metric
Let P1 and P2 denote probability distributions over a finite set X. The total variation metric, which we denote by dTV is defined by dTV(P1,P2):=12∑x∈X|P1(x)−P2(x)|.
It is straightforward to check that dTV satisfies the definition of a metric. As a simple case, if X={0,1}, then given P1 and P2 in ΔX,dTV(P1,P2)=12(|P1(0)−P2(0)|+|P1(1)−P2(1)|) but |P1(1)−P2(1)|=|1−P1(0)−(1−P2(0))|=|P1(0)−P2(0)|. Observe then that in this simple example, dTV(P1,P2)=|P1(0)−P2(0)|.
Lemma 1: Compactness of ΔA under the total-variation metric If A is finite, then the set ΔA is compact under the total-variation metric.
Proof: By assumption, A is finite, so we can enumerate A as A={a1,a2,…,an}.Elements of ΔA, which are probability distributions over A, can be written as vectors in Rn of the form (x1,x2,…,xn) such that xi denotes the probability of ai. By definition of a probability distribution, ∑ni=1xi=1 and xi≥0 for all i. Therefore, there is a bijection between ΔA and the standard (n−1)−dimensional simplex, which is defined as the setΔn−1:={(x1,x2,…,xn)∈Rn:∑ni=1xi=1,xi≥0 for all i}. (See Figure 1 for a depiction of the standard 1-dimensional and 2-dimensional simplices.)
The total-variation metric is induced by the Euclidean norm on Rn, and thus both induce the same topology on ΔA. Under the Euclidean metric, the n-simplex is compact by the Heine-Borel theorem since it is closed and bounded. Thus, ΔA is compact. □
Figure 1: (Left) The standard 1-simplex. (Right) The standard 2-simplex.
Visualizing ΔA as Δn as in Figure 1, computing the total-variation distance between two elements of ΔA can be conceptualized as locating the corresponding points in Rn,projecting the points to each of the coordinate axes, then connecting the projections with a line segment and computing 1/2 of the sum of the line segment lengths. As the earlier calculation (and Euclidean geometry) suggest, in the special case of Δ1, the two segments drawn for any two elements of Δ1 will have the same length.
Lemma 2: Compactness of the set of policies The set of policies Π is compact with respect to the product topology.
Proof: By Lemma 1, ΔA is compact. Then Π=∏(A×O)∗ΔA is compact with respect to the product topology by Tychonoff’s Theorem, a theorem that states that the product of compact topological spaces is compact with respect to the product topology. □
Lemma 2 remains true if we restrict attention to the set of deterministic policies.
Expected Total Loss of a Policy
Suppose we have a policy, an environment, and a history of the form: “Lever 1, No seeds.” How likely is this history to occur compared to other histories of length one? This question is fundamental to comparing various policies. To calculate the answer, we would want to know:
The probability of stepping on Lever 1 at the start according to the policy
The probability of no seeds conditional on stepping on Lever 1 according to the environment
For example, consider the previously described policy:
Before the bird has stepped on any levers, they could flip a coin to decide whether to step on Lever 1 or 2. At later time-steps, the bird could calculate the past percentage of times that each lever yielded seeds, and step on the lever with the best seed percentage 2⁄3 of the time, taking a random action otherwise. In the case of ties, the bird could also make a random choice.
We will examine how this policy interacts with the environment given by:
μ(h,Lever 1):=P(Seeds)=0.9,μ(h,Lever 2):=P(Seeds)=0.5, and μ(h,Lever 3):=P(Seeds)=0.3.
Let ϵ denote the empty history, so that π(Lever 1|ϵ) denotes the probability that Lever 1 is the first action. Using the definition of conditional probability, the probability of “Lever 1, no seeds” among length-one histories is:
P(Lever 1 and no seeds)=P(Lever 1)⋅P(no seeds|Lever 1)=π(Lever 1|ϵ)⋅μ(no seeds|Lever 1)=0.5⋅0.1=0.05.
We could repeat this process to calculate the probability of any history of length one: the policy and environment determine everything we need to do that. The resulting distribution over histories of length one is represented as the tree in Figure 2.
Figure 2: Each path in the tree corresponds to a history of length one, with probabilities along the edges determined by the policy or the environment. To the right is the probability p of the path ending at that height, which is obtained by multiplying together the probabilities along the path.
We can further calculate the probability of a specific history of length l+1>1 among all length-(l+1) histories. By the chain rule of probabilities, P((ao,oo),...,(al,ol)) is given by
Note that for all t,P(at|a0,o0,...,at−1,ot−1) is determined by a chosen policy, whereas P(ot|o0,a0,...,at) is determined by a given environment. So pairing together a policy and an environment produces a distribution over histories of a fixed length.[1] We denote all such distributions by μπ, since the length will hopefully always be clear from context. This is formally stated in the next definition. As a convention, a sequence written with an ellipsis such that the final index is negative denotes an empty sequence.
Definition: Probability distributions on histories induced by a policy and environment
Let (at,ot)lt=0∈(A×O)∗. The probability distribution μπ∈Δ(A×O)∗ is defined by μπ((at,ot)lt=0)=∏lt=0π(at|a0,o0,…,at−1,ot−1)μ(ot|a0,o0,…,at).
Recall that we previously defined the total loss of a history h=(at,ot)Nt=0 by Lγ(h)=∑Nt=0γtL(ot). Since μπ is a distribution over the domain of Lγ, we can now calculate the expected total loss of a policy for a given environment.
Definition: Expected total loss of policy
The expected total loss of a policy π in the environment μ is Eh∼μπ[Lγ(h)].
The distribution μπ gives another way to describe an environment: For a fixed environment, we can construct a function ~μ that takes in a policy and outputs a probability distribution over histories. This function is an example of a Markov kernel. We will repeat a similar construction to define crisp causal laws in the next post.
Optimal and Bayes-Optimal Policies
In this section, we will pin down what we mean by a “best” policy, which provides a benchmark to which we can compare other policies. One choice is to compare the performance of an agent learning about the environment to a hypothetical agent who “knows” the environment ahead of time and therefore acts as if to achieve minimum loss in expectation. The policy of this “all-knowing” agent is the optimal policy. Given uncertainty about the true environment, we could consider the policy that achieves minimum loss in expectation with respect to a distribution over possible environments and with respect to the environments. This is called the Bayes-optimal policy.
Definition: Optimal policy for an environment μ
Let an environment μ be fixed. We say that π∗ is an optimal policy for μ if π∗∈arg minπ∈ΠEh∼μπ[Lγ(h)].
So, the optimal policy for a given environment is the policy that minimizes the expected total loss in that environment. There is something to prove, which we do later in this section, in order to say that the optimal policy exists, namely that
arg minπ∈ΠEh∼μπ[Lγ(h)]≠∅.
Note that the optimal policy isn’t necessarily unique, as different observations could have the same momentary loss, or different actions could lead to a certain observation with equal probability.
If you understand the expression
π∗∈arg minπ∈ΠEh∼μπ[Lγ(h)],
you can skip the next example and go straight to the section where we prove that the optimal policy is well-defined. Otherwise, read the next section to understand the notation through an example.
Example of an optimal policy in the HeavenHell environment:
Let the set of actions be A={l,r} for left and right and O={V,H} for heaven and hell. The loss function is defined by L(V)=0 and L(H)=1. We will use an infinite time horizon. If the first action is l, then the agent is in heaven and receives loss of zero at each time step forever. If the first action is r, then the agent is in hell and receives loss of one at each time step forever.
One optimal policy π∗ is to go left with probability one at every time step, which is denoted by π∗(l|h)=1 and π∗(r|h)=0 for all h∈(A×O)∗.
Given any history that starts with l, at every time-step, the environment outputs a distribution over {V,H} that is one at V. So μπ∗μ is a distribution that is one on histories of the form (l,V,l,V,...) and zero otherwise.
The loss of any history h starting with l is Lγ(h)=(1−γ)∑∞t=00⋅γt=0. So the expected total loss is Eh∼μπ∗μ[Lγ(h)]=1⋅0=0. Since expected loss is bounded between 0 and 1, π∗ is an optimal policy.
We can check that the loss of any history that starts with r is (1−γ)∑∞t=01⋅γt=(1−γ)11−γ=1. So then a policy πp that says to go right with probability p∈(0,1] at the first time-step has expected loss Eh∼μπp[Lγ(h,a)]=1⋅p+(1−p)⋅0=p>0.
Existence of an optimal policy
Proposition1:Existence of an optimal policy
If A and O are finite, an optimal policy for a given environment μ exists; namely, arg minπ∈ΠEh∼μπ[Lγ(h)]≠∅.Furthermore, the optimal policy can be chosen to be deterministic.
We start this section with a summary of the main idea of the existence proof. The full proof is in the proof section. A restatement of Proposition 1 is that the function π↦Eh∼μπ[Lγ(h)] attains a minimum. The usual technique to show that a function attains a minimum or maximum over some domain is to prove that the domain is compact and the function is continuous: then the result follows from the fact that a continuous function over a compact domain attains a maximum and a minimum. We have compactness of Π by Lemma 2.
To show continuity, we take a convergent sequence πn→π∈Π, aiming to show that Eh∼μπn[Lγ(h)]→Eh∼μπ[Lγ(h)] as n→∞. In other words, we need to show that for large enough n,|Eh∼μπn[Lγ(h)]−Eh∼μπ[Lγ(h)]| becomes arbitrarily small. The key idea of the proof for the infinite time-horizon case is to define Lγ|N for N∈N as
Lγ|N((at,ot)∞t=0):=(1−γ)N∑t=0γtL(ot),
which we can conceptualize as the restriction of Lγ to a finite time-horizon. We can show that uniformly over all probability distribution, the expectation of Lγ|N gets close to the expectation of Lγ as N becomes large. Furthermore, for any fixed N<∞,|Eh∼μπn[Lγ|N(h)]−Eh∼μπ[Lγ|N(h)]| becomes arbitrarily small as n becomes large. By applying the triangle inequality, these two statements are enough to complete the proof for infinite time-horizons.
It is natural to do an approximation argument using a finite time horizon because of the definition of the product topology. Recall that Π=∏(A×O)∗ΔA. A set is open under the product topology on Π if it is the product of open sets in ΔA not equal to ΔA in finitely many components. Convergence of πn→π in the product topology then means that for all open sets U containing π, we can find find an M large enough such that for all n≥M,πn∈U. For any open set U, this only puts a condition on πn that is relevant to finitely many histories because U is not equal to ΔA in finitely many components.
Bayes Optimal Policy
Since the problem of reinforcement learning is about agents learning in an unknown environments, we now define a different notion of optimality — the Bayes optimal policy — that is relevant when we have a prior over environments rather than certainty about the true environment. To simplify the definition, we can write a prior over environments as a new environment called a Bayes mixture of environments.
Bayes Mixture of Environments
Consider for example an environment, HeavenHell Opp (H&H Opp), which does the opposite of HeavenHell (H&H). A possible prior ζ over these environments specifies probability 0.4 of being in HeavenHell and probability 0.6 of being in HeavenHell Opp. This prior represents a new environment, μζ, that takes in a history and action and returns a distribution over observations in the following way.
Given a history, Bayes’ rule produces a posterior distribution over the environments. For example, given the history (l,V), the probability of H&H is
p(H&H|l,V)=p(l,V|H&H)p(l,V)p(H&H)=10.40.4=1.
This calculation can be conceptualized using the tree in Figure 3. To calculate p(l,V|H&H), we use the path starting at H&H and ending at V. To calculate p(l,V), we start from the base node of the tree, and find all paths that go through l then V. We then sum together the probabilities of these paths, which gives a weighted sum with weights determined by the prior. The value of p(H&H) is determined by the prior ζ.
Observe that this calculation agrees with the fact that the history (l,V) is impossible in H&H Opp.
Figure 3: Each path in the tree corresponds to a history of length one, with probabilities along the edges determined by the prior, policy, or the environment.
Now we define the Bayes optimal policy with respect to a prior over environments.
Definition: Bayes optimal policy
Let ζ be a prior over environments, which can be written as an environment μζ. A policy π∗ζ is said to be Bayes optimal with respect to the priorζ if it is the optimal policy for μζ. Equivalently, π∗ζ∈argminπ∈ΠEμ∼ζ[Eh∼μπ[Lγ(h)]].
The existence of the Bayes optimal policy given a prior over environments follows from the existence of the optimal policy for the environment formed out of that prior. Note that these optimality definitions depend on the choice of time-discount γ. A Bayes-optimal family of policies is a set of policies that depend on γ, each of which are optimal for the corresponding γ.
Definition: Bayes-optimal family of policies
Let ζ be a prior over environments. A family of policies {πγ}γ∈[0,1) is a Bayes-optimal family with respect to ζ if for all γ∈[0,1),πγ is Bayes-optimal with respect to ζ.
Regret and Bayesian Regret
One way to evaluate a policy given an environment is to measure the difference between the expected loss for that policy and the expected loss of the optimal policy. Informally speaking, this quantifies how much an agent “regrets” using a policy that is not the optimal one for some environment.
Definition: Regret of a policy π with respect to an environment μ Let an environment μ and continuous loss function Lγ be given, and let π∗μ be the optimal policy. Then the regret of a policy π is Reg(π,μ,Lγ):=Eh∼μπ[Lγ(h)]−Eh∼μπ∗μ[Lγ(h)].
Since the optimal policy minimizes expected loss, an equivalent definition of the regret of a policy π is Reg(π,μ,γ):=Eh∼μπ[Lγ(h)]−minπ∗∈ΠEh∼μπ∗[Lγ(h)]. Note that regret is always nonnegative.
Similarly to how we define a Bayes optimal policy for a prior, Bayesian regret is defined for a prior over environments:
Definition: Bayesian regret of a policy π given a prior ζ over environments (Bayes risk) Let ζ denote a prior over environments and Lγ denote a continuous loss function. Then the Bayesian regret of a policy π is BReg(π,ζ,Lγ):=Eμ∼ζReg(π,μ,Lγ).
In the next section, we will use regret and Bayesian regret to define a property of priors and classes of environments called learnability.
Traps and Learnability
Recall that γ∈[0,1) is a time discount factor in the definition of the loss function. The closer γ is to one, the slower the weights decay when proceeding through time. A time discount near one is called a low time-discount, since future rewards are weighted similarly to rewards at the start.
Different choices of time discounts result in different loss functions. Therefore, it might be desirable to construct a set of policies that depend on the time discount. This is denoted by a family of policies {πγ}γ∈[0,1) that is indexed by γ∈[0,1).
Learning requires acquiring information and therefore takes time. One way to judge a time-indexed family of policies while taking this into account is to judge the performance of the policies as the time-discount goes to one. This is captured below in the definition of non-anytime learnability.
Low loss in the γ-to-one limit necessitates avoiding traps; in the case when total loss is normalized to be less than or equal to one, a trap is a state that forces the regret to be bounded away from zero. (In the case when total loss is not normalized, a trap is a state that forces the regret to grow linearly with time.)
An example of a trap is the hell state of HeavenHell and HeavenHell Opp, where the momentary regret is one at each time-step. There is no policy that can be used in both HeavenHell and HeavenHell Opp such that the regret of a policy goes to zero in the low-time-discount limit. In particular, we could only design a policy that has this property in one of these environments, but not both. To say that no such policy exists means that HeavenHell and HeavenHell Opp, when considered together as one class of environments, is not “learnable” to the agent, which we now define:
Definition: Non-anytime learnable class of environments A class of environments (hypotheses) {μi}i∈I is non-anytime learnable if there exists a family of policies πγ indexed by γ∈[0,1) such that for all i∈I, limγ→1Reg(πγ,μi,γ)=0. In this case, we say that πγlearns the class of environments.
In this definition, “non-anytime” refers to the fact that we have a different policy for each time discount. The definition does not require that any particular element of the family is actually optimal in any of the environments. Instead, it requires that the family converges to optimal in all the environments as the time discount goes to one.
We can write an analogous definition for a learnable prior.
Definition: Non-anytime learnable prior over environments A prior ζ over environments is non-anytime learnable if there exists a family of policies πγ indexed by γ∈[0,1) such that limγ→1BReg(πγ,ζ,γ)=0. In this case, we say that πγlearns the prior.
We now consider the connection between learnability of a class of environments to Bayes-optimality of a prior on that class. The condition on the prior is that it is non-dogmatic, defined as follows:
Definition: Non-dogmatic prior A prior ζ on a countable class of environments {μi}i∈I is non-dogmatic if for all i∈I, ζ(μi)≠0.
The next proposition states that if a countable class of environments is learnable, given any non-dogmatic prior over it, then any family of policies that is Bayes-optimal with respect to the prior must also learn the class. As is demonstrated in the proof (in the proof section), countability is an important hypothesis. The non-dogmatic prior condition is also essential because if the prior assigns probability zero to some environment, the performance of the family on that environment does not affect the Bayesian-regret; this means that the policy could perform arbitrarily badly with respect to that environment and still be Bayes-optimal.
Proposition 2: If a countable class of environments {μi}i∈I is non-anytime learnable, then the Bayes-optimal family of policies for any non-dogmatic prior ζ on {μi}i∈I learns the class.
Markov Decision Processes
A Markov decision process (MDP) is a commonly-used model for sequential decision making. In this section, we discuss how every environment can be modeled as an MDP in which the set of states is the set of histories from the domain of the environment.
Definition: Markov decision process
A Markov decision process M is a tuple M=(S,A,T,L,γ) where
S is a set of states;
A is a set of actions;
T:S×A→ΔS is a transition kernel;
L:S×A→[0,1] is a momentary loss function; and
γ∈[0,1) is a time discount.
Intuitively, the transition kernel of the MDP that models an environment must ensure that the agent can only transition to a history that is equal to a single observation appended to the end of the current history. Furthermore, the transition kernel should agree with the environment about the various probabilities of those valid transitions. To ensure this, the probability of a valid transition can be taken to be equal to the probability of the corresponding observation that is appended. This is formalized in the proof of the following proposition.
Proposition 3: Every probabilistic environment can be modeled as an MDP with at most countably infinitely many states.
Proof: Recall that a probabilistic environment is defined to be a function μ:(A×O)∗×A→ΔO. Define S=(A×O)∗, which is countable since by assumption, A and O are finite. Define T:S×A→ΔS by
T(s′|s,a)={μ(o|s,a) if s′=(s,a,o)0 otherwise
This is a valid transition kernel. □
In the following post, we consider the infra-Bayesian analogues to the machinery we have developed so far.
Acknowledgements
I am grateful to Vanessa Kosoy for her valuable comments and suggestions on initial drafts. I also thank Vanessa and Alex Appel for addressing numerous questions and Hannah Gabor, whose feedback improved the writing and clarity of ideas.
^The space of destinies is a measurable space with respect to the cylindrical σ−algebra. By Carathéodory’s extension theorem, there exists a unique probability measure on this space that agrees on cylinder sets with the probability measure on histories. For more detail, see Theorem 3.1 in Probability and Measure by Patrick Billingsley (1995) or remarks of Eilon Solan on page 8 of A Course in Stochastic Game Theory (2022).
An Introduction to Reinforcement Learning for Understanding Infra-Bayesianism
Introduction
The goal of this post is to give a summary of classical reinforcement learning theory that primes the reader to learn about infra-Bayesianism, which is a new framework for reinforcement learning that aims to solve problems related to AI alignment. We will concentrate on basic aspects of the classical theory that have analogous concepts in infra-Bayesianism, and explain these concepts using infra-Bayesianism conventions. The more technical proofs are contained in the proof section.
For the first part of this sequence and for links to other writings, see What is Inadequate about Bayesianism for AI Alignment: Motivating Infra-Bayesianism.
One special case of reinforcement learning is the case of stochastic bandits. For example, a bird may have a choice of three levers. When the bird steps on a lever, either some tasty seeds come tumbling out of a feeder or nothing happens. A mechanical system that the bird cannot observe delivers these seeds according to probabilities that are fixed for each lever.
An all-knowing bird can devise a strategy to maximize the expected number of seeds they will obtain. Some natural questions include: If a bird doesn’t have any knowledge of the mechanical system, what is a good algorithm to follow to learn how to gather lots of seeds? How much worse off is this bird compared to the all-knowing bird? These are the types of questions that can be mathematically formalized and studied using stochastic bandits.
Actions, Observations, and Environments
The formal set-up for reinforcement learning starts with time-steps, actions, observations, and environments:
At discrete time-steps indexed by natural numbers, an agent performs an action a, chosen from a set A of actions.
The agent then receives an observation from a set O of observations.
The time-horizon of a problem is the number of time-steps after which the alternation of action and observations terminates, which may be finite or infinite. In the infra-Bayesianism sequence, time-horizons are assumed to be infinite, which is arguably the more realistic assumption for modeling open-world agents. We will similarly assume infinite time-horizons, in contrast to episodic reinforcement learning, in which the time-horizon is finite and the action-observation process sequentially repeats in episodes, such as in repeated games of Go.
In the bird example from the introduction, we have
A={Lever 1, Lever 2, Lever 3}, andO={Seeds, No Seeds}.Time-steps correspond to sequential opportunities to step on levers.
An element of the cartesian product A×O is a pair of an action a and the following observation o, which is written as (a,o).
Cantor’s diagonal argument proves that the set of infinite binary sequences is an uncountable set. If either A or O has at least two elements, then there exists a bijection from a subset of (A×O)ω to the set of infinite binary sequences, and thus (A×O)ω is also uncountable. For example, if A contains a0 and a1 and O contains o, then the subset of (A×O)ω of sequences with actions equal to a0 or a1 and observations all equal to o is in bijection with the set of infinite binary sequences.
The levers in the bird example are what we intuitively might want to call the “environment,” which responds to agent actions by sending back or providing observations. An environment is formally defined as follows:
Here is a concrete example of an environment. Let h∈(A×O)∗ denote a history. Define μ:(A×O)∗×A→ΔO by
μ(h,Lever 1):=P(Seeds)=0.9,μ(h,Lever 2):=P(Seeds)=0.5, and μ(h,Lever 3):=P(Seeds)=0.3.We’ll use the notation μ(o|h,a) to denote the probability of o∈O according to the distribution that the environment outputs when given (h,a)∈(A×O)∗×A.
Notice that in this example, μ does not depend on h. However, in full generality, an environment may depend on h. For example, a lever could give seeds with probability 1 only if it was not the most recently visited lever, and otherwise it could give seeds with probability 0.5.
Formalizing Preferences: Loss Functions and Momentary Loss
One motivation for reinforcement learning as a framework for artificial intelligence is that many—or perhaps all—”intelligence-requiring” tasks can be formulated by means of an agent who prefers some observations over others and seeks to take actions that bring about their preferred observations. This preference is captured numerically with loss or reward functions.
For example, we can define a function L:O↦R that captures the bird’s preference for seeds over no seeds and thus the lost value when no seeds are received by:
L(Seeds)=0, andL(No Seeds)=1.This is an example of a loss function.
Rather than focusing on lost value, we could measure gained value with a reward function R:O↦R such that R(Seeds)=1 and R(No Seeds)=0. In this post, we will use loss functions, but everything can be reformulated in terms of reward functions.
Total loss: What is an agent optimizing for?
Imagine that in the bird example, a bird steps on Lever 3, and receives seeds. This gives some evidence that Lever 3 is a good choice, but a bird would be much happier in the long run knowing that, in expectation, Lever 1 delivers more seeds. Any such bird has a tradeoff between exploring to gain a better idea of how good the various levers are versus exploiting their current knowledge, i.e. stepping on the lever that looks best with the current information. This is called the exploration versus exploitation tradeoff.
Suppose a bird is famished, and therefore cares strongly about obtaining seeds in the short-term. This bird might step on the lever that looks best under the current information. As a result, a famished bird might not discover a better lever, getting stuck stepping on a worse lever forever, exploiting too much and exploring too little. So, “famished bird” doesn’t seem like the best model of a rational learning agent.
Suppose the time-horizon is theoretically infinite, but the bird is mortal and has no knowledge about when they will die. Such a bird should value seeds in the short-term and the long-term unequally. Furthermore, a bird will get hungry without enough seeds in the short-term. This motivates the notion of discounting, which is weighting the momentary loss of observations unequally over time when it comes to calculating the sum of the momentary losses. This sum is defined as total loss.
Note that the coefficient (1−γ) on the total loss ensures that it is upper-bounded by 1. This fact is used in various proofs. The time discount is especially important when N is infinite, since without it, the total loss of various histories can diverge, making it impossible to meaningfully compare them. So, the time-discount is (1) a choice that makes strategies more or less farsighted, and (2) a mathematical trick that ensures that the total loss is finite.
Formalizing “Strategies”: Policies
We eluded that a bird could develop a strategy for gathering lots of seeds; such a strategy is formally called a policy. For example, one valid, albeit bad, policy is to choose Lever 3 with probability one without regard to past actions and observations.
As another example, before the bird has stepped on any levers, they could flip a coin to decide whether to step on Lever 1 or 2. At later time-steps, the bird could calculate the past percentage of times that each lever yielded seeds, and step on the lever with the best seed percentage 2⁄3 of the time, taking a random action otherwise. In the case of ties, the bird could also make a random choice.
By using the seed percentages, which are determined by past actions and observations, this strategy is a function of (A×O)∗. The range is ΔA because the policy outputs a probability distribution over actions. Formally defined, a policy is exactly a function of this type:
Real-valued sequences are formally defined as functions from N to R, with the notation (a0,a1,…) meaning that ai∈R is the image of i∈N. For example, a sequence of zeros is the constant function that sends every natural number to zero. The set of real-valued-sequences can then be written as ∏NR:={α:N→R}, which is known as the product of R over N. In the same way, the set of stochastic policies is the product space ∏(A×O)∗ΔA.
Lemma 2 below states that the set of policies can be given a topology in which it is compact. For our purposes, topology provides generalized notions of continuity and convergence in spaces that are not the real numbers. Compactness is a significant property for a topological space to have in the context of optimization, because if a space is compact, then any continuous function from that space to the real numbers attains both a maximum and a minimum in the space.
The proof of Lemma 2 depends on Lemma 1, which states that ΔA is a compact metric space. In the infra-Bayesianism sequence, the metric used on ΔA is the Kantorovich-Rubinstein metric, but in the case when A is finite, the Kantorovich-Rubinstein metric is equivalent to a more simple metric known as the total-variation metric.
It is straightforward to check that dTV satisfies the definition of a metric. As a simple case, if X={0,1}, then given P1 and P2 in ΔX, dTV(P1,P2)=12(|P1(0)−P2(0)|+|P1(1)−P2(1)|) but |P1(1)−P2(1)|=|1−P1(0)−(1−P2(0))|=|P1(0)−P2(0)|. Observe then that in this simple example, dTV(P1,P2)=|P1(0)−P2(0)|.
Lemma 1: Compactness of ΔA under the total-variation metric
If A is finite, then the set ΔA is compact under the total-variation metric.
Proof: By assumption, A is finite, so we can enumerate A as A={a1,a2,…,an}.Elements of ΔA, which are probability distributions over A, can be written as vectors in Rn of the form (x1,x2,…,xn) such that xi denotes the probability of ai. By definition of a probability distribution, ∑ni=1xi=1 and xi≥0 for all i. Therefore, there is a bijection between ΔA and the standard (n−1)−dimensional simplex, which is defined as the setΔn−1:={(x1,x2,…,xn)∈Rn:∑ni=1xi=1,xi≥0 for all i}. (See Figure 1 for a depiction of the standard 1-dimensional and 2-dimensional simplices.)
The total-variation metric is induced by the Euclidean norm on Rn, and thus both induce the same topology on ΔA. Under the Euclidean metric, the n-simplex is compact by the Heine-Borel theorem since it is closed and bounded. Thus, ΔA is compact. □
Visualizing ΔA as Δn as in Figure 1, computing the total-variation distance between two elements of ΔA can be conceptualized as locating the corresponding points in Rn,projecting the points to each of the coordinate axes, then connecting the projections with a line segment and computing 1/2 of the sum of the line segment lengths. As the earlier calculation (and Euclidean geometry) suggest, in the special case of Δ1, the two segments drawn for any two elements of Δ1 will have the same length.
Lemma 2: Compactness of the set of policies
The set of policies Π is compact with respect to the product topology.
Proof: By Lemma 1, ΔA is compact. Then Π=∏(A×O)∗ΔA is compact with respect to the product topology by Tychonoff’s Theorem, a theorem that states that the product of compact topological spaces is compact with respect to the product topology. □
Lemma 2 remains true if we restrict attention to the set of deterministic policies.
Expected Total Loss of a Policy
Suppose we have a policy, an environment, and a history of the form: “Lever 1, No seeds.” How likely is this history to occur compared to other histories of length one? This question is fundamental to comparing various policies. To calculate the answer, we would want to know:
The probability of stepping on Lever 1 at the start according to the policy
The probability of no seeds conditional on stepping on Lever 1 according to the environment
For example, consider the previously described policy:
We will examine how this policy interacts with the environment given by:
Let ϵ denote the empty history, so that π(Lever 1|ϵ) denotes the probability that Lever 1 is the first action. Using the definition of conditional probability, the probability of “Lever 1, no seeds” among length-one histories is:
P(Lever 1 and no seeds)=P(Lever 1)⋅P(no seeds|Lever 1)=π(Lever 1|ϵ)⋅μ(no seeds|Lever 1)=0.5⋅0.1=0.05.We could repeat this process to calculate the probability of any history of length one: the policy and environment determine everything we need to do that. The resulting distribution over histories of length one is represented as the tree in Figure 2.
We can further calculate the probability of a specific history of length l+1>1 among all length-(l+1) histories. By the chain rule of probabilities, P((ao,oo),...,(al,ol)) is given by
P(a0)⋅P(o0|a0)⋅l∏t=1P(at|a0,o0,...,at−1,ot−1)⋅P(ot|o0,a0,...,at).Note that for all t, P(at|a0,o0,...,at−1,ot−1) is determined by a chosen policy, whereas P(ot|o0,a0,...,at) is determined by a given environment. So pairing together a policy and an environment produces a distribution over histories of a fixed length.[1] We denote all such distributions by μπ, since the length will hopefully always be clear from context. This is formally stated in the next definition. As a convention, a sequence written with an ellipsis such that the final index is negative denotes an empty sequence.
Recall that we previously defined the total loss of a history h=(at,ot)Nt=0 by Lγ(h)=∑Nt=0γtL(ot). Since μπ is a distribution over the domain of Lγ, we can now calculate the expected total loss of a policy for a given environment.
The distribution μπ gives another way to describe an environment: For a fixed environment, we can construct a function ~μ that takes in a policy and outputs a probability distribution over histories. This function is an example of a Markov kernel. We will repeat a similar construction to define crisp causal laws in the next post.
Optimal and Bayes-Optimal Policies
In this section, we will pin down what we mean by a “best” policy, which provides a benchmark to which we can compare other policies. One choice is to compare the performance of an agent learning about the environment to a hypothetical agent who “knows” the environment ahead of time and therefore acts as if to achieve minimum loss in expectation. The policy of this “all-knowing” agent is the optimal policy. Given uncertainty about the true environment, we could consider the policy that achieves minimum loss in expectation with respect to a distribution over possible environments and with respect to the environments. This is called the Bayes-optimal policy.
So, the optimal policy for a given environment is the policy that minimizes the expected total loss in that environment. There is something to prove, which we do later in this section, in order to say that the optimal policy exists, namely that
arg minπ∈ΠEh∼μπ[Lγ(h)]≠∅.Note that the optimal policy isn’t necessarily unique, as different observations could have the same momentary loss, or different actions could lead to a certain observation with equal probability.
If you understand the expression
π∗∈arg minπ∈ΠEh∼μπ[Lγ(h)],you can skip the next example and go straight to the section where we prove that the optimal policy is well-defined. Otherwise, read the next section to understand the notation through an example.
Example of an optimal policy in the HeavenHell environment:
To understand the definition of the optimal policy, we’ll walk through an example of an optimal policy for the HeavenHell environment described by Marcus Hutter in Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability.
Let the set of actions be A={l,r} for left and right and O={V,H} for heaven and hell. The loss function is defined by L(V)=0 and L(H)=1. We will use an infinite time horizon. If the first action is l, then the agent is in heaven and receives loss of zero at each time step forever. If the first action is r, then the agent is in hell and receives loss of one at each time step forever.
One optimal policy π∗ is to go left with probability one at every time step, which is denoted by π∗(l|h)=1 and π∗(r|h)=0 for all h∈(A×O)∗.
Given any history that starts with l, at every time-step, the environment outputs a distribution over {V,H} that is one at V. So μπ∗μ is a distribution that is one on histories of the form (l,V,l,V,...) and zero otherwise.
The loss of any history h starting with l is Lγ(h)=(1−γ)∑∞t=00⋅γt=0. So the expected total loss is Eh∼μπ∗μ[Lγ(h)]=1⋅0=0. Since expected loss is bounded between 0 and 1, π∗ is an optimal policy.
We can check that the loss of any history that starts with r is (1−γ)∑∞t=01⋅γt=(1−γ)11−γ=1. So then a policy πp that says to go right with probability p∈(0,1] at the first time-step has expected loss Eh∼μπp[Lγ(h,a)]=1⋅p+(1−p)⋅0=p>0.
Existence of an optimal policy
We start this section with a summary of the main idea of the existence proof. The full proof is in the proof section. A restatement of Proposition 1 is that the function π↦Eh∼μπ[Lγ(h)] attains a minimum. The usual technique to show that a function attains a minimum or maximum over some domain is to prove that the domain is compact and the function is continuous: then the result follows from the fact that a continuous function over a compact domain attains a maximum and a minimum. We have compactness of Π by Lemma 2.
To show continuity, we take a convergent sequence πn→π∈Π, aiming to show that Eh∼μπn[Lγ(h)]→Eh∼μπ[Lγ(h)] as n→∞. In other words, we need to show that for large enough n, |Eh∼μπn[Lγ(h)]−Eh∼μπ[Lγ(h)]| becomes arbitrarily small. The key idea of the proof for the infinite time-horizon case is to define Lγ|N for N∈N as
Lγ|N((at,ot)∞t=0):=(1−γ)N∑t=0γtL(ot),which we can conceptualize as the restriction of Lγ to a finite time-horizon. We can show that uniformly over all probability distribution, the expectation of Lγ|N gets close to the expectation of Lγ as N becomes large. Furthermore, for any fixed N<∞, |Eh∼μπn[Lγ|N(h)]−Eh∼μπ[Lγ|N(h)]| becomes arbitrarily small as n becomes large. By applying the triangle inequality, these two statements are enough to complete the proof for infinite time-horizons.
It is natural to do an approximation argument using a finite time horizon because of the definition of the product topology. Recall that Π=∏(A×O)∗ΔA. A set is open under the product topology on Π if it is the product of open sets in ΔA not equal to ΔA in finitely many components. Convergence of πn→π in the product topology then means that for all open sets U containing π, we can find find an M large enough such that for all n≥M,πn∈U. For any open set U, this only puts a condition on πn that is relevant to finitely many histories because U is not equal to ΔA in finitely many components.
Bayes Optimal Policy
Since the problem of reinforcement learning is about agents learning in an unknown environments, we now define a different notion of optimality — the Bayes optimal policy — that is relevant when we have a prior over environments rather than certainty about the true environment. To simplify the definition, we can write a prior over environments as a new environment called a Bayes mixture of environments.
Bayes Mixture of Environments
Consider for example an environment, HeavenHell Opp (H&H Opp), which does the opposite of HeavenHell (H&H). A possible prior ζ over these environments specifies probability 0.4 of being in HeavenHell and probability 0.6 of being in HeavenHell Opp. This prior represents a new environment, μζ, that takes in a history and action and returns a distribution over observations in the following way.
Given a history, Bayes’ rule produces a posterior distribution over the environments. For example, given the history (l,V), the probability of H&H is
p(H&H|l,V)=p(l,V|H&H)p(l,V)p(H&H)=10.40.4=1.This calculation can be conceptualized using the tree in Figure 3. To calculate p(l,V|H&H), we use the path starting at H&H and ending at V. To calculate p(l,V), we start from the base node of the tree, and find all paths that go through l then V. We then sum together the probabilities of these paths, which gives a weighted sum with weights determined by the prior. The value of p(H&H) is determined by the prior ζ.
Observe that this calculation agrees with the fact that the history (l,V) is impossible in H&H Opp.
Now we define the Bayes optimal policy with respect to a prior over environments.
The existence of the Bayes optimal policy given a prior over environments follows from the existence of the optimal policy for the environment formed out of that prior. Note that these optimality definitions depend on the choice of time-discount γ. A Bayes-optimal family of policies is a set of policies that depend on γ, each of which are optimal for the corresponding γ.
Regret and Bayesian Regret
One way to evaluate a policy given an environment is to measure the difference between the expected loss for that policy and the expected loss of the optimal policy. Informally speaking, this quantifies how much an agent “regrets” using a policy that is not the optimal one for some environment.
Since the optimal policy minimizes expected loss, an equivalent definition of the regret of a policy π is Reg(π,μ,γ):=Eh∼μπ[Lγ(h)]−minπ∗∈ΠEh∼μπ∗[Lγ(h)]. Note that regret is always nonnegative.
Similarly to how we define a Bayes optimal policy for a prior, Bayesian regret is defined for a prior over environments:
In the next section, we will use regret and Bayesian regret to define a property of priors and classes of environments called learnability.
Traps and Learnability
Recall that γ∈[0,1) is a time discount factor in the definition of the loss function. The closer γ is to one, the slower the weights decay when proceeding through time. A time discount near one is called a low time-discount, since future rewards are weighted similarly to rewards at the start.
Different choices of time discounts result in different loss functions. Therefore, it might be desirable to construct a set of policies that depend on the time discount. This is denoted by a family of policies {πγ}γ∈[0,1) that is indexed by γ∈[0,1).
Learning requires acquiring information and therefore takes time. One way to judge a time-indexed family of policies while taking this into account is to judge the performance of the policies as the time-discount goes to one. This is captured below in the definition of non-anytime learnability.
Low loss in the γ-to-one limit necessitates avoiding traps; in the case when total loss is normalized to be less than or equal to one, a trap is a state that forces the regret to be bounded away from zero. (In the case when total loss is not normalized, a trap is a state that forces the regret to grow linearly with time.)
An example of a trap is the hell state of HeavenHell and HeavenHell Opp, where the momentary regret is one at each time-step. There is no policy that can be used in both HeavenHell and HeavenHell Opp such that the regret of a policy goes to zero in the low-time-discount limit. In particular, we could only design a policy that has this property in one of these environments, but not both. To say that no such policy exists means that HeavenHell and HeavenHell Opp, when considered together as one class of environments, is not “learnable” to the agent, which we now define:
In this definition, “non-anytime” refers to the fact that we have a different policy for each time discount. The definition does not require that any particular element of the family is actually optimal in any of the environments. Instead, it requires that the family converges to optimal in all the environments as the time discount goes to one.
We can write an analogous definition for a learnable prior.
We now consider the connection between learnability of a class of environments to Bayes-optimality of a prior on that class. The condition on the prior is that it is non-dogmatic, defined as follows:
The next proposition states that if a countable class of environments is learnable, given any non-dogmatic prior over it, then any family of policies that is Bayes-optimal with respect to the prior must also learn the class. As is demonstrated in the proof (in the proof section), countability is an important hypothesis. The non-dogmatic prior condition is also essential because if the prior assigns probability zero to some environment, the performance of the family on that environment does not affect the Bayesian-regret; this means that the policy could perform arbitrarily badly with respect to that environment and still be Bayes-optimal.
Markov Decision Processes
A Markov decision process (MDP) is a commonly-used model for sequential decision making. In this section, we discuss how every environment can be modeled as an MDP in which the set of states is the set of histories from the domain of the environment.
Intuitively, the transition kernel of the MDP that models an environment must ensure that the agent can only transition to a history that is equal to a single observation appended to the end of the current history. Furthermore, the transition kernel should agree with the environment about the various probabilities of those valid transitions. To ensure this, the probability of a valid transition can be taken to be equal to the probability of the corresponding observation that is appended. This is formalized in the proof of the following proposition.
Proof: Recall that a probabilistic environment is defined to be a function μ:(A×O)∗×A→ΔO. Define S=(A×O)∗, which is countable since by assumption, A and O are finite. Define T:S×A→ΔS by
T(s′|s,a)={μ(o|s,a) if s′=(s,a,o)0 otherwiseThis is a valid transition kernel. □
In the following post, we consider the infra-Bayesian analogues to the machinery we have developed so far.
Acknowledgements
I am grateful to Vanessa Kosoy for her valuable comments and suggestions on initial drafts. I also thank Vanessa and Alex Appel for addressing numerous questions and Hannah Gabor, whose feedback improved the writing and clarity of ideas.
^The space of destinies is a measurable space with respect to the cylindrical σ−algebra. By Carathéodory’s extension theorem, there exists a unique probability measure on this space that agrees on cylinder sets with the probability measure on histories. For more detail, see Theorem 3.1 in Probability and Measure by Patrick Billingsley (1995) or remarks of Eilon Solan on page 8 of A Course in Stochastic Game Theory (2022).