Proof Section to an Introduction to Reinforcement Learning for Understanding Infra-Bayesianism

Introduction

This post accompanies An Introduction to Reinforcement Learning for Understanding Infra-Bayesianism. The goal of this introduction is to provide a high-level overview of the proofs contained in this post.

The proof of Proposition 1 is achieved through three lemmas. I believe the most insightful part of the proof is the use of the “epsilon over three” trick that lets us break the proof down into these lemmas. The first lemma uses the concept of the product topology on the space of policies, and I found this very useful for understanding better what convergence in the product topology on the space of policies actually means. The text Topology by James Munkres is a standard reference for topology.

The second lemma uses the concept of a Lebesgue integral and is simply an application of the dominated convergence theorem from measure theory. One standard reference for these ideas is Real Analysis: Modern Techniques and Their Applications by Gerald Folland.

The proof of Proposition 2 is less technical than the other proofs and should be readable to those with a basic real analysis and probability background. For basic real analysis, Principles of Mathematical Analysis by Walter Rudin is a standard reference.

Proposition 1

For brevity throughout this section, we use to denote

Proposition 1 Lemma 1: Suppose that in the product topology on Then for any finite time-horizon

Proof: Suppose that in the product topology on . Let be given and fix . To satisfy the definition of the limit, we must show that there exists such that for all

Let denote the number of histories of length Since and are finite sets and Let the set of histories of length then be indexed by

Then we have:

At this point, we want to bound the magnitude of terms of the form for The key idea is to show that these terms can be factored into the product of two terms: one term that is finite and one term that becomes arbitrarily small for sufficiently large .

Let denote the sequence of the first elements of By the product rule,

Since and are induced by the same environment, we can observe a common factor of and . Namely, we have:

Let which is finite. Consider now the second term of the product, which we will show becomes arbitrarily small for sufficiently large since in the product topology.

First, we establish some notation: Let the cardinality of the set of histories of length at most which is a finite set since and and are finite sets. We can enumerate elements of this set of histories by with the index

Given , let denote the ball of radius centered at under the total variation distance, which we denote by . Importantly, is an open set of probability distributions over actions. Let , which is open in the product topology since it is the product of open sets in not equal to in finitely many components.

Since in the product topology, by definition: for all there exists such that for all By construction, if for By definition, then for

Consider the term we’re trying to bound,

We want to rewrite the terms using the approximation provided above. We have that

Then by the triangle inequality,

Thus so we now have:

Notice that when expanded, the only term of that isn’t the product of and some finite term is , which cancels out with an identical term in the original expression. Since can be chosen arbitrarily small by choosing the index of to be large, this means: given any there exists such that for all

Then if for all Then

Choose, Then for

Proposition 1 Lemma 2: Define for as Let be a probability measure on destinies. Then as

Proof: This lemma follows directly from the dominated convergence theorem, so our proof will be a check that all the conditions of that theorem are satisfied. First note that converges to pointwise. Also since is a probability measure (and thus ) and for all Also, for all and Then, by the dominated convergence theorem,

Proposition 1 Lemma 3: Suppose that is a convergent sequence in Then for any there exists and a finite time-horizon such that for all and

Proof: Let be given. Note that is a metric space since it is the countable product of metric spaces (namely, . This implies that if is a convergent sequence, is Cauchy.

For any we have by the triangle inequality that

Since is Cauchy, the same technique used in Proposition 1 Lemma 1 can be used to show that there exists such that for all there exists such that for all

Let be fixed. By Proposition 1 Lemma 2, there exists such that for all

Note that

By Proposition 1 Lemma 2 there exists such that for all , By the same lemma, there exists such that for all , As previously stated, if then Therefore, if then

Let Then if

Proposition 1: If and are finite, the optimal policy for a given environment exists; namely, Furthermore, the optimal policy can be chosen to be deterministic.

Proof: The usual technique to show that a function attains a minimum is to show that it is continuous and that the domain is compact. Then the existence of a minimum follows by the generalization of the Extreme Value Theorem to topological spaces. Recall that Lemma 2 (from the main text) states that is compact.

So, it suffices to show that the map from to is continuous. Assume that is a sequence of policies converging to with respect to the product topology on Then we want to show that . Define for as which we can conceptualize as the truncation of to a finite time-horizon.

By Proposition 1 Lemma 2, there exists such that for all By Proposition 1 Lemma 3, there exists and such that for all and

Let By Proposition 1 Lemma 1, there exists such that for all

Let be given. If , then by the triangle inequality,

Therefore,

Now we prove that the optimal policy can be chosen to be deterministic. The environment remains fixed. Let denote the set of deterministic policies, which is compact. By the same argument given more generally for is well-defined.

Any stochastic policy is equivalent to a probabilistic mixture of deterministic policies. Therefore, any stochastic policy induces a probability measure on the countable set As a result, can be written as a countable mixture of measures given by .

This implies . (See problem 9.7 in reference [1] for an outline of how this can be justified[1], as also cited in reference [2].) By the definition of , . Thus, the expected loss of a stochastic policy always exceeds the minimum loss of the deterministic policies.

Proposition 2

Proposition 2: If a countable class of environments is learnable, then the Bayes-optimal family of policies for any non-dogmatic prior on learns the class.

Proof: Suppose a class of hypotheses is learnable, and let be a non-dogmatic prior on We will proceed by contrapositive and show that if the family of policies does not learn the class, then it is not the Bayes-optimal family for

Assume that there exists such that By negating the definition of a limit, this means that there exists such that for all there exists such that and By definition, if and only if

We will show that this contradicts the assumption that is a Bayes-optimal family for

With steps explained in the proceeding paragraphs, we have:

In the above, the first and second equalities follow from definition. To get the third equality, we switch sum and the limit, which is justified by the dominated convergence theorem. In the language of countably infinite sums, this theorem says: Let be a sequence of functions indexed by such that the pointwise limit of , exists for all . Suppose there exists such that and for all and . Then

So, this theorem essentially states that if we can find a sequence of terms that “dominates” all of the the pointwise in and the sum of the doesn’t blow up, then the limit and sum can be interchanged.

To apply the dominated convergence theorem, let

Note that

So is an appropriate dominating function. By the definition of a prior, Therefore, the assumptions of the dominated convergence theorem are satisfied.

The second to last inequality follows from the fact that for all

On the other hand, by assumption, is learnable, so there exists some family of policies such that for all Consider a parallel argument to what we have above, replacing with :

Then there exists such that for all On the other hand, if , then there exists such that Hence,

This implies

By definition and by linearity of expectation,

Then . Then by definition, is not the Bayes-optimal family for

References:

[1] Schilling, René L. 2005. Measures, Integrals and Martingales. Cambridge: Cambridge University Press.

[2] humanStampedist (https://​​math.stackexchange.com/​​users/​​474469/​​humanstampedist), How to integrate for a countable sum of measures?, URL (version: 2018-08-27): https://​​math.stackexchange.com/​​q/​​2896276

  1. ^

    It is a standard approximation argument using an increasing sequence of step-functions. The difference between what is needed here and what is stated as the exercise in the book is that we have the coefficients determined by This is not a problem when applying the argument since all of the coefficients are nonnegative.

No comments.