The proof of Proposition 1 is achieved through three lemmas. I believe the most insightful part of the proof is the use of the “epsilon over three” trick that lets us break the proof down into these lemmas. The first lemma uses the concept of the product topology on the space of policies, and I found this very useful for understanding better what convergence in the product topology on the space of policies actually means. The text Topology by James Munkres is a standard reference for topology.
The second lemma uses the concept of a Lebesgue integral and is simply an application of the dominated convergence theorem from measure theory. One standard reference for these ideas is Real Analysis: Modern Techniques and Their Applications by Gerald Folland.
The proof of Proposition 2 is less technical than the other proofs and should be readable to those with a basic real analysis and probability background. For basic real analysis, Principles of Mathematical Analysis by Walter Rudin is a standard reference.
Proposition 1
For brevity throughout this section, we use μπ(Lγ) to denote Eh∼μπ[Lγ(h)].
Proposition 1 Lemma 1:Suppose that πn→π in the product topology on Π. Then for any finite time-horizon N∈N,μπn(Lγ|N)→μπ(Lγ|N).
Proof:Suppose that πn→π in the product topology on Π. Let ϵ>0 be given and fix N∈N. To satisfy the definition of the limit, we must show that there exists M∈N such that for all n≥M,|μπn(Lγ|N)−μπ(Lγ|N)|<ϵ.
Let N(A,O) denote the number of histories of length N. Since A and O are finite sets and N<∞,N(A,O)<∞. Let the set of histories of length N then be indexed by {1,2,…,N(A,O)}.
At this point, we want to bound the magnitude of terms of the form μπn(hi)−μπ(hi) for 1≤i≤N(A,O). The key idea is to show that these terms can be factored into the product of two terms: one term that is finite and one term that becomes arbitrarily small for sufficiently large n.
Let (hi)≤t denote the sequence of the first t elements of hi. By the product rule,
μπ(hi):=N∏t=0π(at|(hi)≤t−1)⋅μ(ot|(hi)≤t).
Since μπn and μπ are induced by the same environment, we can observe a common factor of μπn(hi) and μπ(hi). Namely, we have:
Let c=max1≤i≤N(A,O){∏Nt=0μ(ot|(hi)≤t)}, which is finite. Consider now the second term of the product, which we will show becomes arbitrarily small for sufficiently large nsince πn→π in the product topology.
First, we establish some notation: Let J=|{h:h∈(A×O)∗,|h|≤N}|, the cardinality of the set of histories of length at most N, which is a finite set since N<∞, and A and O are finite sets. We can enumerate elements of this set of histories by hj with the index 1≤j≤J.
Given ^ϵ>0, let B^ϵ(π(hj)) denote the ball of radius ^ϵ centered at π(hj)∈ΔA under the total variation distance, which we denote by dTV. Importantly, B^ϵ(π(hj)) is an open set of probability distributions over actions. Let U:=∏Jj=1B^ϵ(π(hj))×∏∞j=J+1ΔA, which is open in the product topology since it is the product of open sets in ΔA not equal to ΔA in finitely many components.
Since πn→π in the product topology, by definition: for all ^ϵ>0, there exists M such that for all n≥M,πn∈U. By construction, if πn∈U, for 1≤j≤J,dTV(π(hj),πn(hj))<^ϵ. By definition, then for 1≤j≤J,supa∈A|π(a|hj)−πn(a|hj)|<^ϵ.
Consider the term we’re trying to bound,
|N∏t=0πn(at|(hi)≤t−1)−N∏t=0π(at|(hi)≤t−1)|.
We want to rewrite the terms πn(at|(hi)≤t−1) using the approximation provided above. We have that
Notice that when expanded, the only term of ∏Nt=0^ϵ+π(a|(hi)≤t−1) that isn’t the product of ^ϵ and some finite term is ∏Nt=0π(a|(hi)≤t−1), which cancels out with an identical term in the original expression. Since ^ϵ can be chosen arbitrarily small by choosing the index n of πn to be large, this means: given any ~ϵ>0, there exists Mi such that for all n≥Mi,
|N∏t=0πn(at|(hi)≤t−1)−N∏t=0π(at|(hi)≤t−1)|≤~ϵ.
Then if n≥max{Mi:1≤i≤N(A,O)}, for all 1≤i≤N(A,O),|μπn(hi)−μπ(hi)|≤c⋅~ϵ. Then
|μπn(Lγ|N)−μπ(Lγ|N)|≤N(A,O)⋅c⋅~ϵ
Choose, ~ϵ<ϵN(A,O)⋅c. Then for n≥max{Mi:1≤i≤N(A,O)},|μπn(Lγ|N)−μπ(Lγ|N)|<ϵ.□
Proposition 1 Lemma 2: Define Lγ|N for N∈N as Lγ|N(at,ot)Nt=0=∑Nt=0γtL(ot). Let m∈M(A×O)∞ be a probability measure on destinies. Then m(Lγ|N)→m(Lγ) as N→∞.
Proof: This lemma follows directly from the dominated convergence theorem, so our proof will be a check that all the conditions of that theorem are satisfied. First note that Lγ|N converges to Lγ pointwise. Also ∫(A×O)∞|Lγ(h)|dm<∞ since m is a probability measure (and thus m(A×O)∞=1) and Lγ(h)≤1 for all h∈(A×O)∞. Also, for all N∈N and h∈(A×O)∞,Lγ|N(h)≤|Lγ(h)|. Then, by the dominated convergence theorem, limN→∞∫Lγ|Ndm=∫Lγdm.□
Proposition 1 Lemma 3: Suppose that {πn}n∈N is a convergent sequence in Π. Then for any ϵ>0, there exists M∈N and a finite time-horizon N0∈N such that for all m≥Mand N≥N0,|μπm(Lγ|N)−μπm(Lγ)|<ϵ.
Proof: Let ϵ>0 be given. Note that Π is a metric space since it is the countable product of metric spaces (namely, ΔA). This implies that if {πn}n∈N is a convergent sequence, {πn}n∈N is Cauchy.
For any n,m∈N, we have by the triangle inequality that
Since {πn}n∈N is Cauchy, the same technique used in Proposition 1 Lemma 1 can be used to show that there exists M such that for all n,m≥M, there exists N1 such that for all N≥N1,|μπn(Lγ|N)−μπm(Lγ|N)|<ϵ9.
Let n,m≥M be fixed. By Proposition 1 Lemma 2, there exists N2∈N such that for all N≥N2,|μπn(Lγ)−μπn(Lγ|N)|<ϵ3.
By Proposition 1 Lemma 2 there exists N3 such that for all N≥N3, |μπm(Lγ)−μπm(Lγ|N)|<ϵ9. By the same lemma, there exists N4 such that for all N≥N4, |μπn(Lγ|N)−μπn(Lγ)|<ϵ9. As previously stated, if N≥N1, then|μπm(Lγ|N)−μπn(Lγ|N)|<ϵ9. Therefore, if N≥max{N1,N3,N4}, then |μπm(Lγ)−μπn(Lγ)|<ϵ3.
Let N0:=max{N1,N2,N3,N4}. Then if N≥N0,
|μπm(Lγ)−μπm(Lγ|N)|<ϵ3+ϵ3+ϵ9<ϵ.□
Proposition 1: If A and O are finite, the optimal policy for a given environment μ exists; namely, arg minπ∈ΠEh∼μπ[Lγ(h)]≠∅. Furthermore, the optimal policy can be chosen to be deterministic.
Proof: The usual technique to show that a function attains a minimum is to show that it is continuous and that the domain is compact. Then the existence of a minimum follows by the generalization of the Extreme Value Theorem to topological spaces. Recall that Lemma 2 (from the main text) states that Π is compact.
So, it suffices to show that the map π↦Eh∼μπ[Lγ(h)] from Π to R is continuous. Assume that {πn}n∈N is a sequence of policies converging to π with respect to the product topology on Π. Then we want to show that μπn(Lγ)→μπ(Lγ). Define Lγ|N for N∈N as Lγ|N(at,ot)Nt=0:=∑Nt=0γtL(ot), which we can conceptualize as the truncation of Lγ to a finite time-horizon.
By Proposition 1 Lemma 2, there exists N1 such that for all N≥N1,|μπ(Lγ|N)−μπ(Lγ)|<ϵ/3. By Proposition 1 Lemma 3, there exists M1 and N2 such that for all n≥M1 and N≥N2,|μπn(Lγ)−μπn(Lγ|N)|<ϵ3.
Let N0=max{N1,N2}. By Proposition 1 Lemma 1, there exists M2 such that for all n≥M2,|μπ(Lγ|N0)−μπn(Lγ|N0)|<ϵ3.
Let ϵ>0 be given. If n≥max{M1,M2}, then by the triangle inequality,
Now we prove that the optimal policy can be chosen to be deterministic. The environment μ remains fixed. Let Πdet⊂Π denote the set of deterministic policies, which is compact. By the same argument given more generally for Π,d:=minπ∈ΠdetEμπ[Lγ] is well-defined.
Any stochastic policy πs is equivalent to a probabilistic mixture of deterministic policies. Therefore, any stochastic policy πs induces a probability measure νs on the countable set μΠdet:={μπ:π∈Πdet}. As a result, μπs can be written as a countable mixture of measures given by μπs=∑μπi∈μΠdetνs(μπi)μπi.
This implies Eμπs[Lγ]=Eνs[Eμπ[Lγ]]. (See problem 9.7 in reference [1] for an outline of how this can be justified[1], as also cited in reference [2].) By the definition of d, Eνs[Eμπ[Lγ]]≥Eνs[d]=d. Thus, the expected loss of a stochastic policy always exceeds the minimum loss of the deterministic policies. □
Proposition 2
Proposition 2:If a countable class of environments {μi}i∈I is learnable, then the Bayes-optimal family of policies for any non-dogmatic prior ζ on {μi}i∈I learns the class.
Proof: Suppose a class of hypotheses {μi}i∈I is learnable, and let ζ be a non-dogmatic prior on {μi}i∈I. We will proceed by contrapositive and show that if the family of policies {πγζ}γ∈[0,1) does not learn the class, then it is not the Bayes-optimal family for ζ.
Assume that there exists j such that limγ→1Reg(πγζ,μj,γ)≠0. By negating the definition of a limit, this means that there exists ϵ>0 such that for all δ>0, there exists γ such that |γ−1|<δ and |Reg(πγζ,μj,γ)|≥ϵ. By definition, |Reg(πγζ,μj,γ)|≥ϵ if and only if
|minπ∈Π{Eh∼μπj[Lγ(h)]}−Eh∼μπγζj[Lγ(h)]|≥ϵ.
We will show that this contradicts the assumption that {πγζ}γ∈[0,1) is a Bayes-optimal family for ζ.
With steps explained in the proceeding paragraphs, we have:
In the above, the first and second equalities follow from definition. To get the third equality, we switch sum and the limit, which is justified by the dominated convergence theorem. In the language of countably infinite sums, this theorem says: Let fγ(i):I→R be a sequence of functions indexed by γ∈[0,1) such that the pointwise limit of fγ, limγ→1fγ(i), exists for all i∈I. Suppose there exists g(i):I→Rsuch that ∑i∈Ig(i)<∞, and for all γ and i,|fγ(i)|≤g(i). Then
limγ→1∑i∈Ifγ(i)=∑i∈Ilimγ→1fγ(i).
So, this theorem essentially states that if we can find a sequence of terms g(i) that “dominates” all of the the fγ(i) pointwise in i, and the sum of the g(i) doesn’t blow up, then the limit and sum can be interchanged.
So g(i):=ζ(μi) is an appropriate dominating function. By the definition of a prior, ∑i∈Iζ(μi)=1<∞. Therefore, the assumptions of the dominated convergence theorem are satisfied.
The second to last inequality follows from the fact that for all i∈I,
limγ→1(Eh∼μπγζi[Lγ(h)]−minπ∈ΠEh∼μπi[Lγ(h)])≥0.
On the other hand, by assumption, {μi}i∈I is learnable, so there exists some family of policies {πγ}γ∈[0,1) such that for all i,limγ→1Reg(πγ,μi,γ)=0. Consider a parallel argument to what we have above, replacing πγζ with πγ:
Then there exists Γ∈[0,1) such that for all γ≥Γ,BReg(πγ,ζ,γ)<ζ(μj)⋅ϵ. On the other hand, if limγ→1BReg(πγζ,ζ,γ)≥ζ(μj)⋅ϵ, then there exists γ0≥Γ such that BReg(πγ0ζ,ζ,γ0)≥ζ(μj)⋅ϵ. Hence, BReg(πγ0,ζ,γ0)<BReg(πγ0ζ,ζ,γ0).
It is a standard approximation argument using an increasing sequence of step-functions. The difference between what is needed here and what is stated as the exercise in the book is that we have the coefficients determined by ν. This is not a problem when applying the argument since all of the coefficients are nonnegative.
Proof Section to an Introduction to Reinforcement Learning for Understanding Infra-Bayesianism
Introduction
This post accompanies An Introduction to Reinforcement Learning for Understanding Infra-Bayesianism. The goal of this introduction is to provide a high-level overview of the proofs contained in this post.
The proof of Proposition 1 is achieved through three lemmas. I believe the most insightful part of the proof is the use of the “epsilon over three” trick that lets us break the proof down into these lemmas. The first lemma uses the concept of the product topology on the space of policies, and I found this very useful for understanding better what convergence in the product topology on the space of policies actually means. The text Topology by James Munkres is a standard reference for topology.
The second lemma uses the concept of a Lebesgue integral and is simply an application of the dominated convergence theorem from measure theory. One standard reference for these ideas is Real Analysis: Modern Techniques and Their Applications by Gerald Folland.
The proof of Proposition 2 is less technical than the other proofs and should be readable to those with a basic real analysis and probability background. For basic real analysis, Principles of Mathematical Analysis by Walter Rudin is a standard reference.
Proposition 1
For brevity throughout this section, we use μπ(Lγ) to denote Eh∼μπ[Lγ(h)].
Proposition 1 Lemma 1: Suppose that πn→π in the product topology on Π. Then for any finite time-horizon N∈N, μπn(Lγ|N)→μπ(Lγ|N).
Proof: Suppose that πn→π in the product topology on Π. Let ϵ>0 be given and fix N∈N. To satisfy the definition of the limit, we must show that there exists M∈N such that for all n≥M, |μπn(Lγ|N)−μπ(Lγ|N)|<ϵ.
Let N(A,O) denote the number of histories of length N. Since A and O are finite sets and N<∞, N(A,O)<∞. Let the set of histories of length N then be indexed by {1,2,…,N(A,O)}.
Then we have:
|μπn(Lγ|N)−μπ(Lγ|N)|=|Eh∼μπnLγ|N(h)−Eh∼μπLγ|N(h)|=|N(A,O)∑i=1μπn(hi)Lγ|N(hi)−μπ(hi)Lγ|N(hi)|=|N(A,O)∑i=1Lγ|N(hi)(μπn(hi)−μπ(hi))|.≤N(A,O)∑i=1|μπn(hi)−μπ(hi)|.At this point, we want to bound the magnitude of terms of the form μπn(hi)−μπ(hi) for 1≤i≤N(A,O). The key idea is to show that these terms can be factored into the product of two terms: one term that is finite and one term that becomes arbitrarily small for sufficiently large n.
Let (hi)≤t denote the sequence of the first t elements of hi. By the product rule,
μπ(hi):=N∏t=0π(at|(hi)≤t−1)⋅μ(ot|(hi)≤t).Since μπn and μπ are induced by the same environment, we can observe a common factor of μπn(hi) and μπ(hi). Namely, we have:
μπn(hi)−μπ(hi)=N∏t=0μ(ot|(hi)≤t)⋅[N∏t=0πn(at|(hi)≤t−1)−N∏t=0π(at|(hi)≤t−1)]Let c=max1≤i≤N(A,O){∏Nt=0μ(ot|(hi)≤t)}, which is finite. Consider now the second term of the product, which we will show becomes arbitrarily small for sufficiently large nsince πn→π in the product topology.
First, we establish some notation: Let J=|{h:h∈(A×O)∗,|h|≤N}|, the cardinality of the set of histories of length at most N, which is a finite set since N<∞, and A and O are finite sets. We can enumerate elements of this set of histories by hj with the index 1≤j≤J.
Given ^ϵ>0, let B^ϵ(π(hj)) denote the ball of radius ^ϵ centered at π(hj)∈ΔA under the total variation distance, which we denote by dTV. Importantly, B^ϵ(π(hj)) is an open set of probability distributions over actions. Let U:=∏Jj=1B^ϵ(π(hj))×∏∞j=J+1ΔA, which is open in the product topology since it is the product of open sets in ΔA not equal to ΔA in finitely many components.
Since πn→π in the product topology, by definition: for all ^ϵ>0, there exists M such that for all n≥M, πn∈U. By construction, if πn∈U, for 1≤j≤J, dTV(π(hj),πn(hj))<^ϵ. By definition, then for 1≤j≤J, supa∈A|π(a|hj)−πn(a|hj)|<^ϵ.
Consider the term we’re trying to bound,
|N∏t=0πn(at|(hi)≤t−1)−N∏t=0π(at|(hi)≤t−1)|.We want to rewrite the terms πn(at|(hi)≤t−1) using the approximation provided above. We have that
πn(at|(hi)≤t−1)=|πn(at|(hi)≤t−1)|=|πn(at|(hi)≤t−1)−π(at|(hi)≤t−1)+π(at|(hi)≤t−1)|.Then by the triangle inequality,
|πn(at|(hi)≤t−1)−π(at|(hi)≤t−1)+π(at|(hi)≤t−1)|≤|πn(at|(hi)≤t−1)−π(at|(hi)≤t−1)|+|π(a|(hi)≤t−1)|≤^ϵ+π(a|hi)≤t−1).Thus πn(a|(hi)≤t−1)≤^ϵ+π(a|(hi)≤t−1), so we now have:
|N∏t=0πn(at|(hi)≤t−1)−N∏t=0π(at|(hi)≤t−1)|≤|N∏t=0^ϵ+π(a|(hi)≤t−1)−N∏t=0π(a|(hi)≤t−1)|Notice that when expanded, the only term of ∏Nt=0^ϵ+π(a|(hi)≤t−1) that isn’t the product of ^ϵ and some finite term is ∏Nt=0π(a|(hi)≤t−1), which cancels out with an identical term in the original expression. Since ^ϵ can be chosen arbitrarily small by choosing the index n of πn to be large, this means: given any ~ϵ>0, there exists Mi such that for all n≥Mi,
|N∏t=0πn(at|(hi)≤t−1)−N∏t=0π(at|(hi)≤t−1)|≤~ϵ.Then if n≥max{Mi:1≤i≤N(A,O)}, for all 1≤i≤N(A,O),|μπn(hi)−μπ(hi)|≤c⋅~ϵ. Then
|μπn(Lγ|N)−μπ(Lγ|N)|≤N(A,O)⋅c⋅~ϵChoose, ~ϵ<ϵN(A,O)⋅c. Then for n≥max{Mi:1≤i≤N(A,O)}, |μπn(Lγ|N)−μπ(Lγ|N)|<ϵ.□
Proposition 1 Lemma 2: Define Lγ|N for N∈N as Lγ|N(at,ot)Nt=0=∑Nt=0γtL(ot). Let m∈M(A×O)∞ be a probability measure on destinies. Then m(Lγ|N)→m(Lγ) as N→∞.
Proof: This lemma follows directly from the dominated convergence theorem, so our proof will be a check that all the conditions of that theorem are satisfied. First note that Lγ|N converges to Lγ pointwise. Also ∫(A×O)∞|Lγ(h)|dm<∞ since m is a probability measure (and thus m(A×O)∞=1) and Lγ(h)≤1 for all h∈(A×O)∞. Also, for all N∈N and h∈(A×O)∞, Lγ|N(h)≤|Lγ(h)|. Then, by the dominated convergence theorem, limN→∞∫Lγ|Ndm=∫Lγdm. □
Proposition 1 Lemma 3: Suppose that {πn}n∈N is a convergent sequence in Π. Then for any ϵ>0, there exists M∈N and a finite time-horizon N0∈N such that for all m≥Mand N≥N0, |μπm(Lγ|N)−μπm(Lγ)|<ϵ.
Proof: Let ϵ>0 be given. Note that Π is a metric space since it is the countable product of metric spaces (namely, ΔA). This implies that if {πn}n∈N is a convergent sequence, {πn}n∈N is Cauchy.
For any n,m∈N, we have by the triangle inequality that
|μπm(Lγ)−μπm(Lγ|N)|=|μπm(Lγ)−μπn(Lγ)+μπn(Lγ)−μπn(Lγ|N)+μπn(Lγ|N)−μπm(Lγ|N)|≤|μπm(Lγ)−μπn(Lγ)|+|μπn(Lγ)−μπn(Lγ|N)|+|μπn(Lγ|N)−μπm(Lγ|N)|.Since {πn}n∈N is Cauchy, the same technique used in Proposition 1 Lemma 1 can be used to show that there exists M such that for all n,m≥M, there exists N1 such that for all N≥N1, |μπn(Lγ|N)−μπm(Lγ|N)|<ϵ9.
Let n,m≥M be fixed. By Proposition 1 Lemma 2, there exists N2∈N such that for all N≥N2, |μπn(Lγ)−μπn(Lγ|N)|<ϵ3.
Note that
|μπm(Lγ)−μπn(Lγ)|=|μπm(Lγ)−μπm(Lγ|N)+μπm(Lγ|N)−μπn(Lγ|N)+μπn(Lγ|N)−μπn(Lγ)|≤|μπm(Lγ)−μπm(Lγ|N)|+|μπm(Lγ|N)−μπn(Lγ|N)|+|μπn(Lγ|N)−μπn(Lγ)|.By Proposition 1 Lemma 2 there exists N3 such that for all N≥N3, |μπm(Lγ)−μπm(Lγ|N)|<ϵ9. By the same lemma, there exists N4 such that for all N≥N4, |μπn(Lγ|N)−μπn(Lγ)|<ϵ9. As previously stated, if N≥N1, then|μπm(Lγ|N)−μπn(Lγ|N)|<ϵ9. Therefore, if N≥max{N1,N3,N4}, then |μπm(Lγ)−μπn(Lγ)|<ϵ3.
Let N0:=max{N1,N2,N3,N4}. Then if N≥N0,
|μπm(Lγ)−μπm(Lγ|N)|<ϵ3+ϵ3+ϵ9<ϵ.□Proposition 1: If A and O are finite, the optimal policy for a given environment μ exists; namely, arg minπ∈ΠEh∼μπ[Lγ(h)]≠∅. Furthermore, the optimal policy can be chosen to be deterministic.
Proof: The usual technique to show that a function attains a minimum is to show that it is continuous and that the domain is compact. Then the existence of a minimum follows by the generalization of the Extreme Value Theorem to topological spaces. Recall that Lemma 2 (from the main text) states that Π is compact.
So, it suffices to show that the map π↦Eh∼μπ[Lγ(h)] from Π to R is continuous. Assume that {πn}n∈N is a sequence of policies converging to π with respect to the product topology on Π. Then we want to show that μπn(Lγ)→μπ(Lγ). Define Lγ|N for N∈N as Lγ|N(at,ot)Nt=0:=∑Nt=0γtL(ot), which we can conceptualize as the truncation of Lγ to a finite time-horizon.
By Proposition 1 Lemma 2, there exists N1 such that for all N≥N1, |μπ(Lγ|N)−μπ(Lγ)|<ϵ/3. By Proposition 1 Lemma 3, there exists M1 and N2 such that for all n≥M1 and N≥N2, |μπn(Lγ)−μπn(Lγ|N)|<ϵ3.
Let N0=max{N1,N2}. By Proposition 1 Lemma 1, there exists M2 such that for all n≥M2, |μπ(Lγ|N0)−μπn(Lγ|N0)|<ϵ3.
Let ϵ>0 be given. If n≥max{M1,M2}, then by the triangle inequality,
|μπn(Lγ)−μπ(Lγ)|=|μπn(Lγ)−μπn(Lγ|N0)+μπ(Lγ|N0)−μπn(Lγ|N0)+μπ(Lγ|N0)−μπ(Lγ)|≤|μπn(Lγ)−μπn(Lγ|N0)|+|μπ(Lγ|N0)−μπn(Lγ|N0)|+|μπ(Lγ|N0)−μπ(Lγ)|<ϵ.Therefore, μπn(Lγ)→μπ(Lγ).
Now we prove that the optimal policy can be chosen to be deterministic. The environment μ remains fixed. Let Πdet⊂Π denote the set of deterministic policies, which is compact. By the same argument given more generally for Π, d:=minπ∈ΠdetEμπ[Lγ] is well-defined.
Any stochastic policy πs is equivalent to a probabilistic mixture of deterministic policies. Therefore, any stochastic policy πs induces a probability measure νs on the countable set μΠdet:={μπ:π∈Πdet}. As a result, μπs can be written as a countable mixture of measures given by μπs=∑μπi∈μΠdetνs(μπi)μπi.
This implies Eμπs[Lγ]=Eνs[Eμπ[Lγ]]. (See problem 9.7 in reference [1] for an outline of how this can be justified[1], as also cited in reference [2].) By the definition of d, Eνs[Eμπ[Lγ]]≥Eνs[d]=d. Thus, the expected loss of a stochastic policy always exceeds the minimum loss of the deterministic policies. □
Proposition 2
Proposition 2: If a countable class of environments {μi}i∈I is learnable, then the Bayes-optimal family of policies for any non-dogmatic prior ζ on {μi}i∈I learns the class.
Proof: Suppose a class of hypotheses {μi}i∈I is learnable, and let ζ be a non-dogmatic prior on {μi}i∈I. We will proceed by contrapositive and show that if the family of policies {πγζ}γ∈[0,1) does not learn the class, then it is not the Bayes-optimal family for ζ.
Assume that there exists j such that limγ→1Reg(πγζ,μj,γ)≠0. By negating the definition of a limit, this means that there exists ϵ>0 such that for all δ>0, there exists γ such that |γ−1|<δ and |Reg(πγζ,μj,γ)|≥ϵ. By definition, |Reg(πγζ,μj,γ)|≥ϵ if and only if
|minπ∈Π{Eh∼μπj[Lγ(h)]}−Eh∼μπγζj[Lγ(h)]|≥ϵ.We will show that this contradicts the assumption that {πγζ}γ∈[0,1) is a Bayes-optimal family for ζ.
With steps explained in the proceeding paragraphs, we have:
limγ→1BReg(πγζ,ζ,γ)=limγ→1Eμ∼ζReg(πγζ,μ,γ)=limγ→1∑i∈Iζ(μi)(Eh∼μπγζi[Lγ(h)]−minπ∈ΠEh∼μπi[Lγ(h)])=∑i∈Iζ(μi)limγ→1(Eh∼μπγζi[Lγ(h)]−minπ∈ΠEh∼μπi[Lγ(h)])≥ζ(μj)⋅ϵ>0.In the above, the first and second equalities follow from definition. To get the third equality, we switch sum and the limit, which is justified by the dominated convergence theorem. In the language of countably infinite sums, this theorem says: Let fγ(i):I→R be a sequence of functions indexed by γ∈[0,1) such that the pointwise limit of fγ, limγ→1fγ(i), exists for all i∈I. Suppose there exists g(i):I→Rsuch that ∑i∈Ig(i)<∞, and for all γ and i, |fγ(i)|≤g(i). Then
limγ→1∑i∈Ifγ(i)=∑i∈Ilimγ→1fγ(i).So, this theorem essentially states that if we can find a sequence of terms g(i) that “dominates” all of the the fγ(i) pointwise in i, and the sum of the g(i) doesn’t blow up, then the limit and sum can be interchanged.
To apply the dominated convergence theorem, let
fγ(i):=ζ(μi)(Eh∼μπγζi[Lγ(h)]−minπ∈ΠEh∼μπi[Lγ(h)]).Note that
Eh∼μπγζi[Lγ(h)]−minπ∈ΠEh∼μπi[Lγ(h)]≤1.So g(i):=ζ(μi) is an appropriate dominating function. By the definition of a prior, ∑i∈Iζ(μi)=1<∞. Therefore, the assumptions of the dominated convergence theorem are satisfied.
The second to last inequality follows from the fact that for all i∈I,
limγ→1(Eh∼μπγζi[Lγ(h)]−minπ∈ΠEh∼μπi[Lγ(h)])≥0.On the other hand, by assumption, {μi}i∈I is learnable, so there exists some family of policies {πγ}γ∈[0,1) such that for all i, limγ→1Reg(πγ,μi,γ)=0. Consider a parallel argument to what we have above, replacing πγζ with πγ:
limγ→1BReg(πγ,ζ,γ)=limγ→1∑i∈Iζ(μi)(Eh∼μπγi[Lγ(h)]−minπ∈ΠEh∼μπi[Lγ(h)])=∑i∈Iζ(μi)limγ→1(Eh∼μπγi[Lγ(h)]−minπ∈Π{Eh∼μπi[Lγ(h)]})=∑i∈Iζ(μi)limγ→1Reg(πγ,μi,γ)=∑i∈Iζ(μi)⋅0=0.Then there exists Γ∈[0,1) such that for all γ≥Γ,BReg(πγ,ζ,γ)<ζ(μj)⋅ϵ. On the other hand, if limγ→1BReg(πγζ,ζ,γ)≥ζ(μj)⋅ϵ, then there exists γ0≥Γ such that BReg(πγ0ζ,ζ,γ0)≥ζ(μj)⋅ϵ. Hence, BReg(πγ0,ζ,γ0)<BReg(πγ0ζ,ζ,γ0).
This implies
BReg(πγ0,ζ,γ0)−BReg(πγ0ζ,ζ,γ0)>0.By definition and by linearity of expectation,
BReg(πγ0ζ,ζ,γ0)−BReg(πγ0,ζ,γ0)=Eζ[Eμπγ0ζ[Lγ0(h)]−minπ∈ΠEμπ[Lγ0(h)]]−(Eζ[Eμπγ0[Lγ0(h)]−minπ∈ΠEμπ[Lγ0(h)]])=EζEμπγ0ζ[Lγ0(h)]−EζEμπγ0[Lγ0(h)].Then EζEμπγ0ζ[Lγ0(h)]−EζEμπγ0[Lγ0(h)]>0. Then by definition, {πγζ}γ∈[0,1) is not the Bayes-optimal family for ζ. □
References:
[1] Schilling, René L. 2005. Measures, Integrals and Martingales. Cambridge: Cambridge University Press.
[2] humanStampedist (https://math.stackexchange.com/users/474469/humanstampedist), How to integrate for a countable sum of measures?, URL (version: 2018-08-27): https://math.stackexchange.com/q/2896276
It is a standard approximation argument using an increasing sequence of step-functions. The difference between what is needed here and what is stated as the exercise in the book is that we have the coefficients determined by ν. This is not a problem when applying the argument since all of the coefficients are nonnegative.