We use ΔX to denote the space of probability distributions over a set X, which is assumed throughout to be a compact metric space. We use □X to denote the set of credal sets over X.
Given f:X→R and m∈ΔX, let m(f):=Em[f].
Let C(X,Y) denote the space of continuous functions from X to Y.
Proof of Lemma 1
Lemma 1: If A and O are finite, the set of countably infinite histories (A×O)ω is a compact metric space under the metric d(h,h′)=γt(h,h′) where γ∈(0,1) and t(h,h′) is the time of first difference between h and h′.
Proof. The space A×O is compact under the discrete topology since it is finite. Therefore, (A×O)ω is compact under the product topology P by Tychonoff’s theorem. The stated metric induces a topology M. By the definition of compactness, it is sufficient to show that all basis elements of M are contained in P.
The basis elements of M have the form
Bδ(h):={h′∈(A×O)ω:d(h,h′)<δ}
for h∈(A×O)ω and δ∈(0,1].
Furthermore, sets of the form
Uh,N:=N∏i=0{hi}×∞∏i=N+1(A×O)
are basis elements of P where hi∈A×O and N∈N.
Given δ∈(0,1], define N(δ)=max{n∈N:γn≥δ}. We will verify that γt(h,h′)<δ if and only if d(h,h′)≥N(δ)+1. Note that if d(h,h′)≥N(δ)+1>N(δ), then by construction, γt(h,h′)<δ. For the other direction, we have that if d(h,h′)≤N(δ), then since γ∈(0,1),γt(h,h′)≥γN(δ)≥δ. By the contrapositive, if γt(h,h′)<δ, then d(h,h′)≥N(δ)+1.
Since d(h,h′)≥N(δ)+1 if and only if hi=h′i for 0≤i≤N(δ), this proves that Uh,N(δ) and Bδ(h) are equal as sets. Thus all basis elements of M are contained in P, and (A×O)ω is compact in the metric topology. □
Proof of Proposition 1
The following lemma is used both in the proof of Proposition 1 and Proposition 2.
Lemma 2: The set of deterministic policies Π is first-countable under the product topology.
Proof: Let π∈Π. We aim to show that π has a countable neighborhood basis. Consider the collection B of open sets in Π of the form ∏hi∈(A×O)ωUhi where Uhi=π(hi) for finitely many hi and otherwise Uhi=A. There is a bijection between B and the set of finite binary sequences, so this collection is countable.
Let N be a neighborhood of π. By the definition of a basis, there exists a basis element Vfor the product topology such that x∈V⊂N. By definition of the product topology, the set V can be written as ∏hi∈(A×O)ωVhi where Vhi≠A for finitely many hi. Since x∈V, an element of B is contained in V and thus in N. Therefore, B is a countable neighborhood basis. □
Proposition 1: Every crisp causal law Λ:Π→□(A×O)ω is continuous with respect to the product topology on Π and the Hausdorff topology on □(A×O)ω.
Proof: By Lemma 2, Π is first-countable under the product topology. Consequently, Λ:Π→□(A×O)ω is continuous if given a convergent sequence {πn}n∈N such that πn→π, then Λ(πn)→Λ(π). Let such a convergent sequence {πn}n∈N be given.
By definition, there exists a set of environments E that generates Λ. We will show convergence using the weak topology, so let g:(A×O)ω→[0,1] be a 1-Lipschitz continuous function.
Since πn→π, given any finite time horizon T<∞, there exists N such that for all n≥N,πn and π agree up to time T. Furthermore, the set of destinies (A×O)ω can be partitioned into a disjoint union ⨿J(T)j=0Dj where Dj is a set of destinies that are equal up to time T and J(T)<∞. Note that by construction, if h,h′∈Dj, then d(h,h′)≤γT.
As T can be taken arbitrarily large and γ∈(0,1),|Eμπng−Eμπg| tends to zero as n tends to infinity. The Kantorovich-Rubinstein metric induces the weak topology on Δ(A×O)ω, so for all μ∈E,μπn→μπ. Therefore, Λ(πn)→Λ(π) in the Hausdorff topology. □
Proof of Proposition 2 and Corollary 1
Corollary 1 below corresponds to Proposition 5 in this proof section of the original infra-Bayesianism sequence. The proof of that proposition was organized in three phases. Since we expand the ideas in more detail, we have several lemmas; “phase 1” is captured here in Lemma 5, and “phase 2″ and “phase 3” are captured here in Proposition 2. The other lemmas cover prerequisite ideas that are used in the proof.
The following lemma is a special case of Theorem 6.9 of [1]. We provide a simplified proof under the assumption that X is a compact metric space using the fact that the set of all Lipschitz functions is dense in C(X,[0,1]), the space of continuous functions from X to [0,1] [2].
Lemma 3: Let X be a compact metric space, f:X→[0,1] be continuous, and m∈ΔX.Then m↦m(f) is continuous as a function from ΔX with the Kantorovich-Rubinstein (KR-) metric to [0,1].
Proof. Since ΔX is a metric space, ΔX is first-countable. Thus m↦m(f) is continuous if mn→m implies mn(f)→m(f). Let mn→m and ϵ>0. By definition, there exists N∈N such that for all n≥N,
supflip|mn(flip)−m(flip)|<ϵ,
where the supremum is taken over all 1-Lipschitz continuous functions flip:X→[−1,1].
Suppose f is 1-Lipschitz. Then mn(f)→m(f) since
|mn(f)−m(f)|≤supflip|mn(flip)−m(flip)|<ϵ.
Suppose f is K-Lipschitz. Then 1Kf is 1-Lipschitz. So, by the first case, there exists N∈N such that for all n≥N,
|mn(1Kf)−m(1Kf)|<1Kϵ.
By linearity of expectation, for all n≥N,
|mn(f)−m(f)|=K|mn(1Kf)−m(1Kf)|<ϵ.
Suppose f is continuous. By the density of the set of all Lipschitz functions in C(X,[0,1]) [2], there exists a Lipschitz function g:X→[0,1] such that supx∈X|f(x)−g(x)|<ϵ3. By the second case, there exists N∈N such that for all n≥N,|mn(g)−m(g)|<ϵ3. Applying the triangle inequality, we obtain that for all n≥N,
Here we have used the fact that m,mn∈ΔX and thus m(X)=mn(X)=1. This shows that mn(f)→m(f), which completes the proof by the remark at the beginning. □
Lemma 4: Let θ∈□X be a credal set over a compact metric space X. Then for all continuous f:X→[0,1] there exists m∗∈θ such that m∗(f)=maxm∈θ{m(f)}.
Proof. By Lemma 3, the map m↦m(f) is continuous. By definition, θ is compact. A continuous function over a compact set achieves a maximum over that set, which proves the result. □
Lemma 5: Let Λ:Π→□(A×O)ω be a crisp causal law, and let g:Π→C((A×O)ω,[0,1]) be continuous. Suppose a sequence of policies {πn}n∈N converges to π in the KR-metric on Π. Then limn→∞|EΛ(πn)[g(πn)]−EΛ(πn)[g(π)]|=0.
Proof. Let a convergent sequence πn→π and ϵ>0 be given. We aim to show that there exists N such that for all n≥N,|EΛ(πn)[g(πn)]−EΛ(πn)[g(π)]|<ϵ.
By Lemma 1 and Lemma 4, for all n∈N, there exists mn∈Λ(πn) such that EΛ(πn)[g(πn)]=mn(g(πn)). Similarly, there exists m∈Λ(πn) such that EΛ(πn)[g(π)]=m(g(π)). Then
Note that |mn(g(πn))−mn(g(π))|≤mn(|g(πn)−g(π)|) due to the linearity of expectation and the triangle inequality for integrals, which states that for any measure m and integrable function f,|∫fdm|≤∫|f|dm. By definition, mn(|g(πn)−g(π)|)=∫(A×O)ω|g(πn)−g(π)|dmn. By monotonicity of the Lebesgue integral,
Proposition 2: Let Λ:Π→□(A×O)ω be a crisp causal law, and let g:Π→C((A×O)ω,[0,1]) be continuous. Then the function π↦EΛ(π)[g(π)] is continuous.
Proof: By Lemma 2, Π is first-countable, and thus it is sufficient to show that for any convergent sequence πn→π,limn→∞EΛ(πn)[g(πn)]=EΛ(π)[g(π)]. By Lemma 5, it is furthermore sufficient to prove that limn→∞EΛ(πn)[g(π)]=EΛ(π)[g(π)].
To that end, let πn→π. We will show that liminfn→∞EΛ(πn)[g(π)]≥EΛ(π)[g(π)] and limsupn→∞EΛ(πn)[g(π)]≤EΛ(π)[g(π)].
By Lemma 1 and Lemma 4, there exists m∈Λ(π) such that EΛ(π)[g(π)]=m(g(π)). By the continuity of Λ, there exists a convergent sequence {mn}n∈N such that mn→m and mn∈Λ(πn) for all n∈N. By assumption, g is continuous, and thus Lemma 3 implies that mn(g(π))→m(g(π)). By construction, EΛ(πn)[g(π)]≥mn(g(π)). Then
For the second argument, choose a subsequence {nk}k∈N such that
(1)limk→∞EΛ(πnk)[g(π)]=limsupn→∞EΛ(πn)[g(π)].
Invoking Lemma 1 and Lemma 4 again, for each nk, choose mnk∈Λ(πnk) such that EΛ(πnk)[g(π)]=mnk(g(π)). Since Λ is continuous and Λ(π) is closed, there exists a convergent subsequence {mnkj}j∈N with a limit point m∈Λ(π).
Then by (1), Lemma 3, and the definition of expectation with respect to a credal set,
Thus limn→∞EΛ(πn)[g(π)]=EΛ(π)[g(π)], which completes the proof. □
Corollary 1 then follows from Proposition 2 using a standard continuity argument.
Corollary 1: For all crisp causal laws Λ and for all continuous functions g:Π→C((A×O)ω,[0,1]), argminπ∈ΠEΛ∼ζ[EΛ(π)[g(π)]] is a closed, non-empty set.
Proof of Proposition 3
The following proposition corresponds to Proposition 12 in this proof section of the infra-Bayesian sequence. Equation (2) below appears there, and we explain it in more detail here. The remainder of the proof differs from the original version.
Proposition 3: For any non-dogmatic prior ζ over a learnable collection of crisp causal laws {Λi}∞i=0, if a family π∗γ of policies is infra-Bayes optimal with respect to ζ, then π∗γ learns {Λi}∞i=0.
Proof: Assume that {Λi}∞i=0 is learnable. We will proceed by contrapositive and show that if a family of policies π∗γ does not learn {Λi}∞i=0, then for any non-dogmatic prior ζ over {Λi}∞i=0,π∗γ is not infra-Bayes optimal. By definition, there exists a family of policies πγ such that for all i≥0,
(1)limγ→1Reg(πγ,Λi,Lγ)=0
Equivalently, by the definition of infra-regret, for all i≥0,
Since {Λi}∞i=1 is countable and ∑i≥0ζ(Λi)=1<∞, there exists N<∞ such that ∑i≥N+1ζ(Λi)<ϵ/2. Since Reg(πγ,Λi,Lγ)≤1 for all i∈N, the second sum above is bounded by ϵ/2. We will now bound the first sum.
Let ^ϵ:=ϵ2N. Applying equation (1) for each 0≤i≤N, we can obtain γi∈(0,1) such that for all γ∈[γi,1),Reg(πγ,Λi,Lγ)<^ϵ. Let γ∗=max0≤i≤Nγi. Then Reg(πγ,Λi,Lγ)<^ϵ holds simultaneously for all 0≤i≤N and γ≥γ∗. Thus for γ≥γ∗,EΛi∼ζReg(πγ,Λi,Lγ)<ϵ, proving Equation (2).
Since π∗γ does not learn {Λi}∞i=1, there exists j∈N such that limγ→1Reg(π∗γ,Λj,Lγ)≠0. We will show that this implies limγ→1IBReg(π∗γ,ζ,Lγ)>0, which when combined with (2) indicates that π∗γ is not infra-Bayes optimal with respect to ζ.
With steps explained in the proceeding paragraphs, we have that
The first equality follows from the definition of infra-regret, and the second equality follows from the definition of expectation with respect to a measure on a countable set. The third equality is an application of the Lebesgue Dominating Convergence Theorem using the constant one function as a dominating function. The inequalities follow from the fact that all terms in the sum are positive, limγ→1Reg(π∗γ,Λj,Lγ)>0, and ζ(Λj)≠0, which is true by the assumption that ζ is non-dogmatic.
We will now show that π∗γ is not infra-Bayes optimal with respect to ζ. There exists γ0∈[0,1) such that
Therefore, π∗γ is not an infra-Bayes optimal family of policies. □
Acknowledgements
Many thanks Vanessa Kosoy and Marcus Ogren for their constructive feedback on the initial draft.
References
[1] Villani, Cédric, Optimal Transport: Old and New. Springer Berlin, Heidelberg, 2009. [2] Carothers, N. L., “The Stone-Weierstrass Theorem.” In Real Analysis. Cambridge University Press, 2012.
Proof Section to an Introduction to Credal Sets and Infra-Bayes Learnability
This post accompanies An Introduction to Credal Sets and Infra-Bayes Learnability.
Notation
We use ΔX to denote the space of probability distributions over a set X, which is assumed throughout to be a compact metric space. We use □X to denote the set of credal sets over X.
Given f:X→R and m∈ΔX, let m(f):=Em[f].
Let C(X,Y) denote the space of continuous functions from X to Y.
Proof of Lemma 1
Lemma 1: If A and O are finite, the set of countably infinite histories (A×O)ω is a compact metric space under the metric d(h,h′)=γt(h,h′) where γ∈(0,1) and t(h,h′) is the time of first difference between h and h′.
Proof. The space A×O is compact under the discrete topology since it is finite. Therefore, (A×O)ω is compact under the product topology P by Tychonoff’s theorem. The stated metric induces a topology M. By the definition of compactness, it is sufficient to show that all basis elements of M are contained in P.
The basis elements of M have the form
Bδ(h):={h′∈(A×O)ω:d(h,h′)<δ}for h∈(A×O)ω and δ∈(0,1].
Furthermore, sets of the form
Uh,N:=N∏i=0{hi}×∞∏i=N+1(A×O)are basis elements of P where hi∈A×O and N∈N.
Given δ∈(0,1], define N(δ)=max{n∈N:γn≥δ}. We will verify that γt(h,h′)<δ if and only if d(h,h′)≥N(δ)+1. Note that if d(h,h′)≥N(δ)+1>N(δ), then by construction, γt(h,h′)<δ. For the other direction, we have that if d(h,h′)≤N(δ), then since γ∈(0,1), γt(h,h′)≥γN(δ)≥δ. By the contrapositive, if γt(h,h′)<δ, then d(h,h′)≥N(δ)+1.
Since d(h,h′)≥N(δ)+1 if and only if hi=h′i for 0≤i≤N(δ), this proves that Uh,N(δ) and Bδ(h) are equal as sets. Thus all basis elements of M are contained in P, and (A×O)ω is compact in the metric topology. □
Proof of Proposition 1
The following lemma is used both in the proof of Proposition 1 and Proposition 2.
Lemma 2: The set of deterministic policies Π is first-countable under the product topology.
Proof: Let π∈Π. We aim to show that π has a countable neighborhood basis. Consider the collection B of open sets in Π of the form ∏hi∈(A×O)ωUhi where Uhi=π(hi) for finitely many hi and otherwise Uhi=A. There is a bijection between B and the set of finite binary sequences, so this collection is countable.
Let N be a neighborhood of π. By the definition of a basis, there exists a basis element Vfor the product topology such that x∈V⊂N. By definition of the product topology, the set V can be written as ∏hi∈(A×O)ωVhi where Vhi≠A for finitely many hi. Since x∈V, an element of B is contained in V and thus in N. Therefore, B is a countable neighborhood basis. □
Proposition 1: Every crisp causal law Λ:Π→□(A×O)ω is continuous with respect to the product topology on Π and the Hausdorff topology on □(A×O)ω.
Proof: By Lemma 2, Π is first-countable under the product topology. Consequently, Λ:Π→□(A×O)ω is continuous if given a convergent sequence {πn}n∈N such that πn→π, then Λ(πn)→Λ(π). Let such a convergent sequence {πn}n∈N be given.
By definition, there exists a set of environments E that generates Λ. We will show convergence using the weak topology, so let g:(A×O)ω→[0,1] be a 1-Lipschitz continuous function.
Since πn→π, given any finite time horizon T<∞, there exists N such that for all n≥N, πn and π agree up to time T. Furthermore, the set of destinies (A×O)ω can be partitioned into a disjoint union ⨿J(T)j=0Dj where Dj is a set of destinies that are equal up to time T and J(T)<∞. Note that by construction, if h,h′∈Dj, then d(h,h′)≤γT.
Then for every environment μ∈E,
|Eμπng−Eμπg|=|J(T)∑j=0∫Djgdμπn−∫Djgdμπ|≤J(T)∑j=0|∫Djgdμπn−∫Djgdμπ|≤J(T)∑j=0|∫Djsuph∈Djg(h)dμπn−∫Djinfh∈Djg(h)dμπ|=J(T)∑j=0|suph∈Djg(h)μπn(Dj)−infh∈Djg(h)μπ(Dj)|.For all n≥N, μπn(Dj)=μπ(Dj). By the 1-Lipschitz property of g and the construction of Dj,
|suph∈Djg(h)−infh∈Djg(h)|≤suph,h′∈Djd(h,h′)≤γT.Therefore,
J(T)∑j=0|suph∈Djg(h)μπn(Dj)−infh∈Djg(h)μπ(Dj)|≤J(T)∑j=0γTμπ(Dj)=γT.As T can be taken arbitrarily large and γ∈(0,1), |Eμπng−Eμπg| tends to zero as n tends to infinity. The Kantorovich-Rubinstein metric induces the weak topology on Δ(A×O)ω, so for all μ∈E, μπn→μπ. Therefore, Λ(πn)→Λ(π) in the Hausdorff topology. □
Proof of Proposition 2 and Corollary 1
Corollary 1 below corresponds to Proposition 5 in this proof section of the original infra-Bayesianism sequence. The proof of that proposition was organized in three phases. Since we expand the ideas in more detail, we have several lemmas; “phase 1” is captured here in Lemma 5, and “phase 2″ and “phase 3” are captured here in Proposition 2. The other lemmas cover prerequisite ideas that are used in the proof.
The following lemma is a special case of Theorem 6.9 of [1]. We provide a simplified proof under the assumption that X is a compact metric space using the fact that the set of all Lipschitz functions is dense in C(X,[0,1]), the space of continuous functions from X to [0,1] [2].
Lemma 3: Let X be a compact metric space, f:X→[0,1] be continuous, and m∈ΔX.Then m↦m(f) is continuous as a function from ΔX with the Kantorovich-Rubinstein (KR-) metric to [0,1].
Proof. Since ΔX is a metric space, ΔX is first-countable. Thus m↦m(f) is continuous if mn→m implies mn(f)→m(f). Let mn→m and ϵ>0. By definition, there exists N∈N such that for all n≥N,
supflip|mn(flip)−m(flip)|<ϵ,where the supremum is taken over all 1-Lipschitz continuous functions flip:X→[−1,1].
Suppose f is 1-Lipschitz. Then mn(f)→m(f) since
|mn(f)−m(f)|≤supflip|mn(flip)−m(flip)|<ϵ.Suppose f is K-Lipschitz. Then 1Kf is 1-Lipschitz. So, by the first case, there exists N∈N such that for all n≥N,
|mn(1Kf)−m(1Kf)|<1Kϵ.By linearity of expectation, for all n≥N,
|mn(f)−m(f)|=K|mn(1Kf)−m(1Kf)|<ϵ.Suppose f is continuous. By the density of the set of all Lipschitz functions in C(X,[0,1]) [2], there exists a Lipschitz function g:X→[0,1] such that supx∈X|f(x)−g(x)|<ϵ3. By the second case, there exists N∈N such that for all n≥N, |mn(g)−m(g)|<ϵ3. Applying the triangle inequality, we obtain that for all n≥N,
|mn(f)−m(f)|=|mn(f)−mn(g)+mn(g)−m(g)+m(g)−m(f)|≤|mn(f)−mn(g)|+|mn(g)−m(g)|+|m(g)−m(f)|≤mn(X)supx∈X|f(x)−g(x)|+ϵ3+m(X)supx∈X|f(x)−g(x)|<ϵ3+ϵ3+ϵ3<ϵ.Here we have used the fact that m,mn∈ΔX and thus m(X)=mn(X)=1. This shows that mn(f)→m(f), which completes the proof by the remark at the beginning. □
Lemma 4: Let θ∈□X be a credal set over a compact metric space X. Then for all continuous f:X→[0,1] there exists m∗∈θ such that m∗(f)=maxm∈θ{m(f)}.
Proof. By Lemma 3, the map m↦m(f) is continuous. By definition, θ is compact. A continuous function over a compact set achieves a maximum over that set, which proves the result. □
The next lemma corresponds to “phase one” of the proof of Proposition 5 in this proof section of the original infra-Bayesianism sequence.
Lemma 5: Let Λ:Π→□(A×O)ω be a crisp causal law, and let g:Π→C((A×O)ω,[0,1]) be continuous. Suppose a sequence of policies {πn}n∈N converges to π in the KR-metric on Π. Then limn→∞|EΛ(πn)[g(πn)]−EΛ(πn)[g(π)]|=0.
Proof. Let a convergent sequence πn→π and ϵ>0 be given. We aim to show that there exists N such that for all n≥N, |EΛ(πn)[g(πn)]−EΛ(πn)[g(π)]|<ϵ.
By Lemma 1 and Lemma 4, for all n∈N, there exists mn∈Λ(πn) such that EΛ(πn)[g(πn)]=mn(g(πn)). Similarly, there exists m∈Λ(πn) such that EΛ(πn)[g(π)]=m(g(π)). Then
(1) |EΛ(πn)[g(πn)]−EΛ(πn)[g(π)]|=|mn(g(πn))−m(g(π))|.By definition, for all n∈N, mn(g(πn))≥m(g(πn))and m(g(π))≥mn(g(π)). This implies
(2) mn(g(πn))−m(g(πn))≥0, and(3) m(g(π))−mn(g(π))≥0.By (2),
m(g(π))−mn(g(πn))≤m(g(π))−mn(g(πn))+mn(g(πn))−m(g(πn))=m(g(π))−m(g(πn)).By applying (3) similarly, we obtain
mn(g(πn))−m(g(π))≤mn(g(πn))−mn(g(π)).These inequalities imply that
(4) |mn(g(πn))−m(g(π))|≤max{m(g(π))−m(g(πn)),mn(g(πn))−mn(g(π))}.We will now consider how to bound each of the expressions in the set on the right hand side.
Since Π is compact (by Lemma 2 of An Introduction to Reinforcement Learning for Understanding Infra-Bayesianism) and g is continuous, g is uniformly continuous by the Heine-Cantor theorem. Recall that the metric on C((A×O)ω,[0,1]) is the Chebyshev metric defined by d(f1,f2)=suph∈(A×O)ω|f1(h)−f2(h)|. By the definition of uniform continuity, there exists δ>0 such that dKR(πn,π)<δ implies that suph∈(A×O)ω|g(πn)(h)−g(π)(h)|<ϵ.
Since πn→π, there exists N∈N such that for all n≥N, dKR(πn,π)<δ. Then (with steps explained below),
|mn(g(πn))−mn(g(π))|≤mn(|g(πn)−g(π)|)=∫(A×O)ω|g(πn)−g(π)|dmn<∫(A×O)ω(ϵ)dmn=ϵ.Note that |mn(g(πn))−mn(g(π))|≤mn(|g(πn)−g(π)|) due to the linearity of expectation and the triangle inequality for integrals, which states that for any measure m and integrable function f, |∫fdm|≤∫|f|dm. By definition, mn(|g(πn)−g(π)|)=∫(A×O)ω|g(πn)−g(π)|dmn. By monotonicity of the Lebesgue integral,
∫(A×O)ω|g(πn)−g(π)|dmn≤∫(A×O)ωsuph∈(A×O)ω|g(πn)(h)−g(π)(h)|dmn.By the fact that mn is a probability measure and thus mn((A×O)ω)=1,
∫(A×O)ωsuph∈(A×O)ω|g(πn)(h)−g(π)(h)|<ϵfor all n≥N.
By a similar argument, for all n≥N,
|m(g(πn))−m(g(π))|<ϵ.Returning to equations (1) and (4), we obtain that for all n≥N,
|EΛ(πn)[g(πn)]−EΛ(πn)[g(π)]|<ϵ,which completes the proof. □
The next proposition is a special case of “phase two” and “phase 3″ of the proof of Proposition 5 in this proof section of the original infra-Bayesianism sequence. The ideas here are the same, although we provide a direct proof.
Proposition 2: Let Λ:Π→□(A×O)ω be a crisp causal law, and let g:Π→C((A×O)ω,[0,1]) be continuous. Then the function π↦EΛ(π)[g(π)] is continuous.
Proof: By Lemma 2, Π is first-countable, and thus it is sufficient to show that for any convergent sequence πn→π, limn→∞EΛ(πn)[g(πn)]=EΛ(π)[g(π)]. By Lemma 5, it is furthermore sufficient to prove that limn→∞EΛ(πn)[g(π)]=EΛ(π)[g(π)].
To that end, let πn→π. We will show that liminfn→∞EΛ(πn)[g(π)]≥EΛ(π)[g(π)] and limsupn→∞EΛ(πn)[g(π)]≤EΛ(π)[g(π)].
By Lemma 1 and Lemma 4, there exists m∈Λ(π) such that EΛ(π)[g(π)]=m(g(π)). By the continuity of Λ, there exists a convergent sequence {mn}n∈N such that mn→m and mn∈Λ(πn) for all n∈N. By assumption, g is continuous, and thus Lemma 3 implies that mn(g(π))→m(g(π)). By construction, EΛ(πn)[g(π)]≥mn(g(π)). Then
liminfn→∞EΛ(πn)[g(π)]≥liminfn→∞mn(g(π))=m(g(π))=EΛ(π)[g(π)].For the second argument, choose a subsequence {nk}k∈N such that
(1)limk→∞EΛ(πnk)[g(π)]=limsupn→∞EΛ(πn)[g(π)].Invoking Lemma 1 and Lemma 4 again, for each nk, choose mnk∈Λ(πnk) such that EΛ(πnk)[g(π)]=mnk(g(π)). Since Λ is continuous and Λ(π) is closed, there exists a convergent subsequence {mnkj}j∈N with a limit point m∈Λ(π).
Then by (1), Lemma 3, and the definition of expectation with respect to a credal set,
limsupn→∞EΛ(πn)[g(π)]=limj→∞mnkj[g(π)]=m(g(π))≤EΛ(π)[g(π)].Thus limn→∞EΛ(πn)[g(π)]=EΛ(π)[g(π)], which completes the proof. □
Corollary 1 then follows from Proposition 2 using a standard continuity argument.
Corollary 1: For all crisp causal laws Λ and for all continuous functions g:Π→C((A×O)ω,[0,1]), argminπ∈ΠEΛ∼ζ[EΛ(π)[g(π)]] is a closed, non-empty set.
Proof of Proposition 3
The following proposition corresponds to Proposition 12 in this proof section of the infra-Bayesian sequence. Equation (2) below appears there, and we explain it in more detail here. The remainder of the proof differs from the original version.
Proposition 3: For any non-dogmatic prior ζ over a learnable collection of crisp causal laws {Λi}∞i=0, if a family π∗γ of policies is infra-Bayes optimal with respect to ζ, then π∗γ learns {Λi}∞i=0.
Proof: Assume that {Λi}∞i=0 is learnable. We will proceed by contrapositive and show that if a family of policies π∗γ does not learn {Λi}∞i=0, then for any non-dogmatic prior ζ over {Λi}∞i=0, π∗γ is not infra-Bayes optimal. By definition, there exists a family of policies πγ such that for all i≥0,
(1)limγ→1Reg(πγ,Λi,Lγ)=0Equivalently, by the definition of infra-regret, for all i≥0,
(1)limγ→1(EΛi(πγ)[Lγ]−minπ∈ΠEΛi(π)[Lγ])=0.We preliminarily aim to show that
(2)limγ→1IBReg(πγ,ζ,Lγ)=limγ→1EΛi∼ζReg(πγ,Λi,Lγ)=0.Let ϵ>0 be given. Note that for all N∈N,
EΛi∼ζReg(πγ,Λi,Lγ)=N∑i=0ζ(Λi)Reg(πγ,Λi,Lγ)+∞∑i=N+1ζ(Λi)Reg(πγ,Λi,Lγ)Since {Λi}∞i=1 is countable and ∑i≥0ζ(Λi)=1<∞, there exists N<∞ such that ∑i≥N+1ζ(Λi)<ϵ/2. Since Reg(πγ,Λi,Lγ)≤1 for all i∈N, the second sum above is bounded by ϵ/2. We will now bound the first sum.
Let ^ϵ:=ϵ2N. Applying equation (1) for each 0≤i≤N, we can obtain γi∈(0,1) such that for all γ∈[γi,1), Reg(πγ,Λi,Lγ)<^ϵ. Let γ∗=max0≤i≤Nγi. Then Reg(πγ,Λi,Lγ)<^ϵ holds simultaneously for all 0≤i≤N and γ≥γ∗. Thus for γ≥γ∗, EΛi∼ζReg(πγ,Λi,Lγ)<ϵ, proving Equation (2).
Since π∗γ does not learn {Λi}∞i=1, there exists j∈N such that limγ→1Reg(π∗γ,Λj,Lγ)≠0. We will show that this implies limγ→1IBReg(π∗γ,ζ,Lγ)>0, which when combined with (2) indicates that π∗γ is not infra-Bayes optimal with respect to ζ.
With steps explained in the proceeding paragraphs, we have that
limγ→1IBReg(π∗γ,ζ,Lγ)=limγ→1EΛi∼ζReg(π∗γ,Λi,Lγ)=limγ→1∞∑i=0ζ(Λi)Reg(π∗γ,Λi,Lγ)=∞∑i=0ζ(Λi)limγ→1Reg(π∗γ,Λi,Lγ)≥ζ(Λj)limγ→1Reg(π∗γ,Λj,Lγ)>0.The first equality follows from the definition of infra-regret, and the second equality follows from the definition of expectation with respect to a measure on a countable set. The third equality is an application of the Lebesgue Dominating Convergence Theorem using the constant one function as a dominating function. The inequalities follow from the fact that all terms in the sum are positive, limγ→1Reg(π∗γ,Λj,Lγ)>0, and ζ(Λj)≠0, which is true by the assumption that ζ is non-dogmatic.
We will now show that π∗γ is not infra-Bayes optimal with respect to ζ. There exists γ0∈[0,1) such that
IBReg(π∗γ0,ζ,Lγ0)>IBReg(πγ0,ζ,Lγ0).Therefore,
EΛ∼ζReg(π∗γ0,Λ,Lγ0)−EΛ∼ζReg(πγ0,Λ,Lγ0)>0.By linearity,
EΛ∼ζ[Reg(π∗γ0,Λ,Lγ0)−Reg(πγ0,Λ,Lγ0)]>0.By definition,
EΛ∼ζ[EΛ(π∗γ0)[Lγ0]−minπ∈ΠEΛ(π)[Lγ0]−(EΛ(πγ0)[Lγ0]−minπ∈ΠEΛ(π)[Lγ0])]>0.Simplifying and applying linearity of expectation, we obtain
EΛ∼ζ[EΛ(π∗γ0)[Lγ0]−EΛ(πγ0)[Lγ0]]=EΛ∼ζEΛ(π∗γ0)[Lγ0]−EΛ∼ζEΛ(πγ0)[Lγ0]>0.Therefore, π∗γ is not an infra-Bayes optimal family of policies. □
Acknowledgements
Many thanks Vanessa Kosoy and Marcus Ogren for their constructive feedback on the initial draft.
References
[1] Villani, Cédric, Optimal Transport: Old and New. Springer Berlin, Heidelberg, 2009.
[2] Carothers, N. L., “The Stone-Weierstrass Theorem.” In Real Analysis. Cambridge University Press, 2012.