Most of the content of this post was covered by the talk I gave in Los Angeles MIRIx in October, minus the proofs and a minor amendment of Theorem 1 (the role of Δ2sqp,ϕ).
We define variants of the concept of generatable distributional estimation problem and show that these variants also admits a uniformly universal optimal predictor scheme. We show how to use this to implement a form of bounded Solomonoff induction.
Results
We have previously defined a “word ensemble” to be a collection {μk}k∈N of probability measures on {0,1}∗ s.t. for some polynomial p, suppμk⊆{0,1}≤p(k). This was convenient when the formalism was based on Boolean circuits but unnecessary for Turing machines. It is enough to assume that the Turing machine is allowed to read only the beginning of the input and thus halt in time arbitrarily smaller than the length of the input. In the following we will use “word ensemble” to mean an arbitrary sequence of probability measures on {0,1}∗, allow such word ensembles in distributional estimation problems etc.
All proofs are in the Appendix.
We start by defining “Δ(log)-sampler” and “Δ(log)-generator” for Δ an error space of rank 2 (they were previously defined for an error space of rank 1). Fix such an error space.
Definition 1
Consider μ a word ensemble. A (poly,log)-bischeme ^S of signature 1→{0,1}∗ is called a Δ(log)-sampler for μ when
∑x∈{0,1}∗|μk(x)−Pr[^Skj=x]|∈Δ
When ^S has no advice, it is called a Δ-sampler.
μ which admits such ^S is called Δ(log)-sampleable or Δ-sampleable correspondingly.
Definition 2
Consider (f,μ) a distributional estimation problem. A (poly,log)-bischeme ^G of signature 1→{0,1}∗×[0,1] is called a Δ(log)-generator for (f,μ) when
(i) ^G1 is a Δ(log)-sampler for μ.
(ii) EUrG(k,j)[(^Gkj(y)2−f(^Gkj(y)1))2]∈Δ
When ^G has no advice, it is called a Δ-generator.
(f,μ) which admits such ^G is called Δ(log)-generatable or Δ-generatable correspondingly.
We now introduce a variant in which the generator matches f on average but is allowed to have arbitrary variance.
Definition 3
Consider (f,μ) a distributional estimation problem. A (poly,log)-bischeme ^G of signature 1→{0,1}∗×[0,1] is called a weak Δ(log)-generator for (f,μ) when
(i) ^G1 is a Δ(log)-sampler for μ.
(ii) EUrG(k,j)[(EUrG(k,j)[^Gkj(y′)2∣^Gkj(y′)1=^Gkj(y)1]−f(^Gkj(y)1))2]∈Δ
When ^G has no advice, it is called a weak Δ-generator.
(f,μ) which admits such ^G is called weakly Δ(log)-generatable or weakly Δ-generatable correspondingly.
Proposition 1
Consider (f,μ) a distributional estimation problem. Any Δ(log)-generator for (f,μ) is in particular a weak Δ(log)-generator for (f,μ).
We now show that weak generators enable an existence theorem for optimal predictor schemes of the same form as generators.
Construction 1
Given ϕ∈Φ we define Δ2sqp,ϕ to be the set of bounded functions δ:N2→R≥0 s.t. for any ψ∈Φ, if ψ≤ϕ then supj≥tψ(k)δ(k,j)∈Δ1ψ. It is easy to see Δ2sqp,ϕ is an error space.
We define Δ2sqp:=⋂ϕ∈ΦΔ2sqp,ϕ. Δ2sqp is also an error space.
Proposition 2
Δ2sqp,ϕ⊆Δ2ll,ϕ
Δ2sqp⊆Δ2ll
Construction 2
We describe an oracle machine ^Λ that accepts an oracle of signature O:N2→{0,1}∗×[0,1] and implements a (poly,0)O-predictor scheme. Consider the computation of ^Λ[O]kj(x).
We loop over the first j words in lexicographic order. For each word Q, we loop over j2 “test runs”. At test run i, we generate (xi∈{0,1}∗,ti∈[0,1]) by evaluating Okj (we treat O as random, that is, we don’t expect repeated answers to the same query to coincide). We then sample zi from Uj and compute si:=evj(Q,xi,zi). At the end of the test runs, we compute the average error ϵ(Q):=1jk∑i(si−ti)2. At the end of the loop over programs, the program Q∗ with the lowest error is selected and the output evj(Q∗,x) is produced.
Theorem 1
Consider (f,μ) a distributional estimation problem, ϕ∈Φ and ^G a weak Δ2sqp,ϕ(log)-generator for (f,μ). Then, ^Λ[^G] is a Δ2ll,ϕ(poly,log)-optimal predictor scheme for (f,μ).
Corollary 1
Consider (f,μ) a distributional estimation problem and ^G a weak Δ2ll(log)-generator for (f,μ). Then, ^Λ[^G] is a Δ2ll(poly,log)-optimal predictor scheme for (f,μ).
To demonstrate the difference between generators and weak generators, we give the following property of generators which weak generators don’t share.
Theorem 2
Consider (f,μ) a distributional estimation problem and ^G a corresponding Δ(log)-generator. Suppose τ:[0,1]→[0,1] is an α-Hoelder continuous function for α∈(0,1] and ^τ is a [0,1]-vallued (poly,log)-bischeme s.t.
∀x∈{0,1}∗:E[(^τkj(x)−τ(β(x)))2]∈Δ
Define ^Gτ by
^Gkjτ(y):=(^Gkj(y)1,^τkj(^Gkj(y)2))
Then, ^Gτ is a Δ(log)-generator for (τ∘f,μ).
This means that a generator yields an entire “logical probability distribution” for f(x) whereas a weak generator yields only a “logical expectation value.”
The following constructions shows how to use Theorem 2 to implement a form of bounded Solomonoff induction.
Construction 3
Given k∈N, we define the probability distribution σk on N by
∀j∈N:Prσk[n≥j]:=min(loglog(k+3)loglog(j+3),1)
We define the probability distribution μkBSI on {0,1}k⊔{⊥} by
Using an appropriate encoding {0,1}∗⊔{⊥} can be identified with {0,1}∗ and (fBSI(i),μBSI) regarded as a distributional estimation problem.
Proposition 3
(fBSI(i),μBSI) is weakly Δ2sqp-generatable.
The following claim is likely to be a theorem, but we haven’t worked out the details of the proof.
Claim
Consider a random Turing machine M which produces an infinite sequence ωM s.t. the time to compute the k-th bit is bounded by a quasi-polynomial in k. Given k∈N, define the probability measure μkM on {0,1}∗ by
μkM(x):=Pr[ωM<k=x]
Given i∈{0,1}, define fM(i):{0,1}∗→[0,1] by
fM(i)(x):=Pr[ωM|x|=i∣ωM<|x|=x]
Then, the identity function is a Δ2avg-pseudo-invertible reduction of (fM(i),μM) to (fBSI(i),μBSI).
Corollary 2
Suppose ^GBSI(i) is a weak Δ2ll-generator for (fBSI(i),μBSI). Then, ^Λ[^GBSI(i)] is not only a Δ2ll(poly,log)-optimal predictor scheme for (fBSI(i),μBSI) but also a Δ2avg(poly,log)-optimal predictor scheme for (fM(i),μM), for anyM as above.
Note 1
It is straightforward to extend this approach to predicting functions of the several following bits rather than only the next bit.
Note 2
M is computing ωM directly rather than recursively (that is, ωMk can depend directly on the random coin flips used to generate the previous bits rather than only on the previous bits) which means this approach avoids the problem presented by Taylor. That is, if we construct an agent that uses ^Λ to compute expected reward as a function of action based on previous observations, then in Taylor’s example the correct action will be selected (since in that example the reward can be computed in polynomial time and by the reduction and uniqueness theorems ^Λ will produce an asymptotically exact prediction).
Appendix
Proof of Proposition 1
Suppose ^G is a Δ(log)-generator for (f,μ). Define Σ⊆N2×{0,1}∗ by
where δ1∈Δ doesn’t depend on h. Applying Proposition 4 to the last term on the right hand hands completes the desired result.
Proposition 6
Given ϕ∈Φ and δ∈Δ2sqp,ϕ, define δmon(k,j):=supm≥jδ(k,m). Then, δmon∈Δ2sqp,ϕ.
Proof of Proposition 6
Consider ψ∈Φ, ψ≤ϕ.
supj≥tψ(k)δmon(k,j)=supj≥tψ(k)supm≥jδ(k,m)
supj≥tψ(k)δmon(k,j)=supj≥tψ(k)δ(k,j)
supj≥tψ(k)δmon(k,j)∈Δ1ψ
Proof of Theorem 1
Consider ^P=(P,r,a) a (poly,log)-predictor scheme. Choose p:N2→N a polynomial satisfying p(k,j)≥j s.t. evaluating Λ[G]k,p(k,j) involves running ^Pkj until it halts “naturally” (such p exists because ^P runs in at most polynomial time and has at most logarithmic advice). Given i,j∈N, consider the execution of Λ[G]k,p(k,j)+i. The standard deviation of ϵ(^Pkj) with respect to the internal coin tosses of Λ is at most (p(k,j)+i)−1. According to Proposition 5, the expectation value is E[(^Pkj−f)2]+E[(^Gk,p(k,j)+i2−f(^Gk,p(k,j)+i1))2]+γP(k,p(k,j)+i) where |γP|≤δ for δ∈Δ2sqp,ϕ that doesn’t depend on P. Thanks to Proposition 6 we can assume without loss of generality that δ is non-increasing in the second argument, and in particular δ(k,p(k,j)+i)≤δ(k,j). Denote αkj:=E[(^Gk,p(k,j)+i2−f(^Gk,p(k,j)+i1))2]. By Chebyshev’s inequality,
The standard deviation of ϵ(Q) for any Q is also at most (p(k,j)+i)−1. The expectation value is E[(evp(k,j)+i(Q)−f)2]+αkj+γQ where |γQ|≤δ(k,j). Therefore
Given k,j∈N, define σkj to be the probability distribution on N given by σkj:=σk∣n≤max(k,j). We describe the execution of the weak generator ^GkjBSI(i).
n is sampled from σkj (more precisely from a probability distribution that differs from σkj by a rounding error of 2−j∈Δ2sqp). y,z are sampled from Un. x:=evn(y,z) is computed. If |x|<k, (⊥,0) is produced. If |x|>k and xk=i, (x<k,1) is produced. Otherwise, (x<k,0) is produced.
It is easy to see ^GBSI(i) generates (fBSI(i),μBSI) up to an error of order Prσk[n>max(k,j)]<min(loglog(k+3)loglog(j+3),1). By Proposition 7, it is therefore a weak Δ2sqp-generator for (fBSI(i),μBSI).
Bounded Solomonoff induction using optimal predictor schemes
Most of the content of this post was covered by the talk I gave in Los Angeles MIRIx in October, minus the proofs and a minor amendment of Theorem 1 (the role of Δ2sqp,ϕ).
We define variants of the concept of generatable distributional estimation problem and show that these variants also admits a uniformly universal optimal predictor scheme. We show how to use this to implement a form of bounded Solomonoff induction.
Results
We have previously defined a “word ensemble” to be a collection {μk}k∈N of probability measures on {0,1}∗ s.t. for some polynomial p, suppμk⊆{0,1}≤p(k). This was convenient when the formalism was based on Boolean circuits but unnecessary for Turing machines. It is enough to assume that the Turing machine is allowed to read only the beginning of the input and thus halt in time arbitrarily smaller than the length of the input. In the following we will use “word ensemble” to mean an arbitrary sequence of probability measures on {0,1}∗, allow such word ensembles in distributional estimation problems etc.
All proofs are in the Appendix.
We start by defining “Δ(log)-sampler” and “Δ(log)-generator” for Δ an error space of rank 2 (they were previously defined for an error space of rank 1). Fix such an error space.
Definition 1
Consider μ a word ensemble. A (poly,log)-bischeme ^S of signature 1→{0,1}∗ is called a Δ(log)-sampler for μ when
∑x∈{0,1}∗|μk(x)−Pr[^Skj=x]|∈Δ
When ^S has no advice, it is called a Δ-sampler.
μ which admits such ^S is called Δ(log)-sampleable or Δ-sampleable correspondingly.
Definition 2
Consider (f,μ) a distributional estimation problem. A (poly,log)-bischeme ^G of signature 1→{0,1}∗×[0,1] is called a Δ(log)-generator for (f,μ) when
(i) ^G1 is a Δ(log)-sampler for μ.
(ii) EUrG(k,j)[(^Gkj(y)2−f(^Gkj(y)1))2]∈Δ
When ^G has no advice, it is called a Δ-generator.
(f,μ) which admits such ^G is called Δ(log)-generatable or Δ-generatable correspondingly.
We now introduce a variant in which the generator matches f on average but is allowed to have arbitrary variance.
Definition 3
Consider (f,μ) a distributional estimation problem. A (poly,log)-bischeme ^G of signature 1→{0,1}∗×[0,1] is called a weak Δ(log)-generator for (f,μ) when
(i) ^G1 is a Δ(log)-sampler for μ.
(ii) EUrG(k,j)[(EUrG(k,j)[^Gkj(y′)2∣^Gkj(y′)1=^Gkj(y)1]−f(^Gkj(y)1))2]∈Δ
When ^G has no advice, it is called a weak Δ-generator.
(f,μ) which admits such ^G is called weakly Δ(log)-generatable or weakly Δ-generatable correspondingly.
Proposition 1
Consider (f,μ) a distributional estimation problem. Any Δ(log)-generator for (f,μ) is in particular a weak Δ(log)-generator for (f,μ).
We now show that weak generators enable an existence theorem for optimal predictor schemes of the same form as generators.
Construction 1
Given ϕ∈Φ we define Δ2sqp,ϕ to be the set of bounded functions δ:N2→R≥0 s.t. for any ψ∈Φ, if ψ≤ϕ then supj≥tψ(k)δ(k,j)∈Δ1ψ. It is easy to see Δ2sqp,ϕ is an error space.
We define Δ2sqp:=⋂ϕ∈ΦΔ2sqp,ϕ. Δ2sqp is also an error space.
Proposition 2
Δ2sqp,ϕ⊆Δ2ll,ϕ
Δ2sqp⊆Δ2ll
Construction 2
We describe an oracle machine ^Λ that accepts an oracle of signature O:N2→{0,1}∗×[0,1] and implements a (poly,0)O-predictor scheme. Consider the computation of ^Λ[O]kj(x).
We loop over the first j words in lexicographic order. For each word Q, we loop over j2 “test runs”. At test run i, we generate (xi∈{0,1}∗,ti∈[0,1]) by evaluating Okj (we treat O as random, that is, we don’t expect repeated answers to the same query to coincide). We then sample zi from Uj and compute si:=evj(Q,xi,zi). At the end of the test runs, we compute the average error ϵ(Q):=1jk∑i(si−ti)2. At the end of the loop over programs, the program Q∗ with the lowest error is selected and the output evj(Q∗,x) is produced.
Theorem 1
Consider (f,μ) a distributional estimation problem, ϕ∈Φ and ^G a weak Δ2sqp,ϕ(log)-generator for (f,μ). Then, ^Λ[^G] is a Δ2ll,ϕ(poly,log)-optimal predictor scheme for (f,μ).
Corollary 1
Consider (f,μ) a distributional estimation problem and ^G a weak Δ2ll(log)-generator for (f,μ). Then, ^Λ[^G] is a Δ2ll(poly,log)-optimal predictor scheme for (f,μ).
To demonstrate the difference between generators and weak generators, we give the following property of generators which weak generators don’t share.
Theorem 2
Consider (f,μ) a distributional estimation problem and ^G a corresponding Δ(log)-generator. Suppose τ:[0,1]→[0,1] is an α-Hoelder continuous function for α∈(0,1] and ^τ is a [0,1]-vallued (poly,log)-bischeme s.t.
∀x∈{0,1}∗:E[(^τkj(x)−τ(β(x)))2]∈Δ
Define ^Gτ by
^Gkjτ(y):=(^Gkj(y)1,^τkj(^Gkj(y)2))
Then, ^Gτ is a Δ(log)-generator for (τ∘f,μ).
This means that a generator yields an entire “logical probability distribution” for f(x) whereas a weak generator yields only a “logical expectation value.”
The following constructions shows how to use Theorem 2 to implement a form of bounded Solomonoff induction.
Construction 3
Given k∈N, we define the probability distribution σk on N by
∀j∈N:Prσk[n≥j]:=min(loglog(k+3)loglog(j+3),1)
We define the probability distribution μkBSI on {0,1}k⊔{⊥} by
∀x∈{0,1}k:μkBSI(x):=∞∑n=0σk(n)#{(y∈{0,1}n,z∈{0,1}n)∣evn(y,z)<k=x}4n
μkBSI(⊥):=∞∑n=0σk(n)#{(y∈{0,1}n,z∈{0,1}n)∣|evn(y,z)|<k}4n
Given i∈{0,1}, we define fBSI(i):{0,1}∗⊔{⊥}→[0,1] by
∀x∈{0,1}∗:fBSI(i)(x):=∞∑n=0σ|x|(n)#{(y∈{0,1}n,z∈{0,1}n)∣evn(y,z)≤|x|=xi}4n
fBSI(i)(⊥):=0
Using an appropriate encoding {0,1}∗⊔{⊥} can be identified with {0,1}∗ and (fBSI(i),μBSI) regarded as a distributional estimation problem.
Proposition 3
(fBSI(i),μBSI) is weakly Δ2sqp-generatable.
The following claim is likely to be a theorem, but we haven’t worked out the details of the proof.
Claim
Consider a random Turing machine M which produces an infinite sequence ωM s.t. the time to compute the k-th bit is bounded by a quasi-polynomial in k. Given k∈N, define the probability measure μkM on {0,1}∗ by
μkM(x):=Pr[ωM<k=x]
Given i∈{0,1}, define fM(i):{0,1}∗→[0,1] by
fM(i)(x):=Pr[ωM|x|=i∣ωM<|x|=x]
Then, the identity function is a Δ2avg-pseudo-invertible reduction of (fM(i),μM) to (fBSI(i),μBSI).
Corollary 2
Suppose ^GBSI(i) is a weak Δ2ll-generator for (fBSI(i),μBSI). Then, ^Λ[^GBSI(i)] is not only a Δ2ll(poly,log)-optimal predictor scheme for (fBSI(i),μBSI) but also a Δ2avg(poly,log)-optimal predictor scheme for (fM(i),μM), for any M as above.
Note 1
It is straightforward to extend this approach to predicting functions of the several following bits rather than only the next bit.
Note 2
M is computing ωM directly rather than recursively (that is, ωMk can depend directly on the random coin flips used to generate the previous bits rather than only on the previous bits) which means this approach avoids the problem presented by Taylor. That is, if we construct an agent that uses ^Λ to compute expected reward as a function of action based on previous observations, then in Taylor’s example the correct action will be selected (since in that example the reward can be computed in polynomial time and by the reduction and uniqueness theorems ^Λ will produce an asymptotically exact prediction).
Appendix
Proof of Proposition 1
Suppose ^G is a Δ(log)-generator for (f,μ). Define Σ⊆N2×{0,1}∗ by
Σ:=⋃k,j∈N(k,j)×^G1({0,1}rG(k,j))
and F:Σ→[0,1] by
Fkj(x):=EUrG(k,j)[^Gkj(y)2∣^Gkj(y)1=x]
We have
E[(^G2−f(^G1))2]=E[(^G2−F(^G1)+F(^G1)−f(^G1))2]
E[(^G2−f(^G1))2]=E[(F(^G1)−f(^G1))2]−2E[(F(^G1)−f(^G1))(F(^G1)−^G2)]+E[(F(^G1)−^G2)2]
The middle term on the right hand side satisfies
EUrG[(F(^G(y)1)−f(^G(y)1))(F(^G(y)1)−^G(y)2)]=EUrG[EUrG[(F(^G(y)1)−f(^G(y)1))(F(^G(y)1)−^G(y′)2)∣^G(y′)1=^G(y)1]]
EUrG[(F(^G(y)1)−f(^G(y)1))(F(^G(y)1)−^G(y)2)]=EUrG[(F(^G(y)1)−f(^G(y)1))EUrG[(F(^G(y)1)−^G(y′)2)∣^G(y′)1=^G(y)1]]
EUrG[(F(^G(y)1)−f(^G(y)1))(F(^G(y)1)−^G(y)2)]=EUrG[(F(^G(y)1)−f(^G(y)1))(F(^G(y)1)−F(^G(y)1))]
EUrG[(F(^G(y)1)−f(^G(y)1))(F(^G(y)1)−^G(y)2)]=0
We get
E[(^G2−f(^G1))2]=E[(F(^G1)−f(^G1))2]+E[(F(^G1)−^G2)2]
E[(^G2−f(^G1))2]≥E[(F(^G1)−f(^G1))2]
E[(F(^G1)−f(^G1))2]∈Δ
Proof of Proposition 2
Consider δ∈Δ2sqp,ϕ. Consider ψ∈Φ, ψ≤ϕ. We have
Eλkψ[δ(k,j)]=Prλkψ[j<tψ12(k)]Eλkψ[δ(k,j)∣j<tψ12(k)]+Prλkψ[j≥tψ12(k)]Eλkψ[δ(k,j)∣j≥tψ12(k)]
Eλkψ[δ(k,j)]≤ψ(k)−12(supδ)+supj≥tψ12(k)δ(k,j)
For k>>0, ψ12≤ϕ hence supj≥tψ12(k)δ(k,j)∈Δ1ψ12=Δ1ψ. We conclude that Eλkψ[δ(k,j)]∈Δ1ψ and therefore δ∈Δ2ll,ϕ.
Proposition 4
Suppose ^G is a weak Δ(log)-generator for (f,μ). Then there is δ∈Δ s.t. for any k,j∈N and g:{0,1}∗→R bounded
|EUrG(k,j)[g(^Gkj(y)1)(^Gkj(y)2−f(^Gkj(y)1))]|≤(supg)δ(k,j)
Proof of Proposition 4
EUrG(k,j)[g(^Gkj(y)1)(^Gkj(y)2−f(^Gkj(y)1))]=EUrG(k,j)×UrG(k,j)[g(^Gkj(y)1)(^Gkj(y′)2−f(^Gkj(y)1))∣^Gkj(y′)1=^Gkj(y)1]
EUrG(k,j)[g(^Gkj(y)1)(^Gkj(y)2−f(^Gkj(y)1))]=EUrG(k,j)[g(^Gkj(y)1)EUrG(k,j)[^Gkj(y′)2−f(^Gkj(y)1)∣^Gkj(y′)1=^Gkj(y)1]]
EUrG(k,j)[g(^Gkj(y)1)(^Gkj(y)2−f(^Gkj(y)1))]=EUrG(k,j)[g(^Gkj(y)1)(EUrG(k,j)[^Gkj(y′)2∣^Gkj(y′)1=^Gkj(y)1]]−f(^Gkj(y)1))]
Property (ii) of ^G implies the result.
Proposition 5
Suppose ^G is a weak Δ(log)-generator for (f,μ). Then there is δ∈Δ s.t. for any k,j∈N, finite set Z, ν a probability measure on Z and h:{0,1}∗×Z→[0,1]
|EUrG(k,j)×ν[(h(^Gkj(y)1,z)−^Gkj(y)2)2]−Eμk×ν[(h(x,z)−f(x))2]−EUrG(k,j)[(^Gkj(y)2−f(^Gkj(y)1))2]|≤δ(k,j)
Proof of Proposition 5
EUrG(k,j)×ν[(h(^Gkj(y)1,z)−^Gkj(y)2)2]=EUrG(k,j)×ν[(h(^Gkj(y)1,z)−f(^Gkj(y)1)+f(^Gkj(y)1)−^Gkj(y)2)2]
E[(h(^Gkj1)−^Gkj2)2]=E[(h(^Gkj1)−f(^Gkj1))2]+E[(^Gkj2−f(^Gkj1))2]−2E[(h(^Gkj1)−f(^Gkj1))(^Gkj2−f(^Gkj1))]
Since ^G1 is a Δ(log)-sampler of μ, the first term on the right hand side satisfies
|EUrG(k,j)×ν[(h(^Gkj1)−f(^Gkj1))2]−Eμk×ν[(h(x)−f(x)2]|≤δ1(k,j)
where δ1∈Δ doesn’t depend on h. Applying Proposition 4 to the last term on the right hand hands completes the desired result.
Proposition 6
Given ϕ∈Φ and δ∈Δ2sqp,ϕ, define δmon(k,j):=supm≥jδ(k,m). Then, δmon∈Δ2sqp,ϕ.
Proof of Proposition 6
Consider ψ∈Φ, ψ≤ϕ.
supj≥tψ(k)δmon(k,j)=supj≥tψ(k)supm≥jδ(k,m)
supj≥tψ(k)δmon(k,j)=supj≥tψ(k)δ(k,j)
supj≥tψ(k)δmon(k,j)∈Δ1ψ
Proof of Theorem 1
Consider ^P=(P,r,a) a (poly,log)-predictor scheme. Choose p:N2→N a polynomial satisfying p(k,j)≥j s.t. evaluating Λ[G]k,p(k,j) involves running ^Pkj until it halts “naturally” (such p exists because ^P runs in at most polynomial time and has at most logarithmic advice). Given i,j∈N, consider the execution of Λ[G]k,p(k,j)+i. The standard deviation of ϵ(^Pkj) with respect to the internal coin tosses of Λ is at most (p(k,j)+i)−1. According to Proposition 5, the expectation value is E[(^Pkj−f)2]+E[(^Gk,p(k,j)+i2−f(^Gk,p(k,j)+i1))2]+γP(k,p(k,j)+i) where |γP|≤δ for δ∈Δ2sqp,ϕ that doesn’t depend on P. Thanks to Proposition 6 we can assume without loss of generality that δ is non-increasing in the second argument, and in particular δ(k,p(k,j)+i)≤δ(k,j). Denote αkj:=E[(^Gk,p(k,j)+i2−f(^Gk,p(k,j)+i1))2]. By Chebyshev’s inequality,
Pr[ϵ(^Pkj)≥E[(^Pkj−f)2]+αkj+δ(k,j)+(p(k,j)+i)−12]≤(p(k,j)+i)−1
Hence
Pr[ϵ(Q∗)≥E[(^Pkj−f)2]+αkj+δ(k,j)+(p(k,j)+i)−12]≤(p(k,j)+i)−1
The standard deviation of ϵ(Q) for any Q is also at most (p(k,j)+i)−1. The expectation value is E[(evp(k,j)+i(Q)−f)2]+αkj+γQ where |γQ|≤δ(k,j). Therefore
Pr[∃Q<p(k,j)+i:ϵ(Q)≤E[(evp(k,j)+i(Q)−f)2]+αkj−δ(k,j)−(p(k,j)+i)−14]≤(p(k,j)+i)(p(k,j)+i)−32=(p(k,j)+i)−12
The extra p(k,j)+i factor comes from summing probabilities over p(k,j)+i programs. Combining we get
Pr[E[(evp(k,j)+i(Q∗)−f)2]≥E[(^Pkj−f)2]+2δ(k,j)+(p(k,j)+i)−12+(p(k,j)+i)−14]≤(p(k,j)+i)−1+(p(k,j)+i)−12
E[(Λ[G]k,p(k,j)+i−f)2]≤E[(^Pkj−f)2]+2δ(k,j)+(p(k,j)+i)−1+2(p(k,j)+i)−12+(p(k,j)+i)−14
E[(Λ[G]k,p(k,j)+i−f)2]≤E[(^Pkj−f)2]+2δ(k,j)+p(k,j)−1+2p(k,j)−12+p(k,j)−14
Using Proposition 2 and the Lemma about Δ2ll,ϕ, we get the desired result.
Proof of Theorem 2
EUrG(k,j)[(^Gτ(y)2−(τ∘f)(^Gτ(y)1))2]=EUrG(k,j)[(τ(^G(y)2)−τ(f(^G(y)1)))2]
Since τ is α-Hoelder continuous, there is a constant c>0 which doesn’t depend on k,j s.t.
EUrG(k,j)[(^Gτ(y)2−(τ∘f)(^Gτ(y)1))2]≤cEUrG(k,j)[(^G(y)2−f(^G(y)1))2α]
EUrG(k,j)[(^Gτ(y)2−(τ∘f)(^Gτ(y)1))2]≤cEUrG(k,j)[(^G(y)2−f(^G(y)1))2]α
EUrG(k,j)[(^Gτ(y)2−(τ∘f)(^Gτ(y)1))2]∈Δ
Proposition 7
min(loglog(k+3)loglog(j+3),1)∈Δ2sqp
Proof of Proposition 7
Consider ϕ∈Φ. We have
supj≥tϕ(k)min(loglog(k+3)loglog(j+3),1)=min(loglog(k+3)loglog(tϕ(k)+3),1)
Choosing any ϵ∈(0,1) we get
limk→∞ϕ(k)ϵsupj≥tϕ(k)min(loglog(k+3)loglog(j+3),1)=limk→∞ϕ(k)ϵ−1
limk→∞ϕ(k)ϵsupj≥tϕ(k)min(loglog(k+3)loglog(j+3),1)=0
Proof of Proposition 3
Given k,j∈N, define σkj to be the probability distribution on N given by σkj:=σk∣n≤max(k,j). We describe the execution of the weak generator ^GkjBSI(i).
n is sampled from σkj (more precisely from a probability distribution that differs from σkj by a rounding error of 2−j∈Δ2sqp). y,z are sampled from Un. x:=evn(y,z) is computed. If |x|<k, (⊥,0) is produced. If |x|>k and xk=i, (x<k,1) is produced. Otherwise, (x<k,0) is produced.
It is easy to see ^GBSI(i) generates (fBSI(i),μBSI) up to an error of order Prσk[n>max(k,j)]<min(loglog(k+3)loglog(j+3),1). By Proposition 7, it is therefore a weak Δ2sqp-generator for (fBSI(i),μBSI).