Most of the content of this post was covered by the talk I gave in Los Angeles MIRIx in October, minus the proofs and a minor amendment of Theorem 1 (the role of $Δ_{s q p, ϕ}^{2}$ ).

We define variants of the concept of generatable distributional estimation problem and show that these variants also admits a uniformly universal optimal predictor scheme. We show how to use this to implement a form of bounded Solomonoff induction.

Results

We have previously defined a “word ensemble” to be a collection ${μ^{k}}_{k \in N}$ of probability measures on ${0, 1}^{*}$ s.t. for some polynomial $p$ , $supp μ^{k} \subseteq {0, 1}^{\leq p (k)}$ . This was convenient when the formalism was based on Boolean circuits but unnecessary for Turing machines. It is enough to assume that the Turing machine is allowed to read only the beginning of the input and thus halt in time arbitrarily smaller than the length of the input. In the following we will use “word ensemble” to mean an arbitrary sequence of probability measures on ${0, 1}^{*}$ , allow such word ensembles in distributional estimation problems etc.

All proofs are in the Appendix.

We start by defining “ $Δ (l o g)$ -sampler” and “ $Δ (l o g)$ -generator” for $Δ$ an error space of rank 2 (they were previously defined for an error space of rank 1). Fix such an error space.

Definition 1

Consider $μ$ a word ensemble. A $(p o l y, l o g)$ -bischeme $^S$ of signature $1 \to {0, 1}^{*}$ is called a $Δ (l o g)$ -sampler for $μ$ when

$∑x∈{0,1}∗|μk(x)−Pr[^Skj=x]|∈Δ$

When $^S$ has no advice, it is called a $Δ$ -sampler.

$μ$ which admits such $^S$ is called $Δ (l o g)$ -sampleable or $Δ$ -sampleable correspondingly.

Definition 2

Consider $(f, μ)$ a distributional estimation problem. A $(p o l y, l o g)$ -bischeme $^G$ of signature $1 \to {0, 1}^{*} \times [0, 1]$ is called a $Δ (l o g)$ -generator for $(f, μ)$ when

(i) ${^G}_{1}$ is a $Δ (l o g)$ -sampler for $μ$ .

(ii) $E_{U^{r_{G} (k, j)}} [({^G}^{k j} (y)_{2} - f ({^G}^{k j} (y)_{1}))^{2}] \in Δ$

When $^G$ has no advice, it is called a $Δ$ -generator.

$(f, μ)$ which admits such $^G$ is called $Δ (l o g)$ -generatable or $Δ$ -generatable correspondingly.

We now introduce a variant in which the generator matches $f$ on average but is allowed to have arbitrary variance.

Definition 3

Consider $(f, μ)$ a distributional estimation problem. A $(p o l y, l o g)$ -bischeme $^G$ of signature $1 \to {0, 1}^{*} \times [0, 1]$ is called a weak $Δ (l o g)$ -generator for $(f, μ)$ when

(i) ${^G}_{1}$ is a $Δ (l o g)$ -sampler for $μ$ .

(ii) $E_{U^{r_{G} (k, j)}} [(E_{U^{r_{G} (k, j)}} [{^G}^{k j} (y^{'})_{2} ∣ {^G}^{k j} (y^{'})_{1} = {^G}^{k j} (y)_{1}] - f ({^G}^{k j} (y)_{1}))^{2}] \in Δ$

When $^G$ has no advice, it is called a weak $Δ$ -generator.

$(f, μ)$ which admits such $^G$ is called weakly $Δ (l o g)$ -generatable or weakly $Δ$ -generatable correspondingly.

Proposition 1

Consider $(f, μ)$ a distributional estimation problem. Any $Δ (l o g)$ -generator for $(f, μ)$ is in particular a weak $Δ (l o g)$ -generator for $(f, μ)$ .

We now show that weak generators enable an existence theorem for optimal predictor schemes of the same form as generators.

Construction 1

Given $ϕ \in Φ$ we define $Δ_{s q p, ϕ}^{2}$ to be the set of bounded functions $δ : N^{2} \to R^{\geq 0}$ s.t. for any $ψ \in Φ$ , if $ψ \leq ϕ$ then ${sup}_{j \geq t_{ψ} (k)} δ (k, j) \in Δ_{ψ}^{1}$ . It is easy to see $Δ_{s q p, ϕ}^{2}$ is an error space.

We define $Δ_{s q p}^{2} := ⋂_{ϕ \in Φ} Δ_{s q p, ϕ}^{2}$ . $Δ_{s q p}^{2}$ is also an error space.

Proposition 2

$Δ_{s q p, ϕ}^{2} \subseteq Δ_{l l, ϕ}^{2}$

$Δ_{s q p}^{2} \subseteq Δ_{l l}^{2}$

Construction 2

We describe an oracle machine $^Λ$ that accepts an oracle of signature $O : N^{2} \to {0, 1}^{*} \times [0, 1]$ and implements a $(p o l y, 0)^{O}$ -predictor scheme. Consider the computation of $^Λ [O]^{k j} (x)$ .

We loop over the first $j$ words in lexicographic order. For each word $Q$ , we loop over $j^{2}$ “test runs”. At test run $i$ , we generate $(x_{i} \in {0, 1}^{*}, t_{i} \in [0, 1])$ by evaluating $O^{k j}$ (we treat $O$ as random, that is, we don’t expect repeated answers to the same query to coincide). We then sample $z_{i}$ from $U^{j}$ and compute $s_{i} := e v^{j} (Q, x_{i}, z_{i})$ . At the end of the test runs, we compute the average error $ϵ (Q) := \frac{1}{j k} \sum_{i} (s_{i} - t_{i})^{2}$ . At the end of the loop over programs, the program $Q^{*}$ with the lowest error is selected and the output $e v^{j} (Q^{*}, x)$ is produced.

Theorem 1

Consider $(f, μ)$ a distributional estimation problem, $ϕ \in Φ$ and $^G$ a weak $Δ_{s q p, ϕ}^{2} (l o g)$ -generator for $(f, μ)$ . Then, $^Λ [^G]$ is a $Δ_{l l, ϕ}^{2} (p o l y, l o g)$ -optimal predictor scheme for $(f, μ)$ .

Corollary 1

Consider $(f, μ)$ a distributional estimation problem and $^G$ a weak $Δ_{l l}^{2} (l o g)$ -generator for $(f, μ)$ . Then, $^Λ [^G]$ is a $Δ_{l l}^{2} (p o l y, l o g)$ -optimal predictor scheme for $(f, μ)$ .

To demonstrate the difference between generators and weak generators, we give the following property of generators which weak generators don’t share.

Theorem 2

Consider $(f, μ)$ a distributional estimation problem and $^G$ a corresponding $Δ (l o g)$ -generator. Suppose $τ : [0, 1] \to [0, 1]$ is an $α$ -Hoelder continuous function for $α \in (0, 1]$ and $^τ$ is a $[0, 1]$ -vallued $(p o l y, l o g)$ -bischeme s.t.

$∀x∈{0,1}∗:E[(^τkj(x)−τ(β(x)))2]∈Δ$

Define ${^G}_{τ}$ by

${^G}_{τ}^{k j} (y) := ({^G}^{k j} (y)_{1}, {^τ}^{k j} ({^G}^{k j} (y)_{2}))$

Then, ${^G}_{τ}$ is a $Δ (l o g)$ -generator for $(τ \circ f, μ)$ .

This means that a generator yields an entire “logical probability distribution” for $f (x)$ whereas a weak generator yields only a “logical expectation value.”

The following constructions shows how to use Theorem 2 to implement a form of bounded Solomonoff induction.

Construction 3

Given $k \in N$ , we define the probability distribution $σ^{k}$ on $N$ by

$\forall j \in N : P r_{σ^{k}} [n \geq j] := min (\frac{log log (k + 3)}{log log (j + 3)}, 1)$

We define the probability distribution $μ_{B S I}^{k}$ on ${0, 1}^{k} ⊔ {⊥}$ by

$\forall x \in {0, 1}^{k} : μ_{B S I}^{k} (x) := \infty \sum n = 0 σ^{k} (n) \frac{# {(y \in {0, 1}^{n}, z \in {0, 1}^{n}) ∣ e v^{n} (y, z)_{< k} = x}}{4^{n}}$

$μ_{B S I}^{k} (⊥) := \infty \sum n = 0 σ^{k} (n) \frac{# {(y \in {0, 1}^{n}, z \in {0, 1}^{n}) ∣ | e v^{n} (y, z) | < k}}{4^{n}}$

Given $i \in {0, 1}$ , we define $f_{B S I (i)} : {0, 1}^{*} ⊔ {⊥} \to [0, 1]$ by

$\forall x \in {0, 1}^{*} : f_{B S I (i)} (x) := \infty \sum n = 0 σ^{| x |} (n) \frac{# {(y \in {0, 1}^{n}, z \in {0, 1}^{n}) ∣ e v^{n} (y, z)_{\leq | x |} = x i}}{4^{n}}$

$f_{B S I (i)} (⊥) := 0$

Using an appropriate encoding ${0, 1}^{*} ⊔ {⊥}$ can be identified with ${0, 1}^{*}$ and $(f_{B S I (i)}, μ_{B S I})$ regarded as a distributional estimation problem.

Proposition 3

$(f_{B S I (i)}, μ_{B S I})$ is weakly $Δ_{s q p}^{2}$ -generatable.

The following claim is likely to be a theorem, but we haven’t worked out the details of the proof.

Claim

Consider a random Turing machine $M$ which produces an infinite sequence $ω^{M}$ s.t. the time to compute the $k$ -th bit is bounded by a quasi-polynomial in $k$ . Given $k \in N$ , define the probability measure $μ_{M}^{k}$ on ${0, 1}^{*}$ by

$μ_{M}^{k} (x) := P r [ω_{< k}^{M} = x]$

Given $i \in {0, 1}$ , define $f_{M (i)} : {0, 1}^{*} \to [0, 1]$ by

$f_{M (i)} (x) := P r [ω_{| x |}^{M} = i ∣ ω_{< | x |}^{M} = x]$

Then, the identity function is a $Δ_{a v g}^{2}$ -pseudo-invertible reduction of $(f_{M (i)}, μ_{M})$ to $(f_{B S I (i)}, μ_{B S I})$ .

Corollary 2

Suppose ${^G}_{B S I (i)}$ is a weak $Δ_{l l}^{2}$ -generator for $(f_{B S I (i)}, μ_{B S I})$ . Then, $^Λ [{^G}_{B S I (i)}]$ is not only a $Δ_{l l}^{2} (p o l y, l o g)$ -optimal predictor scheme for $(f_{B S I (i)}, μ_{B S I})$ but also a $Δ_{a v g}^{2} (p o l y, l o g)$ -optimal predictor scheme for $(f_{M (i)}, μ_{M})$ , for any $M$ as above.

Note 1

It is straightforward to extend this approach to predicting functions of the several following bits rather than only the next bit.

Note 2

$M$ is computing $ω^{M}$ directly rather than recursively (that is, $ω_{k}^{M}$ can depend directly on the random coin flips used to generate the previous bits rather than only on the previous bits) which means this approach avoids the problem presented by Taylor. That is, if we construct an agent that uses $^Λ$ to compute expected reward as a function of action based on previous observations, then in Taylor’s example the correct action will be selected (since in that example the reward can be computed in polynomial time and by the reduction and uniqueness theorems $^Λ$ will produce an asymptotically exact prediction).

Appendix

Proof of Proposition 1

Suppose $^G$ is a $Δ (l o g)$ -generator for $(f, μ)$ . Define $Σ \subseteq N^{2} \times {0, 1}^{*}$ by

$Σ:=⋃k,j∈N(k,j)×^G1({0,1}rG(k,j))$

and $F : Σ \to [0, 1]$ by

$F^{k j} (x) := E_{U^{r_{G} (k, j)}} [{^G}^{k j} (y)_{2} ∣ {^G}^{k j} (y)_{1} = x]$

We have

$E [({^G}_{2} - f ({^G}_{1}))^{2}] = E [({^G}_{2} - F ({^G}_{1}) + F ({^G}_{1}) - f ({^G}_{1}))^{2}]$

$E [({^G}_{2} - f ({^G}_{1}))^{2}] = E [(F ({^G}_{1}) - f ({^G}_{1}))^{2}] - 2 E [(F ({^G}_{1}) - f ({^G}_{1})) (F ({^G}_{1}) - {^G}_{2})] + E [(F ({^G}_{1}) - {^G}_{2})^{2}]$

The middle term on the right hand side satisfies

$E_{U^{r_{G}}} [(F (^G (y)_{1}) - f (^G (y)_{1})) (F (^G (y)_{1}) -^G (y)_{2})] = E_{U^{r_{G}}} [E_{U^{r_{G}}} [(F (^G (y)_{1}) - f (^G (y)_{1})) (F (^G (y)_{1}) -^G (y^{'})_{2}) ∣^G (y^{'})_{1} =^G (y)_{1}]]$

$E_{U^{r_{G}}} [(F (^G (y)_{1}) - f (^G (y)_{1})) (F (^G (y)_{1}) -^G (y)_{2})] = E_{U^{r_{G}}} [(F (^G (y)_{1}) - f (^G (y)_{1})) E_{U^{r_{G}}} [(F (^G (y)_{1}) -^G (y^{'})_{2}) ∣^G (y^{'})_{1} =^G (y)_{1}]]$

$E_{U^{r_{G}}} [(F (^G (y)_{1}) - f (^G (y)_{1})) (F (^G (y)_{1}) -^G (y)_{2})] = E_{U^{r_{G}}} [(F (^G (y)_{1}) - f (^G (y)_{1})) (F (^G (y)_{1}) - F (^G (y)_{1}))]$

$E_{U^{r_{G}}} [(F (^G (y)_{1}) - f (^G (y)_{1})) (F (^G (y)_{1}) -^G (y)_{2})] = 0$

We get

$E [({^G}_{2} - f ({^G}_{1}))^{2}] = E [(F ({^G}_{1}) - f ({^G}_{1}))^{2}] + E [(F ({^G}_{1}) - {^G}_{2})^{2}]$

$E [({^G}_{2} - f ({^G}_{1}))^{2}] \geq E [(F ({^G}_{1}) - f ({^G}_{1}))^{2}]$

$E [(F ({^G}_{1}) - f ({^G}_{1}))^{2}] \in Δ$

Proof of Proposition 2

Consider $δ \in Δ_{s q p, ϕ}^{2}$ . Consider $ψ \in Φ$ , $ψ \leq ϕ$ . We have

$E_{λ_{ψ}^{k}} [δ (k, j)] = P r_{λ_{ψ}^{k}} [j < t_{ψ^{\frac{1}{2}}} (k)] E_{λ_{ψ}^{k}} [δ (k, j) ∣ j < t_{ψ^{\frac{1}{2}}} (k)] + P r_{λ_{ψ}^{k}} [j \geq t_{ψ^{\frac{1}{2}}} (k)] E_{λ_{ψ}^{k}} [δ (k, j) ∣ j \geq t_{ψ^{\frac{1}{2}}} (k)]$

$E_{λ_{ψ}^{k}} [δ (k, j)] \leq ψ (k)^{- \frac{1}{2}} (sup δ) + sup j \geq t_{ψ^{\frac{1}{2}}} (k) δ (k, j)$

For $k >> 0$ , $ψ^{\frac{1}{2}} \leq ϕ$ hence ${sup}_{j \geq t_{ψ^{\frac{1}{2}}} (k)} δ (k, j) \in Δ_{ψ^{\frac{1}{2}}}^{1} = Δ_{ψ}^{1}$ . We conclude that $E_{λ_{ψ}^{k}} [δ (k, j)] \in Δ_{ψ}^{1}$ and therefore $δ \in Δ_{l l, ϕ}^{2}$ .

Proposition 4

Suppose $^G$ is a weak $Δ (l o g)$ -generator for $(f, μ)$ . Then there is $δ \in Δ$ s.t. for any $k, j \in N$ and $g : {0, 1}^{*} \to R$ bounded

$| E_{U^{r_{G} (k, j)}} [g ({^G}^{k j} (y)_{1}) ({^G}^{k j} (y)_{2} - f ({^G}^{k j} (y)_{1}))] | \leq (sup g) δ (k, j)$

Proof of Proposition 4

$E_{U^{r_{G} (k, j)}} [g ({^G}^{k j} (y)_{1}) ({^G}^{k j} (y)_{2} - f ({^G}^{k j} (y)_{1}))] = E_{U^{r_{G} (k, j)} \times U^{r_{G} (k, j)}} [g ({^G}^{k j} (y)_{1}) ({^G}^{k j} (y^{'})_{2} - f ({^G}^{k j} (y)_{1})) ∣ {^G}^{k j} (y^{'})_{1} = {^G}^{k j} (y)_{1}]$

$E_{U^{r_{G} (k, j)}} [g ({^G}^{k j} (y)_{1}) ({^G}^{k j} (y)_{2} - f ({^G}^{k j} (y)_{1}))] = E_{U^{r_{G} (k, j)}} [g ({^G}^{k j} (y)_{1}) E_{U^{r_{G} (k, j)}} [{^G}^{k j} (y^{'})_{2} - f ({^G}^{k j} (y)_{1}) ∣ {^G}^{k j} (y^{'})_{1} = {^G}^{k j} (y)_{1}]]$

$E_{U^{r_{G} (k, j)}} [g ({^G}^{k j} (y)_{1}) ({^G}^{k j} (y)_{2} - f ({^G}^{k j} (y)_{1}))] = E_{U^{r_{G} (k, j)}} [g ({^G}^{k j} (y)_{1}) (E_{U^{r_{G} (k, j)}} [{^G}^{k j} (y^{'})_{2} ∣ {^G}^{k j} (y^{'})_{1} = {^G}^{k j} (y)_{1}]] - f ({^G}^{k j} (y)_{1}))]$

Property (ii) of $^G$ implies the result.

Proposition 5

Suppose $^G$ is a weak $Δ (l o g)$ -generator for $(f, μ)$ . Then there is $δ \in Δ$ s.t. for any $k, j \in N$ , finite set $Z$ , $ν$ a probability measure on $Z$ and $h : {0, 1}^{*} \times Z \to [0, 1]$

$| E_{U^{r_{G} (k, j)} \times ν} [(h ({^G}^{k j} (y)_{1}, z) - {^G}^{k j} (y)_{2})^{2}] - E_{μ^{k} \times ν} [(h (x, z) - f (x))^{2}] - E_{U^{r_{G} (k, j)}} [({^G}^{k j} (y)_{2} - f ({^G}^{k j} (y)_{1}))^{2}] | \leq δ (k, j)$

Proof of Proposition 5

$E_{U^{r_{G} (k, j)} \times ν} [(h ({^G}^{k j} (y)_{1}, z) - {^G}^{k j} (y)_{2})^{2}] = E_{U^{r_{G} (k, j)} \times ν} [(h ({^G}^{k j} (y)_{1}, z) - f ({^G}^{k j} (y)_{1}) + f ({^G}^{k j} (y)_{1}) - {^G}^{k j} (y)_{2})^{2}]$

$E [(h ({^G}_{1}^{k j}) - {^G}_{2}^{k j})^{2}] = E [(h ({^G}_{1}^{k j}) - f ({^G}_{1}^{k j}))^{2}] + E [({^G}_{2}^{k j} - f ({^G}_{1}^{k j}))^{2}] - 2 E [(h ({^G}_{1}^{k j}) - f ({^G}_{1}^{k j})) ({^G}_{2}^{k j} - f ({^G}_{1}^{k j}))]$

Since ${^G}_{1}$ is a $Δ (l o g)$ -sampler of $μ$ , the first term on the right hand side satisfies

$| E_{U^{r_{G} (k, j)} \times ν} [(h ({^G}_{1}^{k j}) - f ({^G}_{1}^{k j}))^{2}] - E_{μ^{k} \times ν} [(h (x) - f (x)^{2}] | \leq δ_{1} (k, j)$

where $δ_{1} \in Δ$ doesn’t depend on $h$ . Applying Proposition 4 to the last term on the right hand hands completes the desired result.

Proposition 6

Given $ϕ \in Φ$ and $δ \in Δ_{s q p, ϕ}^{2}$ , define $δ_{m o n} (k, j) := {sup}_{m \geq j} δ (k, m)$ . Then, $δ_{m o n} \in Δ_{s q p, ϕ}^{2}$ .

Proof of Proposition 6

Consider $ψ \in Φ$ , $ψ \leq ϕ$ .

$sup j \geq t_{ψ} (k) δ_{m o n} (k, j) = sup j \geq t_{ψ} (k) sup m \geq j δ (k, m)$

$sup j \geq t_{ψ} (k) δ_{m o n} (k, j) = sup j \geq t_{ψ} (k) δ (k, j)$

$sup j \geq t_{ψ} (k) δ_{m o n} (k, j) \in Δ_{ψ}^{1}$

Proof of Theorem 1

Consider $^P = (P, r, a)$ a $(p o l y, l o g)$ -predictor scheme. Choose $p : N^{2} \to N$ a polynomial satisfying $p (k, j) \geq j$ s.t. evaluating $Λ [G]^{k, p (k, j)}$ involves running ${^P}^{k j}$ until it halts “naturally” (such $p$ exists because $^P$ runs in at most polynomial time and has at most logarithmic advice). Given $i, j \in N$ , consider the execution of $Λ [G]^{k, p (k, j) + i}$ . The standard deviation of $ϵ ({^P}^{k j})$ with respect to the internal coin tosses of $Λ$ is at most $(p (k, j) + i)^{- 1}$ . According to Proposition 5, the expectation value is $E [({^P}^{k j} - f)^{2}] + E [({^G}_{2}^{k, p (k, j) + i} - f ({^G}_{1}^{k, p (k, j) + i}))^{2}] + γ_{P} (k, p (k, j) + i)$ where $| γ_{P} | \leq δ$ for $δ \in Δ_{s q p, ϕ}^{2}$ that doesn’t depend on $P$ . Thanks to Proposition 6 we can assume without loss of generality that $δ$ is non-increasing in the second argument, and in particular $δ (k, p (k, j) + i) \leq δ (k, j)$ . Denote $α^{k j} := E [({^G}_{2}^{k, p (k, j) + i} - f ({^G}_{1}^{k, p (k, j) + i}))^{2}]$ . By Chebyshev’s inequality,

$P r [ϵ ({^P}^{k j}) \geq E [({^P}^{k j} - f)^{2}] + α^{k j} + δ (k, j) + (p (k, j) + i)^{- \frac{1}{2}}] \leq (p (k, j) + i)^{- 1}$

Hence

$P r [ϵ (Q^{*}) \geq E [({^P}^{k j} - f)^{2}] + α^{k j} + δ (k, j) + (p (k, j) + i)^{- \frac{1}{2}}] \leq (p (k, j) + i)^{- 1}$

The standard deviation of $ϵ (Q)$ for any $Q$ is also at most $(p (k, j) + i)^{- 1}$ . The expectation value is $E [(e v^{p (k, j) + i} (Q) - f)^{2}] + α^{k j} + γ_{Q}$ where $| γ_{Q} | \leq δ (k, j)$ . Therefore

$P r [\exists Q < p (k, j) + i : ϵ (Q) \leq E [(e v^{p (k, j) + i} (Q) - f)^{2}] + α^{k j} - δ (k, j) - (p (k, j) + i)^{- \frac{1}{4}}] \leq (p (k, j) + i) (p (k, j) + i)^{- \frac{3}{2}} = (p (k, j) + i)^{- \frac{1}{2}}$

The extra $p (k, j) + i$ factor comes from summing probabilities over $p (k, j) + i$ programs. Combining we get

$P r [E [(e v^{p (k, j) + i} (Q^{*}) - f)^{2}] \geq E [({^P}^{k j} - f)^{2}] + 2 δ (k, j) + (p (k, j) + i)^{- \frac{1}{2}} + (p (k, j) + i)^{- \frac{1}{4}}] \leq (p (k, j) + i)^{- 1} + (p (k, j) + i)^{- \frac{1}{2}}$

$E [(Λ [G]^{k, p (k, j) + i} - f)^{2}] \leq E [({^P}^{k j} - f)^{2}] + 2 δ (k, j) + (p (k, j) + i)^{- 1} + 2 (p (k, j) + i)^{- \frac{1}{2}} + (p (k, j) + i)^{- \frac{1}{4}}$

$E [(Λ [G]^{k, p (k, j) + i} - f)^{2}] \leq E [({^P}^{k j} - f)^{2}] + 2 δ (k, j) + p (k, j)^{- 1} + 2 p (k, j)^{- \frac{1}{2}} + p (k, j)^{- \frac{1}{4}}$

Using Proposition 2 and the Lemma about $Δ_{l l, ϕ}^{2}$ , we get the desired result.

Proof of Theorem 2

$E_{U^{r_{G} (k, j)}} [({^G}_{τ} (y)_{2} - (τ \circ f) ({^G}_{τ} (y)_{1}))^{2}] = E_{U^{r_{G} (k, j)}} [(τ (^G (y)_{2}) - τ (f (^G (y)_{1})))^{2}]$

Since $τ$ is $α$ -Hoelder continuous, there is a constant $c > 0$ which doesn’t depend on $k, j$ s.t.

$E_{U^{r_{G} (k, j)}} [({^G}_{τ} (y)_{2} - (τ \circ f) ({^G}_{τ} (y)_{1}))^{2}] \leq c E_{U^{r_{G} (k, j)}} [(^G (y)_{2} - f (^G (y)_{1}))^{2 α}]$

$E_{U^{r_{G} (k, j)}} [({^G}_{τ} (y)_{2} - (τ \circ f) ({^G}_{τ} (y)_{1}))^{2}] \leq c E_{U^{r_{G} (k, j)}} [(^G (y)_{2} - f (^G (y)_{1}))^{2}]^{α}$

$E_{U^{r_{G} (k, j)}} [({^G}_{τ} (y)_{2} - (τ \circ f) ({^G}_{τ} (y)_{1}))^{2}] \in Δ$

Proposition 7

$min (\frac{log log (k + 3)}{log log (j + 3)}, 1) \in Δ_{s q p}^{2}$

Proof of Proposition 7

Consider $ϕ \in Φ$ . We have

$sup j \geq t_{ϕ} (k) min (\frac{log log (k + 3)}{log log (j + 3)}, 1) = min (\frac{log log (k + 3)}{log log (t_{ϕ} (k) + 3)}, 1)$

Choosing any $ϵ \in (0, 1)$ we get

$lim k \to \infty ϕ (k)^{ϵ} sup j \geq t_{ϕ} (k) min (\frac{log log (k + 3)}{log log (j + 3)}, 1) = lim k \to \infty ϕ (k)^{ϵ - 1}$

$lim k \to \infty ϕ (k)^{ϵ} sup j \geq t_{ϕ} (k) min (\frac{log log (k + 3)}{log log (j + 3)}, 1) = 0$

Proof of Proposition 3

Given $k, j \in N$ , define $σ^{k j}$ to be the probability distribution on $N$ given by $σ^{k j} := σ^{k} ∣ n \leq max (k, j)$ . We describe the execution of the weak generator ${^G}_{B S I (i)}^{k j}$ .

$n$ is sampled from $σ^{k j}$ (more precisely from a probability distribution that differs from $σ^{k j}$ by a rounding error of $2^{- j} \in Δ_{s q p}^{2}$ ). $y, z$ are sampled from $U^{n}$ . $x := e v^{n} (y, z)$ is computed. If $| x | < k$ , $(⊥, 0)$ is produced. If $| x | > k$ and $x_{k} = i$ , $(x_{< k}, 1)$ is produced. Otherwise, $(x_{< k}, 0)$ is produced.

It is easy to see ${^G}_{B S I (i)}$ generates $(f_{B S I (i)}, μ_{B S I})$ up to an error of order $P r_{σ^{k}} [n > max (k, j)] < min (\frac{log log (k + 3)}{log log (j + 3)}, 1)$ . By Proposition 7, it is therefore a weak $Δ_{s q p}^{2}$ -generator for $(f_{B S I (i)}, μ_{B S I})$ .

Bounded Solomonoff induction using optimal predictor schemes