Bounded Solomonoff induction using optimal predictor schemes

Most of the content of this post was covered by the talk I gave in Los Angeles MIRIx in October, minus the proofs and a minor amendment of Theorem 1 (the role of ).

We define variants of the concept of generatable distributional estimation problem and show that these variants also admits a uniformly universal optimal predictor scheme. We show how to use this to implement a form of bounded Solomonoff induction.

Results

We have previously defined a “word ensemble” to be a collection of probability measures on s.t. for some polynomial , . This was convenient when the formalism was based on Boolean circuits but unnecessary for Turing machines. It is enough to assume that the Turing machine is allowed to read only the beginning of the input and thus halt in time arbitrarily smaller than the length of the input. In the following we will use “word ensemble” to mean an arbitrary sequence of probability measures on , allow such word ensembles in distributional estimation problems etc.

All proofs are in the Appendix.

We start by defining “-sampler” and “-generator” for an error space of rank 2 (they were previously defined for an error space of rank 1). Fix such an error space.

Definition 1

Consider a word ensemble. A -bischeme of signature is called a -sampler for when

When has no advice, it is called a -sampler.

which admits such is called -sampleable or -sampleable correspondingly.

Definition 2

Consider a distributional estimation problem. A -bischeme of signature is called a -generator for when

(i) is a -sampler for .

(ii)

When has no advice, it is called a -generator.

which admits such is called -generatable or -generatable correspondingly.


We now introduce a variant in which the generator matches on average but is allowed to have arbitrary variance.

Definition 3

Consider a distributional estimation problem. A -bischeme of signature is called a weak -generator for when

(i) is a -sampler for .

(ii)

When has no advice, it is called a weak -generator.

which admits such is called weakly -generatable or weakly -generatable correspondingly.

Proposition 1

Consider a distributional estimation problem. Any -generator for is in particular a weak -generator for .


We now show that weak generators enable an existence theorem for optimal predictor schemes of the same form as generators.

Construction 1

Given we define to be the set of bounded functions s.t. for any , if then . It is easy to see is an error space.

We define . is also an error space.

Proposition 2

Construction 2

We describe an oracle machine that accepts an oracle of signature and implements a -predictor scheme. Consider the computation of .

We loop over the first words in lexicographic order. For each word , we loop over “test runs”. At test run , we generate by evaluating (we treat as random, that is, we don’t expect repeated answers to the same query to coincide). We then sample from and compute . At the end of the test runs, we compute the average error . At the end of the loop over programs, the program with the lowest error is selected and the output is produced.

Theorem 1

Consider a distributional estimation problem, and a weak -generator for . Then, is a -optimal predictor scheme for .

Corollary 1

Consider a distributional estimation problem and a weak -generator for . Then, is a -optimal predictor scheme for .


To demonstrate the difference between generators and weak generators, we give the following property of generators which weak generators don’t share.

Theorem 2

Consider a distributional estimation problem and a corresponding -generator. Suppose is an -Hoelder continuous function for and is a -vallued -bischeme s.t.

Define by

Then, is a -generator for .


This means that a generator yields an entire “logical probability distribution” for whereas a weak generator yields only a “logical expectation value.”

The following constructions shows how to use Theorem 2 to implement a form of bounded Solomonoff induction.

Construction 3

Given , we define the probability distribution on by

We define the probability distribution on by

Given , we define by

Using an appropriate encoding can be identified with and regarded as a distributional estimation problem.

Proposition 3

is weakly -generatable.


The following claim is likely to be a theorem, but we haven’t worked out the details of the proof.

Claim

Consider a random Turing machine which produces an infinite sequence s.t. the time to compute the -th bit is bounded by a quasi-polynomial in . Given , define the probability measure on by

Given , define by

Then, the identity function is a -pseudo-invertible reduction of to .

Corollary 2

Suppose is a weak -generator for . Then, is not only a -optimal predictor scheme for but also a -optimal predictor scheme for , for any as above.

Note 1

It is straightforward to extend this approach to predicting functions of the several following bits rather than only the next bit.

Note 2

is computing directly rather than recursively (that is, can depend directly on the random coin flips used to generate the previous bits rather than only on the previous bits) which means this approach avoids the problem presented by Taylor. That is, if we construct an agent that uses to compute expected reward as a function of action based on previous observations, then in Taylor’s example the correct action will be selected (since in that example the reward can be computed in polynomial time and by the reduction and uniqueness theorems will produce an asymptotically exact prediction).

Appendix

Proof of Proposition 1

Suppose is a -generator for . Define by

and by

We have

The middle term on the right hand side satisfies

We get

Proof of Proposition 2

Consider . Consider , . We have

For , hence . We conclude that and therefore .

Proposition 4

Suppose is a weak -generator for . Then there is s.t. for any and bounded

Proof of Proposition 4

Property (ii) of implies the result.

Proposition 5

Suppose is a weak -generator for . Then there is s.t. for any , finite set , a probability measure on and

Proof of Proposition 5

Since is a -sampler of , the first term on the right hand side satisfies

where doesn’t depend on . Applying Proposition 4 to the last term on the right hand hands completes the desired result.

Proposition 6

Given and , define . Then, .

Proof of Proposition 6

Consider , .

Proof of Theorem 1

Consider a -predictor scheme. Choose a polynomial satisfying s.t. evaluating involves running until it halts “naturally” (such exists because runs in at most polynomial time and has at most logarithmic advice). Given , consider the execution of . The standard deviation of with respect to the internal coin tosses of is at most . According to Proposition 5, the expectation value is where for that doesn’t depend on . Thanks to Proposition 6 we can assume without loss of generality that is non-increasing in the second argument, and in particular . Denote . By Chebyshev’s inequality,

Hence

The standard deviation of for any is also at most . The expectation value is where . Therefore

The extra factor comes from summing probabilities over programs. Combining we get

Using Proposition 2 and the Lemma about , we get the desired result.

Proof of Theorem 2

Since is -Hoelder continuous, there is a constant which doesn’t depend on s.t.

Proposition 7

Proof of Proposition 7

Consider . We have

Choosing any we get

Proof of Proposition 3

Given , define to be the probability distribution on given by . We describe the execution of the weak generator .

is sampled from (more precisely from a probability distribution that differs from by a rounding error of ). are sampled from . is computed. If , is produced. If and , is produced. Otherwise, is produced.

It is easy to see generates up to an error of order . By Proposition 7, it is therefore a weak -generator for .

No comments.