What is the advantage of the Kolmogorov complexity prior?

As I understand it, Solomonoff induction works by the a procedure loosely stated as saying we start with a Kolmogorov complexity universal prior (formalized Occam’s razor), then update using Bayesian inference any time we see new data.

More precisely, suppose the universe is a computable sequence X of 0s and 1s. We want to predict the nth term before it happens, using the first n-1 terms.

To do this, we use the following ‘algorithm’:

  1. List all computable sequences in increasing order of the complexity of the algorithm used to generate them to get a list A1, A2, A3, …

  2. We start with the prior P(Ak)=1/​2k.

  3. With this prior, we predict to see a string b of observations with probability P(b)=∑k P(Ak)[b<Ak], where [b<Ak]=1 if Ak starts with the string b, and [b<Ak]=0 otherwise.

  4. After we see a string b of observations, we update our beliefs using Bayes theorem: P(Ak|b) = P(b|Ak)P(Ak)/​P(b) = [b<Ak]P(Ak)/​P(b), and similarly P(bc|b) = P(b|bc)P(bc)/​P(b) = P(bc)/​P(b).

I’m putting the word ‘algorithm’ in quotes, because there is no list of all computable sequences where [b<Ak] is a computable function of k and b, let alone one where the sequences are listed in increasing Kolmogorov complexity. You can think of it as an algorithm relative to an oracle that computes such a function.

If we follow this ‘algorithm’, we get a nice theorem:

For all ε>0 there is some N such that if b is an initial segment of X of length at least N, then P(bc|b)>1-ε if bc is an initial segment of X, and P(bc|b)<ε, otherwise.

So in the limit, we can correctly extrapolate from what we’ve seen with high confidence.


Proof: Let K = {k : X=Ak}. Let p0 = ∑k in K 12k. Let M ≥ -log(εp0). Let b be a sufficiently long initial segment of X such that b is not an initial segment of any Am, where m≤M and m is not in K. Then if bc is an initial segment of X,

P(bc) = ∑k 12k[bc<Ak] ≥ p0,

but on the other hand

P(b) = ∑k 12k[b<Ak] ≤ p0 + ∑k>M 12k = p0 + 2-M ≤ (1+ε)p0.

Therefore,

P(bc|b)=P(bc)/​P(b) ≥ p0/​(1+ε)p0 = 1/​(1+ε) > 1-ε.


What if the universe isn’t actually computable? Maybe there is some true randomness in the universe, as quantum-mechanics suggests? If you assume the universe is plucked at random from some computable probability distribution on the set of all binary sequences, you get almost the same result. By repeating the same procedure as above, but instead of listing all computable sequences, you list all computable probability distributions on the set of binary sequences and do a similar sort of Bayesian inference, you again get a theorem telling you that as you make observations, your distribution (conditioned on what you’ve seen so far) converges to the true distribution (conditioned on what you’ve seen so far).

This is all well and good. But in the end, what does it have to do with Kolmogorov complexity? The exact prior distribution we started with and the order of the list of all computable sequences was not used in any essential way in the proof. What’s the advantage of Occam’s razor? After all, we could skip steps 1 and 2 completely, start with an arbitrary list of all computable sequences, and an arbitrary prior supported on the set of computable sequences, and we would end up with the exact same theorem. (For the random universe version, we need to start with a prior that is a convex combination of every computable probability distribution, but again, we don’t need Kolmogorov complexity at all.)

Is there a stronger theorem that we get when we use the Kolmogorov complexity prior that we don’t get otherwise? Why are people making a big deal about the relationship between Solomonoff induction and Occam’s razor, if the part of Solomonoff induction that corresponds to Occam’s razor is not necessary for Solomonoff induction to work? Occam’s razor seems to work well in practice, but I don’t see how it’s in any way necessary (or even useful!) for the math.