I’m confused. Isn’t one of the standard justification for the Solomonoff prior that you can get it without talking about K-complexity, just by assuming a uniform prior over programs of length l on a universal monotone Turing machine and letting l tend to infinity?
What you describe is not the Solomonoff prior on hypotheses, but the Solomonoff a priori distribution on sequences/histories! This is the distribution I call M in my post. It can then be written as a mixture of LSCSMs, with the weights given either by the Solomonoff prior Psol (involving Kolmogorov complexity) or the a priori prior Pap in my work. Those priors are not the same.
If they have the same prior on sequences/histories, then in what relevant sense are they not the same prior on hypotheses? If they both sum to M(x), how can their predictions come to differ?
Well, their induced mixture distributions are the same up to a constant, but the priors on hypotheses are different. I’m not sure if you consider the difference “relevant”, perhaps you only care about the induced mixture distribution?
To make a simple example: Assume there were only three Turing machines T, T0, and T1. Assume that T(0p)=T0(p) and T(1p)=T1(p). Let ν, ν0 and ν1 be the LSCSMs induced by T, T1, and T2. Notice that ν is a mixture of ν0 and ν1: ν=1/2ν0+1/2ν1.
Let M be the mixture distribution given as M=1/3ν+1/3ν0+1/3ν1. Then clearly, M is also represented as M=1/2ν0+1/2ν1. My viewpoint is that the prior distributions giving weight 1/3 to each of the three hypotheses is different from the one giving weight 1/2 to each of ν0 and ν1, even if their mixture distributions are exactly the same.
And this is exactly the situation we’re in with the true mixture distribution M from the post. Some of the LSCSMs ν in the mixture are given by ν=νT for a separate universal monotone Turing machine, which means that νT is itself a mixture of all LSCSMs. Any such mixtures in the LSCSMs allow to redistribute the prior weight from this LSCSM to all others, without affecting the mixture M in any way.
This is also related to what makes a prior based on Kolmogorov complexity ultimately so arbitrary: We could have chosen just about anything and it would still essentially sum to M. A posteriori the Kolmogorov complexity then has some mathematical advantages as outlined in the post, however.
My viewpoint is that the prior distributions giving weight 1/3 to each of the three hypotheses is different from the one giving weight 1/2 to each of ν0 and ν1, even if their mixture distributions are exactly the same.
That’s pretty unintuitive to me. What does it matter whether we happen to write out our belief state one way or the other? So long as the predictions come out the same, what we do and don’t choose to call our ‘hypotheses’ doesn’t seem particularly relevant for anything?
We made our choice when we settled on M as the prior. Everything past that point just seems like different choices of notation to me? If our induction procedure turned out to be wrong or suboptimal, it’d be because M was a bad prior to pick, not because we happened to write M down in a weird way, right?
I answered in the parallel thread, which is probably going down to the crux now. To add a few more points:
The prior matters for the Solomonoff bound, see Theorem 5. (Tbc., the true value of the prediction error is the same irrespective of the prior, but the bound we can prove differs)
I think different priors have different aesthetics. Choosing a prior because it gives you a nice result (i.e., Solomonoff prior) feels different from choosing it because it’s a priori correct (like the a priori prior in this post). to me, aesthetics matter.
It’s also useful to emphasize why even if the mixtures are the same, having different priors can make a practical difference. E.g., imagine that in the example above we had one prior giving 100% weight to ν, and another prior giving 50% weight to each of ν0 and ν1. They give the same mixture, but the first prior can’t update, and the second prior can!
Okay, I think I overstated the extent to which the difference in priors matters in the previous comments and crossed out “practical”.
Basically, I was right that the prior that gives 100% on ν cannot update, it gives all its weight to ν no matter how much data comes in. However, νitself can update with more data and shift between ν1 and ν2.
I can see that this feels perhaps very syntactic, but in my mind the two priors still feel different. One of them is saying “The world first samples a bit indicating whether the world will continue with world 0 or world 1”, and the other one is saying “I am uncertain on whether we live in world 0 or world 1″.
The difference is not a “practical” one as long as you only use the posterior predictive distribution, but in some AIXI variants (KSA, certain safety proposals) the posterior weights themselves are accessed and the form may matter. Arguably this is a defect of those variants.
Might be worth more explicitly noting in the post that P_sol and P_ap in fact define the same semimeasure over strings(up to a multiplicative factor) From a skim I was confused about this point “wait, is he saying that not only are alt-complexity and K-complexity different, but even define different probability distributions? That seems to contradict the universality of P_sol, doesn’t it....?”
Good idea, I now added the following to the opening paragraphs of the section doing the comparisons:
Importantly, due to Theorem 4, this means that the Solomonoff prior Psol and a priori prior Pap lead up to a constant to the same predictions on sequences. The advantages of the priors that we analyze are thus not statements about their induced predictive distributions.
What you describe is not the Solomonoff prior on hypotheses, but the Solomonoff a priori distribution on sequences/histories! This is the distribution I call M in my post. It can then be written as a mixture of LSCSMs, with the weights given either by the Solomonoff prior Psol (involving Kolmogorov complexity) or the a priori prior Pap in my work. Those priors are not the same.
If they have the same prior on sequences/histories, then in what relevant sense are they not the same prior on hypotheses? If they both sum to M(x), how can their predictions come to differ?
Well, their induced mixture distributions are the same up to a constant, but the priors on hypotheses are different. I’m not sure if you consider the difference “relevant”, perhaps you only care about the induced mixture distribution?
To make a simple example: Assume there were only three Turing machines T, T0, and T1. Assume that T(0p)=T0(p) and T(1p)=T1(p). Let ν, ν0 and ν1 be the LSCSMs induced by T, T1, and T2. Notice that ν is a mixture of ν0 and ν1: ν=1/2ν0+1/2ν1.
Let M be the mixture distribution given as M=1/3ν+1/3ν0+1/3ν1. Then clearly, M is also represented as M=1/2ν0+1/2ν1. My viewpoint is that the prior distributions giving weight 1/3 to each of the three hypotheses is different from the one giving weight 1/2 to each of ν0 and ν1, even if their mixture distributions are exactly the same.
And this is exactly the situation we’re in with the true mixture distribution M from the post. Some of the LSCSMs ν in the mixture are given by ν=νT for a separate universal monotone Turing machine, which means that νT is itself a mixture of all LSCSMs. Any such mixtures in the LSCSMs allow to redistribute the prior weight from this LSCSM to all others, without affecting the mixture M in any way.
This is also related to what makes a prior based on Kolmogorov complexity ultimately so arbitrary: We could have chosen just about anything and it would still essentially sum to M. A posteriori the Kolmogorov complexity then has some mathematical advantages as outlined in the post, however.
That’s pretty unintuitive to me. What does it matter whether we happen to write out our belief state one way or the other? So long as the predictions come out the same, what we do and don’t choose to call our ‘hypotheses’ doesn’t seem particularly relevant for anything?
We made our choice when we settled on M as the prior. Everything past that point just seems like different choices of notation to me? If our induction procedure turned out to be wrong or suboptimal, it’d be because M was a bad prior to pick, not because we happened to write M down in a weird way, right?
I answered in the parallel thread, which is probably going down to the crux now. To add a few more points:
The prior matters for the Solomonoff bound, see Theorem 5. (Tbc., the true value of the prediction error is the same irrespective of the prior, but the bound we can prove differs)
I think different priors have different aesthetics. Choosing a prior because it gives you a nice result (i.e., Solomonoff prior) feels different from choosing it because it’s a priori correct (like the a priori prior in this post). to me, aesthetics matter.
It’s also useful to emphasize why even if the mixtures are the same, having different priors can make a
practicaldifference. E.g., imagine that in the example above we had one prior giving 100% weight to ν, and another prior giving 50% weight to each of ν0 and ν1. They give the same mixture, but the first prior can’t update, and the second prior can!… Wait, are you saying we’re not propagating updates into ν to change the mass it puts on inputs 0 vs. 1?
Okay, I think I overstated the extent to which the difference in priors matters in the previous comments and crossed out “practical”.
Basically, I was right that the prior that gives 100% on ν cannot update, it gives all its weight to ν no matter how much data comes in. However, ν itself can update with more data and shift between ν1 and ν2.
I can see that this feels perhaps very syntactic, but in my mind the two priors still feel different. One of them is saying “The world first samples a bit indicating whether the world will continue with world 0 or world 1”, and the other one is saying “I am uncertain on whether we live in world 0 or world 1″.
The difference is not a “practical” one as long as you only use the posterior predictive distribution, but in some AIXI variants (KSA, certain safety proposals) the posterior weights themselves are accessed and the form may matter. Arguably this is a defect of those variants.
Might be worth more explicitly noting in the post that P_sol and P_ap in fact define the same semimeasure over strings(up to a multiplicative factor) From a skim I was confused about this point “wait, is he saying that not only are alt-complexity and K-complexity different, but even define different probability distributions? That seems to contradict the universality of P_sol, doesn’t it....?”
Good idea, I now added the following to the opening paragraphs of the section doing the comparisons: