I find it intellectually exhilarating as I have not been introduced to Solomonoff before and his work may be very informative for my studies. I have come at inference from quite a different direction and I am hopeful that an appreciation of Solomonoff will broaden my scope.
One thing that puzzles me is the assertion that:
Therefore an algorithm that is one bit longer is half as likely to be the true algorithm. Notice that this intuitively fits Occam’s razor; a hypothesis that is 8 bits long is much more likely than a hypothesis that is 34 bits long. Why bother with extra bits? We’d need evidence to show that they were necessary.
First my understanding of the principle of maximum entropy suggests that prior probabilities are constrained only by evidence and not by the length of the hypothesis test algorithm. In fact Jaynes argues that ‘Ockham’s’ razor is already built into Bayesian inference.
Second given that the probability is reduced by half with every bit of added algorithm length wouldn’t that imply that algorithms’ having 1 bit were the most likely and have a probability of 1⁄2. In fact I doubt if any algorithm at all is describable with 1 bit. Some comments as well as the body of the article suggest that the real accomplishment of Solomonoff’s approach is to provide the set of all possible algorithms/hypothesis and that the probabilities assigned to each are not part of a probability distribution but rather are for the purposes of ranking. Why do they need to be ranked? Why not assign them all probability 1/N where N = 2^(n+1) − 2, the number of algorithms having length up to and including length n.
Clearly I am missing something important.
Could it be that ranking them by length is for the purpose of determining the sequence in which the possible hypothesis should be evaluated? When ranking hypothesis by length and then evaluating them against the evidence in sequence from shorter to longer our search will stop at the shortest possible algorithm, which by Occam’s razor is the preferred algorithm.
Second given that the probability is reduced by half with every bit of added algorithm length wouldn’t that imply that algorithms’ having 1 bit were the most likely and have a probability of 1⁄2. In fact I doubt if any algorithm at all is describable with 1 bit.
Keep in mind that the Solomonoff induction is defined up to the choice of a prefix-free universal Turing machine. There exist some of these UTMs that can represent a valid program with 1 bit, but this isn’t particularly relevant: Each UTM leads to a different Solomonoff prior. Solomonoff induction is universal only in an asymptotic sense: as you accumulate more and more bits of evidence, the posterior distributions converge with high probability.
Probability estimates dominated by short programs are biased by the choice of the UTM. As more evidence accumulates, the likelihood of a short program being consistent with all of it decreases (intuitively, by the pigeonhole principle), and the probability estimate will be dominated by longer and longer programs, “washing out” the initial bias.
Why do they need to be ranked? Why not assign them all probability 1/N where N = 2^(n+1) − 2, the number of algorithms having length up to and including length n.
In order to normalize properly, you have to divide the raw Solomonoff measure by the Chaitin’s omega of that particular UTM, that is, the probability that a randomly generated program halts.
What a wonderful post!
I find it intellectually exhilarating as I have not been introduced to Solomonoff before and his work may be very informative for my studies. I have come at inference from quite a different direction and I am hopeful that an appreciation of Solomonoff will broaden my scope.
One thing that puzzles me is the assertion that:
First my understanding of the principle of maximum entropy suggests that prior probabilities are constrained only by evidence and not by the length of the hypothesis test algorithm. In fact Jaynes argues that ‘Ockham’s’ razor is already built into Bayesian inference.
Second given that the probability is reduced by half with every bit of added algorithm length wouldn’t that imply that algorithms’ having 1 bit were the most likely and have a probability of 1⁄2. In fact I doubt if any algorithm at all is describable with 1 bit. Some comments as well as the body of the article suggest that the real accomplishment of Solomonoff’s approach is to provide the set of all possible algorithms/hypothesis and that the probabilities assigned to each are not part of a probability distribution but rather are for the purposes of ranking. Why do they need to be ranked? Why not assign them all probability 1/N where N = 2^(n+1) − 2, the number of algorithms having length up to and including length n.
Clearly I am missing something important.
Could it be that ranking them by length is for the purpose of determining the sequence in which the possible hypothesis should be evaluated? When ranking hypothesis by length and then evaluating them against the evidence in sequence from shorter to longer our search will stop at the shortest possible algorithm, which by Occam’s razor is the preferred algorithm.
Keep in mind that the Solomonoff induction is defined up to the choice of a prefix-free universal Turing machine. There exist some of these UTMs that can represent a valid program with 1 bit, but this isn’t particularly relevant:
Each UTM leads to a different Solomonoff prior. Solomonoff induction is universal only in an asymptotic sense: as you accumulate more and more bits of evidence, the posterior distributions converge with high probability.
Probability estimates dominated by short programs are biased by the choice of the UTM. As more evidence accumulates, the likelihood of a short program being consistent with all of it decreases (intuitively, by the pigeonhole principle), and the probability estimate will be dominated by longer and longer programs, “washing out” the initial bias.
In order to normalize properly, you have to divide the raw Solomonoff measure by the Chaitin’s omega of that particular UTM, that is, the probability that a randomly generated program halts.