Okay, I have a “stupid” question. Why is the longer binary sequence that represents the hypothesis less likely to be ‘true’ data generator? I read the part below but I don’t get the example, can someone explain in a different way?
We have a list, but we’re trying to come up with a probability, not just a list of possible explanations. So how do we decide what the probability is of each of these hypotheses? Imagine that the true algorithm is produced in a most unbiased way: by flipping a coin. For each bit of the hypothesis, we flip a coin. Heads will be 0, and tails will be 1. In the example above, 01001101, the coin landed heads, tails, heads, heads, tails, and so on. Because each flip of the coin has a 50% probability, each bit contributes ½ to the final probability.
Therefore an algorithm that is one bit longer is half as likely to be the true algorithm. Notice that this intuitively fits Occam’s razor; a hypothesis that is 8 bits long is much more likely than a hypothesis that is 34 bits long. Why bother with extra bits? We’d need evidence to show that they were necessary.
This is a completely different explanation, and I am not sure about its correctness. But it may help your intuition.
Imagine that you randomly generate programs in a standard computer language. And—contrary to the S.I. assumption—let’s say that all those programs have the same length N.
Let’s try a few examples. For simplicity, let’s suppose that the input is in variable x, and the programming language has only two commands: “x++” (increases x by one) and “return x” (returns the current value of x as a result). There are four possible programs:
x++; x++;
x++; return x;
return x; x++;
return x; return x;
The first program does not return a value. Let’s say that it is not a valid program, and remove it from the pool. The second program returns the value “x+1”. The third and fourth programs both return the value “x”, because both are equivalent to one-line program “return x”.
It seems that the shorter programs have an advantage, because a longer program can give the same results as the shorter program. More precisely, for every short program, there are infinitely many equivalent longer programs; as you increase the maximum allowed length N, you increase their numbers exponentially (though not necessarily 2^x exponentially; this is just a result of our example language having 2 instructions). This is why in this example the one-line program won the competition, even if it wasn’t allowed to participate.
OK, this is not a proof, just an intuition. Seems to me that even if you use something else instead of S.I., the S.I. may appear spontaneously anyway; so it is a “natural” distribution. (Similarly, the Gauss curve is a natural distribution for adding independent random variables. If you don’t like it, just choose a different curve F(x), and then calculate what happens if you add together thousand or million random numbers selected by F(x)… the result will be more and more similar to the Gauss curve.)
The algorithmic probability of a string is the probability that the binary sequence will be obtained when you feed infinite string of random digits on input tape to universal prefix Turing machine (look up prefix Turing machine for more info).
The sum weighted by powers of two is simply the alternative way of writing it down as the probability of obtaining specific prefix of length l via coin flips is 2^-l , in the model whereby you split the string into some ‘prefix’ that is executed as program, and the ‘data’. This division is pretty arbitrary as the prefix may be an interpreter.
AFAIK, Solomonoff induction can be specified in such way: probability of seeing 1 after [string] is algorithmic probability ( [ string ] 1 ) / algorithmic_probability ( [ string ] )
Further reading: papers by Marcus Hutter.
edit: i’ve always said that markdown is for people who use italics more than algebra.
edit2: and if its not clear from the post, the programs are self delimiting; this is what specifies the end of program. Prefix turing machine does not have special ‘stop’ marker. The end of program is largely in the eye of the beholder; you can make a prefix that would proceed to read the input tape and output the digits from input tape; you can view it as a short program, or you can view it as a set of of infinitely long programs.
Okay, I have a “stupid” question. Why is the longer binary sequence that represents the hypothesis less likely to be ‘true’ data generator? I read the part below but I don’t get the example, can someone explain in a different way?
This is a completely different explanation, and I am not sure about its correctness. But it may help your intuition.
Imagine that you randomly generate programs in a standard computer language. And—contrary to the S.I. assumption—let’s say that all those programs have the same length N.
Let’s try a few examples. For simplicity, let’s suppose that the input is in variable x, and the programming language has only two commands: “x++” (increases x by one) and “return x” (returns the current value of x as a result). There are four possible programs:
x++;
x++;
x++;
return x;
return x;
x++;
return x;
return x;
The first program does not return a value. Let’s say that it is not a valid program, and remove it from the pool. The second program returns the value “x+1”. The third and fourth programs both return the value “x”, because both are equivalent to one-line program “return x”.
It seems that the shorter programs have an advantage, because a longer program can give the same results as the shorter program. More precisely, for every short program, there are infinitely many equivalent longer programs; as you increase the maximum allowed length N, you increase their numbers exponentially (though not necessarily 2^x exponentially; this is just a result of our example language having 2 instructions). This is why in this example the one-line program won the competition, even if it wasn’t allowed to participate.
OK, this is not a proof, just an intuition. Seems to me that even if you use something else instead of S.I., the S.I. may appear spontaneously anyway; so it is a “natural” distribution. (Similarly, the Gauss curve is a natural distribution for adding independent random variables. If you don’t like it, just choose a different curve F(x), and then calculate what happens if you add together thousand or million random numbers selected by F(x)… the result will be more and more similar to the Gauss curve.)
The algorithmic probability of a string is the probability that the binary sequence will be obtained when you feed infinite string of random digits on input tape to universal prefix Turing machine (look up prefix Turing machine for more info).
The sum weighted by powers of two is simply the alternative way of writing it down as the probability of obtaining specific prefix of length l via coin flips is 2^-l , in the model whereby you split the string into some ‘prefix’ that is executed as program, and the ‘data’. This division is pretty arbitrary as the prefix may be an interpreter.
AFAIK, Solomonoff induction can be specified in such way: probability of seeing 1 after [string] is algorithmic probability ( [ string ] 1 ) / algorithmic_probability ( [ string ] )
Further reading: papers by Marcus Hutter.
edit: i’ve always said that markdown is for people who use italics more than algebra.
edit2: and if its not clear from the post, the programs are self delimiting; this is what specifies the end of program. Prefix turing machine does not have special ‘stop’ marker. The end of program is largely in the eye of the beholder; you can make a prefix that would proceed to read the input tape and output the digits from input tape; you can view it as a short program, or you can view it as a set of of infinitely long programs.