If confines oneself to elaborate, baroque rules, but sticks to a particular style of baroque elaboration, one obtains the same effect.
The point of Kolmogorov complexity is that there is some limit to how baroque your rules can ever become, or how baroque you can make a fundamentally simple rule by choosing a really weird prior.
In algorithmic information theory, the problem of choosing a prior is equivalent to choosing a particular universal Turing machine. If you pick a really weird Turing machine, you will end up assigning low prior probability to things that “normal” people would consider simple—like a sine wave, for example. But because of universal computation, there’s a limit to how low a probability you can assign. Your weird machine is still universal, so somehow or other it’s got to be able to produce a sine wave, after whatever translation-prefix machinations you have to perform to get it to act like a normal programming language.
Another way of viewing this is just to eschew the notion of generalization and state that the goal of learning is compression. If you do this, you end up doing all the same kinds of work, with many fewer philosophical headaches. Now the great deep problem of picking a prior boils down to the rather more quotidian one of picking a data format to use to transmit/encode data.
But because of universal computation, there’s a limit to how low a probability you can assign.
I don’t understand this claim. Surely we can make the probability of a specific program as low as we want by restricting the programming language in ad hoc ways, while letting it stay Turing complete.
I agree there is some philosophical greyness here. But let me try again.
Let’s say we are adversaries. I am going to choose a data set which I claim is simple, and you are going to try to prove me wrong by picking a weird Turing machine which assigns the data a low probability. I generate my data by taking T samples from a sine wave. You pick some strange Turing machine which is designed to be unable to produce sine waves. But regardless of the choice you make, I can always just crank up T to a high enough value so that the compression rate of the data set is arbitrarily close to 100%, proving its simplicity.
The point of Kolmogorov complexity is that there is some limit to how baroque your rules can ever become, or how baroque you can make a fundamentally simple rule by choosing a really weird prior.
In algorithmic information theory, the problem of choosing a prior is equivalent to choosing a particular universal Turing machine. If you pick a really weird Turing machine, you will end up assigning low prior probability to things that “normal” people would consider simple—like a sine wave, for example. But because of universal computation, there’s a limit to how low a probability you can assign. Your weird machine is still universal, so somehow or other it’s got to be able to produce a sine wave, after whatever translation-prefix machinations you have to perform to get it to act like a normal programming language.
Another way of viewing this is just to eschew the notion of generalization and state that the goal of learning is compression. If you do this, you end up doing all the same kinds of work, with many fewer philosophical headaches. Now the great deep problem of picking a prior boils down to the rather more quotidian one of picking a data format to use to transmit/encode data.
I don’t understand this claim. Surely we can make the probability of a specific program as low as we want by restricting the programming language in ad hoc ways, while letting it stay Turing complete.
I agree there is some philosophical greyness here. But let me try again.
Let’s say we are adversaries. I am going to choose a data set which I claim is simple, and you are going to try to prove me wrong by picking a weird Turing machine which assigns the data a low probability. I generate my data by taking T samples from a sine wave. You pick some strange Turing machine which is designed to be unable to produce sine waves. But regardless of the choice you make, I can always just crank up T to a high enough value so that the compression rate of the data set is arbitrarily close to 100%, proving its simplicity.