Just found this note in Shalizi’s notebooks which casts an interesting shadow on the Solomonoff prior:
The technical results say that a classification rule is simple if it has a short description, measured in bits. (That is, we are in minimum description length land, or very close to it.) The shorter the description, the tighter the bound on the generalization error. I am happy to agree that this is a reasonable (if language-dependent) way of defining “simplicity” for classifier rules. However, so far as I can tell, this really isn’t what makes the proofs work.
What actually makes the proofs go is the fact that there are not very many binary strings of length m for small m. Finding a classifier with a short description means that you have searched over a small space. It’s the restriction of the search space which really does all the work. It’s not the simplicity of the rules which matters, but the fact that simple rules are scarce in the space of all possible rules. If confines oneself to elaborate, baroque rules, but sticks to a particular style of baroque elaboration, one obtains the same effect.
Here’s my understanding of the dialog, which (as I read it) is not particularly critical of the Solomonoff prior, if that is what cousin_it meant by “casts a shadow”.
(Background knowledge) Shalizi understands “Occam’s Razor” to be something like “In order to reach the truth, among the theories compatible with the evidence, chose the simplest”.
There is a claim that he wishes to refute. The claim is that a certain result is an explanation or proof of Occam’s Razor. The result says that if one finds a simple classification rule which works well in-sample, then it is highly probable that it will continue to work well out-of-sample.
This is a failure of relevance. Occam’s Razor, as Shalizi understands it, is a way of obtaining TRUTH, but the proof only concludes something about GENERALIZATION PERFORMANCE. To illustrate the difference, he points to an example where, in order to increase generalization performance, one might decrease truth.
Shalizi contrasts the algorithmic information theory proof with Kevin T. Kelly’s Ockham Efficiency Theorem, which seems to Shalizi more productive. In particular, Kevin T. Kelly’s formalization does talk about truth rather than generalization performance.
Finally, Shalizi provides an alternative ending to the algorithmic information theory proof. If instead of choosing the simplest classification rule, one chose the simplest rule within a sparse random subset of rules (even a non-computable random subset), then you could still conclude a bound on generalization performance. By providing an alternative ending, he has constructed an alternative proof. Presumably this alternative proof does NOT seem like it is a formalization of Occam’s Razor. Therefore, the interpretation of the original proof as demonstrating some version of Occam’s Razor must also be mistaken.
To sum up, Shalizi is arguing that a certain rhetorical/motivational/interpretational notion which often occurs near a specific proof is wrong. I don’t think he’s concluding anything at all about the Solomonoff prior.
Looks like I almost missed a very interesting discussion. Hope I’m not too late in joining it.
As far as I can tell, you still need the Solomonoff prior for decision making. Kevin T. Kelly’s Ockham Efficiency Theorem says that by using Ockham’s Razor you minimize reversals of opinion prior to finding the true theory, but that seems irrelevant when you have to bet on something, especially since even after you’ve found the truth using Ockham’s Razor, you don’t know that you’ve found it.
Also, I think there’s a flaw in Shalizi’s argument:
Suppose one of these rules correctly classifies all the data.
But if you’re working with a sparse random subset of the rules, why would any of them correctly classify all the data? In algorithmic information theory, the set of rules is universal so one of them is guaranteed to fit the data (assuming the input is computable, which may not be a good assumption but that’s a separate issue).
All good points if the universe has important aspects that are computable, which seems uncontroversial to me. Thanks for the link, I’d lost it sometime ago, that thread is pretty epic.
If confines oneself to elaborate, baroque rules, but sticks to a particular style of baroque elaboration, one obtains the same effect.
The point of Kolmogorov complexity is that there is some limit to how baroque your rules can ever become, or how baroque you can make a fundamentally simple rule by choosing a really weird prior.
In algorithmic information theory, the problem of choosing a prior is equivalent to choosing a particular universal Turing machine. If you pick a really weird Turing machine, you will end up assigning low prior probability to things that “normal” people would consider simple—like a sine wave, for example. But because of universal computation, there’s a limit to how low a probability you can assign. Your weird machine is still universal, so somehow or other it’s got to be able to produce a sine wave, after whatever translation-prefix machinations you have to perform to get it to act like a normal programming language.
Another way of viewing this is just to eschew the notion of generalization and state that the goal of learning is compression. If you do this, you end up doing all the same kinds of work, with many fewer philosophical headaches. Now the great deep problem of picking a prior boils down to the rather more quotidian one of picking a data format to use to transmit/encode data.
But because of universal computation, there’s a limit to how low a probability you can assign.
I don’t understand this claim. Surely we can make the probability of a specific program as low as we want by restricting the programming language in ad hoc ways, while letting it stay Turing complete.
I agree there is some philosophical greyness here. But let me try again.
Let’s say we are adversaries. I am going to choose a data set which I claim is simple, and you are going to try to prove me wrong by picking a weird Turing machine which assigns the data a low probability. I generate my data by taking T samples from a sine wave. You pick some strange Turing machine which is designed to be unable to produce sine waves. But regardless of the choice you make, I can always just crank up T to a high enough value so that the compression rate of the data set is arbitrarily close to 100%, proving its simplicity.
It’s not the simplicity of the rules which matters, but the fact that simple rules are scarce in the space of all possible rules. If confines oneself to elaborate, baroque rules, but sticks to a particular style of baroque elaboration, one obtains the same effect.
But a particular style of baroque elaboration is one that has a short description.
Not necessarily. (Or did I misread your comment?) The particular style can have an arbitrarily long/complex description, and learning will still work as long as the class of described rules is small enough. This observation seems to imply that algorithmic simplicity doesn’t play the central role I’d imagined it to play. This is precisely the point where I’d like to hear LW’s informed replies.
Not necessarily. The particular style can have an arbitrarily long description, and learning will still work as long as the class of described rules is small.
Fixed overheads are ignored in the definition of Kolmogorov complexity.
A “particular style of baroque elaboration” is simply a program through whose eyes certain rules look short, which look elaborate and baroque through the eyes of another program.
The PAC-learning scenario doesn’t have any parameter that goes to infinity, so I’m not sure why you dismiss “fixed” overheads :-)
Once you’ve chosen them, they’re fixed, and don’t run off to infinity.
This definition-only-up-to-a-constant is one of the weaknesses of minimum description length. (The other is its uncomputability. Shalizi somewhere else remarks that in discussions of algorithmic complexity, it is traditional to solemnly take out Kolmogorov complexity, exhibit its theoretical properties, remark on its uncomputability, and put it away again before turning to practical matters.)
This is precisely the point where I’d like to hear LW’s informed replies.
FWIW, this is mine, informed or otherwise. Anyone else have light to shed?
No, let me try nailing this jelly to the wall once again. The definition-only-up-to-a-constant is a weakness of MDL, but this weakness isn’t relevant to my question at all! Even if we had some globally unique variant of MDL derived from some nice mathematical idea, learning theory still doesn’t use description lengths, and would be perfectly happy with rules that have long descriptions as long as we delineate a small set of those rules. To my mind this casts doubt on the importance of MDL.
Consider this alternative characterization. Someone wants to fit a polynomial to some data. They pre-selected a sparse set of polynomials, which are in general ridiculously complex. Against all odds, they get a good fit to the training data. This theorem says that, because they haven’t examined lots and lots of polynomials, they definitely haven’t fallen into the trap of overfitting. Therefore, the good fit to the training data can be expected to generalize to the real data.
Shalizi is saying that this story is fine as far as it goes—it’s just not Occam’s Razor.
Good characterization. It’s worth noting that learning theory never gives any kind of guarantee that you will actually find a function that provides a good fit to the training data, it just tells you that if you do, and the function comes from a low-complexity set, it will probably give good generalization.
learning theory still doesn’t use description lengths, and would be perfectly happy with rules that have long descriptions as long as we delineate a small set of those rules
Any delineation of a small set of rules leads immediately to a short description length for the rules. You just need to encode the index of the rule in the set, costing log(N) bits for a set of size N.
Note that MDL is not the same as algorithmic information theory (definition-up-to-a-constant comes up in AIT, not MDL), though they’re of course related.
You just need to encode the index of the rule in the set, costing log(N) bits for a set of size N.
I see everyone’s assuming that some things, by their nature, always go to infinity (e.g. number of samples) while others stay constant (e.g. rule set). This is a nice convention, but not always realistic—and it certainly wasn’t mentioned in the original formulation of the problem of learning, where everything is finite. If you really want things to grow, why don’t you then allow the set itself to be specified by increasingly convoluted algorithms as N goes to infinity? Like, exponential in N? :-) Learning can still work—in theory, it’ll work just as well as a simple rule set—but you’ll have a hard time explaining that with MDL.
(If there’s some theoretical justification why this kind of outrageous bloatware won’t be as successful as simple algorithms, I’d really like to hear it...)
Johnicolas, thanks. This link and your other comments in those threads are very close to what I’m looking for, though this realization took me some time.
To my mind this casts doubt on the importance of MDL.
I think its uncomputability already does that. When you make a computable version by limiting attention to some framework of descriptive capabilities smaller than universal computation, different choices of that framework will give you different measures of simplicity. What is simple in one framework may seem elaborate and baroque in another. Or as some military strategist once put it:
“To the foot-soldier, the strategy of a general may seem obscure, shrouded in shadows and fog, but to the general himself, his way is as plain as if he were marching his army down a broad, straight highway.”
Just found this note in Shalizi’s notebooks which casts an interesting shadow on the Solomonoff prior:
For context, see the Wikipedia page on PAC learning or this lecture by Scott Aaronson.
Here’s my understanding of the dialog, which (as I read it) is not particularly critical of the Solomonoff prior, if that is what cousin_it meant by “casts a shadow”.
(Background knowledge) Shalizi understands “Occam’s Razor” to be something like “In order to reach the truth, among the theories compatible with the evidence, chose the simplest”.
There is a claim that he wishes to refute. The claim is that a certain result is an explanation or proof of Occam’s Razor. The result says that if one finds a simple classification rule which works well in-sample, then it is highly probable that it will continue to work well out-of-sample.
This is a failure of relevance. Occam’s Razor, as Shalizi understands it, is a way of obtaining TRUTH, but the proof only concludes something about GENERALIZATION PERFORMANCE. To illustrate the difference, he points to an example where, in order to increase generalization performance, one might decrease truth.
Shalizi contrasts the algorithmic information theory proof with Kevin T. Kelly’s Ockham Efficiency Theorem, which seems to Shalizi more productive. In particular, Kevin T. Kelly’s formalization does talk about truth rather than generalization performance.
Finally, Shalizi provides an alternative ending to the algorithmic information theory proof. If instead of choosing the simplest classification rule, one chose the simplest rule within a sparse random subset of rules (even a non-computable random subset), then you could still conclude a bound on generalization performance. By providing an alternative ending, he has constructed an alternative proof. Presumably this alternative proof does NOT seem like it is a formalization of Occam’s Razor. Therefore, the interpretation of the original proof as demonstrating some version of Occam’s Razor must also be mistaken.
To sum up, Shalizi is arguing that a certain rhetorical/motivational/interpretational notion which often occurs near a specific proof is wrong. I don’t think he’s concluding anything at all about the Solomonoff prior.
Looks like I almost missed a very interesting discussion. Hope I’m not too late in joining it.
As far as I can tell, you still need the Solomonoff prior for decision making. Kevin T. Kelly’s Ockham Efficiency Theorem says that by using Ockham’s Razor you minimize reversals of opinion prior to finding the true theory, but that seems irrelevant when you have to bet on something, especially since even after you’ve found the truth using Ockham’s Razor, you don’t know that you’ve found it.
Also, I think there’s a flaw in Shalizi’s argument:
But if you’re working with a sparse random subset of the rules, why would any of them correctly classify all the data? In algorithmic information theory, the set of rules is universal so one of them is guaranteed to fit the data (assuming the input is computable, which may not be a good assumption but that’s a separate issue).
All good points if the universe has important aspects that are computable, which seems uncontroversial to me. Thanks for the link, I’d lost it sometime ago, that thread is pretty epic.
The point of Kolmogorov complexity is that there is some limit to how baroque your rules can ever become, or how baroque you can make a fundamentally simple rule by choosing a really weird prior.
In algorithmic information theory, the problem of choosing a prior is equivalent to choosing a particular universal Turing machine. If you pick a really weird Turing machine, you will end up assigning low prior probability to things that “normal” people would consider simple—like a sine wave, for example. But because of universal computation, there’s a limit to how low a probability you can assign. Your weird machine is still universal, so somehow or other it’s got to be able to produce a sine wave, after whatever translation-prefix machinations you have to perform to get it to act like a normal programming language.
Another way of viewing this is just to eschew the notion of generalization and state that the goal of learning is compression. If you do this, you end up doing all the same kinds of work, with many fewer philosophical headaches. Now the great deep problem of picking a prior boils down to the rather more quotidian one of picking a data format to use to transmit/encode data.
I don’t understand this claim. Surely we can make the probability of a specific program as low as we want by restricting the programming language in ad hoc ways, while letting it stay Turing complete.
I agree there is some philosophical greyness here. But let me try again.
Let’s say we are adversaries. I am going to choose a data set which I claim is simple, and you are going to try to prove me wrong by picking a weird Turing machine which assigns the data a low probability. I generate my data by taking T samples from a sine wave. You pick some strange Turing machine which is designed to be unable to produce sine waves. But regardless of the choice you make, I can always just crank up T to a high enough value so that the compression rate of the data set is arbitrarily close to 100%, proving its simplicity.
cousin_it quoting Shalizi:
But a particular style of baroque elaboration is one that has a short description.
Not necessarily. (Or did I misread your comment?) The particular style can have an arbitrarily long/complex description, and learning will still work as long as the class of described rules is small enough. This observation seems to imply that algorithmic simplicity doesn’t play the central role I’d imagined it to play. This is precisely the point where I’d like to hear LW’s informed replies.
Fixed overheads are ignored in the definition of Kolmogorov complexity.
A “particular style of baroque elaboration” is simply a program through whose eyes certain rules look short, which look elaborate and baroque through the eyes of another program.
The PAC-learning scenario doesn’t have any parameter that goes to infinity, so I’m not sure why you dismiss “fixed” overheads :-)
Once you’ve chosen them, they’re fixed, and don’t run off to infinity.
This definition-only-up-to-a-constant is one of the weaknesses of minimum description length. (The other is its uncomputability. Shalizi somewhere else remarks that in discussions of algorithmic complexity, it is traditional to solemnly take out Kolmogorov complexity, exhibit its theoretical properties, remark on its uncomputability, and put it away again before turning to practical matters.)
FWIW, this is mine, informed or otherwise. Anyone else have light to shed?
No, let me try nailing this jelly to the wall once again. The definition-only-up-to-a-constant is a weakness of MDL, but this weakness isn’t relevant to my question at all! Even if we had some globally unique variant of MDL derived from some nice mathematical idea, learning theory still doesn’t use description lengths, and would be perfectly happy with rules that have long descriptions as long as we delineate a small set of those rules. To my mind this casts doubt on the importance of MDL.
Consider this alternative characterization. Someone wants to fit a polynomial to some data. They pre-selected a sparse set of polynomials, which are in general ridiculously complex. Against all odds, they get a good fit to the training data. This theorem says that, because they haven’t examined lots and lots of polynomials, they definitely haven’t fallen into the trap of overfitting. Therefore, the good fit to the training data can be expected to generalize to the real data.
Shalizi is saying that this story is fine as far as it goes—it’s just not Occam’s Razor.
Good characterization. It’s worth noting that learning theory never gives any kind of guarantee that you will actually find a function that provides a good fit to the training data, it just tells you that if you do, and the function comes from a low-complexity set, it will probably give good generalization.
Any delineation of a small set of rules leads immediately to a short description length for the rules. You just need to encode the index of the rule in the set, costing log(N) bits for a set of size N.
Note that MDL is not the same as algorithmic information theory (definition-up-to-a-constant comes up in AIT, not MDL), though they’re of course related.
I see everyone’s assuming that some things, by their nature, always go to infinity (e.g. number of samples) while others stay constant (e.g. rule set). This is a nice convention, but not always realistic—and it certainly wasn’t mentioned in the original formulation of the problem of learning, where everything is finite. If you really want things to grow, why don’t you then allow the set itself to be specified by increasingly convoluted algorithms as N goes to infinity? Like, exponential in N? :-) Learning can still work—in theory, it’ll work just as well as a simple rule set—but you’ll have a hard time explaining that with MDL.
(If there’s some theoretical justification why this kind of outrageous bloatware won’t be as successful as simple algorithms, I’d really like to hear it...)
You might find what you’re looking for in Kevin T. Kelly’s work:
http://www.andrew.cmu.edu/user/kk3n/ockham/Ockham.htm
Dr. Shalizi mentioned it as an alternative formalization of Occam’s Razor.
Johnicolas, thanks. This link and your other comments in those threads are very close to what I’m looking for, though this realization took me some time.
I think its uncomputability already does that. When you make a computable version by limiting attention to some framework of descriptive capabilities smaller than universal computation, different choices of that framework will give you different measures of simplicity. What is simple in one framework may seem elaborate and baroque in another. Or as some military strategist once put it:
“To the foot-soldier, the strategy of a general may seem obscure, shrouded in shadows and fog, but to the general himself, his way is as plain as if he were marching his army down a broad, straight highway.”