Kevin T. Kelly’s Ockham Efficiency Theorem

There is a game studied in Philosophy of Science and Probably Approximately Correct (machine) learning. It’s a cousin to the Looney Labs game “Zendo”, but less fun to play with your friends. http://​​en.wikipedia.org/​​wiki/​​Zendo_(game) (By the way, playing this kind of game is excellent practice at avoiding confirmation bias.) The game has two players, who are asymmetric. One player plays Nature, and the other player plays Science. First Nature makes up a law, a specific Grand Unified Theory, and then Science tries to guess it. Nature provides some information about the law, and then Science can change their guess, if they want to. Science wins if it converges to the rule that Nature made up.

Kevin T. Kelly is a philosopher of science, and studies (among other things) the justification within the philosophy of science for William of Occam’s razor: “entities should not be multiplied beyond necessity”. The way that he does this is by proving theorems about the Nature/​Science game with all of the details elaborated.

Why should you care? Firstly, his justification is different from the overfitting justification that sometimes shows up in Bayesian literature. Roughly speaking, the overfitting justification characterizes our use of Occam’s razor as pragmatic—we use Occam’s razor to do science because we get good generalization performance from it. If we found something else (e.g. boosting and bagging, or some future technique) that proposed oddly complex hypotheses, but achieved good generalization performance, we would switch away from Occam’s razor.

An aside regarding boosting and bagging: These are ensemble machine learning techniques. Suppose you had a technique that created decision tree classifiers, such as C4.5 (http://​​en.wikipedia.org/​​wiki/​​C4.5_algorithm), or even decision stumps (http://​​en.wikipedia.org/​​wiki/​​Decision_stump). Adaboost (http://​​en.wikipedia.org/​​wiki/​​Adaboost) would start by weighting all of the examples identically and invoking your technique to find an initial classifier. Then it would reweight the examples, prioritizing the ones that the first iteration got wrong, and invoke your technique on the reweighted input. Eventually, Adaboost outputs an ensemble of decision trees (or decision stumps), and taking the majority opinion of the ensemble might well be more effective (generalize better beyond training) than the original classifier. Bagging is similar (http://​​en.wikipedia.org/​​wiki/​​Bootstrap_aggregating).

Ensemble learning methods are a challenge for “prevents overfitting” justifications of Occam’s razor since they propose weirdly complex hypotheses, but suffer less from overfitting than the weak classifiers that they are built from.

Secondly, his alternative justification bears on ordering hypotheses “by simplicity”, providing an alternative to (approximations of) Kolmogorov complexity as a foundation of science.

Let’s take a philosopher’s view of physics—from a distance, a philosopher listens to the particle physicists and hears: “There are three fundamental particles that make up all matter!”, “We’ve discovered another, higher-energy particle!”, “Another, even higher-energy particle!”. There is no reason the philosopher can see why this should stop at any particular number of particles. What should the philosopher believe at any given moment?

This is a form of the Nature vs. Science game where Nature’s “Grand Unified Theory” is known to be a nonnegative number (of particles), and at each round Nature can reveal a new fundamental particle (by name) or remain silent. What is the philosopher of science’s strategy? Occam’s razor suggests that we prefer simpler hypotheses, but if there were 254 known particles, how would we decide whether to claim that there are 254, 255, or 256 particles? Note that in many encodings, 255 and 256 are quite round numbers and therefore have short descriptions; low Kolmogorov complexity.

An aside regarding “round numbers”: Here are some “shortest expressions” of some numbers, according to one possible grammar of expressions. (Kolmogorov complexity is unique up to an additive constant, but since we never actually use Kolmogorov complexity, that isn’t particularly helpful.)

1 == (1)

2 == (1+1)

3 == (1+1+1)

4 == (1+1+1+1)

5 == (1+1+1+1+1)

6 == (1+1+1+1+1+1)

7 == (1+1+1+1+1+1+1)

8 == ((1+1)*(1+1+1+1))

9 == ((1+1+1)*(1+1+1))

10 == ((1+1)*(1+1+1+1+1))

11 == (1+1+1+1+1+1+1+1+1+1+1)

12 == ((1+1+1)*(1+1+1+1))

13 == ((1)+((1+1+1)*(1+1+1+1)))

14 == ((1+1)*(1+1+1+1+1+1+1))

15 == ((1+1+1)*(1+1+1+1+1))

16 == ((1+1+1+1)*(1+1+1+1))

17 == ((1)+((1+1+1+1)*(1+1+1+1)))

18 == ((1+1+1)*(1+1+1+1+1+1))

19 == ((1)+((1+1+1)*(1+1+1+1+1+1)))

20 == ((1+1+1+1)*(1+1+1+1+1))

21 == ((1+1+1)*(1+1+1+1+1+1+1))

22 == ((1)+((1+1+1)*(1+1+1+1+1+1+1)))

24 == ((1+1+1+1)*(1+1+1+1+1+1))

25 == ((1+1+1+1+1)*(1+1+1+1+1))

26 == ((1)+((1+1+1+1+1)*(1+1+1+1+1)))

27 == ((1+1+1)*((1+1+1)*(1+1+1)))

28 == ((1+1+1+1)*(1+1+1+1+1+1+1))

30 == ((1+1+1+1+1)*(1+1+1+1+1+1))

32 == ((1+1)*((1+1+1+1)*(1+1+1+1)))

35 == ((1+1+1+1+1)*(1+1+1+1+1+1+1))

36 == ((1+1+1)*((1+1+1)*(1+1+1+1)))

40 == ((1+1)*((1+1+1+1)*(1+1+1+1+1)))

42 == ((1+1+1+1+1+1)*(1+1+1+1+1+1+1))

45 == ((1+1+1)*((1+1+1)*(1+1+1+1+1)))

As you can see, there are several inversions, where a larger number (in magnitude) is “simpler” in that it has a shorter shortest expression (in this grammar). The first inversion occurs at 12, which is simpler than 11. In this grammar, squares, powers of two, and smooth numbers generally (http://​​en.wikipedia.org/​​wiki/​​Smooth_number) will be considered simple. Even though “the 256th prime” sounds like a short description of a number, this grammar isn’t flexible enough to capture that concept, and so this grammar does not consider primes to be simple. I believe this illustrates that picking a particular approximate Kolmogorov-complexity-like concept is a variety of justification of Occam’s razor by aesthetics. Humans can argue about aesthetics, and be convinced by arguments somewhat like logic, but ultimately it is a matter of taste.

In contrast, Kelly’s idea of “simplicity” is related to Popper’s falsifiability. In this sense of simplicity, a complex theory is one that can (for a while) camouflage itself as another (simple) theory, but a simple theory cannot pretend to be complex. So if Nature really had 256 particles, it could refuse to reveal them for a while (maybe they’re hidden in “very high energy regimes”), and the 256-particle universe would exactly match the givens for a 254-particle universe. However, the 254-particle universe cannot reveal new particles; it’s already shown everything that it has.

Remember, we’re not talking about elaborate data analysis, where there could be “mirages” or “aliasing”, patterns in the data that look initially like a new particle, but later reveal themselves to be explicable using known particles. We’re talking about a particular form of the Nature vs Science game where each round Nature either reveals (by name) a new particle, or remains silent. This illustrates that Kelly’s simplicity is relative to the possible observables. In this scenario, where Nature identifies new particles by name, then the hypothesis that we have seen all of the particles and there will be no new particles is always the simplest, the “most falsifiable”. With more realistic observables, the question of what is the simplest consistent hypothesis becomes trickier.

A more realistic example that Kelly uses (following Kuhn) is the Copernican hypothesis that the earth and other planets circle the sun. In what sense is it simpler than the geocentric hypothesis? From a casual modern perspective, both hypotheses might seem symmetric and of similar complexity. The crucial effect is that Ptolemy’s model parameters (velocities, diameters) have to be carefully adjusted to create a “coincidence”—apparent retrograde motion that always coincides with solar conjunction (for mercury and venus) and solar opposition (for the other planets). (Note: Retrograde means moving in the unusual direction, west to east. Conjunction means two entities near each other in the sky. Opposition means the entities are at opposite points of the celestial sphere.) The Copernican model “predicts” that coincidence; not in the sense that the creation of the model precedes knowledge of the effect, but that any tiny deviation from exact coincidence to be discovered in the future would be evidence against the Copernican model. In this sense, the Copernican model is more falsifiable; simpler.

The Ockham Efficiency Theorems explain in what sense this version of Occam’s razor is strictly better than other strategies for the Science vs. Nature game. If what we care about is the number of public mind changes (saying “I was mistaken, actually there are X particles.” would count as a mind change for any X), and the timing of the mind changes, then Occam’s razor is the best strategy for the Science vs. Nature game. The Occam strategy for the number-of-particles game will achieve exactly as many mind changes as there are particles. A scientist who deviates from Occam’s razor allows Nature to extract a mind change from the (hasty) scientist “for free”.

The way this works in the particles game is simple. To extract a mind change from a hasty scientist who jumps to predicting 12 particles when they’ve only seen 11, or 256 particles when they’ve only seen 254, Nature can simply continuously refuse to reveal new particles. If the scientist doesn’t ever switch back down to the known number of particles, then they’re a nonconvergent scientist—they lose the game. If the scientist, confronted with a long run of “no new particles found” does switch to the known number of particles, then Nature has extracted a mind change from the scientist without a corresponding particle. The Occam strategy achieves the fewest possible number of mind changes (that is, equal to the number of particles), given such an adversarial Nature.

The “Ockham Efficiency Theorems” refer to the worked-out details of more elaborate Science vs. Nature games—where Nature chooses a polynomial GUT, for example.

This entire scenario does generalize to noisy observations as well (learning Perlean causal graphs) though I don’t understand this aspect fully. If I understand correctly, the scientist guesses a probability distribution over the possible worlds and you count “mind changes” as changes of that probability distribution, so adjusting a 0.9 probability to 0.8 would be counted as a fraction of a mind change. Anyway, read the actual papers, they’re well-written and convincing.

www.andrew.cmu.edu/​user/​kk3n/​ockham/​Ockham.htm

This post benefited greatly from encouragement and critique by cousin_it.