Occam’s Razor

The more com­plex an ex­pla­na­tion is, the more ev­i­dence you need just to find it in be­lief-space. (In Tra­di­tional Ra­tion­al­ity this is of­ten phrased mis­lead­ingly, as “The more com­plex a propo­si­tion is, the more ev­i­dence is re­quired to ar­gue for it.”) How can we mea­sure the com­plex­ity of an ex­pla­na­tion? How can we de­ter­mine how much ev­i­dence is re­quired?

Oc­cam’s Ra­zor is of­ten phrased as “The sim­plest ex­pla­na­tion that fits the facts.” Robert Hein­lein replied that the sim­plest ex­pla­na­tion is “The lady down the street is a witch; she did it.”

One ob­serves that the length of an English sen­tence is not a good way to mea­sure “com­plex­ity.” And “fit­ting” the facts by merely failing to pro­hibit them is in­suffi­cient.

Why, ex­actly, is the length of an English sen­tence a poor mea­sure of com­plex­ity? Be­cause when you speak a sen­tence aloud, you are us­ing la­bels for con­cepts that the listener shares—the re­ceiver has already stored the com­plex­ity in them. Sup­pose we ab­bre­vi­ated Hein­lein’s whole sen­tence as “Tldt­si­awsdi!” so that the en­tire ex­pla­na­tion can be con­veyed in one word; bet­ter yet, we’ll give it a short ar­bi­trary la­bel like “Fnord!” Does this re­duce the com­plex­ity? No, be­cause you have to tell the listener in ad­vance that “Tldt­si­awsdi!” stands for “The lady down the street is a witch; she did it.” “Witch,” it­self, is a la­bel for some ex­traor­di­nary as­ser­tions—just be­cause we all know what it means doesn’t mean the con­cept is sim­ple.

An enor­mous bolt of elec­tric­ity comes out of the sky and hits some­thing, and the Norse tribesfolk say, “Maybe a re­ally pow­er­ful agent was an­gry and threw a light­ning bolt.” The hu­man brain is the most com­plex ar­ti­fact in the known uni­verse. If anger seems sim­ple, it’s be­cause we don’t see all the neu­ral cir­cuitry that’s im­ple­ment­ing the emo­tion. (Imag­ine try­ing to ex­plain why Satur­day Night Live is funny, to an alien species with no sense of hu­mor. But don’t feel su­pe­rior; you your­self have no sense of fnord.) The com­plex­ity of anger, and in­deed the com­plex­ity of in­tel­li­gence, was glossed over by the hu­mans who hy­poth­e­sized Thor the thun­der-agent.

To a hu­man, Maxwell’s equa­tions take much longer to ex­plain than Thor. Hu­mans don’t have a built-in vo­cab­u­lary for calcu­lus the way we have a built-in vo­cab­u­lary for anger. You’ve got to ex­plain your lan­guage, and the lan­guage be­hind the lan­guage, and the very con­cept of math­e­mat­ics, be­fore you can start on elec­tric­ity.

And yet it seems that there should be some sense in which Maxwell’s equa­tions are sim­pler than a hu­man brain, or Thor the thun­der-agent.

There is. It’s enor­mously eas­ier (as it turns out) to write a com­puter pro­gram that simu­lates Maxwell’s equa­tions, com­pared to a com­puter pro­gram that simu­lates an in­tel­li­gent emo­tional mind like Thor.

The for­mal­ism of Solomonoff in­duc­tion mea­sures the “com­plex­ity of a de­scrip­tion” by the length of the short­est com­puter pro­gram which pro­duces that de­scrip­tion as an out­put. To talk about the “short­est com­puter pro­gram” that does some­thing, you need to spec­ify a space of com­puter pro­grams, which re­quires a lan­guage and in­ter­preter. Solomonoff in­duc­tion uses Tur­ing ma­chines, or rather, bit­strings that spec­ify Tur­ing ma­chines. What if you don’t like Tur­ing ma­chines? Then there’s only a con­stant com­plex­ity penalty to de­sign your own uni­ver­sal Tur­ing ma­chine that in­ter­prets what­ever code you give it in what­ever pro­gram­ming lan­guage you like. Differ­ent in­duc­tive for­mal­isms are pe­nal­ized by a worst-case con­stant fac­tor rel­a­tive to each other, cor­re­spond­ing to the size of a uni­ver­sal in­ter­preter for that for­mal­ism.

In the bet­ter (in my hum­ble opinion) ver­sions of Solomonoff in­duc­tion, the com­puter pro­gram does not pro­duce a de­ter­minis­tic pre­dic­tion, but as­signs prob­a­bil­ities to strings. For ex­am­ple, we could write a pro­gram to ex­plain a fair coin by writ­ing a pro­gram that as­signs equal prob­a­bil­ities to all 2N strings of length N. This is Solomonoff in­duc­tion’s ap­proach to fit­ting the ob­served data. The higher the prob­a­bil­ity a pro­gram as­signs to the ob­served data, the bet­ter that pro­gram fits the data. And prob­a­bil­ities must sum to 1, so for a pro­gram to bet­ter “fit” one pos­si­bil­ity, it must steal prob­a­bil­ity mass from some other pos­si­bil­ity which will then “fit” much more poorly. There is no su­per­fair coin that as­signs 100% prob­a­bil­ity to heads and 100% prob­a­bil­ity to tails.

How do we trade off the fit to the data, against the com­plex­ity of the pro­gram? If you ig­nore com­plex­ity penalties, and think only about fit, then you will always pre­fer pro­grams that claim to de­ter­minis­ti­cally pre­dict the data, as­sign it 100% prob­a­bil­ity. If the coin shows ht­thht, then the pro­gram that claims that the coin was fixed to show ht­thht fits the ob­served data 64 times bet­ter than the pro­gram which claims the coin is fair. Con­versely, if you ig­nore fit, and con­sider only com­plex­ity, then the “fair coin” hy­poth­e­sis will always seem sim­pler than any other hy­poth­e­sis. Even if the coin turns up hth­hth­h­hth­h­h­hth­h­h­hht . . .

In­deed, the fair coin is sim­pler and it fits this data ex­actly as well as it fits any other string of 20 coin­flips—no more, no less—but we see an­other hy­poth­e­sis, seem­ing not too com­pli­cated, that fits the data much bet­ter.

If you let a pro­gram store one more bi­nary bit of in­for­ma­tion, it will be able to cut down a space of pos­si­bil­ities by half, and hence as­sign twice as much prob­a­bil­ity to all the points in the re­main­ing space. This sug­gests that one bit of pro­gram com­plex­ity should cost at least a “fac­tor of two gain” in the fit. If you try to de­sign a com­puter pro­gram that ex­plic­itly stores an out­come like ht­thht, the six bits that you lose in com­plex­ity must de­stroy all plau­si­bil­ity gained by a 64-fold im­prove­ment in fit. Other­wise, you will sooner or later de­cide that all fair coins are fixed.

Un­less your pro­gram is be­ing smart, and com­press­ing the data, it should do no good just to move one bit from the data into the pro­gram de­scrip­tion.

The way Solomonoff in­duc­tion works to pre­dict se­quences is that you sum up over all al­lowed com­puter pro­grams—if ev­ery pro­gram is al­lowed, Solomonoff in­duc­tion be­comes un­com­putable—with each pro­gram hav­ing a prior prob­a­bil­ity of 12 to the power of its code length in bits, and each pro­gram is fur­ther weighted by its fit to all data ob­served so far. This gives you a weighted mix­ture of ex­perts that can pre­dict fu­ture bits.

The Min­i­mum Mes­sage Length for­mal­ism is nearly equiv­a­lent to Solomonoff in­duc­tion. You send a string de­scribing a code, and then you send a string de­scribing the data in that code. Whichever ex­pla­na­tion leads to the short­est to­tal mes­sage is the best. If you think of the set of al­low­able codes as a space of com­puter pro­grams, and the code de­scrip­tion lan­guage as a uni­ver­sal ma­chine, then Min­i­mum Mes­sage Length is nearly equiv­a­lent to Solomonoff in­duc­tion.1

This lets us see clearly the prob­lem with us­ing “The lady down the street is a witch; she did it” to ex­plain the pat­tern in the se­quence 0101010101. If you’re send­ing a mes­sage to a friend, try­ing to de­scribe the se­quence you ob­served, you would have to say: “The lady down the street is a witch; she made the se­quence come out 0101010101.” Your ac­cu­sa­tion of witchcraft wouldn’t let you shorten the rest of the mes­sage; you would still have to de­scribe, in full de­tail, the data which her witch­ery caused.

Witchcraft may fit our ob­ser­va­tions in the sense of qual­i­ta­tively per­mit­ting them; but this is be­cause witchcraft per­mits ev­ery­thing , like say­ing “Phlo­gis­ton!” So, even af­ter you say “witch,” you still have to de­scribe all the ob­served data in full de­tail. You have not com­pressed the to­tal length of the mes­sage de­scribing your ob­ser­va­tions by trans­mit­ting the mes­sage about witchcraft; you have sim­ply added a use­less pro­logue, in­creas­ing the to­tal length.

The real sneak­i­ness was con­cealed in the word “it” of “A witch did it.” A witch did what?

Of course, thanks to hind­sight bias and an­chor­ing and fake ex­pla­na­tions and fake causal­ity and pos­i­tive bias and mo­ti­vated cog­ni­tion, it may seem all too ob­vi­ous that if a woman is a witch, of course she would make the coin come up 0101010101. But I’ll get to that soon enough. . .

1 Nearly, be­cause it chooses the short­est pro­gram, rather than sum­ming up over all pro­grams.