# Occam’s Razor

The more com­plex an ex­pla­na­tion is, the more ev­i­dence you need just to find it in be­lief-space. (In Tra­di­tional Ra­tion­al­ity this is of­ten phrased mis­lead­ingly, as “The more com­plex a propo­si­tion is, the more ev­i­dence is re­quired to ar­gue for it.”) How can we mea­sure the com­plex­ity of an ex­pla­na­tion? How can we de­ter­mine how much ev­i­dence is re­quired?

Oc­cam’s Ra­zor is of­ten phrased as “The sim­plest ex­pla­na­tion that fits the facts.” Robert Hein­lein replied that the sim­plest ex­pla­na­tion is “The lady down the street is a witch; she did it.”

One ob­serves that the length of an English sen­tence is not a good way to mea­sure “com­plex­ity.” And “fit­ting” the facts by merely failing to pro­hibit them is in­suffi­cient.

Why, ex­actly, is the length of an English sen­tence a poor mea­sure of com­plex­ity? Be­cause when you speak a sen­tence aloud, you are us­ing la­bels for con­cepts that the listener shares—the re­ceiver has already stored the com­plex­ity in them. Sup­pose we ab­bre­vi­ated Hein­lein’s whole sen­tence as “Tldt­si­awsdi!” so that the en­tire ex­pla­na­tion can be con­veyed in one word; bet­ter yet, we’ll give it a short ar­bi­trary la­bel like “Fnord!” Does this re­duce the com­plex­ity? No, be­cause you have to tell the listener in ad­vance that “Tldt­si­awsdi!” stands for “The lady down the street is a witch; she did it.” “Witch,” it­self, is a la­bel for some ex­traor­di­nary as­ser­tions—just be­cause we all know what it means doesn’t mean the con­cept is sim­ple.

An enor­mous bolt of elec­tric­ity comes out of the sky and hits some­thing, and the Norse tribesfolk say, “Maybe a re­ally pow­er­ful agent was an­gry and threw a light­ning bolt.” The hu­man brain is the most com­plex ar­ti­fact in the known uni­verse. If anger seems sim­ple, it’s be­cause we don’t see all the neu­ral cir­cuitry that’s im­ple­ment­ing the emo­tion. (Imag­ine try­ing to ex­plain why Satur­day Night Live is funny, to an alien species with no sense of hu­mor. But don’t feel su­pe­rior; you your­self have no sense of fnord.) The com­plex­ity of anger, and in­deed the com­plex­ity of in­tel­li­gence, was glossed over by the hu­mans who hy­poth­e­sized Thor the thun­der-agent.

To a hu­man, Maxwell’s equa­tions take much longer to ex­plain than Thor. Hu­mans don’t have a built-in vo­cab­u­lary for calcu­lus the way we have a built-in vo­cab­u­lary for anger. You’ve got to ex­plain your lan­guage, and the lan­guage be­hind the lan­guage, and the very con­cept of math­e­mat­ics, be­fore you can start on elec­tric­ity.

And yet it seems that there should be some sense in which Maxwell’s equa­tions are sim­pler than a hu­man brain, or Thor the thun­der-agent.

There is. It’s enor­mously eas­ier (as it turns out) to write a com­puter pro­gram that simu­lates Maxwell’s equa­tions, com­pared to a com­puter pro­gram that simu­lates an in­tel­li­gent emo­tional mind like Thor.

The for­mal­ism of Solomonoff in­duc­tion mea­sures the “com­plex­ity of a de­scrip­tion” by the length of the short­est com­puter pro­gram which pro­duces that de­scrip­tion as an out­put. To talk about the “short­est com­puter pro­gram” that does some­thing, you need to spec­ify a space of com­puter pro­grams, which re­quires a lan­guage and in­ter­preter. Solomonoff in­duc­tion uses Tur­ing ma­chines, or rather, bit­strings that spec­ify Tur­ing ma­chines. What if you don’t like Tur­ing ma­chines? Then there’s only a con­stant com­plex­ity penalty to de­sign your own uni­ver­sal Tur­ing ma­chine that in­ter­prets what­ever code you give it in what­ever pro­gram­ming lan­guage you like. Differ­ent in­duc­tive for­mal­isms are pe­nal­ized by a worst-case con­stant fac­tor rel­a­tive to each other, cor­re­spond­ing to the size of a uni­ver­sal in­ter­preter for that for­mal­ism.

In the bet­ter (in my hum­ble opinion) ver­sions of Solomonoff in­duc­tion, the com­puter pro­gram does not pro­duce a de­ter­minis­tic pre­dic­tion, but as­signs prob­a­bil­ities to strings. For ex­am­ple, we could write a pro­gram to ex­plain a fair coin by writ­ing a pro­gram that as­signs equal prob­a­bil­ities to all 2N strings of length N. This is Solomonoff in­duc­tion’s ap­proach to fit­ting the ob­served data. The higher the prob­a­bil­ity a pro­gram as­signs to the ob­served data, the bet­ter that pro­gram fits the data. And prob­a­bil­ities must sum to 1, so for a pro­gram to bet­ter “fit” one pos­si­bil­ity, it must steal prob­a­bil­ity mass from some other pos­si­bil­ity which will then “fit” much more poorly. There is no su­per­fair coin that as­signs 100% prob­a­bil­ity to heads and 100% prob­a­bil­ity to tails.

How do we trade off the fit to the data, against the com­plex­ity of the pro­gram? If you ig­nore com­plex­ity penalties, and think only about fit, then you will always pre­fer pro­grams that claim to de­ter­minis­ti­cally pre­dict the data, as­sign it 100% prob­a­bil­ity. If the coin shows ht­thht, then the pro­gram that claims that the coin was fixed to show ht­thht fits the ob­served data 64 times bet­ter than the pro­gram which claims the coin is fair. Con­versely, if you ig­nore fit, and con­sider only com­plex­ity, then the “fair coin” hy­poth­e­sis will always seem sim­pler than any other hy­poth­e­sis. Even if the coin turns up hth­hth­h­hth­h­h­hth­h­h­hht . . .

In­deed, the fair coin is sim­pler and it fits this data ex­actly as well as it fits any other string of 20 coin­flips—no more, no less—but we see an­other hy­poth­e­sis, seem­ing not too com­pli­cated, that fits the data much bet­ter.

If you let a pro­gram store one more bi­nary bit of in­for­ma­tion, it will be able to cut down a space of pos­si­bil­ities by half, and hence as­sign twice as much prob­a­bil­ity to all the points in the re­main­ing space. This sug­gests that one bit of pro­gram com­plex­ity should cost at least a “fac­tor of two gain” in the fit. If you try to de­sign a com­puter pro­gram that ex­plic­itly stores an out­come like ht­thht, the six bits that you lose in com­plex­ity must de­stroy all plau­si­bil­ity gained by a 64-fold im­prove­ment in fit. Other­wise, you will sooner or later de­cide that all fair coins are fixed.

Un­less your pro­gram is be­ing smart, and com­press­ing the data, it should do no good just to move one bit from the data into the pro­gram de­scrip­tion.

The way Solomonoff in­duc­tion works to pre­dict se­quences is that you sum up over all al­lowed com­puter pro­grams—if ev­ery pro­gram is al­lowed, Solomonoff in­duc­tion be­comes un­com­putable—with each pro­gram hav­ing a prior prob­a­bil­ity of 12 to the power of its code length in bits, and each pro­gram is fur­ther weighted by its fit to all data ob­served so far. This gives you a weighted mix­ture of ex­perts that can pre­dict fu­ture bits.

The Min­i­mum Mes­sage Length for­mal­ism is nearly equiv­a­lent to Solomonoff in­duc­tion. You send a string de­scribing a code, and then you send a string de­scribing the data in that code. Whichever ex­pla­na­tion leads to the short­est to­tal mes­sage is the best. If you think of the set of al­low­able codes as a space of com­puter pro­grams, and the code de­scrip­tion lan­guage as a uni­ver­sal ma­chine, then Min­i­mum Mes­sage Length is nearly equiv­a­lent to Solomonoff in­duc­tion.1

This lets us see clearly the prob­lem with us­ing “The lady down the street is a witch; she did it” to ex­plain the pat­tern in the se­quence 0101010101. If you’re send­ing a mes­sage to a friend, try­ing to de­scribe the se­quence you ob­served, you would have to say: “The lady down the street is a witch; she made the se­quence come out 0101010101.” Your ac­cu­sa­tion of witchcraft wouldn’t let you shorten the rest of the mes­sage; you would still have to de­scribe, in full de­tail, the data which her witch­ery caused.

Witchcraft may fit our ob­ser­va­tions in the sense of qual­i­ta­tively per­mit­ting them; but this is be­cause witchcraft per­mits ev­ery­thing , like say­ing “Phlo­gis­ton!” So, even af­ter you say “witch,” you still have to de­scribe all the ob­served data in full de­tail. You have not com­pressed the to­tal length of the mes­sage de­scribing your ob­ser­va­tions by trans­mit­ting the mes­sage about witchcraft; you have sim­ply added a use­less pro­logue, in­creas­ing the to­tal length.

The real sneak­i­ness was con­cealed in the word “it” of “A witch did it.” A witch did what?

Of course, thanks to hind­sight bias and an­chor­ing and fake ex­pla­na­tions and fake causal­ity and pos­i­tive bias and mo­ti­vated cog­ni­tion, it may seem all too ob­vi­ous that if a woman is a witch, of course she would make the coin come up 0101010101. But I’ll get to that soon enough. . .

1 Nearly, be­cause it chooses the short­est pro­gram, rather than sum­ming up over all pro­grams.

• The Vap­nik Ch­er­novenkis Di­men­sion also offers a way of filling in the de­tail of the the con­cept of “sim­ple” ap­pro­pri­ate to Oc­cam’s Ra­zor. I’ve read about it in the con­text of statis­ti­cal learn­ing the­ory, speci­fi­cally “prob­a­bly ap­prox­i­mately cor­rect learn­ing”.

Hav­ing suc­cess­fully tuned the pa­ram­e­ters of your model to fit the data, how likely is it to fit new data, that is, how well does it gen­er­al­ise. The VC di­men­sion comes with for­mu­lae that tell you. I’ve not been able to fol­low the field, but I sus­pect that VC di­men­sion leads to worst case es­ti­mates whose use­ful­ness is harmed by their pes­simism.

• Great post!

• “Your ac­cu­sa­tion of witchcraft wouldn’t let you shorten the rest of the mes­sage; you would still have to de­scribe, in full de­tail, the data which her witch­ery caused.”

My model of witches, if I had one, would pro­duce a given sim­ple se­quence like 01010101 with greater prob­a­bil­ity than a given ran­dom se­quence like 00011011. Wouldn’t yours? I might agree if you said “in nearly full de­tail”.

• Steven, that means you have to trans­mit the ac­cu­sa­tion of witchcraft, fol­lowed by a com­puter pro­gram, fol­lowed by the coded data. Why not just trans­mit the com­puter pro­gram fol­lowed by the coded data? I don’t ex­pect my own en­vi­ron­ment to be ran­dom noise, but that has noth­ing to do with witchcraft...

Alan, I agree that VC di­men­sion is an im­por­tant con­cep­tu­ally differ­ent way of think­ing about “com­plex­ity”. One of its pri­mary sel­l­ing points is that, for ex­am­ple, it doesn’t at­tach in­finite com­plex­ity to a model class that con­tains one real-val­ued pa­ram­e­ter, if that model class isn’t very flex­ible (i.e., it says only “the data points are greater than R”). But VC com­plex­ity doesn’t plug into stan­dard prob­a­bil­ity the­ory as eas­ily as Solomonoff in­duc­tion.

• In Solomonoff in­duc­tion it is im­por­tant to use a two-tape Tur­ing ma­chine where one tape is for the pro­gram and one is for the in­put and work space. The pro­gram tape is an in­finite ran­dom string, but the pro­gram length is defined to be the num­ber of bits that the Tur­ing ma­chine ac­tu­ally reads dur­ing its ex­e­cu­tion. This way the set of pos­si­ble pro­grams be­comes a pre­fix free set. It fol­lows that the prior prob­a­bil­ities will add up to one when you weight by 2^(-l) where l is pro­gram length. (I be­lieve this was re­al­ized by Leonid Levin. In Solomonoff’s origi­nal scheme the prior prob­a­bil­ities did not add to one.) This also al­lows the beau­tiful in­ter­pre­ta­tion that the pro­gram tape is as­signed by in­de­pen­dent coin flips for each bit, and the 2^-l weight­ing arises nat­u­rally rather than as an ar­tifi­cial as­sump­tion. I be­lieve this is dis­cussed in the in­for­ma­tion the­ory book by Cover and Thomas.

• Eliezer,

“I don’t ex­pect my own en­vi­ron­ment to be ran­dom noise, but that has noth­ing to do with witchcraft...”

I think I mis­in­ter­preted the math and now see what you’re get­ting at. Would it be an ac­cu­rate trans­la­tion to hu­man lan­guage to say, “a se­quence like 10101010 may fa­vor witchcraft over the hy­poth­e­sis that noth­ing weird is go­ing on (i.e. the coin­flips are ran­dom), but it will never fa­vor witchcraft over the sim­pler hy­poth­e­sis that some­thing weird is go­ing on that isn’t witchcraft”?

I find it awk­ward to think of “witchcraft” as just a con­tent-free word; what “witchcraft” means to me is some­thing like the pos­si­bil­ity that re­al­ity in­cludes hu­man-mind-like things with per­son­al­ities and with prefer­ences that they achieve through un­known non­stan­dard causal means. If you coded that up, it would prob­a­bly no longer be con­tent-free; it would al­low short­en­ing the rest of the pro­gram gen­er­at­ing the se­quences in some cases and re­quire length­en­ing it in some other cases. In all re­al­is­tic cases the re­sult­ing pro­gram would still be longer than nec­es­sary.

• Good com­ments, all!

Steven, yes. Stephen, also yes.

• Eli, you said:

An enor­mous bolt of elec­tric­ity comes out of the sky and hits some­thing, and the Norse tribesfolk say, “Maybe a re­ally pow­er­ful agent was an­gry and threw a light­ning bolt.” The hu­man brain is the most com­plex ar­ti­fact in the known uni­verse. If anger seems sim­ple, it’s be­cause we don’t see all the neu­ral cir­cuitry that’s im­ple­ment­ing the emo­tion. (Imag­ine try­ing to ex­plain why Satur­day Night Live is funny, to an alien species with no sense of hu­mor. But don’t feel su­pe­rior; you your­self have no sense of fnord.) The com­plex­ity of anger, and in­deed the com­plex­ity of in­tel­li­gence, was glossed over by the hu­mans who hy­poth­e­sized Thor the thun­der-agent.

I think it’s worth not­ing that Norse tribesfolk already knew about hu­man be­ings, so what­ever model of the uni­verse they made had to in­clude an­gry agents in it some­where.

• I agree. I feel like the post is pok­ing a bit of fun at hokey re­li­gion, and in so do­ing falls into an er­ror. The Norse would do quite badly in life if they switched to a prior based on de­scrip­tion lengths in Tur­ing ma­chines rather than a de­scrip­tion length in their own lan­guage, be­cause their lan­guage em­bod­ies use­ful bias con­cern­ing their en­vi­ron­ment. Similarly, English de­scrip­tion lengths con­tain use­ful bias for our en­vi­ron­ment. The for­mal­ism of Solomonoff in­duc­tion does not tell us which uni­ver­sal lan­guage to use, and English is a fine choice. The “thun­der god” the­ory is not bad be­cause of Oc­cam’s ra­zor, but be­cause it doesn’t hold up when we in­ves­ti­gate em­piri­cally! Similarly, if the Norse be­lieved that earth­quakes were caused by gi­ant an­i­mals mov­ing un­der the earth, it would not be such a bad the­ory given what ev­i­dence they had (even though an­i­mals are com­plex from a Tur­ing-ma­chine per­spec­tive); an­i­mals caused many things in their en­vi­ron­ment. We just know it is wrong to­day, based on what we know now.

• What you are talk­ing about in terms of Sol­monoff in­duc­tion is usu­ally called al­gorith­mic in­for­ma­tion the­ory and the short­est-pro­gram-to-pro­duce-a-bit-string is usu­ally called Kol­mogorov-Chaitin in­for­ma­tion. I am sure you know this. Which begs the ques­tion, why didn’t you men­tion this? I agree, it is the neat­est way to think about Oc­cam’s ra­zor. I am not sure why some are rais­ing PAC the­ory and VC-di­men­sion. I don’t quite see how they illu­mi­nate Oc­cam. Min­i­mal­ist in­duc­tive learn­ing is hardly the sim­plest “ex­pla­na­tion” in the Oc­cam sense, and is ac­tu­ally closer to Shan­non en­tropy in spirit, in be­ing more of a raw mea­sure. Gre­gory Chaitin’s ‘Meta Math: The Search for Omega’, which I did a re­view sum­mary of is a pretty neat look at this stuff.

• Venkat: I think there is a very good rea­son to men­tion PAC learn­ing. Namely, Kol­mogorov com­plex­ity is un­com­putable, so Solomonoff in­duc­tion is not pos­si­ble even in prin­ci­ple. Thus one must use ap­prox­i­mate meth­ods in­stead such as PAC learn­ing.

• Oc­cam’s ra­zor is not con­clu­sive and it’s not sci­ence. It is not un­scien­tific but I would say that it fits into the cat­e­gory of philos­o­phy. In sci­ence you do not get two the­o­ries, take the facts you know, and then con­clude based on the sim­plest the­ory. If you’re do­ing this, you need to do bet­ter ex­per­i­ments to de­ter­mine the facts. Oc­cam’s ra­zor can be a use­ful heuris­tic to sug­gest what ex­per­i­ments should be done. Just like math­e­mat­i­cal el­e­gance, Oc­cam’s ra­zor sug­gests that some­thing is on the right track but it is not de­ci­sive. To look back at the facts and then in­ter­pret it through Oc­cam’s ra­zor is just an ex­er­cise in hind­sight bias.

Your anal­ogy with Norse tribesfolk re­minds me of the NRA slo­gan, “Guns don’t kill peo­ple, peo­ple kill peo­ple”. There are many differ­ent lev­els of cau­sa­tion. The gun can be said to be the sec­ondary cause of why some­one died. The per­son pul­ling the trig­ger would be the pri­mary cause. The sec­ondary cause of thun­der is na­ture but the first cause that brought things into ex­is­tence and cre­ated the sys­tem is God. Na­ture can­not be its own cause.

The rest of what you wrote sounds like you’re pul­ling num­bers out of your arse. The last sen­tence should be read in your best Norse tribesfolk ac­cent.

• Science is just a method of fil­ter­ing hy­poth­e­sis. Which is ex­actly what Oc­cam’s ra­zor is. Oc­cam’s ra­zor is not a philos­o­phy, it is a statis­ti­cal pre­dic­tion. To claim that Oc­cam’s ra­zor is not a sci­ence would be to claim that statis­tics is not a sci­ence.

Ex­am­ple: You leave a bowl with milk in it over night, you wake up in the morn­ing and its gone. Two pos­si­bly the­o­ries, are one, your cat drank it, or two, some­one broke into your house, and drank it, then left.

Well, we know that cats like milk, and you have a cat, so you know the prob­a­bil­ity of there be­ing a cat is 1:1, and you also know your cat likes to steal food when your sleep­ing, so based on past ex­pe­rience you might say the prob­a­bil­ity of the cat steal­ing the milk is 1:2, so you know theres two high prob­a­bil­ities. But when we con­sider the bur­glar hy­poth­e­sis, we know that its ex­tremely rare for some­one to break into our house, thus the prob­a­bil­ity for that situ­a­tion, while be­ing phys­i­cally pos­si­ble, is very low say 1 in 10,000. We know that bur­glars tend to break into houses to steal ex­pen­sive things, not milk from a bowl, thus the prob­a­bil­ity of that hap­pen­ing is say 1 in a mil­lion.

This is Oc­cams ra­zor at work, its 11 12 vs 110,000 11,000,000. Its statis­tics, and its sci­ence. Noth­ing I de­scribed here would be in­ac­cessible to ex­per­i­men­ta­tion and con­trol groups.

• I think that the God refer­ence and foul lan­guage used in Cure_of_Ars com­ment have mis­di­rected an im­por­tant crit­i­cism to this ar­ti­cle, which I for one would like to hear your re­sponses to, so please for those who down­voted and saved the crit­i­cism for his com­ment, I would like to hear your thoughts and have it ex­plained to me; for me, it is not triv­ial that he has no point in his first para­graph.

But to clar­ify, I’d restate my open ques­tions on the sub­ject which were partly de­scribed by his com­ment.

The origi­nal for­mu­la­tion of this prin­ci­ple is: “En­tities should not be mul­ti­plied with­out ne­ces­sity.” This for­mu­la­tion is not that clear to me; what I can un­der­stand from it is that one shouldn’t add un­nec­es­sary com­plex­ity to a the­ory un­less he has to.

A clear ex­am­ple where Oc­cam’s ra­zor may be used as in­tended is as fol­low­ing: as­sume I have a pro­gram that takes a sin­gle num­ber as an in­put and re­turns a num­ber. Now, if we ob­serve the fol­low­ing se­quence: f(1) = 2, f(4) = 16 and f(10) = 1024, we might be tempted say f(x) = 2^x. But this is not the only op­tion; we could have: f(x) = {x > 0 → 2^x, x ⇐ 0 → 10239999999} or even f(x) = {1 → 2, 4 ->16, 10->1024, [ANY OTHER INPUT TO OUTPUT]}.

Since these ex­am­ples all make the same pre­dic­tions in all ex­per­i­men­tal tests so far, it fol­lows we should choose the sim­plest one, be­ing 2^x [and if more ex­per­i­men­tal tests will fol­low in the fu­ture, we could have cho­sen in ad­vance similarly com­plex al­ter­na­tives that would have pre­dicted cor­rect ob­ser­va­tions as 2^x for these tests just as well. In fact, we can only make a finite amount of ex­per­i­men­tal tests, and as such there are an in­finite amount of hy­pothe­ses that would cor­rectly pre­dict these tests and have an ad­di­tional, use­less, layer of com­plex­ion added to them.]

What ex­actly en­tities mean, or how mul­ti­pli­ca­tion of them is defined, I could only guess based on my un­der­stand­ing of these con­cepts and the pop­u­lar in­ter­pre­ta­tions of this prin­ci­ple, such as: “Oc­cam’s ra­zor says that when pre­sented with com­pet­ing hy­pothe­ses that make the same pre­dic­tions, one should se­lect the solu­tion with the fewest as­sump­tions”

In any case, I sense (af­ter read­ing mul­ti­ple sources that em­pha­size this) that there is an em­pha­sis here that isn’t prop­erly ad­dressed in this ar­ti­cle and skipped over in these replies, and it is that Oc­cam’s ra­zor is not meant to be a way of choos­ing be­tween hy­pothe­ses that make differ­ent pre­dic­tions.

In the ar­ti­cle, the ques­tion of how to weight sim­plic­ity over pre­ci­sion arises; if we have two the­o­rems, T1 and T2, which have differ­ent pre­ci­sion (say T1 has 90% suc­cess rate where T2 has 82%) and differ­ent com­plex­ity (but T1 is more com­plex than T2) how can we de­cide be­tween the two?

From my un­der­stand­ing, and this is where I would like to hear your thoughts, this ques­tion can­not be solved by Oc­cam’s ra­zor. That be­ing said, I think this ques­tion is even more in­ter­est­ing and im­por­tant than the one that Oc­cam’s ra­zor at­tempts at solv­ing. And to an­swer that ques­tion, it ap­pears that Oc­cam’s ra­zor has been gen­er­al­ized, to some­thing like: “The ex­pla­na­tion re­quiring the fewest as­sump­tions is most likely to be cor­rect.” Th­ese gen­er­al­iza­tions are even given a differ­ent name (the law of par­si­mony, or the rule of sim­plic­ity) to stress they are not the same as Oc­cam’s ra­zor.

But that is nei­ther the origi­nal pur­pose of the prin­ci­ple, nor is it a proven fact. The fol­low­ing quote stresses this is­sue: “The prin­ci­ple of sim­plic­ity works as a heuris­tic rule of thumb, but some peo­ple quote it as if it were an ax­iom of physics, which it is not. [...] The law of par­si­mony is no sub­sti­tute for in­sight, logic and the sci­en­tific method. It should never be re­lied upon to make or defend a con­clu­sion. As ar­biters of cor­rect­ness, only log­i­cal con­sis­tency and em­piri­cal ev­i­dence are ab­solute.”

A us­age of this prin­ci­ple that does ap­peal to my logic is to get rid of hy­po­thet­i­cal ab­sur­di­ties, esp. if they can­not be tested us­ing the sci­en­tific method. This has been done in the field of physics, and this quote illus­trates my point:

“In physics we use the ra­zor to shave away meta­phys­i­cal con­cepts. [...] The prin­ci­ple has also been used to jus­tify un­cer­tainty in quan­tum me­chan­ics. Heisen­berg de­duced his un­cer­tainty prin­ci­ple from the quan­tum na­ture of light and the effect of mea­sure­ment.

Stephen Hawk­ing writes in A Brief His­tory of Time:
We could still imag­ine that there is a set of laws that de­ter­mines events com­pletely for some su­per­nat­u­ral be­ing, who could ob­serve the pre­sent state of the uni­verse with­out dis­turb­ing it. How­ever, such mod­els of the uni­verse are not of much in­ter­est to us mor­tals. It seems bet­ter to em­ploy the prin­ci­ple known as Oc­cam’s ra­zor and cut out all the fea­tures of the the­ory that can­not be ob­served.”″

My point here is not to dis­agree with the rule of sim­plic­ity (and surely not the origi­nal ra­zor) but to stress why it is some­what philo­soph­i­cal (af­ter all, it was in­vented in the 14th cen­tury, much be­fore the sci­en­tific method,) or at least, that it isn’t proven that this law is right for all cases; there are strong cases in his­tory that sup­port it, but that is not the same as be­ing proven.

I think that this law is a very good heuris­tic. Espe­cially when we try to lo­cate our be­lief in be­lief-space. But I be­lieve this ra­zor is wielded with less care than it should be—please let me know if and why you dis­agree.

Ad­di­tion­ally, I do not think I have gained a prac­ti­cal tool to eval­u­ate pre­ci­sion vs. sim­plic­ity. Solomonoff’s in­duc­tion seems highly im­pos­si­ble to use in real life, esp. when eval­u­at­ing the­o­ries out­side of the lab­o­ra­to­ries (in our ac­tual life!) I do un­der­stand it’s a very hard prob­lem, but Ra­tion­al­ity’s pur­pose is all about us­ing our brains, with all its weak­nesses and bi­ases, to the best of our abil­ities, in or­der to have the max­i­mum chance to reach Truth. This im­plies prac­ti­cal, how­ever im­perfect they may be (hope­fully, as least im­perfect as pos­si­ble,) tools to deal with these kinds of prob­lems in our pri­vate lives. I do not think that Solomonoff’s in­duc­tion is such a tool, and I do think we could use some heuris­tic to help us in this task.

To dude­icus: one can­not ar­gue a the­ory by an ex­am­ple to it and then con­clude by say­ing “if it would be tested with proper re­search, it will be proven.” This is not the sci­en­tific method at work. What I do take from your com­ment is only that this has not been for­mally proven—thus re­lat­ing to the philos­o­phy dis­cus­sion again.

• In sci­ence you do not get two theories

You’re right—there are an in­finite num­ber of the­o­ries con­sis­tent with any set of ob­ser­va­tions. Any set. All ob­served facts are tech­ni­cally con­sis­tent with the pre­dic­tion that grav­ity will re­verse in one hour, but no­body be­lieves that be­cause of… Oc­cam’s Ra­zor!

• I don’t think this is what’s ac­tu­ally go­ing on in the brains of most hu­mans.

Sup­pose there were ten ran­dom peo­ple who each told you that grav­ity would be sud­denly re­vers­ing soon, but each one pre­dicted a differ­ent month. For sim­plic­ity, per­son 1 pre­dicts the grav­ity re­ver­sal will come in 1 month, per­son 2 pre­dicts it will come in 2 months, etc.

Now you wait a month, and there’s no grav­ity re­ver­sal, so clearly per­son 1 is wrong. You wait an­other month, and clearly per­son 2 is wrong. Then per­son 3 is proved wrong, as is per­son 4 and then 5 and then 6 and 7 and 8 and 9. And so when you ap­proach the 10-month mark, you prob­a­bly aren’t go­ing to be ex­pect­ing a grav­ity-re­ver­sal.

Now, do you not sus­pect the grav­ity-re­ver­sal at month ten sim­ply be­cause it’s not as sim­ple as say­ing “there will never a be a grav­ity re­ver­sal,” or is your dis­mis­sal sub­stan­tially mo­ti­vated by the fact that the claim type-matches nine other claims that have already been dis­proven? I think that in prac­tice most peo­ple end up adopt­ing the lat­ter ap­proach.

• The rest of what you wrote sounds like you’re pul­ling num­bers out of your arse.

Cure of Ars, I should pre­fer it if you no longer com­mented on my posts. There may be a place on Over­com­ing Bias for Catholics; but none for those who de­spise math they don’t un­der­stand.

• MIT Press has just pub­lished Peter Grün­wald’s The Min­i­mum De­scrip­tion Length Prin­ci­ple. His Pre­face, Chap­ter 1, and Chap­ter 17 are available at that link. Chap­ter 17 is a com­par­i­son of differ­ent con­cep­tions of in­duc­tion.

I don’t know this area well enough to judge Peter’s wok, but it is cer­tainly in­for­ma­tive. Many of his points echo Eliezer’s. If you find this topic in­ter­est­ing, Peter’s book is definitely worth check­ing out.

• “Differ­ent in­duc­tive for­mal­isms are pe­nal­ized by a worst-case con­stant fac­tor rel­a­tive to each other”

You mean a con­stant term; it’s ad­di­tive, not mul­ti­plica­tive.

• That de­pends on whether you’re think­ing of the length or the prob­a­bil­ity. Since the length is the log-prob­a­bil­ity, it works out.

• Oc­cam’s ra­zor ac­tu­ally sug­gest that en­tities are not to be mul­ti­plied with­out ne­ces­sity.

Un­for­tu­nately, most peo­ple hap­pily bas­tardize Oc­cam’s Ra­zor, abus­ing it to sug­gest the sim­pler ex­pla­na­tion is usu­ally the bet­ter one.

First off, define sim­ple. Sim­ple how? Can you ob­jec­tively define sim­plic­ity? (It’s not eas­ily done.) Se­cond, ex­pla­na­tions must fit the facts. Third, this is a heuris­tic-based ar­gu­ment, not a log­i­cal proof of some­thing. (This same ar­gu­ment was used against Boltz­man and his idea of the atom, but Boltz­man was right.) Fourth, what does “usu­ally” mean any­way? Define that ob­jec­tively. Black swans events are seem­ingly im­pos­si­ble, yet they hap­pen much more reg­u­larly than peo­ple imag­ine (be­cause they are based on power laws/​frac­tal statis­tics, not the Bell Curve re­al­ity we of­ten think in terms of where the past gives us some sense of what to ex­pect).

Con­se­quently, I don’t con­sider an off­hand men­tion of Oc­cam’s ra­zor as a com­pel­ling ar­gu­ment. I would stop shak­ing your head and re­con­sider what it is you think you know.

• Sev­eral of these points are ex­plic­itly ad­dressed in the ar­ti­cle.

• shan­erg is right, Oc­cam’s ra­zor is not “The sim­plest an­swer is usu­ally the right one.” It is, “do not sug­gest en­tities for which there are no need”.

That is a com­mon mis­rep­re­sen­ta­tion of Oc­cam’s ra­zor, and it is ex­tremely vague and I think it shouldn’t be used, it has too many hid­den as­sump­tions. Now I do agree with ev­ery­thing that was writ­ten in the ar­ti­cle, but ev­ery­thing in the ar­ti­cle was the un­der­ly­ing ex­pla­na­tion for why Oc­cam’s ra­zor is true, which sim­ply put, has to do with statis­tics. I was dis­ap­pointed though, that this ar­ti­cle that was about Oc­cam’s ra­zor, didn’t ac­tu­ally have Oc­cam’s ra­zor in it.

• I’m sure we could have a fruit­ful dis­cus­sion about the proper form of Oc­cam’s Ra­zor, gen­er­ally speak­ing it is taken slightly differ­ently than the pre­cise word­ing at­tributed to William of Oc­cam.

How­ever, shan­erg’s post in­cludes sev­eral ques­tions an­swered ex­plic­itly and promi­nently in the post to which ey is re­spond­ing. Based on this, I ex­pected that a lengthy philo­soph­i­cal re­sponse would be wasted.

• There is: It’s enor­mously eas­ier (as it turns out) to write a com­puter pro­gram that simu­lates Maxwell’s Equa­tions, com­pared to a com­puter pro­gram that simu­lates an in­tel­li­gent emo­tional mind like Thor.

Com­ing back to this post, I fi­nally no­ticed that “emo­tional” is a nec­es­sary word in the quoted sen­tence. If we leave it out, the sen­tence might just be­come false! That is, if you be­lieve there’s some sort of sim­ple math­e­mat­i­cal “key” to in­tel­li­gence (my im­pres­sion is that you do be­lieve that), then you also ought to be­lieve that the Solomonoff prior makes an in­tel­li­gent god quite prob­a­ble apri­ori. Maybe even more prob­a­ble than the cur­rently most el­e­gant known for­mu­la­tions of phys­i­cal laws, which in­clude a whole zoo of el­e­men­tary par­ti­cles etc. Of course, if we take into ac­count the ev­i­dence we’ve seen so far, it looks like our uni­verse is based on physics rather than a “god”.

• What sort of speci­fi­ca­tion for Thor are you think­ing of that could pos­si­bly be sim­pler than Maxwell’s equa­tions? A de­scrip­tion of macro­scopic elec­tri­cal phe­nom­ena is more com­plex, as is “a be­ing that wants to simu­late Maxwell’s equa­tions.”

If you’re think­ing of com­par­ing all “god-like” hy­pothe­ses to Maxwell’s equa­tions, sure. But that com­par­i­son is a bit false—you should re­ally be com­par­ing all “god-like” hy­pothe­ses to all “nat­u­ral law-like” hy­pothe­ses, in which case I con­fi­dently pre­dict that the “nat­u­ral law-like” hy­pothe­ses will win hand­ily.

• Yeah, I agree. The short­est god-pro­grams are prob­a­bly longer than the short­est physics-pro­grams, just not “enor­mously” longer.

• Prob­a­bly enor­mously longer if you want it to pro­duce a god that would cause the world to act in a way as if ba­sic EM held.

ie, you don’t just need a mind, you need to spec­ify the sort of mind that would want to cause the world to be in a spe­cific way...

• Is there a nice way to quan­tify how fast does these the­o­ret­i­cal pri­ors drop off with the length of some­thing? By how much should I fa­vor sim­ple ex­pla­na­tion X over only mediumly more com­pli­cated ex­pla­na­tion Y.

• In­ter­est­ing ques­tion. If you have a countable in­finity of mu­tu­ally ex­clu­sive ex­pla­na­tions (e.g. they are all finite strings us­ing let­ters from some finite alpha­bet), then your only con­straint is that the in­finite sum of all their prior prob­a­bil­ities must con­verge to 1. Other­wise you’re free to choose. You could make the con­ver­gence re­ally fast (say, by mak­ing the prior of a hy­poth­e­sis in­versely pro­por­tional to the ex­po­nent of the ex­po­nent of its length), or slower if you wish to. A very nat­u­ral and pop­u­lar choice is re­strict­ing the hy­pothe­ses to form a “pre­fix-free set” (no hy­poth­e­sis can be­gin with an­other shorter hy­poth­e­sis) and then as­sign­ing ev­ery hy­poth­e­sis of N bits a prior of 2^-N, which makes the sum con­verge by Kraft’s in­equal­ity.

• What is the rea­son­ing be­hind us­ing a pre­fix-free set?

• Apart from giv­ing a sim­ple for­mula for the prior, it comes in handy in other the­o­ret­i­cal con­struc­tions. For ex­am­ple, if you have a “uni­ver­sal Tur­ing ma­chine” (a com­puter than can ex­e­cute ar­bi­trary pro­grams) and feed it an in­finite in­put stream of bits, per­haps com­ing from a ran­dom source be­cause you in­tend to “ex­e­cute a ran­dom pro­gram”… then it needs to know where the pro­gram ends. You could in­tro­duce an end-of-pro­gram marker, but a more gen­eral solu­tion is to make valid pro­grams form a pre­fix-free set, so that when the ma­chine has finished read­ing a valid pro­gram, it knows that read­ing more bits won’t re­sult in a longer but still valid pro­gram. (Note that adding an end-of-pro­gram marker is one of the ways to make your set of pro­grams pre­fix-free!)

Over­all this is a nice ex­am­ple of an idea that “just smells good” to a math­e­mat­i­cian’s in­tu­ition.

• Ah! I must have had a brain-stnank—this makes to­tal sense in math /​ the­o­ret­i­cal CS terms, I was sub­sti­tut­ing an in­cor­rect in­ter­pre­ta­tion of “hy­poth­e­sis” when read­ing the com­ment out of con­text. Thanks :)

• And, in par­tic­u­lar, we’re look­ing at god-pro­grams that pro­duce the out­put we’ve ob­served, which seems to cut out a lot of them (and speci­fi­cally a lot of sim­ple ones).

• Oc­cam’s Ra­zor is “en­tities must not be mul­ti­plied be­yond ne­ces­sity” (en­tia non sunt mul­ti­pli­canda praeter ne­ces­si­tatem)

NOT “The sim­plest ex­pla­na­tion that fits the facts.”

Now thats just purely defi­ni­tion. I think both are true. I think there are prob­lems with both. The prob­lem with Oc­cams ra­zor, is that yes its true, how­ever, it doesn’t cover all the bases. There is a deeper un­der­ly­ing prin­ci­ple that makes Oc­cams ra­zor true, which is the one you de­scribed in the ar­ti­cle. How­ever sum­ming up your ar­ti­cle as “The sim­plest ex­pla­na­tion that fits the facts” is also mis­lead­ing as in, while it does seem to cover all the bases, it only does so if you use a very spe­cific defi­ni­tion of sim­ple which re­ally doesnt fit with ev­ery­day lan­guage.

Ex­am­ple: Stone­henge, let me sug­gest two the­o­ries, 1. it was built by an­cient hu­mans, 2. it fell to­gether through purely ran­dom ge­olog­i­cal pro­cess. Both the­o­ries fit with the facts, we know that both are phys­i­cally pos­si­ble (yes 2. is vastly less prob­a­ble, ill get to that in a sec­ond). Oc­cams ra­zor sug­gest 2. as the an­swer, and “The sim­plest ex­pla­na­tion” ap­pears to be 2. also. Both seem to be failing. The real un­der­ly­ing prin­ci­ple as to why Oc­cams ra­zor is true, is statis­tics, not sim­plic­ity. Now dont get me wrong, I un­der­stand why “The sim­plest ex­pla­na­tion that fit the facts” ac­tu­ally points to 1., but then you have to go through this long pro­cess of what you ac­tu­ally mean by sim­plest, which ba­si­cally just ends up be­ing a long ex­pla­na­tion of how “sim­ple” ac­tu­ally means “prob­a­ble”.

Any­ways, I’m just ar­gu­ing over se­man­tics, I do in fact agree with ev­ery­thing you said. I just wish there was no Oc­cams ra­zor, it should just be “The the­ory which is the most statis­ti­cally prob­a­ble, is usu­ally the right one.” This is what peo­ple ac­tu­ally mean to say when they say “The sim­plest ex­pla­na­tion that fits the facts.”

• Oc­cam’s Ra­zor is “en­tities must not be mul­ti­plied be­yond ne­ces­sity” (en­tia non sunt mul­ti­pli­canda praeter ne­ces­si­tatem)

NOT “The sim­plest ex­pla­na­tion that fits the facts.”

The form you list it in is the his­tor­i­cal form of Oc­cam’s Ra­zor, but it isn’t the form that the Ra­zor has been ap­plied in for a fairly long time. Among other prob­lems, defin­ing what one means by dis­tinct en­tities is prob­le­matic. And we re­ally do want to pre­fer sim­pler ex­pla­na­tions to more com­pli­cated ones. In­deed, the most gen­eral form of the ra­zor doesn’t even need to have an ex­plana­tory el­e­ment (I in gen­eral pre­fer a low de­gree polyno­mial to in­ter­po­late some data to a high de­gree polyno­mial even if I have no ex­pla­na­tion at­tached to why I should ex­pect the ac­tual phe­nomenon to fit a lin­ear or quadratic polyno­mial.)

• I may be miss­ing some­thing here -- Oc­cam’s Ra­zor is “en­tities must not be mul­ti­plied be­yond ne­ces­sity” (en­tia non sunt mul­ti­pli­canda praeter ne­ces­si­tatem)

NOT “The sim­plest ex­pla­na­tion that fits the facts.”


-- but isn’t the post us­ing the first defi­ni­tion any­way? So even if he ex­plic­itly wrote the sec­ond defi­ni­tion in­stead of the first, he was clearly aware of the first since that’s what cor­re­sponds with his ar­gu­ment.

• In statis­tics gen­er­ally the model that has the least vari­ables and is the most statis­ti­cally prob­a­ble is the one used. See things like AIC or Bayesian In­for­ma­tion Cri­te­rion on how to choose a good model. This means that Oc­cam’s ra­zor is ac­cu­rate. Given that is is pos­si­ble to con­tin­u­ously add vari­ables to a model and get a perfect fit but have the model be blown apart with the ad­di­tion of an ad­di­tional ob­ser­va­tion that is not oth­er­wise in­fluen­tial, then, un­less you are defin­ing prob­a­bil­ity to in­clude an In­for­ma­tion Cri­te­rion, your for­mu­la­tion is less use­ful.

• Witchcraft may fit our ob­ser­va­tions in the sense of qual­i­ta­tively per­mit­ting them; but this is be­cause witchcraft per­mits everything

I think re­plac­ing witchcraft with god­hood is also a com­mon mistake

• What I don’t un­der­stand is so much in­sis­tence that Oc­cam’s Ra­zor ap­plies only to ex­pla­na­tions you ad­dress to God. Or else how do you avoid the ob­ser­va­tion that the sim­plic­ity of an ex­pla­na­tion is a func­tion of whom you are ex­plain­ing to ? In the post, you ac­tu­ally touch on the is­sue, only to ob­serve that there are difficul­ties in­ter­pret­ing Oc­cam’s Ra­zor in the frame of ex­plain­ing things to hu­mans (in their own nat­u­ral lan­guage), so let’s trans­pose to a situ­a­tion where hu­mans are com­pletely re­moved from the pic­ture. Cu­ri­ously enough, where the same is­sue oc­curs in the con­text of ma­chine lan­guages it is quickly “solved”. Makes one won­der what Oc­cam—who had no ac­cess to Tur­ing ma­chines—him­self had in might.

Also, if you deal in prac­tice with short­en­ing code length of ac­tual pro­grams, at some point you have ex­ploited all the low ly­ing fruit; fur­ther progress can come af­ter a mo­ment of con­tem­pla­tion made you ob­serve that dis­tinct paths of con­trol through the code have “some­thing in com­mon” that you may try to en­hance to the point where you can fac­tor it out. This “en­hanc­ing” fol­lows from the quest for min­i­mal “com­plex­ity”, but it drives you to do lo­cally, on the code, just the con­trary of what you did dur­ing the “low-ly­ing fruit” phase, you “com­plex­ify” rather than “sim­plify” two dis­tinct ar­eas of the code to make them re­sem­ble each other (and the tar­get re­sult emerges dur­ing the pro­cess, fun). What I mean to say, I guess, is that even the frame pro­posed by Chaitin-Kol­mogorov com­plex­ity gives only fake rea­sons to ne­glect bias (from shared back­ground or the equiv­a­lent).

• “each pro­gram is fur­ther weighted by its fit to all data ob­served so far. This gives you a weighted mix­ture of ex­perts that can pre­dict fu­ture bits.”

I don’t see it ex­plained any­where what al­gorithm is used to weight the ex­perts for this mea­sure. Does it mat­ter? And how are the “fit” prob­a­bil­ities and “com­plex­ity” prob­a­bil­ities com­bined? Mul­ti­ply and nor­mal­ize?

• What I find fas­ci­nat­ing is that Solomonoff In­duc­tion (and the re­lated con­cepts from Kol­mogorov com­plex­ity) very el­e­gantly solves the clas­si­cal philo­soph­i­cal prob­lem of in­duc­tion, as well as re­solv­ing a lot of other prob­lems:

1. What is the cor­rect “prior” in Bayesian in­fer­ence, and isn’t the choice of prior all sub­jec­tive?

2. What does Oc­cam’s ra­zor re­ally mean, and what is a “sim­ple” the­ory?

3. Why do physi­cists in­sist that their the­o­ries are “sim­ple” when only they can un­der­stand them?

De­spite this, it is al­most un­heard of in the gen­eral philo­soph­i­cal (an­a­lytic philos­o­phy) com­mu­nity. I’ve read liter­ally dozens of top-grade philoso­phers dis­cussing these top­ics, with the im­pli­ca­tion that these are still big un­solved prob­lems, and in com­plete ig­no­rance that there is a very rich math­e­mat­i­cal the­ory in this area. And the the­ory’s not ex­actly new ei­ther… dates back to the 1960s.

Any­one got an ex­pla­na­tion for the dis­con­nect?

• Any­one got an ex­pla­na­tion for the dis­con­nect?

Philoso­phers don’t read those things. If that ex­pla­na­tion seems lack­ing, I feel like refer­ring to Feyn­man.

• Pos­si­bly be­cause .Solomonoff in­duc­tion isnt very suit­able to an­swer­ing the kinds of ques­tions philoso­phers want an­swered, ques­tions of fun­da­men­tal on­tol­ogy.. It can tell you what pro­gramme would gen­er­ate ob­served data, but it doesn’t tell you what the pro­gramme is run­ning on..the laws of physics, Gods mind, .or a gi­ant simu­la­tion. OTOH, tra­di­tional Oc­cams ra­zor can ex­clude a range of on­tolog­i­cal hy­pothe­ses.

There is also the prob­lem that there is no ab­solute mea­sure of the com­plex­ity of a pro­gramme: a pro­gram­ming lan­guage is still a lan­guage, and some lan­guages can ex­press some things more con­cisely than oth­ers, as ex­plained in koko­ta­jlods other com­ment. http://​​less­wrong.com/​​lw/​​jhm/​​un­der­stand­ing_and_jus­tify­ing_solomonoff_in­duc­tion/​​ady8

• I don’t think Solomonoff In­duc­tion solves any of those three things. I re­ally hope it does, and I can see how it kinda goes half of the way there to solv­ing them, but I just don’t see it go­ing all the way yet. (Mostly I’m con­cerned with #1. The other two I’m less sure about, but they are also less im­por­tant.)

I don’t know why the philo­soph­i­cal com­mu­nity seems to be ig­nor­ing Solomonoff In­duc­tion etc. though. It does seem rele­vant. Maybe the philoso­phers are just more cyn­i­cal than we are about Solomonoff In­duc­tion’s chances of even­tu­ally be­ing able to solve 1, 2, and 3.

• If you let a pro­gram store one more bi­nary bit of in­for­ma­tion, it will be able to cut down a space of pos­si­bil­ities by half, and hence as­sign twice as much prob­a­bil­ity to all the points in the re­main­ing space. This sug­gests that one bit of pro­gram com­plex­ity should cost at least a “fac­tor of two gain” in the fit. If you try to de­sign a com­puter pro­gram that ex­plic­itly stores an out­come like “HTTHHT”, the six bits that you lose in com­plex­ity must de­stroy all plau­si­bil­ity gained by a 64-fold im­prove­ment in fit. Other­wise, you will sooner or later de­cide that all fair coins are fixed.

I found this para­graph con­fus­ing. How about

If you let a pro­gram store one more bi­nary bit of in­for­ma­tion, it will be able to cut down a space of pos­si­bil­ities by half, and hence as­sign twice as much prob­a­bil­ity to all the points in the re­main­ing space. This sug­gests that one bit of pro­gram com­plex­ity should always buy at least a “fac­tor of two gain” in the fit. If you try to de­sign a com­puter pro­gram that ex­plic­itly stores an out­come like “HTTHHT”, the six bits that you pay in com­plex­ity must get you at least a 64-fold im­prove­ment in fit. Other­wise, you will sooner or later de­cide that all fair coins are fixed.

Does that mean the same thing?

• Up­com­ing for­mal philos­o­phy con­fer­ence on the foun­da­tions of Oc­cam’s ra­zor here. Ab­stracts in­cluded.

• I’ll be there!

• I don’t think it’s quite nec­es­sary for peo­ple to even be con­sciously aware of Oc­cam’s Ra­zor. The right pre­dic­tions will even­tu­ally win out be­cause there will ex­ist an eco­nomic profit some­where which will be ex­ploited. If you can think of an area which is over­run with mar­ket in­effi­cien­cies due to some­thing re­lated to this post, please let me know and I will be sure to grab what­ever I can of the eco­nomic prof­its while they last.

• OK, I am com­ing in way late, but I can tell you that all of you are wrong. Oc­cam’ s ra­zor the­ory is based on ob­ser­va­tions of hu­man be­hav­ior over a long pe­riod of time. Some Hu­mans want to at­tribute mys­ti­cal or su­per­nat­u­ral sig­nifi­cance to any event out of the or­di­nary that oc­curs in their lives. Ad­vanced brains like your­selves seek to ap­ply equa­tions and the­o­rems that re­duce life events to an equa­tion. Some­times shit hap­pens that just can’t or won’t fit into any­one’s strongly held be­liefs or the­o­ries. Live life, stop try­ing to use math or what the fuck ever to ex­plain it!

• Hello,

I need some help un­der­stand­ing the ar­ti­cle af­ter Un­less your pro­gram is be­ing smart, and com­press­ing the data, it should do no good just to move one bit from the data into the pro­gram de­scrip­tion.

How is the con­nec­tion be­ing made from com­plex­ity and fit to data and pro­gram de­scrip­tion?

• Com­plex­ity, as defined in Solomonoff In­duc­tion, means pro­gram de­scrip­tion—that is, code length in bits.

Si­de­note: thank you for re­mind­ing me that Eliezer was talk­ing about bet­ter ver­sions of SI in 2007, be­fore start­ing his quan­tum me­chan­ics se­quence.

• I found a refer­ence to a very nice overview for the math­e­mat­i­cal mo­ti­va­tions of Oc­cam’s Ra­zor on wikipe­dia.

It’s Chap­ter 28: Model Com­par­i­son and Oc­cam’s Ra­zor; from (page 355 of) In­for­ma­tion The­ory, In­fer­ence, and Learn­ing Al­gorithms (legally free to read pdf) by David J. C. MacKay.

The Solomonoff In­duc­tion stuff went over my head, but this overview’s talk of trade-offs be­tween com­mu­ni­cat­ing in­creas­ing num­bers of model pa­ram­e­ters vs hav­ing to com­mu­ni­cate less resi­d­u­als (ie. offsets from real data); was very in­for­ma­tive.

• My own way of think­ing of Oc­cam’s Ra­zor is through model se­lec­tion. Sup­pose you have two com­pet­ing state­ments (the which did it) and (it was chance or pos­si­bly some­thing other than a which caused it ()) and some ob­ser­va­tions (the se­quence came up 0101010101). Then the preferred state­ment is whichever is more prob­a­ble calcu­lated as

this is sim­ply Bayes rule where

and the model is parametrized by some pa­ram­e­ters .

Now all this is just the math­e­mat­i­cal way of writ­ing that a hy­poth­e­sis that has more pa­ram­e­ters (or more speci­fi­cally more pos­si­ble val­ues that it pre­dicts), will not be as strong a state­ment that pre­dicts a smaller state of out­comes.

In the witch ex­am­ple this would be:

• There ex­ist an ad­vanced in­tel­li­gent be­ing (at least not much less than hu­man in­tel­li­gence) that can do things be­yond what has ever been re­pro­duced in a sci­en­tific way that for some rea­son chooses to live on our street and act mostly as a hu­man that will choose to in­fluence my se­quence coin tosses to end up in some seem­ingly look­ing pattern
• The coin toss is ruled by chance and might end up in the set of pos­si­ble out­comes that seem to form a pat­tern ()
• The coin toss ended up as
• The way I stated the hy­pothe­ses

Now what re­mains is to es­ti­mate the pri­ors and the the frac­tion of out­comes that look like a pat­tern. We can skip as we are in­ter­ested in .

Now com­par­ing the amount of con­di­tion­als in the hy­pothe­ses and how sur­prised I am by them I would roughly es­ti­mate a ra­tio of the pri­ors as some­thing like in fa­vor to chance, as the witch hy­poth­e­sis goes against many of my formed be­liefs of the world col­lected over many years, it in­cludes weird choices of liv­ing for this hy­po­thet­i­cal alien en­tity, it picks out me as a pos­si­ble agent of many in the neigh­bor­hood, it sin­gles out an ar­bi­trary ac­tion of mine and an ar­bi­trary set of out­comes.

For the sake of com­plete­ness. The frac­tion of out­comes that look like a pat­tern is kind of hard to es­ti­mate ex­actly. How­ever, my way of think­ing about it is how soon in the se­quence would I pos­tu­late the spe­cific se­quence that it ended up in. After 0101, I think that the se­quence 0101010101 is the most ob­vi­ous pat­tern to con­tinue it in. So roughly this is six bits of ev­i­dence.

In con­clu­sion, I would say that the prob­a­bil­ity of the witch hy­poth­e­sis is lack­ing around 94 bits of ev­i­dence for me to be­lieve it as much as the chance hy­poth­e­sis.

The down­side of this ap­proach to the Solomonoff in­duc­tion and the min­i­mum mes­sage length is that it is clunkier to use and it might be easy to for­get to in­clude con­di­tion­als or com­plex­ity in the pri­ors the same way they can be lost in the English lan­guage. The up­side is that as a model it is sim­pler, less ad hoc and builds di­rectly on the product rule in prob­a­bil­ity and that prob­a­bil­ities sum to one and should thus be preferred by Oc­cam’s Ra­zor ;).