# Friedman’s “Prediction vs. Explanation”

We do ten ex­per­i­ments. A sci­en­tist ob­serves the re­sults, con­structs a the­ory con­sis­tent with them, and uses it to pre­dict the re­sults of the next ten. We do them and the re­sults fit his pre­dic­tions. A sec­ond sci­en­tist now con­structs a the­ory con­sis­tent with the re­sults of all twenty ex­per­i­ments.

The two the­o­ries give differ­ent pre­dic­tions for the next ex­per­i­ment. Which do we be­lieve? Why?

One of the com­menters links to Over­com­ing Bias, but as of 11PM on Sep 28th, David’s blog’s time, no one has given the ex­act an­swer that I would have given. It’s in­ter­est­ing that a ques­tion so ba­sic has re­ceived so many an­swers.

• Hrm, I’d have to say go with whichever is sim­pler (choose your fa­vorite rea­son­able method of mea­sur­ing the com­plex­ity of a hy­poth­e­sis) for the usual rea­sons. (less bits to de­scribe it means less stuff that has to be “just so”, etc etc… Of course, mod­ify this a bit if one of the hy­poth­e­sies has a sig­nifi­cantly differ­ent prior than the other due to pre­vi­ously learned info, but...) But yeah, the less com­plex one that works is more likely to be closer to the un­der­ly­ing dy­namic.

If you’re handed the two hy­poth­e­sies as black boxes, so that you can’t ac­tu­ally see in­side them and work out which is more com­plex, then go with the first one. The first one, since it’s more likely to be less com­plex (since max­i­mum only the first ten data points could have been in some way ex­plic­itly hard coded into it. It suc­cess­fully re­ally pre­dicted the next ten. The sec­ond one could, po­ten­tially, have in some way all twenty data points hard coded into it, and thus be more com­plex and thus effec­tively less likely to ac­tu­ally have any­thing re­sem­bling the un­der­ly­ing dy­namic en­coded into it)

• Is it cheat­ing to say that it de­pends hugely on the con­tent of the the­o­ries, and their prior prob­a­bil­ities?

• To de­cide ob­jec­tively, not know­ing the con­tent of the the­ory is more effec­tive.

• The the­o­ries screen off the the­o­rists, so if we knew the the­o­ries then we could (given enough clev­er­ness) de­cide based on the the­o­ries them­selves what our be­lief should be.

But be­fore we even look at the the­o­ries, you ask me which the­ory I ex­pect to be cor­rect. I ex­pect the one which was writ­ten ear­lier to be cor­rect. This is not be­cause it mat­ters which the­ory came first, ir­re­spec­tive of their con­tent; it is be­cause I have differ­ent be­liefs about what each of the two the­o­ries might look like.

The first the­o­rist had less data to work with, and so had less data available to in­sert into the the­ory as pa­ram­e­ters. This is ev­i­dence that the first the­ory will be smaller than the sec­ond the­ory. I as­sign greater prior prob­a­bil­ities to small the­o­ries than to large the­o­ries, so I think the first the­ory is more likely to be cor­rect than the sec­ond one.

• I rather like the 3rd an­swer on his blog (Doug D’s). A slight elab­o­ra­tion on that—one virtue of a sci­en­tific the­ory is its gen­er­al­ity, and pre­dic­tion is a bet­ter way of de­ter­min­ing gen­er­al­ity than ex­pla­na­tion—de­mand­ing pre­dic­tive power from a the­ory ex­cludes ad hoc the­o­ries of the sort Doug D men­tioned, that do noth­ing more than re-state the data. This rea­son­ing, note, does not re­quire any math. :-)

• (Not­ing that the math-ey ver­sion of that rea­son has just been stated by Peter and Psy-kosh.)

• The first guy has demon­strated pre­dic­tion, the sec­ond only hind­sight. We as­sume the first the­ory is right—but of course, we do the next ex­per­i­ment, and then we’ll know.

• As­sum­ing both of them can pro­duce val­ues (are for­mu­lated in such a way that they can pro­duce a new value with just the past val­ues + the en­vi­ron­ment)

The sec­ond the­ory has the risk of be­ing more de­scrip­tive than pre­dic­tive. It has more po­ten­tial of be­ing fit to the in­put data, in­clud­ing all its noise, and to be a (maybe com­plex) enu­mer­a­tion of its val­ues.

The first one has at least proven it could be used to pre­dict, while the sec­ond one can only pro­duce a new value.

I would thus give more credit to the first the­ory. At least it has won against ten coin flips with­out om­ni­science.

• Which do we be­lieve?

What ex­actly is meant here by ‘be­lieve’? I can imag­ine var­i­ous in­ter­pre­ta­tions.

a. Which do we be­lieve to be ‘a true cap­tur­ing of an un­der­ly­ing re­al­ity’? b. Which do we be­lieve to be ‘use­ful’? c. Which do we pre­fer, which seems more plau­si­ble?

a. Nei­ther. Real sci­en­tists don’t be­lieve in the­o­ries, they just test them. Eng­ineers be­lieve in the­o­ries :-)

b. Utility de­pends on what you’re try­ing to do. If you’re an economist, then a beau­tifully com­pli­cated post-hoc ex­pla­na­tion of 20 ex­per­i­ments may get your next grant more eas­ily than a sim­ple the­ory that you can’t get pub­lished.

c. Who de­vel­oped the the­o­ries? Which the­ory is sim­pler? (Ptolemy, Coper­ni­cus?) Which the­ory fits in best with other well-sup­ported pre-ex­ist­ing the­o­ries? (Creation­ism, Evolu­tion vs. the­o­ries about dis­ease be­havi­our). Did any un­usual data ap­pear in the last 10 ex­per­i­ments that ‘fit­ted’ the origi­nal the­ory but hinted to­wards an even bet­ter the­ory? What is meant by ‘con­sis­tent’ (how well did it fit within er­ror bands, how ac­cu­rate is it)? Per­haps the­ory 1 came from New­ton, and the­ory 2 was thought up by Ein­stein. How similar were the sec­ond sets of ex­per­i­ments to the origi­nal set?

How easy/​difficult were the pre­dic­tions? In other words, how well did they steer us through ‘the­ory-space’? If the­ory 1 pre­dicts the sun would come up each day, it’s hardly as pow­er­ful as the­ory 2 which sug­gests the earth ro­tates around the sun.

What do we mean when we use the word ‘con­structs’? Per­haps the sec­ond the­o­rist blinded him­self to half of the re­sults, con­structed a the­ory, then tested it, plac­ing him­self in the same po­si­tion as the origi­nal the­o­rist but with the ad­van­tage of hav­ing tested his the­ory be­fore pro­claiming it to the world? Per­haps the con­struc­tor re­peated this many times us­ing differ­ent sub­sets of the data to build a pre­dic­tor and test it; and chose the the­o­rem which was most con­sis­tently sug­gested by the data and ver­ified by sub­se­quent test­ing.

Per­haps he found that no mat­ter how he sliced and diced and blinded him­self to parts of the data, his hand unerringly fell on the same ‘piece of pa­per in the box’ (to use the metaphor from the other site).

Another is­sue is ‘how im­por­tant is the the­ory’? For cer­tain im­por­tant the­o­ries (de­vel­op­ment of can­cer, space travel, build­ing new types of nu­clear re­ac­tors etc.), nei­ther 10 nor 20 large ex­per­i­ments might be suffi­cient for so­ciety to con­fer ‘be­lief’ in an en­g­ineer­ing sense.

Other so­cial is­sues may ex­ist. Gal­ileo ‘be­lieved’ bravely, but per­haps fool­ishly, de­pend­ing on how he val­ued his free­dom.

d. Set­ting aside these other is­sues, and in the ab­sence of any other in­for­ma­tion: As a sci­en­tist, my at­ti­tude would be to be­lieve nei­ther, and test both. As an en­g­ineer, my at­ti­tude would be to ‘pre­fer’ the first the­ory (if forced to ‘be­lieve’ only one), and ask a sci­en­tist to check out the other one.

• Both the­o­ries fit 20 data points. That some of those are pre­dic­tions is ir­rele­vant, ex­cept for the in­fer­ences about the­ory sim­plic­ity that re­sult. Since like­li­hoods are the same, those pri­ors are also the pos­te­ri­ors.

My state of be­lief is then rep­re­sented by a cer­tain prob­a­bil­ity that each the­ory is true. If forced to pick one out of the two, I would ex­am­ine the penalties and pay­offs of be­ing cor­rect and wrong, ala Pas­cal’s wa­ger.

• We do ten ex­per­i­ments. A sci­en­tist ob­serves the re­sults, con­structs a the­ory con­sis­tent with them

Huh? How did the sci­en­tist know what to ob­serve with­out already hav­ing a the­ory? The­o­ries arise as ex­pla­na­tions for prob­lems, ex­pla­na­tions which yield pre­dic­tions. When the first ten ex­per­i­ments were con­ducted, our sci­en­tist would there­fore be test­ing pre­dic­tions aris­ing from an ex­pla­na­tion to a prob­lem. He wouldn’t just be con­duct­ing any old set of ex­per­i­ments.

Similarly the sec­ond sci­en­tist’s the­ory would be a differ­ent ex­pla­na­tion of the prob­lem situ­a­tion, one yield­ing a differ­ent pre­dic­tion. Be­fore the de­ci­sive test, the the­ory that emerges as the best ex­pla­na­tion un­der the glare of crit­i­cal scrutiny would be the preferred ex­pla­na­tion. Without know­ing the prob­lem situ­a­tion and the ex­pla­na­tions that have been ad­vanced it can­not be de­ter­mined which is to be preferred.

• We do ten ex­per­i­ments. A sci­en­tist ob­serves the re­sults, con­structs a the­ory con­sis­tent with them

Huh? How did the sci­en­tist know what to ob­serve with­out already hav­ing a the­ory? The­o­ries arise as ex­pla­na­tions for prob­lems, ex­pla­na­tions which yield pre­dic­tions. When the first ten ex­per­i­ments were con­ducted, our sci­en­tist would there­fore be test­ing pre­dic­tions aris­ing from an ex­pla­na­tion to a prob­lem. He wouldn’t just be con­duct­ing any old set of ex­per­i­ments.

Similarly the sec­ond sci­en­tist’s the­ory would be a differ­ent ex­pla­na­tion of the prob­lem situ­a­tion, one yield­ing a differ­ent pre­dic­tion. Be­fore the de­ci­sive test, the the­ory that emerges as the best ex­pla­na­tion un­der the glare of crit­i­cal scrutiny would be the preferred ex­pla­na­tion. Without know­ing the prob­lem situ­a­tion and the ex­pla­na­tions that have been ad­vanced it can­not be de­ter­mined which is to be preferred.

• One the­ory has a track record of pre­dic­tion, and what is be­ing asked for is a pre­dic­tion, so at first glance I would choose that one. But the ex­pla­na­tion based-one is built on more data.

But it is nei­ther pre­dic­tion nor ex­pla­na­tion that makes things hap­pen in the real world, but causal­ity. So I would look in to the two the­o­ries and pick the one that looks to have iden­ti­fied a real cause in­stead of sim­ply iden­ti­fy­ing a statis­ti­cal pat­tern in the data.

• Whichever is sim­pler—as­sum­ing we don’t know any­thing about the sci­en­tists’ abil­ities or track record.

Hav­ing two differ­ent sci­en­tists seems to pointlessly con­found the ex­am­ple with ex­tra­ne­ous vari­ables.

• I don’t think the sec­ond the­ory is any less “pre­dic­tive” than the first. It could have been pro­posed at the same time or be­fore the first, but it wasn’t. Why should the pre­dic­tive abil­ity of a the­ory vary de­pend­ing on the point in time in which it was cre­ated? David Fried­man seems to pre­fer the first be­cause it demon­strates more abil­ity on the part of the sci­en­tist who cre­ated it (i.e., he got it af­ter only 10 tries).

Un­less we are given any more in­for­ma­tion on the prob­lem, I think I agree with David.

• Th­ese the­o­ries are ev­i­dence about true dis­tri­bu­tion of data, so I con­struct a new the­ory based on them. I then could pre­dict the next data point us­ing my new the­ory, and if I have to play this game go back and choose one of the origi­nal the­o­ries that gives the same pre­dic­tion, based only on pre­dic­tion about this par­tic­u­lar next data point, in­de­pen­dently on whether se­lected the­ory as a whole is deemed bet­ter.

Hav­ing more data is strictly bet­ter. But I could ex­pect that there is a good chance that a par­tic­u­lar sci­en­tist will make an er­ror (worse than me now, judg­ing his re­sult, since he him­self could think about all of this and, say, con­struct a the­ory from first 11 data points and ver­ify the ab­sence of this sys­tem­atic er­ror us­ing the rest, or use a re­li­able method­ol­ogy). Suc­cess of the first the­ory gives ev­i­dence for it, which de­pend­ing on my pri­ors can sig­nifi­cantly over­weight ex­pected im­prove­ment from more data points com­ing through im­perfect pro­ce­dure of con­vert­ing into a the­ory.

• Here’s my an­swer, prior to read­ing any of the com­ments here, or on Fried­man’s blog, or Fried­man’s own com­men­tary im­me­di­ately fol­low­ing his state­ment of the puz­zle. So, it may have already been given and/​or shot down.

We should be­lieve the first the­ory. My ar­gu­ment is this. I’ll call the first the­ory T1 and the sec­ond the­ory T2. I’ll also as­sume that both the­o­ries made their pre­dic­tions with cer­tainty. That is, T1 and T2 gave 100% prob­a­bil­ity to all the pre­dic­tions that the story at­tributed to them.

First, it should be noted that the two the­o­ries should have given the same pre­dic­tion for the next ex­per­i­ment (ex­per­i­ment 21). This is be­cause T1 should have been the best the­ory that (would have) pre­dicted the first batch. And since T1 also cor­rectly pre­dicted the sec­ond batch, it should have been the best the­ory that would do that, too. (Here, “best” is ac­cord­ing to what­ever ob­jec­tive met­ric eval­u­ates the­o­ries with re­spect to a given body of ev­i­dence.)

But we are told that T2 makes ex­actly the same pre­dic­tions for the first two batches. So it also should have been the best such the­ory. It should be noted that T2 has no more in­for­ma­tion with which to im­prove it­self. T1, for all in­tents and pur­poses, also knew the out­comes of the sec­ond batch of ex­per­i­ments, since it pre­dicted them with 100% cer­tainty. There­fore, the the­o­ries should have been the best pos­si­ble given the first two batches. In par­tic­u­lar, they should have been equally good.

But if “be­ing the best, given the first two batches” doesn’t de­ter­mine a pre­dic­tion for ex­per­i­ment 21, then nei­ther of these “best” the­o­ries should be pre­dict­ing the out­come of ex­per­i­ment 21 with cer­tainty. There­fore, since it is given that they are mak­ing such pre­dic­tions, they should be mak­ing the same one.

It fol­lows that at least one of the the­o­ries is not the best, given the ev­i­dence that it had. That is, at least one of them was con­structed us­ing flawed meth­ods. T2 is more likely to be flawed than is T1, be­cause T2 only had to post-dict the sec­ond batch. This is triv­ial to for­mal­ize us­ing Bayes’s the­o­rem. Roughly speak­ing, it would have been harder for T1 to been con­structed in a flawed way and still have got­ten its pre­dic­tions for the sec­ond batch right.

There­fore, T1 is more likely to be right than is T2 about the out­come of ex­per­i­ment 21.

• (And, of course, first the­ory could be im­proved us­ing the next 10 data points by Bayes’ rule, which will give a can­di­date for be­ing the sec­ond the­ory. This new the­ory can even dis­agree with the first on which value of par­tic­u­lar data point is most likely.)

• Know­ing how the­o­ries and ex­periements were cho­sen would make this more sen­si­ble prob­lem. Hav­ing that in­for­ma­tion would af­fect our ex­pec­ta­tions about the­o­ries—as oth­ers have noted there are a lot of the­o­ries one could form in ad hoc man­ner, but ques­tion is which of them was se­lected.

First the­ory has been se­lected with first ten ex­periements and it seems to have sur­vived sec­ond set of ex­periements. If ex­periements were in­de­pen­dent from first set of ex­periements and from each other this is quite un­likely so this is strong ev­i­dence that first the­ory is the con­nec­tion be­tween ex­periements.

Given rea­son­able way of choos­ing the­o­ries I would rate both the­o­ries as likely, but given finite re­sources and fal­lible the­o­rists I would pre­fer first the­ory as we have ev­i­dence that it was cho­sen sen­si­bly and that the prob­lem is ex­plain­able with the­ory of its cal­ibre, but only to ex­tent how far I doubt ra­tio­nal­ity of the­o­rist mak­ing sec­ond the­ory.

• Gah, oth­ers got there first.

• I would go with the first one in gen­eral. The first one has proved it­self on some test data, while all the sec­ond one has done is to fit a model on given data. There is always the risk that the sec­ond the­ory has overfit­ted a model with no worth­while gen­er­al­iza­tion ac­cu­racy. Even if the sec­ond the­ory is sim­pler than the first the fact that the first the­ory has been proved right on un­seen data makes it a slam dunk win­ner. Of course fur­ther ex­per­i­ments may cause us to up­date our be­liefs, par­tic­u­larly if the­ory 2 is prov­ing just as ac­cu­rate.

• There are an in­finite num­ber of mod­els that can pre­dict 10 vari­ables, or 20 for that mat­ter. The only prob­a­ble way for sci­en­tist A to pre­dict a model out of the in­finite pos­si­ble ones is to bring prior knowl­edge to the table about the na­ture of that model and the data. This is also true for the sec­ond sci­en­tist, but only slightly less so.

There­fore, sci­en­tist A has demon­strated a higher prob­a­bil­ity of hav­ing valuable prior knowl­edge.

I don’t think there is much more to this than that. If the two sci­en­tists have equal knowl­edge there is no rea­son the sec­ond model need be more com­pli­cated than the first since the first fully de­scribed the ex­tra re­vealed data in the sec­ond.

If it was the same sci­en­tist with both sets of data then you would pick the sec­ond model.

• Tyrrell’s ar­gu­ment seems to me to hit the nail on the head. (Although I would have liked to see that for­mal­iza­tion—it seems to me that while T1 will be preferred, the prefer­ence may be ex­tremely slight, de­pend­ing. No, I’m too lazy to do it my­self :-))

• For­mal­iz­ing Vi­jay’s an­swer here:

The short an­swer is that you should put more of your prob­a­bil­ity mass on T1′s pre­dic­tion be­cause ex­perts vary, and an ex­pert’s past perfor­mance is at least some­what pre­dic­tive of his fu­ture perfor­mance.

We need to as­sume that all else is sym­met­ri­cal: you had equal pri­ors over the re­sults of the next ex­per­i­ment be­fore you heard the sci­en­tists’ the­o­ries; the sci­en­tists were of equal ap­par­ent cal­iber; P( the first twenty ex­per­i­men­tal re­sults | T1 ) = P( the first twenty ex­per­i­men­tal re­sults | T2); nei­ther the­o­rist in­fluenced the pro­cess by which the next ex­per­i­ment was cho­sen; etc.

Sup­pose we have a bag of ex­perts, each of which con­tains a func­tion for gen­er­at­ing the­o­ries from data. We draw a first ex­pert from our bag at ran­dom and show him data points 1-10; ex­pert 1 gen­er­ates the­ory T1. We draw a sec­ond ex­pert from our bag at ran­dom and show him data points 1-20: ex­pert 2 gen­er­ates the­ory T2.

Given the man­ner in which real hu­man ex­perts vary (some know more than oth­ers about a given do­main; some aim for ac­cu­racy where oth­ers aim to sup­port their own poli­ti­cal fac­tions; etc.), it is rea­son­able to sup­pose that some ex­perts have pri­ors that are well al­igned with the prob­lem at hand (or be­have as if they do) while oth­ers have pri­ors that are poorly al­igned. Ex­pert 1 dis­t­in­guished him­self by ac­cu­rately pre­dict­ing the re­sults of ex­per­i­ments 11-20 from the re­sults of ex­per­i­ments 1-10; many pre­dic­tive pro­cesses would not have done so well. Ex­pert 2 has only shown an abil­ity to find some the­ory that is con­sis­tent with the re­sults of ex­per­i­ments 1-20; many pre­dic­tive pro­cesses put a non-zero prior on some such the­ory that would not have given the re­sults of ex­per­i­ments 11-20 “most ex­pected” sta­tus based only on the re­sults from ex­per­i­ments 1-10. We should there­fore ex­pect bet­ter fu­ture perfor­mance from Ex­pert 1, all else equal.

The prob­lem at hand is com­pli­cated slightly in that we are judg­ing, not ex­perts, but the­o­ries, and the two ex­perts gen­er­ated their the­o­ries at differ­ent times from differ­ent amounts of in­for­ma­tion. If Ex­pert 1 would have as­signed a prob­a­bil­ity < 1 to re­sults 11-20 (de­spite pro­duc­ing a the­ory that pre­dicted those re­sults), Ex­pert 2 is work­ing from more in­for­ma­tion than Ex­pert 1, which gives Ex­pert 2 at least a slight ad­van­tage. Still, given the de­tails of hu­man vari­abil­ity and the fact that Ex­pert 1 did pre­dict re­sults 11-20, I would ex­pect the former con­sid­er­a­tion to out­weigh the lat­ter.

• Scien­tist 2′s the­ory is more sus­cep­ti­ble to over-fit­ting of the data; we have no rea­son to be­lieve it’s par­tic­u­larly gen­er­al­iz­able. His the­ory could, in essence, sim­ply be restat­ing the known re­sults and then giv­ing a more or less ran­dom pre­dic­tion for the next one. Let’s make it 100,000 tri­als rather than 20 (and say that Scien­tist A has based his yet-to-be-falsified the­ory off the first 50,000 tri­als), and stipu­late that Scien­tist 2 is a neu­ral net­work—then the an­swer seems clear.

• I wrote in my last com­ment that “T2 is more likely to be flawed than is T1, be­cause T2 only had to post-dict the sec­ond batch. This is triv­ial to for­mal­ize us­ing Bayes’s the­o­rem. Roughly speak­ing, it would have been harder for T1 to been con­structed in a flawed way and still have got­ten its pre­dic­tions for the sec­ond batch right.”

Benja Fallen­stein asked for a for­mal­iza­tion of this claim. So here goes :).

Define a method to be a map that takes in a batch of ev­i­dence and re­turns a the­ory. We have two assumptions

ASSUMPTION 1: The the­ory pro­duced by giv­ing an in­put batch to a method will at least pre­dict that in­put. That is, no mat­ter how flawed a method of the­ory-con­struc­tion is, it won’t con­tra­dict the ev­i­dence fed into it. More pre­cisely,

p( M(B) pre­dicts B ) = 1.

(A real ac­count of hy­poth­e­sis test­ing would need to be much more care­ful about what con­sti­tutes a “con­tra­dic­tion”. For ex­am­ple, it would need to deal with the fact that in­puts aren’t ab­solutely re­li­able in the real world. But I think we can ig­nore these com­pli­ca­tions in this prob­lem.)

ASSUMPTION 2: If a method M is known to be flawed, then its the­o­ries are less likely to make cor­rect pre­dic­tions of fu­ture ob­ser­va­tions. More pre­cisely, if B2 is not con­tained in B1, then

p( M(B1) pre­dicts B2 | M flawed ) < P( M(B1) pre­dicts B2 ).

(Out­side of toy prob­lems like this one, we would need to stipu­late that B2 is not a log­i­cal con­se­quence of B1, and so forth.)

Now, let B1 and B2 be two dis­joint and nonempty sets of in­put data. In the prob­lem, B1 is the set of re­sults of the first ten ex­per­i­ments, and B2 is the set of re­sults of the next ten ex­per­i­ments.

My claim amounted to the fol­low­ing. Let

P1 := p( M is flawed | M(B1) pre­dicts B2 ),

P2 := p( M is flawed | M(B1 union B2) pre­dicts B2 ).

Then P1 < P2

To prove this, note that, by Bayes’s the­o­rem, the sec­ond quan­tity P2 is given by

P2 = p( M(B1 union B2) pre­dicts B2 | M is flawed ) * p(M is flawed) /​ p( M(B1 union B2) pre­dicts B2 ).

Since p(X) = 1 im­plies p(X|Y) = 1 when Y is nonempty, As­sump­tion 1 tells us that this re­duces to

P2 = p(M is flawed).

On the other hand, the first quan­tity P1 is

P1 = p( M(B1) pre­dicts B2 | M is flawed ) * p( M is flawed) /​ p( M(B1) pre­dicts B2 ).

By As­sump­tion 2, this becomes

P1 < p( M is flawed ).

Hence, P1 < P2, as claimed.

• Through­out these replies there is a be­lief that the­ory 1 is ‘cor­rect through skill’. With that in mind it is hard to come to any other con­clu­sion than ‘sci­en­tist 1 is bet­ter’.

Without know­ing more about the ex­per­i­ments, we can’t de­ter­mine if the­ory 1′s 10 good pre­dic­tions were sim­ply ‘good luck’ or ac­ci­dent.

If your the­ory is that the next 10 hu­mans you meet will have the same num­ber of arms as they have legs, for ex­am­ple...

There’s also po­ten­tial for sur­vivor­ship bias here. If the first sci­en­tist’s re­sults had been 5 cor­rect, 5 wrong, we wouldn’t be hav­ing this dis­cus­sion about the qual­ity of their the­ory-mak­ing skills. Without know­ing if we are ‘pick­ing a lot­tery win­ner for this com­par­i­son’ we can’t tell if those ten re­sults are chance or are mean­ingful pre­dic­tions.

• I’d use the only tool we have to sort the­o­ries: Oc­cam’s ra­zor.

1. Weed out all the the­o­ries that do not match the ex­per­i­ment — keep both in that case.

2. Sort them by how sim­ple they are.

This is what many do by as­sum­ing the sec­ond is “over-fit­ted”; I be­lieve a good sci­en­tist would search the liter­a­ture be­fore stat­ing a the­ory, and know about the first one; as he would also ap­pre­ci­ate el­e­gance, I’d ex­pect him to come up with a sim­pler the­ory — but, as you pointed out, some time in a eco­nomics lab could eas­ily prove me wrong, al­though I’m as­sum­ing the daunt­ing com­plex­ity cor­re­sponds to plumb­ing against ex­per­i­ment dis­prov­ing a pre­vi­ous the­ory, not the case that we con­sider here.

In one word: the sec­ond (longer refer­ences).

The bar­rel and box anal­ogy hides that sim­plic­ity ar­gu­ment, by mak­ing all the­o­ries a ‘pa­per’. A stern wag of the finger to any­one who used statis­ti­cal refer­ences, be­cause there aren’t enough data to do that.

• Peter, your point that we have differ­ent be­liefs about the the­o­ries prior to look­ing at them is helpful. AFAICT the­o­ries don’t screen off the­o­rists, though. My be­lief that the col­lege base­ball team will score at least one point in ev­ery game (“the­ory A”), in­clud­ing the next one (“ex­per­i­ment 21″), may rea­son­ably be in­creased by a lo­cal base­ball ex­pert tel­ling me so and by ev­i­dence about his ex­per­tise. This holds even if I in­de­pen­dently know some­thing about base­ball.

As to the effect of “num­ber of pa­ram­e­ters” on the the­o­ries’ prob­a­bil­ities, would you bet equally on the two the­o­ries if you were told that they con­tained an iden­ti­cal num­ber of pa­ram­e­ters? I wouldn’t, given the asym­met­ric in­for­ma­tion con­tained in the two ex­perts vouch­ing for the the­o­ries.

Tim, I agree that if you re­move the dis­tinct sci­en­tists and have the hy­pothe­ses pro­duced in­stead by a sin­gle pro­cess (drawn from the same bucket), you should pre­fer whichever pre­dic­tion has the high­est prior prob­a­bil­ity. Do you mean that the prior prob­a­bil­ity is equal to the pre­dic­tion’s sim­plic­ity or just that sim­plic­ity is a good rule of thumb in as­sign­ing prior prob­a­bil­ities? If we have some do­main knowl­edge I don’t see why sim­plic­ity should cor­re­spond ex­actly to our pri­ors; even Solomonoff In­duc­ers move away from their ini­tial no­tion of sim­plic­ity with in­creased data. (You’ve stud­ied that math and I haven’t; is there a non-triv­ial up­dated-from-data no­tion of “sim­plic­ity” that has iden­ti­cal or­di­nal struc­ture to an up­dated Solomonoff In­ducer’s prior?)

Tyrrell, I like your solu­tion a lot. A dis­agree­ment any­how: as you say, if ex­perts 1 and 2 are good prob­a­bil­ity the­o­rists, T1 will con­tain the most likely pre­dic­tions given the ex­per­i­men­tal re­sults ac­cord­ing to Ex­pert 1 and T2 like­wise ac­cord­ing to Ex­pert 2. Still, if the ex­perts have differ­ent start­ing knowl­edge and at least one can­not see the other’s pre­dic­tions, I don’t see any­thing that sur­pris­ing in their “high­est prob­a­bil­ity pre­dic­tions given the data” calcu­la­tions dis­agree­ing with one an­other. This part isn’t in dis­agree­ment with you, but it also rele­vant that if the space of out­comes is small or if ex­per­i­ments 1-20 are part of some lo­cal regime that ex­per­i­ment 21 is not(e.g. physics at macro­scopic scales, or hous­ing prices be­fore the bub­ble broke), it may not be sur­pris­ing to see two the­o­ries that agree on a large body of data and di­verge el­se­where. The­o­ries that agree in one regime and dis­agree in oth­ers seem rel­a­tively com­mon.

Alex, Ber­til, and oth­ers, I may be wrong, but I think we should taboo “overfit­ting” and “ad hoc” for this prob­lem and sub­sti­tute mechanis­tic, prob­a­bil­ity-the­ory-based ex­pla­na­tions for where phe­nom­ena like “overfit­ting” come from.

• Tyrrell, right, thanks. :) Your for­mal­iza­tion makes clear that P1/​P2 = p(M(B1) pre­dicts B2 | M flawed) /​ p(M(B1) pre­dicts B2), which is a stronger re­sult than I thought. Argh, I wish I were able to see this sort of thing im­me­di­ately.

One small nit­pick: It could be more ex­plicit that in As­sump­tion 2, B1 and B2 range over ac­tual ob­ser­va­tion, whereas in As­sump­tion 1, B ranges over all pos­si­ble ob­ser­va­tions. :)

Anna, right, I think we need some sort of “other things be­ing equal” pro­viso to Tyrrell’s solu­tion. If ex­per­i­ments 11..20 were cho­sen by sci­en­tist 1, ex­per­i­ment 21 is cho­sen by sci­en­tist 2, and ex­per­i­ments 1..10 were cho­sen by a third party, and sci­en­tist 2 knows sci­en­tist 1′s the­ory, for ex­am­ple, we could spec­u­late that sci­en­tist 2 has found a strange edge case in 1′s for­mal­iza­tion that 1 did not ex­pect. I think I was im­plic­itly tak­ing the ques­tion to re­fer to a case where all 21 ex­per­i­ments are of the same sort and cho­sen in­de­pen­dently—say, low­est tem­per­a­tures at the mag­netic north pole in con­sec­u­tive years, that sort of thing.

• “One small nit­pick: It could be more ex­plicit that in As­sump­tion 2, B1 and B2 range over ac­tual ob­ser­va­tion, whereas in As­sump­tion 1, B ranges over all pos­si­ble ob­ser­va­tions. :)”

Ac­tu­ally, I im­plic­itly was think­ing of the “B” vari­ables as rang­ing over ac­tual ob­ser­va­tions (past, pre­sent, and fu­ture) in both as­sump­tions. But you’re right: I definitely should have made that ex­plicit.

• We know that the first re­searcher is able to suc­cess­fully pre­dict the re­sults of ex­per­i­ment. We don’t know that about the sec­ond re­searcher. There­fore I would bet on the first re­searcher pre­dic­tion (but only as­sum­ing other things be­ing equal).

Then we’ll do the ex­per­i­ment and know for sure.

• Benja --

I dis­agree with Tyrrell (see be­low), but I can give a ver­sion of Tyrrell’s “triv­ial” for­mal­iza­tion:

We want to show that:

Aver­ag­ing over all the­o­ries T, P(T makes cor­rect pre­dic­tions | T passes 10 tests) > P(T makes cor­rect pre­dic­tions)

By Bayes’ rule,

P(T makes cor­rect pre­dic­tions | T passes 10 tests) = P(T makes cor­rect pre­dic­tions)

• P(T passes 10 tests | T makes cor­rect pre­dic­tions) /​ P(T passes 10 tests)

So our con­clu­sion is equiv­a­lent to:

Aver­ag­ing over all the­o­ries T, P(T passes 10 tests | T makes cor­rect pre­dic­tions) /​ P(T passes 10 tests)

1

which is equiv­a­lent to

Aver­ag­ing over all the­o­ries T, P(T passes 10 tests | T makes cor­rect pre­dic­tions) > P(T passes 10 tests)

which has to be true for any plau­si­ble defi­ni­tion of “makes cor­rect pre­dic­tions”. The effect is only small if nearly all the­o­ries can pass the 10 tests.

I dis­agree with Tyrrell’s con­clu­sion. I think his fal­lacy is to work with the un­defined con­cept of “the best the­ory”, and to as­sume that:

• If a the­ory con­sis­tent with past ob­ser­va­tions makes in­cor­rect pre­dic­tions then there was some­thing wrong with the pro­cess by which that the­ory was formed. (Not true; mak­ing pre­dic­tions is in­her­ently an un­re­li­able pro­cess.)

• There­fore we can as­sume that that pro­cess pro­duces bad the­o­ries with a fixed fre­quency. (Not mean­ingful; the ob­ser­va­tions made so far are a vary­ing in­put to the pro­cess of form­ing the­o­ries.)

In the math above, the fal­lacy shows up be­cause the set of the­o­ries that are con­sis­tent with the first 10 ob­ser­va­tions is differ­ent from the set of the­o­ries that are con­sis­tent with the first 20 ob­ser­va­tions, so the ini­tial state­ment isn’t re­ally what we wanted to show. (If that fal­lacy is a prob­lem with my un­der­stand­ing of Tyrrell’s post, he should have done the “triv­ial” for­mal­iza­tion him­self.)

There are lots of ways to ap­ply Bayes’ Rule, and this wasn’t the first one I tried, so I also dis­agree with Tyrrell’s claim that this is triv­ial.

• Hi, Anna. I definitely agree with you that two equally-good the­o­ries could agree on the re­sults of ex­per­i­ments 1--20 and then dis­agree about the re­sults of ex­per­i­ment 21. But I don’t think that they could both be best-pos­si­ble the­o­ries, at least not if you fix a “good” crite­rion for eval­u­at­ing the­o­ries with re­spect to given data.

What I was think­ing when I claimed that in my origi­nal com­ment was the fol­low­ing:

Sup­pose that T1 says “re­sult 21 will be X” and the­ory T2 says “re­sult 21 will be Y”.

Then I claim that there is an­other the­ory T3, which cor­rectly pre­dicts re­sults 1--20, and which also pre­dicts “re­sult 21 will be Z”, where Z is a less-pre­cise de­scrip­tion that is satis­fied by both X and Y. (E.g., maybe T1 says “the ball will be red”, T2 says “the ball will be blue”, and T3 says “the ball will be visi­ble”.)

So T3 has had the same suc­cess­ful pre­dic­tions as T1 and T2, but it re­quires less in­for­ma­tion to spec­ify (in the Kol­mogorov-com­plex­ity sense), be­cause it makes a less pre­cise pre­dic­tion about re­sult 21.

I think that’s right, any­way. There’s definitely still some hand-wav­ing here. I haven’t proved that a the­ory’s be­ing va­guer about re­sult 21 im­plies that it re­quires less in­for­ma­tion to spec­ify. I think it should be true, but I lack the for­mal in­for­ma­tion the­ory to prove it.

But sup­pose that this can be for­mal­ized. Then there is a the­ory T3 that re­quires less in­for­ma­tion to spec­ify than do T1 and T2, and which has performed as well as T1 and T2 on all ob­ser­va­tions so far. A “good” crite­rion should judge T3 to be a bet­ter the­ory in this case, so T1 and T2 weren’t best-pos­si­ble.

• Among the many ex­cel­lent, and some in­spiring, con­tri­bu­tions to Over­com­ingBias, this sim­ple post, to­gether with its com­ments, is by far the most im­pact­ful for me. It’s scary in al­most the same way as the way the gen­eral pub­lic ap­proaches se­lec­tion of their elected rep­re­sen­ta­tives and lead­ers.

• Tyrrell, um. If “the ball will be visi­ble” is a bet­ter the­ory, then “we will ob­serve some ex­per­i­men­tal re­sult” would be an even bet­ter the­ory?

Solomonoff in­duc­tion, the in­duc­tion method based on Kol­mogorov com­plex­ity, re­quires the the­ory (pro­gram) to out­put the pre­cise ex­per­i­men­tal re­sults of all ex­per­i­ments so far, and in the fu­ture. So your T3 would not be a sin­gle pro­gram; rather, it would be a set of pro­grams, each en­cod­ing speci­fi­cally one ex­per­i­men­tal out­come con­sis­tent with “the ball is visi­ble.” (Which gets rid of the prob­lem that “we will ob­serve some ex­per­i­men­tal re­sult” is the best pos­si­ble the­ory :))

• Here is my an­swer with­out look­ing at the com­ments or in­deed even at the post linked to. I’m work­ing solely from Eliezer’s post.

Both the­o­ries are sup­ported equally well by the re­sults of the ex­per­i­ments, so the ex­per­i­ments have no bear­ing on which the­ory we should pre­fer. (We can see this by switch­ing the­ory A with the­ory B: the ex­per­i­men­tal re­sults will not change.) Ap­ply­ing bayescraft, then, we should pre­fer whichever the­ory was a pri­ori more plau­si­ble. If we could ac­tu­ally look at the con­tents of the the­ory we could make a judge­ment straight from that, but since we can’t we’re forced to in­fer it from the be­hav­ior of sci­en­tist A and sci­en­tist B.

Scien­tist A only needed ten ex­per­i­men­tal pre­dic­tions of the­ory A borne out be­fore he was will­ing to pro­pose the­ory A, whereas sci­en­tist B needed twenty pre­dic­tions of the­ory B borne out be­fore he was will­ing to pro­pose the­ory B. In ab­sence of other in­for­ma­tion (per­haps sci­en­tist B is very shy, or had been sick while the first nine­teen ex­per­i­ments were be­ing performed), this sug­gests that the­ory B is much less a pri­ori plau­si­ble than the­ory A. There­fore, we should put much more weight on the pre­dic­tion of the­ory A than that of the­ory B.

If I’m lucky this post is both right and novel. Here’s hop­ing!

• I’ve seen too many cases of overfit­ting data to trust the sec­ond the­ory. Trust the val­i­dated one more.

The ques­tion would be more in­ter­est­ing if we said that the origi­nal the­ory ac­counted for only some of the new data.

If you know a lot about the space of pos­si­ble the­o­ries and “pos­si­ble” ex­per­i­men­tal out­comes, you could try to com­pute which the­ory to trust, us­ing (sur­prise) Bayes’ law. If it were the case that the first the­ory ap­plied to only 9 of the 10 new cases, you might find pa­ram­e­ters such that you should trust the new the­ory more.

In the given case, I don’t think there is any way to de­duce that you should trust the 2nd the­ory more, un­less you have some a pri­ori mea­sure of a the­ory’s like­li­hood, such as its com­plex­ity.

• Benja, I have never stud­ied Solomonoff in­duc­tion for­mally. God help me, but I’ve only read about it on the In­ter­net. It definitely was what I was think­ing of as a can­di­date for eval­u­at­ing the­o­ries given ev­i­dence. But since I don’t re­ally know it in a rigor­ous way, it might not be suit­able for what I wanted in that hand-wavy part of my ar­gu­ment.

How­ever, I don’t think I made quite so bad a mis­take as highly-rank­ing the “we will ob­serve some ex­per­i­men­tal re­sult” the­ory. At least I didn’t make that mis­take in my own mind ;). What I ac­tu­ally wrote was cer­tainly vague enough to in­vite that in­ter­pre­ta­tion. But what I was think­ing was more along these lines:

[looks up color spec­trum on Wikipe­dia and jug­gles num­bers to make things work out]

The visi­ble wave­lengths are 380 nm -- 750 nm. Within that range, blue is 450 nm -- 495 nm, and red is 620 nm -- 750 nm.

Let f(x) be the dec­i­mal ex­pan­sion of (x − 380nm)/​370nm. This moves the visi­ble spec­trum into the range [0,1].

I was imag­in­ing that T3 (“the ball is visi­ble”) was predicting

“The only digit to the left of the dec­i­mal point in f(color of ball in nm) is a 0 (with­out a nega­tive sign).”

while T1 (“the ball is red”) predicts

“The only digit to the left of the dec­i­mal point in f(color of ball in nm) is a 0 (with­out a nega­tive sign), and the digit im­me­di­ately to the right is a 7.”

and T2 (“the ball is blue”) predicts

“The only digit to the left of the dec­i­mal point in f(color of ball in nm) is a 0 (with­out a nega­tive sign), and the digit im­me­di­ately to the right is a 2.”

So I was re­ally think­ing of all the the­o­ries T1, T2, and T3 as giv­ing pre­cise pre­dic­tions. It’s just that T3 opted not to make a pre­dic­tion about some­thing that T2 and T3 did pre­dict on.

How­ever, I definitely take the point that Solomonoff in­duc­tion might still not be suit­able for my pur­poses. I was sup­pos­ing that T3 would be a “bet­ter” the­ory by some crite­rion like Solomonoff in­duc­tion. (I’m as­sum­ing, BTW, that T3 did pre­dict ev­ery­thing that T1 and T2 pre­dicted for the first 20 re­sults. It’s only for the 21st re­sult that T3 didn’t give an an­swer as de­tailed as those of T1 and T2. ) But from read­ing your com­ment, I guess maybe Solomonoff in­duc­tion wouldn’t even com­pare T3 to T1 and T2, since T3 doesn’t pur­port to an­swer all of the same ques­tions.

If so, I think that just means the Solomonoff isn’t quite gen­eral enough. There should be a way to com­pare two the­o­ries even if one of them an­swers ques­tions that the other doesn’t ad­dress. In par­tic­u­lar, in the case un­der con­sid­er­a­tion, T1 and T2 are given to be “equally good” (in some un­speci­fied sense), but they both pur­port to an­swer the same ques­tion in a differ­ent way. To my mind, that should mean that each of them isn’t re­ally jus­tified in choos­ing its an­swer over the other. But T3, in a sense, ac­knowl­edges that there is no rea­son to fa­vor one an­swer over the other. There should be some rigor­ous sense in which this makes T3 a bet­ter the­ory.

Tim Free­man, I hope to re­ply to your points soon, but I think I’m at my “re­cent com­ments” limit already, so I’ll try to get to it to­mor­row.

• Upon first read­ing, I hon­estly thought this post was ei­ther a joke or a se­man­tic trick (e.g., as­sum­ing the sci­en­tists were them­selves perfect Bayesi­ans which would re­quire some “There are blue-eyed peo­ple” rea­son­ing).

Be­cause the­o­ries that can make ac­cu­rate fore­casts are a small frac­tion of the­o­ries that can make ac­cu­rate hind­casts, the Bayesian weight has to be on the first guy.

In my mind, I see this vi­su­ally as the first guy pro­ject­ing a sur­face that con­tains the first 10 ob­ser­va­tions into the fu­ture and it in­ter­sect­ing with the ac­tual fu­ture. The sec­ond guy just wrapped a sur­face around his pre­sent (which con­tains the first guy’s fu­ture). Who says he pro­jected it in the right di­rec­tion?

But then I’m not as smart as Eliezer and could have missed some­thing.

• Both the­o­ries are equally good. Both are cor­rect. There is no way to choose one, ex­cept to make an­other ex­per­i­ment and see which the­ory—if any (still might be both well or both bro­ken) - will pre­vail.

• Thomas

• That the first the­ory is right seems ob­vi­ous and not the least bit coun­ter­in­tu­itive. There­fore, based on what I know about the psy­chol­ogy of this blog, I pre­dict that it is false and the sec­ond one is true.

• We have two the­o­ries that ex­plain the all the available data—and this is Over­com­ing Bias—so how come only a tiny num­ber of peo­ple have men­tioned the pos­si­bil­ity of us­ing Oc­cam’s ra­zor? Surely that must be part of any sen­si­ble re­sponse.

• I don’t think you’ve given enough in­for­ma­tion to make a rea­son­able choice. If the re­sults of all 20 ex­per­i­ments are con­sis­tent with both the­o­ries but the sec­ond the­ory would not have been made with­out the data from the sec­ond set of ex­per­i­ments, then it stands to rea­son that the sec­ond the­ory makes more pre­cise pre­dic­tions.

If the the­o­ries are equally com­plex and the sec­ond makes more pre­cise pre­dic­tions, then it ap­pears to be a bet­ter the­ory. If the sec­ond the­ory con­tains a bunch of ad hoc pa­ram­e­ters to im­prove the fit, then it’s likely a worse the­ory.

But of course the origi­nal ques­tion does not say that the sec­ond the­ory makes more pre­cise pre­dic­tions, nor that it would not have been made with­out the sec­ond set of ex­per­i­ments.

• Hi Tyrrell,

Let T1_21 and T2_21 be the two the­o­ries’ pre­dic­tions for the twenty-first ex­per­i­ment.

As you note, if all else is equal, our prior be­liefs about P(T1_21) and P(T2_21) -- the odds we would’ve ac­cepted on bets be­fore hear­ing T1s and T2′s pre­dic­tions—are rele­vant to the prob­a­bil­ity we should as­sign af­ter hear­ing T1′s and T2′s pre­dic­tions. It takes more ev­i­dence to jus­tify a high-pre­ci­sion or oth­er­wise low-prior-prob­a­bil­ity pre­dic­tion. (Of course, by the same to­ken, high pre­ci­sion and oth­er­wise low-prior pre­dic­tions are of­ten more use­ful.)

The pre­ci­sion (or more ex­actly, the prior prob­a­bil­ity) of the pre­dic­tions T1 and T2 as­sign to the first twenty ex­per­i­men­tal re­sults are also rele­vant. The pre­ci­sion of these tested pre­dic­tions, how­ever, pulls in the op­po­site di­rec­tion: if the­ory T1 made ex­tremely pre­cise, low-prior-prob­a­bil­ity pre­dic­tions and got them right , this should more strongly in­crease our prior prob­a­bil­ity that T1′s set of pre­dic­tions is en­tirely true. You can for­mal­ize this with Bayes’ the­o­rem. [How­ever, the ob­vi­ous for­mal­iza­tion only shows how prob­a­bil­ity of the con­junc­tion of all of T1′s pre­dic­tions in­creases; you need a model of how T1 and T2 were gen­er­ated to know how in­dica­tive each the­ory’s track record is of its fu­ture pre­dic­tive ac­cu­racy, or how much your be­liefs about P(T1_21) speci­fi­cally should in­crease. If you re­place the sci­en­tists with ran­dom coin-flip ma­chines, and your prior prob­a­bil­ity for each event is (1/​2), T1′s past suc­cess shouldn’t in­crease your P(T1_21) be­lief at all.]

As to whether there is a sin­gle “best” met­ric for eval­u­at­ing the­o­ries, you are right that for any one ex­pert, with one set of start­ing (prior) be­liefs about the world and one set of data with which to up­date those be­liefs, there will be ex­actly one best (Bayes’-score-max­i­miz­ing) prob­a­bil­ity to as­sign to events T1_21 and T2_21. How­ever, if the two ex­perts are work­ing from non-iden­ti­cal back­ground in­for­ma­tion (e.g., if one has back­ground knowl­edge the other lacks), there is no rea­son to sup­pose the two ex­perts’ prob­a­bil­ities will match even if both are perfect Bayesi­ans. If you want to stick with the Solomonoff for­mal­ism, we can make the same point there: a given Solomonoff in­ducer will in­deed have ex­actly one best (prob­a­bil­is­tic) pre­dic­tion for the next ex­per­i­ment. How­ever, two differ­ent Solomonoff in­duc­ers, work­ing from two differ­ent UTM’s and as­so­ci­ated pri­ors (or up­dat­ing to two differ­ent sets of ob­ser­va­tions) may dis­agree. There is no known way to con­struct a perfectly canon­i­cal no­tion of “sim­plic­ity”, “prior prob­a­bil­ity” or “best” in your sense.

If you want to re­spond but are afraid of the “re­cent com­ments” limit, per­haps email me? We’re both friends of Jen­nifer Muel­ler’s (I think. I’m as­sum­ing you’re the Tyrrell McAllister she knows?), so be­tween that and our Over­com­ing Bias in­ter­sec­tion I’ve been mean­ing to try talk­ing to you some­time. an­nasala­mon at gmail dot com.

Also, have you read A Tech­ni­cal Ex­pla­na­tion ? It’s brilli­ant on many of these points.

• We be­lieve the first(T1).

Why: Cor­rectly pre­dicted out­comes up­dates it’s prob­a­bil­ity of be­ing cor­rect(Bayes).

The ad­di­tional in­for­ma­tion available to the sec­ond the­ory is re­dun­dant since it was cor­rectly pre­dicted by T1.

• A few thoughts.

I would like the one that:

0) Doesn’t vi­o­late any use­ful rules of thumb, e.g. con­ser­va­tion of en­ergy, al­low­ing trans­mit­ting in­for­ma­tion faster than the speed of light in a vac­uum. 1) Gives more pre­cise pre­dic­tions. To be con­sis­tent with a the­ory isn’t hard if the the­ory gives a large range of un­cer­tainty. E.g. if one the­ory is 2) Doesn’t have any in­fini­ties in its range

If all these are equal, I would pre­fer them equally. Other­wise I would have to think that some­thing was spe­cial about the time they were sug­gested, and be money pumped.

For ex­am­ple: As­sume that I was asked this ques­tion many times, but my mem­ory wiped in be­tween times. If I preferred the pre­dict­ing the­ory, they could al­ter­nate which sci­en­tist dis­cov­ered the the­ory first, and charge me a small amount of money to get the first guys the­ory, but get the ex­plana­tory one for free. So I would be for­ever switch­ing be­tween the­o­ries, purely on their tem­po­ral­ness. Which seems a lit­tle weird.

• As a ma­chine-learn­ing prob­lem, it would be straight­for­ward: The sec­ond learn­ing al­gorithm (sci­en­tist) did it wrong. He’s sup­posed to train on half the data and test on the other half. In­stead he trained on all of it and skipped val­i­da­tion. We’d also be able to mea­sure how rel­a­tively com­plex the the­o­ries were, but the prob­lem state­ment doesn’t give us that in­for­ma­tion.

As a hu­man learn­ing prob­lem, it’s fog­gier. The sec­ond guy could still have hon­estly val­i­dated his the­ory against the data, or not. And it’s not straight­for­ward to show that one hu­man-read­able the­ory is more com­plex than an­other.

But with the in­for­ma­tion we’re given, we don’t know any­thing about that. So ISTM the prob­lem state­ment has ab­stracted away those el­e­ments, leav­ing us with learn­ing al­gorithms done right and done wrong.

• We should take into ac­count the costs to a sci­en­tist of be­ing wrong. As­sume that the first sci­en­tist would pay a high price if the sec­ond ten data points didn’t sup­port his the­ory. In this case he would only pro­pose the the­ory if he was con­fi­dent it was cor­rect. This con­fi­dence might come from his in­tu­itive un­der­stand­ing of the the­ory and so wouldn’t be cap­tured by us if we just ob­served the 20 data points.

In con­trast, if there will be no more data the sec­ond sci­en­tist knows his the­ory will never be proved wrong.

• Sorry, I mis­read the ques­tion. Ig­nore my last an­swer.

• So re­view­ing the other com­ments now I see that I am es­sen­tially in agree­ment with M@ (on David’s blog) who posted prior to Eli. There­fore, Eli dis­agrees with that. Count me cu­ri­ous.

• Peter de Blanc got it right, IMHO. I don’t agree with any of the an­swers that in­volve in­fer­ence about the the­o­rists them­selves; they each did only one thing, so it is not the case that you can take one thing they did as ev­i­dence for the na­ture of some other thing they did.

• Peter de Blanc is right: The­o­ries screen off the the­o­rists. It doesn’t mat­ter what data they had, or what pro­cess they used to come up with the the­ory. At the end of the data, you’ve got twenty data points, and two the­o­ries, and you can use your pri­ors in the do­main (along with things like Oc­cam’s Ra­zor) to com­pute the like­li­hoods of the two the­o­ries.

But that’s not the puz­zle. The puz­zle doesn’t give us the two the­o­ries. Hence, strictly speak­ing, there is no cor­rect an­swer.

That said, we can start guess­ing like­li­hoods for what an­swer we would come up with, if we knew the two the­o­ries. And here what is im­por­tant is that all we know is that both the­o­ries are “con­sis­tent” with the data they had seen so far. Well, there are an in­finite num­ber of con­sis­tent the­o­ries for any data set, so that’s a pretty weak con­straint.

Hence peo­ple are jump­ing into the guess that sci­en­tist #2 will “overfit” the data, given the ex­tra 10 ob­ser­va­tions.

But that’s not a con­clu­sion you ought to make be­fore see­ing the de­tails of the two the­o­ries. Either he did overfit the data, or he didn’t, but we can’t de­ter­mine that un­til we see the the­o­ries.

So what it comes down to is that the first sci­en­tist has less op­por­tu­nity to overfit the data, since he only saw the first 10 points. And, its suc­cess­ful pre­dic­tion of the next 10 points is rea­son­able ev­i­dence that the­ory #1 is on the right track, whereas we have pre­cious lit­tle ev­i­dence (from the puz­zle) about the­ory #2.

But this doesn’t say that the­ory #1 _is­bet­ter than the­ory #2. It only says that, if we ever had the chance to ac­tu­ally cor­rectly eval­u­ate both the­o­ries (us­ing Bayesian pri­ors on both the­o­ries and all the data), then we cur­rently ex­pect the­ory #1 will win that bat­tle more of­ten than the­ory #2.

But that’s a weak, kind of in­di­rect, con­clu­sion.

• The short an­swer is, “it de­pends.” For all we can tell from the state­ment of the prob­lem, the sec­ond “the­ory” could be “I prayed for di­v­ine rev­e­la­tion of the an­swers and got these 20.” Or it could be spe­cial rel­a­tivity in 1905. So I don’t think this “puz­zle” poses a real ques­tion.

• Oh, and Thomas says: “There is no way to choose one, ex­cept to make an­other ex­per­i­ment and see which the­ory—if any (still might be both well or both bro­ken) - will pre­vail.”

Which leads me to think he is con­strained by the Scien­tific Method, and hasn’t yet learned the Way of Bayes.

• Ac­tu­ally I’d like to take back my last com­ment. To the ex­tent that pre­dic­tions 11-20 and 21-30 are gen­er­ated by differ­ent in­de­pen­dent “parts” of the the­ory, then the qual­ity of the former part is ev­i­dence about the qual­ity of the lat­ter part via the the­o­rist’s com­pe­tence.

• Of course you can make an in­fer­ence about the ev­i­denced skill of the sci­en­tists. Scien­tist 1 was ca­pa­ble of pick­ing out of a large set of mod­els that cov­ered the first 10 vari­ables, the con­sid­er­ably smaller set of mod­els that also cov­ered the sec­ond 10. He did that by refer­ence to prin­ci­ples and knowl­edge he brought to the table about the na­ture of in­fer­ence and the prob­lem do­main. The sec­ond sci­en­tist has not shown any of this ca­pa­bil­ity. I think our prior ex­pec­ta­tion for the skill of the sci­en­tists would be ir­rele­vant, as­sum­ing that the prior was at least equal for both of them.

Peter: “The first the­o­rist had less data to work with, and so had less data available to in­sert into the the­ory as pa­ram­e­ters. This is ev­i­dence that the first the­ory will be smaller than the sec­ond the­ory”

The data is not equiv­a­lent to the model pa­ram­e­ters. A lin­ear pre­dic­tion model of [PREDICT_VALUE = CONSTANT * DATA_POINT_SEQUENCE_NUMBER] can model an in­finite num­ber of data points. Ad­ding more data points does not in­crease the model pa­ram­e­ters. If there is a model that pre­dicts 10 vari­ables, and sub­se­quently pre­dicts an­other 10 vari­ables there is no rea­son to add com­plex­ity un­less one prefers com­plex­ity.

• Ce­teris paribus, I’d choose the sec­ond the­ory since the pro­cess that gen­er­ated it had strictly more in­for­ma­tion. As­sume that the sci­en­tists would’ve gen­er­ated the same the­ory given the same data, and the data in ques­tion are coin flips. The first sci­en­tist sees a ran­dom look­ing se­ries of 10 coin flips with 5 heads and 5 tails and hy­poth­e­sizes that they are gen­er­ated by the ran­dom flips of a fair coin. We col­lect 10 more data points, and again we get 5 heads and 5 tails, the max­i­mum like­li­hood re­sult given the first the­ory. Now the sec­ond sci­en­tist sees the same 20 coin flips, and no­tices that the sec­ond se­ries of 10 flips ex­actly du­pli­cates the first. So the sec­ond sci­en­tist hy­poth­e­sizes that the gen­er­at­ing pro­cess is de­ter­minis­tic with a pe­riod of 10 flips. So even though the same 20 data points are max­i­mum like­li­hood given both the­o­ries, the sec­ond the­ory as­signs them more prob­a­bil­ity mass. I think this be­comes more salient in­tu­itively if we imag­ine in­creas­ing the length of the re­peat­ing se­ries to, say, 1,000,000.

• Ex­pe­rience alone leads me to pick The­ory #2. In what I do I’m con­stantly bat­tling aca­demic ex­perts ped­dling The­ory #1. Typ­i­cally they have looked at say 10 epi­demiolog­i­cal stud­ies and con­cluded that the the­ory “A causes B” is con­sis­tent with the data and thus true. A thou­sand law­suits against the maker of “A” are then launched on be­half of those who suffer from “B”.

Even­tu­ally, and al­most in­vari­ably with ad­mit­tedly a few no­table ex­cep­tions, the molec­u­lar peo­ple then come along and more con­vinc­ingly the­o­rize that “C causes A and B” such that the rea­son plain­tiff was say in­gest­ing “A” and then suffer­ing from “B” was be­cause “C” was pro­duc­ing an urge/​need for “A” as an early symp­tom of the even­tual “B”. Or, to take a more con­crete ex­am­ple, it may be that they demon­strate that plain­tiffs were ex­posed to “A” (e.g. a vac­cine) at the typ­i­cal age of on­set (e.g. autism) of “B” so that the per­ceived causal con­nec­tion was merely co­in­ci­den­tal.

The abil­ity to iden­tify and con­trol for con­founders is for some rea­son (per­haps to do with over­com­ing bias) height­ened in the sec­ond set of eyes to re­view data and the the­o­ries they gen­er­ate.

• I wrote:

To the ex­tent that pre­dic­tions 11-20 and 21 are gen­er­ated by differ­ent in­de­pen­dent “parts” of the the­ory, the qual­ity of the former part is ev­i­dence about the qual­ity of the lat­ter part via the the­o­rist’s com­pe­tence.

...how­ever, this is much less true of cases like New­ton or GR where you can’t change a small part of the the­ory with­out chang­ing all the pre­dic­tions, than it is of cases like “evolu­tion the­ory is true and by the way gen­eral rel­a­tivity is also true”, which is re­ally two the­o­ries, or cases like “New­ton is true on week­days and GR on week­ends”, which is a bad the­ory.

So I think that to first or­der, Peter’s an­swer is still right; and more­over, I think it can be restated from Oc­cam to Bayes as fol­lows:

Ex­per­i­ments 11-20 have given the late the­o­rizer more in­for­ma­tion on what false the­o­ries are con­sis­tent with the ev­i­dence, but they have not given the early the­o­rizer any us­able in­for­ma­tion on what false the­o­ries are con­sis­tent with the ev­i­dence. Ex­per­i­ments 11-20 have also given the late the­o­rizer more in­for­ma­tion on what the­o­ries are con­sis­tent with the ev­i­dence, but this does not help the late the­o­rizer rel­a­tive to the early the­o­rizer, whose the­ory af­ter all turned out to be con­sis­tent with the ev­i­dence. So ex­per­i­ments 11-20 made it more likely for a ran­dom false late the­ory to be con­sis­tent with the ev­i­dence, rel­a­tive to a ran­dom false early the­ory; but they did not make it more likely for a ran­dom late the­ory to be con­sis­tent with the ev­i­dence, rel­a­tive to the early the­ory that was put for­ward. There­fore, ac­cord­ing to some Bayes math that I’m too lazy to do, it must be the case that there are more false the­o­ries among late the­o­ries con­sis­tent with the ev­i­dence, than among early the­o­ries con­sis­tent with the ev­i­dence.

Does this make sense? I think I will let it stand as my fi­nal an­swer, with the caveat about the­o­ries with in­de­pen­dent parts pre­dict­ing differ­ent ex­per­i­ments, in which case our new in­for­ma­tion about the the­o­rists mat­ters.

• rel­a­tive to the early the­ory that was put for­ward should have read rel­a­tive to a ran­dom early the­ory given that it was con­sis­tent with the ev­i­dence.

• Let’s sup­pose, purely for the sake of ar­gu­ment of course, that the sci­en­tists are su­per­ra­tional.

The first sci­en­tist chose the most prob­a­ble the­ory given the 10 ex­per­i­ments. If the pre­dic­tions are 100% cer­tain then it will still be the most prob­a­ble af­ter 10 more suc­cess­ful ex­per­i­ments. So, since the sec­ond sci­en­tist chose a differ­ent the­ory, there is un­cer­tainty and the other the­ory as­signed an even higher prob­a­bil­ity to these out­comes.

In re­al­ity peo­ple are bad at as­sess­ing pri­ors (hind­sight bias), lead­ing to overfit­ting. But these sci­en­tists are as­sumed to have as­sessed the pri­ors cor­rectly, and given this as­sump­tion you should be­lieve the sec­ond ex­pla­na­tion.

Of course, given more re­al­is­tic sci­en­tists, overfit­ting may be likely.

• The first the­o­rist had mul­ti­ple the­o­ries to choose from that would have been con­sis­tent with the first 10 data points—some of them bet­ter than oth­ers. Later ev­i­dence in­di­cates that he chose well, that he ap­par­ently has some kind of skill in choos­ing good the­o­ries. No such ev­i­dence is available re­gard­ing the skill of the sec­ond the­o­rist.

• My ap­proach: (us­ing Bayes’ The­o­rem ex­plic­itly)

A: first the­ory
B: sec­ond the­ory
D: data ac­cu­mu­lated be­tween the 10th and 20th trials

We’re in­ter­ested in the ra­tio P(A|D)/​P(B|D).

By Bayes’ The­o­rem:
P(A|D) = P(D|A)P(A)/​P(D)
P(B|D) = P(D|B)
P(B)/​P(D)

Then
P(A|D)/​P(B|D) = P(D|A)P(A)/​(P(D|B)P(B)).

If each the­ory pre­dicts the data ob­served with equal like­li­hood (that is, un­der nei­ther the­ory is the data more likely to be ob­served), then P(D|A) = P(D|B) so we can sim­plify,
P(A|D)/​P(B|D) = P(A)/​P(B) >> 1
given that pre­sum­ably the­ory A was a much more plau­si­ble prior hy­poth­e­sis than the­ory B. Ac­cord­ingly we have P(A|D) >> P(B|D), so we should pre­fer the first the­ory.

In prac­tice, we these as­sump­tions may not be war­ranted. In which case, we have to bal­ance the like­li­hood of the pri­ors (as we can best guess) and how well the the­o­ries pre­dict the ob­served data (as we should be able to es­ti­mate di­rectly from the the­o­ries).

• If you’re handed the two hy­poth­e­sies as black boxes, so that you can’t ac­tu­ally see in­side them and work out which is more com­plex, then go with the first one.

...un­less you are at­tend­ing a magic show. For­tu­nately, it is not com­mon for sci­en­tists to be asked to choose be­tween hy­pothe­ses with­out even know­ing what they are.

• Sup­pose the sci­en­tists S_10 and S_20 are fit­ting curves f(i) to noisy ob­ser­va­tions y(i) at points i = 0...20. Sup­pose there are two fam­i­lies of mod­els, a polyno­mial g(i;a) and a tri­gono­met­ric h(i;Ï,Ï):

g(i) ← sum(a[k]x^k, k=0..in­finity)
h(i) ← cos(Ï
i+Ï)

The an­gu­lar fre­quency Ï is pre­de­ter­mined. The phase Ï is ran­dom:

Ï ~ Flat(), equiv­a­lently Ï ~ Uniform(0, 2*Ï)

The co­effi­cients a[k] are in­de­pen­dently nor­mally dis­tributed with mo­ments matched to the marginal mo­ments of the co­effi­cients in the Tay­lor ex­pan­sion of h(i):

a[k] ~ Nor­mal(mean=0, std­dev=(Ï^k)/​(sqrt(2)*fac­to­rial(k)))

There is some prob­a­bil­ity q that the true curve f(i) is gen­er­ated by the tri­gono­met­ric model h(i), and oth­er­wise f(i) is gen­er­ated by the polyno­mial model g(i):

isTri­gono­met­ric ~ Bernoulli(q)
f(i) ← if(isTri­gono­met­ric, then_val=h(i), else_val=g(i))

Noise is iid Gaus­sian:

n[i] ~ Nor­mal(mean=0, std­dev=Ï)
y[i] ← f(i) + n[i]

(The no­ta­tion has been abused to use i as an in­dex in n[i] and y[i] be­cause each point i is sam­pled at most once.)

Scien­tists S_10 and S_20 were ran­domly cho­sen from a pool of sci­en­tists Sj hav­ing differ­ent be­liefs about q. A known frac­tion s of the sci­en­tists in the pool un­der­stand that the tri­gono­met­ric model is pos­si­ble. Their be­lief q{Sj} about the value of q for this prob­lem is that q is equal to v. The re­main­ing sci­en­tists do not un­der­stand that the tri­gono­met­ric model is pos­si­ble, and re­sort to polyno­mial ap­prox­i­ma­tions to pre­dict ev­ery­thing. Their be­lief q{S_j} about the value of q for this prob­lem is that q equals 0:

un­der­stand­sTri­gono­met­ricModel(Sj) ~ Bernoulli(s)
q
{S_j} ← if(un­der­stand­sTri­gono­met­ricModel(S_j), then_val=v, else_val=0)

(As a vari­a­tion, the sci­en­tists can have beta-dis­tributed be­liefs q_{S_j} ~ Beta(Î±, Î²).)

Both sci­en­tists re­port their pos­te­rior means for f(i) con­di­tional on their knowl­edge. S_10 knows y[i] for i=0...9 and S_20 knows y[i] for i=0...19. Both sci­en­tists are Bayesi­ans and know the prob­a­bil­is­tic struc­ture of the prob­lem and the val­ues of Ï and Ï. Both sci­en­tists also pre­dict pos­te­rior means for f(20), and there­fore for the ob­serv­able y(20).

You are given the val­ues of Ï, Ï, q, s, and v and the fact that, for each sci­en­tist, the mean of the squared differ­ences be­tween the pos­te­rior means for f(i) and the ob­ser­va­tions y[i] is less than Ï^2 (“the the­ory is con­sis­tent with the ex­per­i­ments”). You are not given the val­ues y[i]. (You are also not given any in­for­ma­tion about any of the sci­en­tists’ pre­dic­tive den­si­ties at y, con­di­tional or not, which is mad­den­ing if you’re a Bayesian.) You are asked to choose a mix­ing co­effi­cient t to com­bine the two sci­en­tists’ pre­dic­tions for y[20] into a mixed pre­dic­tion y_t[20]:

yt[20] ← t*y{S10}[20] + (1-t)*y{S_20}[20]

Your goal in choos­ing t is to min­i­mize the ex­pec­ta­tion of the squared er­ror (y_t[20]-y[20])^2. For some ex­am­ple val­ues of Ï, Ï, q, s, and v, what are the op­ti­mal val­ues of t?

(In the vari­a­tion with beta-dis­tributed q_{S_j}, the op­ti­mal t de­pends on Î± and Î² and not on s and v.)

Note that if Ï is small, Ï is not small, q is not small, s is not small, and v is not small, then the given in­for­ma­tion im­plies with very high prob­a­bil­ity that isTri­gono­met­ric==True, that the first sci­en­tist un­der­stands that the tri­gono­met­ric model is pos­si­ble, and that the first sci­en­tist’s pos­te­rior be­lief that the tri­gono­met­ric model is cor­rect is very high. (If the polyno­mial model had been cor­rect, the first sci­en­tist’s nar­row pre­dic­tion of y[10]...y[19] would have been im­prob­a­ble.) What hap­pens when s is high, so that the sec­ond sci­en­tist is likely to agree? Would S_20 then be a bet­ter pre­dic­tor than S_10?

In this for­mu­la­tion the sci­en­tists are mak­ing pre­dic­tive dis­tri­bu­tions, which are not what most peo­ple mean by “the­o­ries”. How do you draw the line be­tween a pre­dic­tive dis­tri­bu­tion and a the­ory? When peo­ple in this thread use the words “sin­gle best the­ory”, what does that even mean? Even the Stan­dard Model and Gen­eral Rel­a­tivity use con­stants which are only known from mea­sure­ments up to an ap­prox­i­mate mul­ti­vari­ate Gaus­sian pos­te­rior dis­tri­bu­tion. Any­one who uses these phys­i­cal the­o­ries to pre­dict the “ideal” out­comes of ex­per­i­ments which mea­sure phys­i­cal con­stants must pre­dict a dis­tri­bu­tion of out­comes, not a point. Does this mean they are us­ing a “dis­tri­bu­tion over phys­i­cal the­o­ries” and not a “sin­gle best phys­i­cal the­ory”? Why do we even care about that dis­tinc­tion?

• Evey­thing else be­ing equal, go for the first the­ory.

• Two points I’d like to com­ment on.

I don’t think this is rele­vant if—as I un­der­stood from the de­scrip­tion—the first sci­en­tist’s the­ory pre­dicted ex­per­i­ments 11..20 with high ac­cu­racy. In this sce­nario, I don’t think the first sci­en­tist should have learned any­thing that would make them re­ject their pre­vi­ous view. This seems like an im­por­tant point. (I think I un­der­stood this from Tyrrell’s com­ment.)

Re: The­o­ries screen of theorists

I agree—we should pick the sim­pler the­ory—if we’re able to judge them for sim­plic­ity, and one is the clear win­ner. This may not be easy. (To judge Gen­eral Rel­a­tivity to be ap­pro­pri­ately sim­ple, we may have to be fa­mil­iar with the dis­cus­sion around sym­me­tries in physics, not just with the for­mu­las of GR, for ex­am­ple...)

I un­der­stood Tyrrell to say that both of the sci­en­tists are im­perfect Bayesian rea­son­ers, and so are we. If we were perfect Bayesi­ans, both sci­en­tists and we would look at the data and im­me­di­ately make the same pre­dic­tion about the next trial. In prac­tice, all three of us use some large blob of heuris­tics. Each such blob of heuris­tics is go­ing to have a bias, and we want to pick the one that has the smaller ex­pected bias. (If we for­mal­ize the­o­ries as func­tions from ex­per­i­ments to prob­a­bil­ity dis­tri­bu­tions of re­sults, I think the “bias” would nat­u­rally be the Kul­lback-Leibler di­ver­gence be­tween the the­ory and the true dis­tri­bu­tion.) Us­ing Tyrrell’s ar­gu­ment, it seems we can show that the first sci­en­tist’s bias is likely to be smaller than the sec­ond sci­en­tist’s bias (other things be­ing equal).

• The way sci­ence is cur­rently done, ex­per­i­men­tal data that the for­mu­la­tor of the hy­poth­e­sis did not know about is much stronger ev­i­dence for a hy­poth­e­sis than ex­per­i­men­tal data he did know about.

A hy­poth­e­sis for­mu­lated by a perfect Bayesian rea­soner would not have that prop­erty, but hy­pothe­ses from hu­man sci­en­tists do, and I know of no cost-effec­tive way to stop hu­man sci­en­tists from gen­er­at­ing the effect. Part of the rea­son hu­man sci­en­tists do it is be­cause the origi­na­tor of a hy­poth­e­sis is too op­ti­mistic about the hy­poth­e­sis (and this op­ti­mism stems in part from the fact that be­ing known as the origi­na­tor of a suc­cess­ful hy­poth­e­sis is very ca­reer-en­hanc­ing), and part of the rea­son is be­cause a sci­en­tist tends to stop search­ing for hy­pothe­ses once he has one that fits the data (and I be­lieve this has been called mo­ti­vated stop­ping on this blog).

Most of the time, these hu­man bi­ases will swamp the other con­sid­er­a­tions (ex­cept that con­sid­er­a­tion men­tioned be­low) men­tioned so far in these com­ments. Con­se­quently, the hy­poth­e­sis ad­vanced by Scien­tist 1 is more prob­a­ble.

Some­one made a very good com­ment to the effect that Scien­tist 1 is prob­a­bly mak­ing bet­ter use of prior in­for­ma­tion. It might be the case that that is an­other way of de­scribing the effect I have de­scribed.

• Who­ever (E or Fried­man) chose the ti­tle, “Pre­dic­tion vs. Ex­pla­na­tion”, was prob­a­bly think­ing along the same lines.

• If I were to travel to the North Pole and live there through the months of Jan­uary and Fe­bru­ary with no prior knowl­edge of the area, then I would al­most cer­tainly be­lieve (one could even say The­o­rize) that it is con­stantly night time at the North Pole. I could move back to The United States, and may never know that my the­ory is wrong. If I had, how­ever, stayed through March and maybe into April, I would then know that the Sun does even­tu­ally rise. From this ex­tra in­for­ma­tion, I could pos­tu­late a new the­ory that would likely be more cor­rect.

“The Sun rises and falls in months-long cy­cles at the North Pole”, is, sub­jec­tively, more com­plex than “The sun never rises at the North Pole” and yet, the sec­ond the­ory is more cor­rect.

A The­ory based on more in­for­ma­tion (as­sum­ing the ex­per­i­ments were pure and vari­ables were con­trol­led) has to be more ac­cu­rate. Fear of “Over-fit­ting” is a bias. The prin­ci­ple can only be used on hind­sight af­ter the mis­takes are already known. Also, it would seem that “Over-fit­ting” is a product of hu­man er­ror. That we are given no in­for­ma­tion on the sci­en­tists run­ning the ex­per­i­ments and do­ing the the­o­ries means we must as­sume that they are faith­ful and dili­gent, and in a word, Perfect.

Oc­cam’s Ra­zor is it­self a bias. It as­sumes hu­man er­ror in the form of over-com­pli­ca­tion via in­suffi­cient in­for­ma­tion. Given the in­for­ma­tion we have for this puz­zle, we can­not use any tool that as­sumes any such thing. I vote that Oc­cam’s Ra­zor sit this one out.

Given only what we have, even if E21 sides with T1 (Say that T1 = A is true, and T2 = A is true ex­cept when B. E21 yields A in spite of B.), then we must con­clude T3 (A is true ex­cept when B, ex­cept when C), which will be closer to T2 than to T1.

T1 < T2 < T3 etc.

Now if we were given in­for­ma­tion on the The­o­ries, Ex­per­i­ments, or Scien­tists, then it might be a com­pletely differ­ent story and Oc­cam’s Ra­zor might come off the sidelines. Un­til then though, I am of the opinion that this is the only log­i­cal con­clu­sion to this puz­zle us­ing only the in­for­ma­tion we were given.

• Re­mem­ber the first com­ment way back in the thread? Psy-Kosh? I’m pretty much with him.

We as­sume that both hy­pothe­ses are equally pre­cise—that they have equally pointed like­li­hood func­tions in the vicinity of the data so far.

If you know what’s in­side the boxes, and it’s di­rectly com­pa­rable via Oc­cam’s Ra­zor, then Oc­cam’s Ra­zor should prob­a­bly take over.

The main caveat on this point is that count­ing sym­bols in an equa­tion doesn’t always get you the true prior prob­a­bil­ity of some­thing, and the sci­en­tist’s abil­ity to pre­dict the next ten sym­bols from the first ten sym­bols may sug­gest that his ver­sion of Oc­cam’s Ra­zor /​ prior prob­a­bil­ity, is un­usu­ally good, if there’s a dis­pute about which Ra­zor or prior to use.

For ex­am­ple, it might be that each ex­per­i­ment only gives you 4 bits of data, and when you write out the first sci­en­tist’s hy­poth­e­sis in sym­bols, it comes out to 60 bits worth of causal net­work, or some­thing like that. But it was the first hy­poth­e­sis the sci­en­tist thought of, af­ter see­ing only 10 ex­per­i­ments or 40 bits worth of data—and what’s more, it worked. Which sug­gests that the first sci­en­tist has a higher effec­tive prior for that hy­poth­e­sis, than the 60-bit Oc­cam mea­sure­ment of “count­ing sym­bols” would have you be­lieve. Direct Oc­cam stuff only gives you an up­per bound on the prob­a­bil­ity, not a lower bound.

If you don’t know what’s in­side the boxes or you don’t have a good Oc­cam prior for it, the first the­ory wins be­cause the sec­ond black box is pre­sumed to have used more of the data.

The main cir­cum­stance un­der which the sec­ond the­ory wins out­right, is if you can look in­side the boxes and the sec­ond the­ory is strictly sim­pler—that is, it cap­tures all the suc­cesses so far, while con­tain­ing strictly fewer el­e­ments—not just a shorter de­scrip­tion, but a de­scrip­tion that is a strict sub­set of the first. Then we just say that the first the­ory had a dan­gling part that needs snip­ping off, which there was never a rea­son to hy­poth­e­size in the first place.

• The ques­tion is whether the like­li­hood that the 21st ex­per­i­ment will val­i­date the best the­ory con­structed from 20 data points and in­val­i­date the best the­ory con­structed from 10 data points, when that the­ory also fits the other ten, is greater than the like­li­hood sci­en­tist B is just be­ing dumb.

The like­li­hood of the former is very hard to calcu­late, but it’s definitely less than 111, in other words, over 91% of the time the first the­ory will still be, if not the best pos­si­ble the­ory, good enough to pre­dict the re­sults of one more ex­per­i­ment. The like­li­hood that a ran­dom sci­en­tist, who has 20 data points and a the­ory that ex­plains them, will come up with a differ­ent the­ory which is to­tal crap, is eas­ily more than 1 in 10.

Ergo, we trust the­ory A.

• Part of the prob­lem here is that the situ­a­tion pre­sented is an ex­tremely un­usual one. Un­less sci­en­tist B’s the­ory is de­liber­ately idiotic, ex­per­i­ment 21 has to strike at a point of con­tention be­tween two the­o­ries which oth­er­wise agree, and it has to be the only ex­per­i­ment out of 21 which does so. On top of that, both sci­en­tists have to pick one of these the­o­ries, and they have to pick differ­ent ones. Even if those the­o­ries are the only ones which make any sense, and they’re equally likely from the available data, your chance of end­ing up in the situ­a­tion the prob­lem pre­sents is less than 1100.

Even if you’re given 21 data points that fol­low a pat­tern which could be pre­dicted from the first 10, and you have to de­liber­ately come up with a the­ory that fits the first 20 but not the 21st, it’s quite tricky to do so. I would be sur­prised if any­one could come up with even a sin­gle ex­am­ple of the situ­a­tion pre­sented in this puz­zle (or an analo­gous one with even more ex­per­i­ments) ever oc­cur­ring in the real world.

Un­less ex­per­i­ment 21 is of a differ­ent na­ture than ex­per­i­ments 1-20. A differ­ent level of pre­ci­sion, say. Then I’d go with sci­en­tist B, be­cause with more data he can make a model that’s more pre­cise, and if pre­ci­sion sud­denly mat­ters a lot more, it’s easy to see how he could be right and A could be wrong.

• If in­stead of ten ex­per­i­ments per set, there were only 3, who here would pick the­ory B in­stead?

• Since both the­o­ries satisfy all 20 ex­per­i­ments, for all in­tents and pur­poses of ex­per­i­men­ta­tion the the­o­ries are both equally valid or equally in­valid.

• My the­ory af­ter see­ing all 20 ex­per­i­ments is:

• The first in the se­ries will be ‘A’

• The sec­ond in the se­ries will be ‘B’

• The third in the se­ries will be ‘C’

• The fourth in the se­ries will be ‘D’

• The fifth in the se­ries will be ‘E’

• The sixth in the se­ries will be ‘F’

• The sev­enth in the se­ries will be ‘G’

• The eighth in the se­ries will be ‘H’

• The ninth in the se­ries will be ‘I’

• The tenth in the se­ries will be ‘J’

• The eleventh in the se­ries will be ‘K’

• The twelfth in the se­ries will be ‘L’

• The thir­teenth in the se­ries will be ‘M’

• The four­teenth in the se­ries will be ‘N’

• The fif­teenth in the se­ries will be ‘O’

• The six­teenth in the se­ries will be ‘P’

• The eigh­teenth in the se­ries will be ‘Q’

• The twen­tieth in the se­ries will be ‘R’

• The twenty first in the se­ries will be a camel.

Those guys who have the the­o­ries “Each ex­per­i­ment will give a suc­ces­sive let­ter of the alpha­bet” and “Each ex­per­i­ment will give the next ASCII char­ac­ter” may be ‘equally valid or in­valid’ but, well, they lack cre­ativity, don’t you think?

• Imag­ine the ten ex­per­i­ments pro­duced the fol­low­ing num­bers as re­sults: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

The first sci­en­tists hy­poth­e­sis is this func­tion: if n < 20 then n else 5, (where n is the num­ber of the vari­able be­ing tested in the ex­per­i­ment)

10 more ex­per­i­ments are done and of course it pre­dicts the an­swers perfectly. Scien­tist two comes up with his hy­poth­e­sis: n. That’s it, just the value of n is the value that will be mea­sured by the ex­per­i­ment.

Now, would you re­ally trust the first hy­poth­e­sis be­cause it hap­pened to have been made be­fore the next ten ex­per­i­men­tal re­sults were known?

In prac­tice it’s usu­ally bet­ter to choose the hy­poth­e­sis that has made more suc­cess­ful pre­dic­tions in the past, be­cause that is ev­i­dence that it isn’t overfit. But com­plex­ity is also a more gen­eral way to keep overfit­ting in check. Use­ful when you can’t just perform more ex­per­i­ments to test your hy­poth­e­sises, or where it would be ex­pen­sive to do so.