# Unknown Knowns

Pre­vi­ously (Marginal Revolu­tion): Gam­bling Can Save Science

A study was done to at­tempt to repli­cate 21 stud­ies pub­lished in Science and Na­ture.

Be­fore­hand, pre­dic­tion mar­kets were used to see which stud­ies would be pre­dicted to repli­cate with what prob­a­bil­ity. The re­sults were as fol­lows (from the origi­nal pa­per):

Fig. 4: Pre­dic­tion mar­ket and sur­vey be­liefs.

The pre­dic­tion mar­ket be­liefs and the sur­vey be­liefs of repli­cat­ing (from treat­ment 2 for mea­sur­ing be­liefs; see the Sup­ple­men­tary Meth­ods for de­tails and Sup­ple­men­tary Fig. 6 for the re­sults from treat­ment 1) are shown. The repli­ca­tion stud­ies are ranked in terms of pre­dic­tion mar­ket be­liefs on the y axis, with repli­ca­tion stud­ies more likely to repli­cate than not to the right of the dashed line. The mean pre­dic­tion mar­ket be­lief of repli­ca­tion is 63.4% (range: 23.1–95.5%, 95% CI = 53.7–73.0%) and the mean sur­vey be­lief is 60.6% (range: 27.8–81.5%, 95% CI = 53.0–68.2%). This is similar to the ac­tual repli­ca­tion rate of 61.9%. The pre­dic­tion mar­ket be­liefs and sur­vey be­liefs are highly cor­re­lated, but im­pre­cisely es­ti­mated (Spear­man cor­re­la­tion co­effi­cient: 0.845, 95% CI = 0.652–0.936, P < 0.001, n = 21). Both the pre­dic­tion mar­ket be­liefs (Spear­man cor­re­la­tion co­effi­cient: 0.842, 95% CI = 0.645–0.934, P < 0.001, n = 21) and the sur­vey be­liefs (Spear­man cor­re­la­tion co­effi­cient: 0.761, 95% CI = 0.491–0.898, P < 0.001, n = 21) are also highly cor­re­lated with a suc­cess­ful repli­ca­tion.

That is not only a su­per im­pres­sive re­sult. That re­sult is sus­pi­ciously amaz­ingly great.

The mean pre­dic­tion mar­ket be­lief of repli­ca­tion is 63.4%, the sur­vey mean was 60.6% and the fi­nal re­sult was 61.9%. That’s im­pres­sive all around.

What’s far more strik­ing is that they knew ex­actly which stud­ies would repli­cate. Every study that would repli­cate traded at a higher prob­a­bil­ity of suc­cess than ev­ery study that would fail to repli­cate.

Com­bin­ing that with an al­most ex­actly cor­rect mean suc­cess rate, we have a stun­ning dis­play of knowl­edge and of un­der-con­fi­dence.

Then we com­bine that with this fact from the pa­per:

Se­cond, among the un­suc­cess­ful repli­ca­tions, there was es­sen­tially no ev­i­dence for the origi­nal find­ing. The av­er­age rel­a­tive effect size was very close to zero for the eight find­ings that failed to repli­cate ac­cord­ing to the statis­ti­cal sig­nifi­cance crite­rion.

That means there was a clean cut. Thir­teen of the stud­ies suc­cess­fully repli­cated. Eight of them not only didn’t repli­cate, but showed very close to no effect.

Now com­bine these facts: The rate of repli­ca­tion was es­ti­mated cor­rectly. The stud­ies were ex­actly cor­rectly sorted by whether they would repli­cate. None of the stud­ies that failed to repli­cate came close to repli­cat­ing, so there was a ‘clean cut’ in the un­der­ly­ing sci­en­tific re­al­ity. Some of the stud­ies found real re­sults. All oth­ers were ei­ther fraud, p-hack­ing or the light p-hack­ing of a bad hy­poth­e­sis and small sam­ple size. No in be­tween.

The im­ple­men­ta­tion of the pre­dic­tion mar­ket used a mar­ket maker who be­gan an­chored to a 50% prob­a­bil­ity of repli­ca­tion. This, and the fact that par­ti­ci­pants had limited to­kens with which to trade (and thus, had to pri­ori­tize which prob­a­bil­ities to move) ex­plains some of the un­der-con­fi­dence in the in­di­vi­d­ual re­sults. The rest seems to be le­gi­t­i­mate un­der-con­fi­dence.

What we have here is an ex­am­ple of that elu­sive ob­ject, the un­known known: Things we don’t know that we know. This com­pletes Rums­feld’s 2×2. We pre­tend that we don’t have enough in­for­ma­tion to know which stud­ies rep­re­sent real re­sults and which ones don’t. We are mod­est. We don’t fully up­date on in­for­ma­tion that doesn’t con­form prop­erly to the for­mal rules of in­fer­ence, or the norms of sci­en­tific de­bate. We don’t dare make the claim that we know, even to our­selves.

And yet, we know.

What else do we know?

• This is a very in­ter­est­ing post that seems to be a clean ex­am­ple of a re­ally im­por­tant prob­lem. If it’s true, I ex­pect it will be an im­por­tant build­ing block in my model of the world.

How­ever, I feel con­fused about it. For ex­am­ple, the par­ti­ci­pants had limited to­kens and the ig­no­rance prior was set be­fore they traded, which ap­pears to have in­duced un­der­con­fi­dence by de­fault, and it’s not clear to me whether this en­tire effect is ex­plained by that. Also the blue di­a­monds aren’t ac­tu­ally a great pre­dic­tor of the blue cir­cles and I don’t know why that would hap­pen.

So I’m nom­i­nat­ing this for re­view. If peo­ple re­view it in de­tail and find it’s valid, then I think it’s very im­por­tant, but they might not, and that’s also valuable work.

• I.… had to­tally for­got­ten what the ac­tual con­tent of this post was (I looked at it while pon­der­ing things to nom­i­nate, vaguely re­mem­ber some anec­dote that led up to ‘and there­fore, un­known knowns’ ex­ist, and think­ing ’well, it might be im­por­tant that un­known knowns ex­ist, but I haven’t used that in the past year so prob­a­bly shouldn’t nom­i­nate it.)

But, yeah, the meat of this post seems in­cred­ibly im­por­tant-if-true.

• Se­cond Bena’s nomination

• Tldr; I don’t think that this post stands up to close scrutiny al­though there may be un­known knowns any­way. This is partly due to a cou­ple of things in the origi­nal pa­per which I think are a bit mis­lead­ing for the pur­poses of analysing the mar­kets.

The un­known knowns claim is based on 3 pat­terns in the data:

“The mean pre­dic­tion mar­ket be­lief of repli­ca­tion is 63.4%, the sur­vey mean was 60.6% and the fi­nal re­sult was 61.9%. That’s im­pres­sive all around.”

“Every study that would repli­cate traded at a higher prob­a­bil­ity of suc­cess than ev­ery study that would fail to repli­cate.”

“None of the stud­ies that failed to repli­cate came close to repli­cat­ing, so there was a ‘clean cut’ in the un­der­ly­ing sci­en­tific re­al­ity.”

Tak­ing these in re­verse or­der:

## Clean cut in results

I don’t think that there is as clear a dis­tinc­tion be­tween suc­cess­ful and un­suc­cess­ful repli­ca­tions as stated in the OP:

“None of the stud­ies that failed to repli­cate came close to repli­cat­ing”

This as­ser­tion is based on a state­ment in the pa­per:

“Se­cond, among the un­suc­cess­ful repli­ca­tions, there was es­sen­tially no ev­i­dence for the origi­nal find­ing. The av­er­age rel­a­tive effect size was very close to zero for the eight find­ings that failed to repli­cate ac­cord­ing to the statis­ti­cal sig­nifi­cance crite­rion.”

How­ever this doesn’t nec­es­sar­ily sup­port the claim of a di­chotomy – the av­er­age be­ing close to 0 doesn’t im­ply that all the re­sults were close to 0, nor that ev­ery suc­cess­ful repli­ca­tion passed cleanly. If you ig­nore the colours, this graph from the pa­per sug­gests that the nor­mal­ised effect sizes are more of a con­tinuum than a clean cut (cen­tral sec­tion b is rele­vant chart).

Eye­bal­ling that graph, there is 1 failed repli­ca­tion which nearly suc­ceeded and 4 suc­cess­ful which could have failed. If the effect size shifted by less than 1 S.D. (some of them less than 0.5 S.D.) then the suc­cess would have be­come a failure or vice-versa (al­though some might have then passed at stage 2). [1]

## Mono­tonic mar­ket be­lief vs repli­ca­tion success

Of the 5 repli­ca­tions noted above, the 1 which nearly passed was ranked last by mar­ket be­lief, the 4 which nearly failed were ranked 3, 4, 5 and 7. If any of these had gone the other way it would have ru­ined the beau­tiful mono­tonic re­sult.

Ac­cord­ing to the planned pro­ce­dure [1], the 1 study which nearly passed repli­ca­tion should have been counted as a pass as it suc­cess­fully repli­cated in stage 1 and should not have pro­ceeded to stage 2 where the sig­nifi­cance dis­ap­peared. I think it is right to count this as an over­all failed repli­ca­tion but for the sake of analysing the mar­ket it should be listed as a suc­cess.

Hav­ing said that, the pat­tern is still a very im­pres­sive re­sult which I look into be­low.

## Mean mar­ket belief

The OP notes that there is a good match be­tween the mean mar­ket be­lief of repli­ca­tion and the ac­tual frac­tion of suc­cess­ful repli­ca­tions. To me this doesn’t re­ally sug­gest much by way of whether the par­ti­ci­pants in the mar­ket were un­der-con­fi­dent or not. If they were to sud­denly be­come more con­fi­dent then the mean mar­ket be­lief could eas­ily move away from the re­sult.

If the mar­ket is un­der-con­fi­dent, it seems like one could buy op­tions in all the mar­kets trad­ing above 0.5 and sell op­tions in all the ones be­low and ex­pect to make a profit. If I did this then I would buy op­tions in 1621 (76%) of mar­kets and would ac­tu­ally in­crease the mean mar­ket be­lief away from the ac­tual per­centage of suc­cess­ful repli­ca­tions. By this met­ric be­com­ing more con­fi­dent would lower ac­cu­racy.

In a similar vein, I also don’t think Spear­man co­effi­cients can tell us much about over/​un­der-con­fi­dence. Spear­man co­effi­cients are based on rank or­der so if ev­ery op­tion on the mar­ket be­came less/​more con­fi­dent by the same amount, the Spear­man co­effi­cients wouldn’t change.

## Are there un­known knowns any­way?

Notwith­stand­ing the above, the graph in the OP still looks to me as though the mar­ket is un­der-con­fi­dent. If I were to buy an op­tion in ev­ery study with mar­ket be­lief >0.5 and sell in ev­ery study <0.5 I would still make a de­cent profit when the mar­ket re­solved. How­ever it is not clear whether this is a con­sis­tent pat­tern across similar mar­kets.

For­tu­nately the pa­per also in­cludes data on 2 other mar­kets (suc­cess in stage 1 of the repli­ca­tion based on 2 differ­ent sets of par­ti­ci­pants) so it is pos­si­ble to check whether these mar­kets were similarly un­der-con­fi­dent. [2]

If I performed the same ac­tion of buy­ing and sel­l­ing de­pend­ing on mar­ket be­lief I would make a very small gain in one mar­ket and a small loss in the other. This does not sug­gest that there is a con­sis­tent pat­tern of un­der-con­fi­dence.

It is pos­si­ble to check for cal­ibra­tion across the mar­kets. I split the 63 mar­ket pre­dic­tions (3 mar­kets x 21 stud­ies) into 4 groups de­pend­ing on the level of mar­ket be­lief, 50-60%, 60-70%, 70-80% and 80-100% (any mar­ket be­liefs with p<50% are con­verted to 1-p for group­ing).

For be­liefs of 50-60% con­fi­dence, the mar­ket was cor­rect 29% of the time. Across the 3 mar­kets this varied from 0-50% cor­rect.

For be­liefs of 60-70% con­fi­dence, the mar­ket was cor­rect 93% of the time. Across the 3 mar­kets this varied from 75-100% cor­rect.

For be­liefs of 70-80% con­fi­dence, the mar­ket was cor­rect 78% of the time. Across the 3 mar­kets this varied from 75-83% cor­rect.

For be­liefs of 80-100% con­fi­dence, the mar­ket was cor­rect 89% of the time. Across the 3 mar­kets this varied from 75-100% cor­rect.

We could make a claim that any­thing which the mar­kets show in the 50-60% range are gen­uinely un­cer­tain but that for ev­ery­thing above 60% we should just ad­just all prob­a­bil­ities to at least 75%, maybe some­thing like 80-85% chance.

If I perform the same buy­ing/​sel­l­ing that I dis­cussed pre­vi­ously but set my limit to 0.6 in­stead of 0.5 (i.e. don’t buy or sell in the range 40%-60%) then I would make a tidy profit in all 3 mar­kets.

But I’m not sure whether I’m com­pletely per­suaded. Essen­tially there is only one range which differs sig­nifi­cantly from the mar­ket be­ing well cal­ibrated (p=0.024, two-tailed bino­mial). If I ad­just for mul­ti­ple hy­poth­e­sis test­ing this is no longer sig­nifi­cant. There is some Bayesian ev­i­dence here but not enough to com­pletely per­suade me.

## Summary

I don’t think the pa­per in ques­tion pro­vides suffi­cient ev­i­dence to con­clude that there are un­known knowns in pre­dict­ing study repli­ca­tion. It is good to know that we are fairly good at pre­dict­ing which re­sults will repli­cate but I think the ques­tion of how well cal­ibrated we are re­mains an open topic.

Hope­fully the repli­ca­tion mar­kets study will give more in­sights into this.

***

[1] The repli­ca­tion was performed in 2 stages. The first was in­tended to have a 95% change of find­ing an effect size of 75% of the origi­nal find­ing. If the study repli­cated here it was to stop and ticked off as a suc­cess­ful repli­ca­tion. Those that didn’t repli­cate in stage 1 pro­ceeded to stage 2 where the sam­ple size was in­creased in or­der to have a 95% change of find­ing effect sizes at 50% of the origi­nal find­ing.

[2] Fig 7 in the sup­ple­men­tary in­for­ma­tion shows the same graph as in the OP but bas­ing on Treat­ment 1 mar­ket be­liefs which re­late to stage 1 pre­dic­tions. This still looks quite im­pres­sively mono­tonic. How­ever the colour­ing sys­tem is mis­lead­ing for analysing mar­ket suc­cess as the colour­ing sys­tem re­lated to suc­cess af­ter stage 2 of the repli­ca­tion but the mar­ket was pre­dict­ing stage 1. If this is cor­rected then the graph look a lot less mono­tonic, flip­ping the re­sults for Pyc & Raw­son (6th), Dun­can et al. (8th) and Ack­er­man et al. (19th).

• This is awe­some :) Thank you very much for read­ing through it all and writ­ing down your thoughts and con­clu­sions.

• 80,000 Hours now has a quiz up where you can test your own abil­ity to guess which stud­ies repli­cated: https://​​www.less­wrong.com/​​posts/​​KsyD6GmFN2EtirX74/​​psy­chol­ogy-repli­ca­tion-quiz

• That re­sult is sus­pi­ciously amaz­ingly great.

After this sen­tence, I thought you were go­ing in an en­tirely differ­ent di­rec­tion. I am also some­what sus­pi­cious. How trust­wor­thy is the repli­ca­tion pa­per?

• Look­ing at the com­ments on the quiz link post, the av­er­age was 34.8/​42. Us­ing the same scor­ing sys­tem the pre­dic­tion mar­ket would have got 3642 (3 didn’t repli­cate which the mar­ket gave >50% cre­dence to). If com­pe­tent laypeo­ple (some of whom by their own ad­mis­sion didn’t spend long on the task) can get 34.8, 36 doesn’t seem un­rea­son­able.

I think the pa­per looks es­pe­cially im­pres­sive due to the abil­ity to give prob­a­bil­ity es­ti­mates but hav­ing done the quiz the ones I got wrong were amongst those which I felt least con­fi­dent about.

• I cu­rated this post; I re­ally ap­pre­ci­ated hav­ing the post’s fram­ing. It made the study’s im­pli­ca­tions for the old de­bate on mod­esty re­ally clear and crisp, and I doubt I’d’ve framed it as clearly in my own head had I found out about the study a differ­ent way.

• Like ricraz I was ini­tially ex­pect­ing a differ­ent post but like was was done.

How­ever we still have the un­der­ly­ing prob­lem that the repli­ca­tion test performed does not seem to do what it claims. https://​​www.sci­ence­news.org/​​blog/​​con­text/​​de­bates-whether-sci­ence-bro­ken-dont-fit-tweets has some in­ter­est­ing com­ments I think. If I un­der­stood cor­rectly the con­clu­sion that a later test pro­duced a differ­ent p-value says noth­ing about the un­der­ly­ing hy­poth­e­sis—in other words the hy­poth­e­sis is not tested, only the data. So un­less this is all about run­ning the same data sets....but that sug­gest other prob­lems.

• I think this ex­pla­na­tion is mostly true, but the par­ti­ci­pants may have been more con­fi­dent of the rel­a­tive de­gree of fish­i­ness than the ab­solute de­gree. A sin­gle un­known vari­able af­fect­ing all of the stud­ies (in this case, the de­gree of truly in­visi­ble fudg­ing in the field) ought to regress each in­di­vi­d­ual prob­a­bil­ity away from ex­tremes some­what.

• Any ex­am­ple that comes to mind is with poker. Say you just sit down at the table and a kid with a back­wards base­ball cap and sun­glasses makes an ag­gres­sive move. A lot of peo­ple will fold with the ra­tio­nale:

I have a feel­ing he’s bluffing, but I just sat down so I don’t know. I have to see him prove that he’s ca­pa­ble be­fore I give him credit.

Similar ex­am­ple: take the same situ­a­tion and re­place the kid with a twenty-some­thing year old asian. Twenty-some­thing year old asi­ans tend to be ag­gres­sive. Peo­ple know that, but of­ten still won’t give the player credit for be­ing ca­pa­ble of bluffing be­cause they “don’t want to stereo­type”.

• Given that the stereo­types are known to all play­ers and can be ma­nipu­lated (moreso the base­ball cap than race), re­fus­ing to be­lieve the sig­nals seems like the cor­rect thing at high level ta­bles where all play­ers can be as­sumed to have thought through their pre­sen­ta­tion. Even with some­thing like race, if the 20 year old asian knows you think he’s likely to be ag­gres­sive, he can use that to his ad­van­tage.

• I would prob­a­bly have thought the same thing if I didn’t play poker, but my im­pres­sion from play­ing poker is that play­ers just aren’t that so­phis­ti­cated (in this man­ner) at any­thing but the high­est of stakes.

• I think a lit­tle more goes into it with poker, at least with Texas Hold’em. The odds change ev­ery time a new card is laid down. The player who goes all-in be­fore the flop might ac­tu­ally have a pair of Aces, but an­other player could still win with a flush once all the cards are down.

I’m not sure what your un­der­ly­ing point here is—I might not be dis­agree­ing with you. One les­son I take from poker is that there is lit­tle cost to fold­ing when the stakes are high, but a very large cost to bet­ting and be­ing wrong. It is safer to sit and watch for a while and wait for a hand you have great con­fi­dence in be­fore challeng­ing the “all-in” player.

Similarly, there seems to be greater so­cial down-sides to be­liev­ing some­thing that turns out to be false than to be skep­ti­cal of some­thing that turns out to be true.

• The cen­tral point I’m mak­ing is that peo­ple of­ten know that the kid with a back­wards base­ball cap and sun­glasses is likely to be bluffing, even though they don’t know that they know it, and thus it’s an ex­am­ple of an un­known known.

It is true that the cards change ev­ery hand, and so the kid may not be bluffing, but the prob­a­bil­ities don’t change (for a given con­text), so the kid is just as likely to be bluffing each time (for a given con­text). Eg. on a 964 flop, if the kid is the pre­flop raiser, he could have AA, but on that flop he’s likely to be bluffing, say, 80% of the time.