# [Question] How do you assess the quality /​ reliability of a scientific study?

When you look at a pa­per, what signs cause you to take it se­ri­ously? What signs cause you to dis­card the study as too poorly de­signed to be much ev­i­dence one way or the other?

I’m hop­ing to com­pile a repos­i­tory of heuris­tics on study eval­u­a­tion, and would love to hear peo­ple’s tips and tricks, or their full eval­u­a­tion-pro­cess.

I’m look­ing for things like...

• “If the n (sam­ple size) is be­low [some thresh­old value], I usu­ally don’t pay much at­ten­tion.”

• “I’m mostly on the look­out for big effect sizes.”

• “I read the ab­stract, then I spend a few min­utes think­ing about how I would de­sign the ex­per­i­ment, in­clud­ing which con­founds I would have to con­trol for, and how I could do that. Then I read the meth­ods sec­tion, and see how their study de­sign com­pares to my 1-3 minute sketch. Does their de­sign seem sen­si­ble? Are they ac­count­ing for the first-or­der-ob­vi­ous con­founds?”

• etc.

• I’ve prob­a­bly read about 1000 pa­pers. Les­sons learned the hard way...

1. Look at the spon­sor­ship of the re­search and of the re­searchers (pre­vi­ous spon­sor­ship, “con­sul­tan­cies” etc are also im­por­tant for up to 10-15 years). This cre­ates mas­sive bias. E.g: A lot of med­i­cal bod­ies and re­searchers are owned by phar­ma­ceu­ti­cal companies

2. Look at ide­olog­i­cal bi­ases of the au­thors. E.g. a lot of so­cial sci­ence re­search as­sumes as a given that genes have no effect on per­son­al­ity or in­tel­li­gence. (Yes, re­ally).

3. Un­der­stand statis­tics very deeply. There is no pain-free way to get this knowl­edge, but with­out it you can­not win here. E.g. a) The as­sump­tions be­hind all the statis­ti­cal mod­els b) the limi­ta­tions of alleged “cor­rec­tions”. You need to un­der­stand both Bayesian and Fre­quen­tist statis­tics in depth, to the point that they are ob­vi­ous and in­tu­itive to you.

4. Un­der­stand how re­searchers rig re­sults. e.g. undis­closed mul­ti­ple com­par­i­sons, peek­ing at the data be­fore de­cid­ing what anal­y­sis to do, failing to pre-pub­lish the de­sign and end points and to fol­low that pre-pub­li­ca­tion, “run-in pe­ri­ods” for drug tri­als, spon­sor-con­trol­led com­mit­tees to re­view and change di­ag­noses… There are pa­pers about this e.g. “why most pub­lished re­search find­ings are false”.

5. After spon­sor­ship, read the meth­ods sec­tion care­fully. Look for prob­lems. Have valid and ap­pro­pri­ate statis­tics been used? Were the log­i­cal end points as­sessed? Maybe then look at the con­clu­sions. Do the con­clu­sions match the body of the pa­per? Has the data from the study been made available to all qual­ified re­searchers to check the anal­y­sis? Things can change a lot when that hap­pens e.g. Tam­iflu. Is the data is only available to com­mer­cial in­ter­ests and their stooges this is a bad sign.

6. Has the study been repli­cated by in­de­pen­dent re­searchers?

7. Is the study ob­ser­va­tional? If so, does is meet gen­er­ally ac­cepted crite­ria for valid ob­ser­va­tional stud­ies? (large effect, dose-re­sponse gra­di­ent, well un­der­stood causal model, well un­der­stood con­founders, con­founders smaller than the pub­lished effect etc).

8. Do not think you can read ab­stracts only and learn much that is use­ful.

9. Read some of the vit­ri­olic books about the prob­lems in re­search e.g. “Deadly Medicines and Or­ganised Crime How big pharma has cor­rupted health­care” by PETER C GØTZSCHE. Not ev­ery­thing in this book is true but it will open your eyes about what can hap­pen.

10. Face up to the fact that 80-90% of stud­ies are use­less or wrong. You will spend a lot of time read­ing things only to con­clude that there is not much there.

• One of the most mis­er­able things about the LW ex­pe­rience is re­al­iz­ing how lit­tle you ac­tu­ally know with con­fi­dence.

• I’ve prob­a­bly read about 1000 pa­pers. Les­sons learned the hard way...

Very cool. How have these been split across differ­ent fields/​do­mains?

• Mostly medicine, nu­tri­tion, metabolism. Also fi­nance and eco­nomics.

• What kind of ex­pe­riences were the hard les­son? How did the mo­ments of learn­ing look like?

• Mostly be­lat­edly re­al­iz­ing that stud­ies I took as Gospel turned out to be wrong. This trig­gered an in­tense de­sire to know why and how.

• This is a great an­swer and should be taught to ev­ery­one.

• (a minor thing—I used to have a sep­a­rate MSWord file with a table for “tech­niques”. Some peo­ple pre­fer Ex­cel and so on, but I find that Word helps me keep it la­conic. The columns were: Species; Pur­pose; Fix­a­tion/​Stor­age; Treat­ment; and Refer­ence (with a hy­per­link). Within Treat­ment I just high­lighted spe­cific terms. Very easy to see some­thing out of the or­di­nary.)

• Is there an on­line way to bet­ter tag which stud­ies are sus­pect and which ones aren’t—for the sake of ev­ery­one else who reads af­ter?

• Check out PubPeer.

• I am us­ing https://​​sc­ite.ai/​​ with a plu­gin for browsers, but I would love a similar ser­vice with user-gen­er­ated flags.

• Con­text: My ex­pe­rience is pri­mar­ily with psy­chol­ogy pa­pers (heuris­tics & bi­ases, so­cial psych, and similar ar­eas), and it seems to gen­er­al­ize pretty well to other so­cial sci­ence re­search and fields with similar sorts of meth­ods.

1. Is this “re­sult” just noise? Or would it repli­cate?

2. (If there’s some­thing be­sides noise) Is there any­thing in­ter­est­ing go­ing on here? Or are all the “effects” just con­founds, statis­ti­cal ar­ti­facts, demon­strat­ing the ob­vi­ous, etc.

3. (If there is some­thing in­ter­est­ing go­ing on here) What is go­ing on here? What’s the main take­away? What can we learn from this? Does it sup­port the claim that some peo­ple are tempted to use it to sup­port?

There is some benefit just to ex­plic­itly con­sid­er­ing all three ques­tions, and keep­ing them sep­a­rate.

For #1 (“Is this just noise?”) peo­ple ap­par­ently do a pretty good job of pre­dict­ing which stud­ies will repli­cate. Rele­vant fac­tors in­clude:

1a. How strong is the em­piri­cal re­sult (tiny p value, large sam­ple size, pre­cise es­ti­mate of effect size, etc.).

1b. How plau­si­ble is this effect on pri­ors? In­clud­ing: How big an effect size would you ex­pect on pri­ors? And: How defini­tively does the re­searchers’ the­ory pre­dict this par­tic­u­lar em­piri­cal re­sult?

1c. Ex­per­i­menter de­grees of free­dom /​ gar­den of fork­ing paths /​ pos­si­bil­ity of p-hack­ing. Pr­ereg­is­tra­tion is best, visi­ble signs of p-hack­ing are worst.

1d. How filtered is this ev­i­dence? How much pub­li­ca­tion bias?

1e. How much do I trust the re­searchers about things like (c) and (d)?

I’ve found that this post on how to think about whether a repli­ca­tion study “failed” also seems to have helped clar­ify my think­ing about whether a study is likely to repli­cate.

If there are many stud­ies of es­sen­tially the same phe­nomenon, then try to find the method­olog­i­cally strongest few and fo­cus mainly on those. (Rather than pick­ing one study at ran­dom and dis­miss­ing the whole area of re­search if that study is bad, or as­sum­ing that just be­cause there are lots of stud­ies they must add up to solid ev­i­dence.)

If you care about effect size, it’s also worth keep­ing in mind that the things which turn noise into “statis­ti­cally sig­nifi­cant re­sults” also tend to in­flate effect sizes.

For #2 (“Is there any­thing in­ter­est­ing go­ing on here?”), un­der­stand­ing method­ol­ogy & statis­tics is pretty cen­tral. Partly that’s back­ground knowl­edge & ex­per­tise that you keep build­ing up over the years, partly that’s tak­ing the time & effort to sort out what’s go­ing on in this study (if you care about this study and can’t sort it out quickly), some­times you can find other writ­ings which com­ment on the method­ol­ogy of this study which can help a lot. You can try googling for crit­i­cisms of this par­tic­u­lar study or line of re­search (or check google scholar for pa­pers that have cited it), or google for crit­i­cisms of spe­cific meth­ods they used. It is of­ten eas­ier to rec­og­nize when some­one makes a good ar­gu­ment than to come up with that ar­gu­ment your­self.

One fram­ing that helps me think about a study’s method­ol­ogy (and whether or not there’s any­thing in­ter­est­ing go­ing on here) is to try to flesh out “null hy­poth­e­sis world”: in the world where noth­ing in­ter­est­ing is go­ing on, what would I ex­pect to see come out of this ex­per­i­men­tal pro­cess? Some­times I’ll come up with more than one world that feels like a null hy­poth­e­sis world. Ex­er­cise: try that with this study (Egan, San­tos, Bloom 2007). Another ex­er­cise: Try that with the hot hand effect.

#3 (“What is go­ing on here?”) is the biggest/​broad­est ques­tion of the three. It’s the one that I spend the most time on (at least if the study is any good), and it’s the one that I could most eas­ily write a whole bunch about (mak­ing lots of points and elab­o­rat­ing on them). But it’s also the one that is the most dis­tant from Eli’s origi­nal ques­tion, and I don’t want to turn those post into a big huge es­say, so I’ll just high­light a few things here.

A big part of the challenge is think­ing for your­self about what’s go­ing on and not be­ing too an­chored on how things are de­scribed by the au­thors (or the press re­lease or the per­son who told you about the study). Some moves here:

3a. Imag­ine (us­ing your in­ner sim) be­ing a par­ti­ci­pant in the study, such that you can pic­ture what each part of the study was like. In par­tic­u­lar, be sure that you un­der­stand ev­ery ex­per­i­men­tal ma­nipu­la­tion and mea­sure­ment in con­crete terms (okay, so then they filled out this ques­tion­naire which asked if you agree with state­ments like such-and-such and blah-blah-blah).

3b. Be sure you can clearly state the pat­tern of re­sults of the main find­ing, in a con­crete way which is not laden with the au­thors’ the­ory (e.g. not “this group was de­pleted” but “this group gave up on the puz­zles sooner”). You need this plus 3a to un­der­stand what hap­pened in the study, then from there you’re try­ing to draw in­fer­ences about what the study im­plies.

3c. Come up with (one or sev­eral) pos­si­ble mod­els/​the­o­ries about what could be hap­pen­ing in this study. Espe­cially look for ones that seem com­mon­sen­si­cal /​ that are based in how you’d in­ner sim your­self or other peo­ple in the ex­per­i­men­tal sce­nario. It’s fine if you have a model that doesn’t make a crisp pre­dic­tion, or if you have a the­ory that seems a lot like the au­thors’ the­ory (but with­out their jar­gon). Ex­er­cise: try that with a typ­i­cal willpower de­ple­tion study.

3d. Have in mind the key take­away of the study (e.g., the one sen­tence sum­mary that you would tell a friend; this is the thing that’s the main rea­son why you’re in­ter­ested in read­ing the study). Poke at that sen­tence to see if you un­der­stand what each piece of it means. As you’re look­ing at the study, see if that key take­away ac­tu­ally holds up. e.g., Does the main pat­tern of re­sults match this take­away or do they not quite match up? Does the study dis­t­in­guish the var­i­ous mod­els that you’ve come up with well enough to strongly sup­port this main take­away? Can you edit the take­away claim to make it more pre­cise /​ to more clearly re­flect what hap­pened in the study /​ to make the speci­fics of the study un­sur­pris­ing to some­one who heard the take­away? What sort of re­search would it take to provide re­ally strong sup­port for that take­away, and how does the study at hand com­pare to that?

3e. Look for con­crete points of refer­ence out­side of this study which re­sem­ble the sort of thing the re­searchers are talk­ing about. Search in par­tic­u­lar for ones that seem out-of-sync with this study. e.g., This study says not to tell other peo­ple your goals, but the other day I told Alex about some­thing I wanted to do and that seemed use­ful; do the speci­fics of this ex­per­i­ment change my sense of whether that con­ver­sa­tion with Alex was a good idea?

Some nar­rower points which don’t neatly fit into my 3-cat­e­gory break­down:

A. If you care about effect sizes then con­sider do­ing a Fermi es­ti­mate, or oth­er­wise trans­lat­ing the effect size into num­bers that are in­tu­itively mean­ingful to you. Also think about the range of pos­si­ble effect sizes rather than just the point es­ti­mate, and re­mem­ber that the is­sues with noise in #1 also in­flate effect size.

B. If the pa­per finds a null effect and claims that it’s mean­ingful (e.g., that the in­ter­ven­tion didn’t help) then you do care about effect sizes. (e.g., If it claims the in­ter­ven­tion failed be­cause it had no effect on mor­tal­ity rates, then you might as­sume a value of $10M per life and try to calcu­late a 95% con­fi­dence in­ter­val on the value of the in­ter­ven­tion based solely on its effect on mor­tal­ity.) C. New pa­pers that claim to de­bunk an old find­ing are of­ten right when they claim that the old find­ing has is­sues with #1 (it didn’t repli­cate) or #2 (it had method­olog­i­cal flaws) but are rarely ac­tu­ally de­bunk­ings if they claim that the old find­ing has is­sues with #3 (it mis­de­scribes what’s re­ally go­ing on). The new study on #3 might be im­por­tant and cause you to change your think­ing in some ways, but it’s gen­er­ally an in­cre­men­tal up­date rather than a de­bunk­ing. Ex­am­ples that look to me like suc­cess­ful de­bunk­ings: be­hav­ioral so­cial prim­ing re­search (#1), the Den­nis-den­tist effect (#2), the hot hand fal­lacy (#2 and some of B), the Stan­ford Pri­son Ex­per­i­ment (clos­est to #2), var­i­ous other things that didn’t repli­cate (#1). Ex­am­ples of alleged “de­bunk­ings” which seem like in­ter­est­ing but over­hyped in­cre­men­tal re­search: the by­stan­der effect (#3), loss aver­sion (this study) (#3), the en­dow­ment effect (#3). • Often I want to form a quick im­pres­sion as to whether it is worth me analysing a given pa­per in more de­tail. A cou­ple of quick calcu­la­tions can go a long way. Some of this will be ob­vi­ous but I’ve tried to give the ap­prox­i­mate thresh­olds for the re­sults which up un­til now I’ve been us­ing sub­con­sciously. I’d be very in­ter­ested to hear other peo­ple’s thresh­olds. ## Calculations • Calcu­late how many p-val­ues (could) have been calcu­lated. • If the study and anal­y­sis tech­niques were pre-reg­istered then count how many p-val­ues were calcu­lated. • If the study was not pre-reg­istered, calcu­late how many differ­ent p-val­ues could have been calcu­lated (had the data looked differ­ent) which would have been equally jus­tified as the ones that they did calcu­late (see Gel­man’s gar­den of fork­ing paths). This de­pends on how ag­gres­sive any hack­ing has been but roughly speak­ing I’d calcu­late: • Num­ber of in­put vari­ables (in­clud­ing in­ter­ac­tions) x Num­ber of mea­sure­ment variables • Calcu­late ex­pected num­ber of type I errors • Mul­ti­ply an­swer from pre­vi­ous step by the thresh­old p-value of the paper • Differ­ent re­sults may have differ­ent thresh­olds which makes life a lit­tle more complicated • Es­ti­mate Co­hen’s d for the ex­per­i­ment (with­out look­ing at the ac­tual re­sult!) • One op­tion in es­ti­mat­ing effect size is to not con­sider the spe­cific in­ter­ven­tion, but just to es­ti­mate how easy the tar­get vari­able is to move for any in­ter­ven­tion – see putanu­monit for a more de­tailed ex­pla­na­tion. I wouldn’t com­pletely throw away my prior on how effec­tive the par­tic­u­lar in­ter­ven­tion in ques­tion is, but I do con­sider it helpful ad­vice to not let my prior act too pow­er­fully. • Calcu­late ex­per­i­men­tal power • You can calcu­late this prop­erly but al­ter­na­tively can use Lehr’s for­mula. Sam­ple size equa­tions for differ­ent un­der­ly­ing dis­tri­bu­tions can be found here. • To get Power > 0.8 we re­quire sam­ple size per group of: • This is based on , sin­gle p-value calcu­lated, 2 sam­ples of equal size, 2 tailed t-test. • A mod­ifi­ca­tion to this rule to ac­count for mul­ti­ple p-val­ues would be to add 3.25 to the nu­mer­a­tor for each dou­bling of the num­ber of p-val­ues calcu­lated pre­vi­ously. • If sam­ple sizes are very un­equal (ra­tio of >10) then the num­ber re­quired in the smaller sam­ple is the above calcu­la­tion di­vided by 2. This also works for sin­gle sam­ple tests against a fixed value. ## Thresholds Roughly speak­ing, if ex­pected type I er­rors is above 0.25 I’ll write the study off, be­tween 0.05 and 0.25 I’ll be sus­pi­cious. If mul­ti­ple sig­nifi­cant p-val­ues are found this gets a bit tricky due to non-in­de­pen­dence of the p-val­ues so more in­ves­ti­ga­tion may be re­quired. If sam­ple size is suffi­cient for power > 0.8 then I’m happy. If it comes out be­low then I’m sus­pi­cious and have to check whether my es­ti­ma­tion for Co­hen’s d is rea­son­able. If I’m still con­vinced N is a long way from be­ing large enough I’ll write the study off. Ob­vi­ously as the pa­per has been pub­lished the calcu­lated Co­hen’s d is large enough to get a sig­nifi­cant re­sult but the ques­tion is do I be­lieve that the effect size calcu­lated is rea­son­able. ## Test I tried Lehr’s for­mula on the 80,000 hours repli­ca­tion quiz. Of the 21 repli­ca­tions, my calcu­la­tion gave a de­ci­sive an­swer in 17 pa­pers, get­ting them all cor­rect − 9 stud­ies with com­fortably over­sized sam­ples repli­cated suc­cess­fully, 8 stud­ies with mas­sively un­der­sized sam­ples (less than half the re­quired sam­ple size I calcu­lated) failed to repli­cate. Of the re­main­ing 4 where the sam­ple sizes were 0.5 – 1.2 x my es­ti­mate from Lehr’s equa­tion, all suc­cess­fully repli­cated. (I re­mem­bered the an­swer to most of the repli­ca­tions but tried my hard­est to ig­nore this when es­ti­mat­ing Co­hen’s d.) Just hav­ing a fixed min­i­mum N wouldn’t have worked nearly as well – of the 5 small­est stud­ies only 1 failed to repli­cate. • I just came across an ex­am­ple of this which might be helpful. Good grades and a desk ‘key for uni­ver­sity hopes’ (BBC News) Essen­tially get­ting good grades and hav­ing a desk in your room are ap­par­ently good pre­dic­tors of whether you want to go to uni­ver­sity or not. The former seemed sen­si­ble, the lat­ter seemed like it shouldn’t have a big effect size but I wanted to give it a chance. The pa­per it­self is here. Just from the ab­stract you can tell there are at least 8 in­put vari­ables so the nu­mer­a­tor on Lehr’s equa­tion be­comes ~26. This means a co­hen’s d of 0.1 (which I feel is pretty gen­er­ous for hav­ing a desk in your room) would re­quire 2600 re­sults in each sam­ple. As the sam­ples are un­likely to be of equal size, I would es­ti­mate they would need a to­tal of ~10,000 sam­ples for this to have any chance of find­ing a mean­ingful re­sult for smaller effect sizes. The ac­tual num­ber of sam­ples was ~1,000. At this point I would nor­mally write off the study with­out both­er­ing to go deeper, the pro­cess tak­ing less than 5 min­utes. I was cu­ri­ous to see how they man­aged to get mul­ti­ple sig­nifi­cant re­sults de­spite the sam­ple size limi­ta­tions. It turns out that they de­cided against re­port­ing p-val­ues be­cause “we could no longer as­sume ran­dom­ness of the sam­ple”. In­stead they re­port the odds ra­tio of each re­sult and said that any­thing with a large ra­tio had an effect, ig­nor­ing any un­cer­tainty of the re­sults. It turns out there were only 108 stu­dents in the no-desk sam­ple. Definitely what An­drew Gel­man calls a Kan­ga­roo mea­sure­ment. There are a lot of other prob­lems with the pa­per but just look­ing at the sam­ple size (even though the sam­ple size was ~1,000) was a helpful check to con­fi­dently re­ject the pa­per with min­i­mal effort. • Ad­di­tional thoughts: 1. For rea­son­able as­sump­tions if you’re study­ing an in­ter­ac­tion then you might need 16x larger sam­ples—see Gel­man. Essen­tially stan­dard er­ror is dou­ble for in­ter­ac­tions and An­drew thinks that in­ter­ac­tion effects be­ing half the size of main effects is a good start­ing point for es­ti­mates, giv­ing times larger sam­ples. 2. When es­ti­mat­ing co­hen’s d, it is im­por­tant that you know whether the study is be­tween or within sub­jects—within sub­ject stud­ies will give much lower stan­dard er­ror and thus re­quire much smaller sam­ples. Again Gel­man dis­cusses. • 1. For health-re­lated re­search, one of the main failure modes I’ve ob­served when peo­ple I know try to do this, is tun­nel vi­sion and a lack of pri­ors about what’s com­mon and rele­vant. Read­ing raw re­search pa­pers be­fore you’ve read broad-overview stuff will make this worse, so read UpToDate first and Wikipe­dia sec­ond. If you must read raw re­search pa­pers, find them with PubMed, but do this only rarely and only with a spe­cific ques­tion in mind. 2. Be­fore look­ing at the study it­self, check how you got there. If you ar­rived via a search en­g­ine query that asked a ques­tion or posed a topic with­out pre­sup­pos­ing an an­swer, that’s good; if there are mul­ti­ple stud­ies that say differ­ent things, you’ve sam­pled one of them at ran­dom. If you ar­rived via a query that asked for con­fir­ma­tion of a hy­poth­e­sis, that’s bad; if there are mul­ti­ple stud­ies that said differ­ent things, you’ve sam­pled in a way that was bi­ased to­wards that hy­poth­e­sis. If you ar­rived via a news ar­ti­cle, that’s the worst; if there are mul­ti­ple stud­ies that said differ­ent things, you sam­pled in a way that was bi­ased op­po­site re­al­ity. 3. Don’t bother with stud­ies in ro­dents, an­i­mals smaller than ro­dents, cell cul­tures, or un­der­grad­u­ate psy­chol­ogy stu­dents. Th­ese stud­ies are done in great num­bers be­cause they are cheap, but they have low av­er­age qual­ity. The fact that they are so nu­mer­ous makes the search-sam­pling prob­lems in (2) more se­vere. 4. Think about what a sen­si­ble end­point or met­ric would be be­fore you look at what end­point/​met­ric was re­ported. If the re­ported met­ric is not the met­ric you ex­pected, this will of­ten be be­cause the rele­vant met­ric was ter­rible. Clas­sic ex­am­ples are pa­pers about bat­tery tech­nolo­gies re­port­ing power rather than ca­pac­ity, biomed­i­cal pa­pers re­port­ing effects on bio­mark­ers rather than symp­toms or mor­tal­ity. 5. Cor­rectly con­trol­ling for con­founders is much, much harder than peo­ple typ­i­cally give it credit for. Ad­ding ex­tra things to the list of things con­trol­led for can cre­ate spu­ri­ous cor­re­la­tions, and study au­thors are not in­cen­tivized to han­dle this cor­rectly. The prac­ti­cal up­shot is that ob­ser­va­tional stud­ies only count if the effect size is very large. • One tac­tic I like to use is “how do they know this?”, and ask­ing my­self or in­ves­ti­gat­ing if it’s pos­si­ble for their an­swer to demon­strate the thing they’re claiming. A lot of work doesn’t tell you. Those aren’t nec­es­sar­ily wrong, be­cause they might have a good an­swer they’re not in­cen­tivized to share, but at a min­i­mum it’s go­ing to make it hard to learn from the work. A lot of work claims to tell you, but when you look they are ly­ing. For ex­am­ple, when I in­ves­ti­gated the claim hu­mans could do 4 hours of thought-work per day, I looked up the pa­per’s cita­tions, and found they referred to ex­per­i­ments of busy work. Even if those stud­ies were valid, they couldn’t pos­si­bly prove any­thing about thought-work. I con­sider “pre­tend­ing to have sources and rea­sons” a worse sin than “not giv­ing a source or rea­son” More am­bigu­ously, I spent a lot of time try­ing to figure out how much we could tell and at what re­s­olu­tion from ice core data. I still don’t have a great an­swer on this for the time pe­riod I was in­ter­ested in. But I learned enough to know that the amount of cer­tainty the book I was read­ing (The Fate of Rome) was pre­sent­ing data as more clear cut than it was. On the other end, The Fall of Rome spends a lot of time ex­plain­ing why pot­tery is use­ful in es­tab­lish­ing eco­nomic and es­pe­cially trade sta­tus of an area/​era. This was pretty hard to ver­ify from ex­ter­nal sources be­cause it’s origi­nal re­search from the au­thor, but it ab­solutely makes sense and pro­duces a lot of claims and pre­dic­tions that could be dis­proved. More­over, none of the crit­i­cism I fond of Fall of Rome ad­dressed his points on pot­tery- no one was say­ing “well I looked at Ro­man pot­tery and think the qual­ity stayed con­stant through the 600s”. • Thanks. This point in par­tic­u­lar sticks with me: I con­sider “pre­tend­ing to have sources and rea­sons” a worse sin than “not giv­ing a source or rea­son” I no­tice that one of the things that tips me off that a sci­en­tist is good, is if her/​his work demon­strates cu­ri­os­ity. Do they seem like they’re ac­tu­ally try­ing to figure out the an­swer? Do they think though and ad­dress coun­ter­ar­gu­ments, or just try to ob­scure those coun­ter­ar­gu­ment? This seems re­lated: a per­son who puts no source might still be shar­ing their ac­tual be­lief, but a per­son who puts a fake source seems like they’re try­ing to sound le­gi­t­i­mate. • Yes, this seems like a good guideline, al­though I can’t im­me­di­ately for­mal­ize how I de­tect cu­ri­os­ity. Vague list of things this made me think of: • I think this is a bet­ter guideline for books than sci­en­tific ar­ti­cles, which are heav­ily con­strained by aca­demic so­cial and fund­ing norms. • One good sign is if *I* feel cu­ri­ous in a con­crete way when I read the book. What I mean by con­crete is... • e.g. Fate of Rome had a ton of very spe­cific claims about how cli­mate worked and how his­tor­i­cal cli­mate con­di­tions could be known. I spent a lot of time try­ing to ver­ify these and even though I ul­ti­mately found them in­suffi­ciently sup­ported, there was a con­crete­ness that I still give pos­i­tive marks for. • In con­trast my most re­cently writ­ten epistemic spot check (not yet pub­lished), I spent a long time on sev­eral claims along the lines of “Pre-in­dus­trial Bri­tain had a more fa­vor­able le­gal cli­mate for en­trepreneur­ship than con­ti­nen­tal Europe”. I don’t re­call the au­thor giv­ing any speci­fics on what he meant by “more fa­vor­able”, nor how he de­ter­mined it was true. In­ves­ti­gat­ing felt like a slog be­cause I wasn’t even sure what I was look­ing for. • I worry I’m be­ing un­fair here be­cause maybe if I’d found lots of other use­ful sources I’d be rat­ing the origi­nal book bet­ter. But when I in­ves­ti­gated I found there wasn’t even a con­sen­sus on whether Bri­tain had a strong or weak patent sys­tem. • Mo­r­al­iz­ing around con­clu­sions tends to in­hibit gen­uine cu­ri­os­ity in me, al­though it can loop around to spite cu­ri­os­ity (e.g., Carol Dweck). • Already many good an­swers, but I want to re­in­force some and add oth­ers. 1. Be­ware of mul­ti­plic­ity—does the ex­per­i­ment in­clude a large num­ber of hy­pothe­ses, ex­plic­itly or im­plic­itly? Im­plicit hy­pothe­ses in­clude “Does the in­ter­ven­tion have an effect on sub­jects with at­tributes A, B or C?” (sub­groups) and “Does the in­ter­ven­tion have an effect that is shown by mea­sur­ing X, Y or Z?” (mul­ti­ple end­points). If mul­ti­ple hy­pothe­ses were tested, were the re­sults for each dili­gently re­ported? Note that mul­ti­plic­ity can be sneaky and you’re of­ten look­ing for what was left un­said, such as a lack of plau­si­ble mechanism for the re­ported effect. For ex­am­ple, take the ex­per­i­men­tal re­sult “Male sub­jects who reg­u­larly con­sume Vi­tamin B in a non-multi-vi­tamin form have a greater risk of de­vel­op­ing lung can­cer (ir­re­spec­tive of dose).” Did they *in­ten­tion­ally* hy­poth­e­size that vi­tamin B would in­crease the like­li­hood can­cer, but only if 1) it was not con­sumed as part of a multi vi­tamin and 2) in a man­ner that was not dose-de­pen­dent? Un­likely! The real con­clu­sion of this study should have been “Vi­tamin B con­sump­tion does not ap­pear cor­re­lated to lung can­cer risk. Some spe­cific sub­groups did ap­pear to have a height­ened risk, but this may be statis­ti­cal anomaly.” 2. Be­ware of small effect sizes and look for clini­cal sig­nifi­cance—does the re­ported effect sound like some­thing that mat­ters? Con­sider the end­point (e.g. change in symp­toms of de­pres­sion, as mea­sured by the Hamil­ton De­pres­sion Rat­ing Scale) and the effect size (e.g. d = 0.3, which is gen­er­ally in­ter­preted as a small effect). As a de­pres­sive per­son, I don’t re­ally care about a drug that has a small effect size.* I don’t care if the effect is real but small or not real at all, be­cause I’m not go­ing to bother with that in­ter­ven­tion. The “should I care” ques­tion cuts through a lot of the bul­lshit, bi­nary think­ing and the difficulty in in­ter­pret­ing small effect sizes (given their nois­i­ness). 3. Be­ware of large effect sizes—lots of un­der­pow­ered stud­ies + pub­li­ca­tion bias = lots of in­flated effect sizes re­ported. An­drew Gel­man’s “Type M” (mag­ni­tude) er­rors are a good way to look at this—an es­ti­mate of the how in­flated the effect size is likely to be. How­ever, this isn’t too helpful un­less you’re ready to bust out R when read­ing re­search. Alter­nately, a good rule of thumb is to be skep­ti­cal of 1) large effect sizes re­ported from small N stud­ies and 2) con­fi­dence in­ter­vals wide enough to drive a trunk through. 4. Be­ware of low prior odds—is this find­ing in a highly ex­plo­ra­tory field of re­search, and it­self rather ex­traor­di­nary? IMO this is an un­der-con­sid­ered con­clu­sion of Ioan­ni­dis’ fa­mous “Why Most Pub­lished Re­search Find­ings are False” pa­per. This Shinyapp nicely illus­trates “pos­i­tive pre­dic­tive value” (PPV), which takes into ac­count bias & prior odds. 5. Con­sider study de­sign—ob­vi­ously look for placebo con­trol, ran­dom­iza­tion, blind­ing etc. But also look for re­peated mea­sures de­signs, e.g. “crossover” de­signs. Crossover de­signs achieve far higher power with fewer par­ti­ci­pants. If you’re eye­bal­ling study power, keep this in mind. 6. Avoid in­con­sis­tent skep­ti­cism—for one, don’t be too skep­ti­cal of re­search just be­cause of its fund­ing source. All re­searchers are bi­ased. It’s small pota­toes$$-wise com­pared to a Pfizer, but post­doc Bob’s ca­reer/​iden­tity is on the line if he doesn’t pub­lish. Pfizer may have$3 billion on the line for their Phase III clini­cal trial, but if Bob can’t make a name for him­self, he’s lost a decade of his life and his ca­reer prospects. Then take Pro­fes­sor Su­san who built her ca­reer on Effect X be­ing real—what were those last 30 years for, if Effect X was just anomaly?

In­stead, look at 1) the qual­ity of the study de­sign, 2) the qual­ity and trans­parency of the re­port­ing (in­clud­ing COI dis­clo­sures, pre­reg­is­tra­tions, the de­tail and or­ga­ni­za­tion in said pre­reg­is­tra­tions, etc).

7. Learn to love meta-anal­y­sis—Where pos­si­ble, look at meta-analy­ses rather than in­di­vi­d­ual stud­ies. But be­ware: meta-analy­ses can suffer their own de­sign flaws, lead­ing to some peo­ple say­ing “lies, damn lies and meta-anal­y­sis.” Cochrane is the gold stan­dard. If they have a meta-anal­y­sis for the ques­tion at hand, you’re in luck. Also, check out the GRADE crite­ria—a prag­matic frame­work for eval­u­at­ing the qual­ity of re­search used by Cochrane and oth­ers.

*un­less there is high het­ero­gene­ity in the effect amongst a sub­group with whom I share at­tributes, which is why sub­group­ing is both haz­ardous and yet still im­por­tant.

• On bias see here https://​​www.bmj.com/​​con­tent/​​335/​​7631/​​1202 and refer­ences. There is a lot of re­search about this. Note also that you do not even need to bias a par­tic­u­lar re­searcher, just fund the re­searchers pro­duc­ing the an­swers you like, or pur­su­ing the av­enues you are in­ter­ested in e.g. Coke’s spon­sor­ship of ex­er­cise re­search which pro­duces pa­pers sug­gest­ing that per­haps ex­er­cise is the an­swer.

One should not sim­ply dis­miss a study be­cause of spon­sor­ship, but be aware of what might be go­ing on be­hind the scenes. And also be aware that peo­ple are oblivi­ous to the effect that spon­sor­ship has on them. One study of pri­mary care doc­tors found a large effect on pre­scribing from free courses, din­ners, etc, but the doc­tors adamantly de­nied any im­pact.

The sug­ges­tions of things to look for are valid and use­ful but of­ten you just don’t know what ac­tu­ally hap­pened.

• Here’s an an­swer for con­densed mat­ter physics:

Step 1: Read the ti­tle, jour­nal name, au­thor list, and af­fili­a­tions.

By read­ing pa­pers in a field, talk­ing to peo­ple in the field, and gen­er­ally keep­ing track of the field as a so­cial en­ter­prise, you should be able to place pa­pers in a con­text even be­fore read­ing them. Peo­ple ab­solutely have rep­u­ta­tions, and that should in­form your pri­ors. You should also have an un­der­stand­ing of what the typ­i­cal re­search meth­ods are to an­swer a cer­tain ques­tion—check ei­ther the ti­tle or the ab­stract to make sure that the meth­ods used match the prob­lem.

Ac­tu­ally, you know what?

Step 0: Spend years read­ing pa­pers and keep­ing track of peo­ple to de­velop an un­der­stand­ing of trust and rep­u­ta­tion as var­i­ous re­sults ei­ther pan our or don’t. Read a few text­books to un­der­stand the phys­i­cal ba­sis of the com­monly-used ex­per­i­men­tal and the­o­ret­i­cal tech­niques, then check that un­der­stand­ing by read­ing more pa­pers and keep­ing track of what kind of data qual­ity is the stan­dard in the field, how tech­niques are best ap­plied, and which tech­niques and meth­ods of anal­y­sis provide the most re­li­able re­sults.

For ex­am­ple, by com­bin­ing steps 0 and 1, you can un­der­stand that cer­tain ex­per­i­men­tal tech­niques might be more difficult and eas­ier to fool your­self with, but might be the best method available for an­swer­ing some spe­cific ques­tion. If you see a pa­per ap­ply­ing this tech­nique to this sort of ques­tion, this ac­tu­ally should in­crease your con­fi­dence in the pa­per rel­a­tive to the base rate for this tech­nique, be­cause it shows that the au­thors are ex­er­cis­ing good judg­ment. Next...

Step 2: Read the ab­stract and look at the figures.

This is good for un­der­stand­ing the pa­per too, not just eval­u­at­ing trust­wor­thi­ness. Look for data qual­ity (re­mem­ber that you learned how to judge the data qual­ity of the most com­mon tech­niques in step 0) and whether they’ve pre­sented it in a way that clearly backs up the core claims of the ab­stract, or pre­sents the in­for­ma­tion you’re try­ing to learn from the pa­per. Data that is merely sug­ges­tive of the au­thors’ claims is ac­tu­ally a red flag, be­cause re­mem­ber, ev­ery­one just pre­sents the nicest figure they can. Re­spon­si­ble sci­en­tists re­duce their claims when the ev­i­dence is weak.

If you have spe­cific parts you know you care about, you can usu­ally just read those in de­tail and skim the rest. But if you re­ally care about as­sess­ing this par­tic­u­lar pa­per, check the pro­ce­dures and com­pare it to your knowl­edge of how this sort of work should go. If there are spe­cific parts that you want to check your­self, and you can do so, do so. This is also use­ful so you can...

Step 4: Com­pare it to similar pa­pers.

You should have back­ground knowl­edge, but it’s also use­ful to keep similar pa­pers (both in terms of what meth­ods they used, and what prob­lem they stud­ied) di­rectly on hand if you want to check some­thing. If you know a pa­per that did a similar thing, use that to check their meth­ods. Find some pa­pers on the same prob­lem and cross-check how they pre­sent the de­tails of the prob­lem and the plau­si­bil­ity of var­i­ous an­swers, to get a feel for the con­sen­sus. Speak­ing of con­sen­sus, if there are two similar pa­pers from way in the past that you found via Google Scholar and one of them has 10x the cita­tions of the other, take that into ac­count. When you no­tice con­fus­ing state­ments, you can check those similar pa­pers to see how they han­dled it. But once you’re re­ally get­ting into the de­tails, you’ll have to...

Step 5: Fol­low up cita­tions for things you don’t un­der­stand or want to check.

If some­one is us­ing a con­fus­ing method or ex­pla­na­tion, there should be a nearby cita­tion. If not, that’s a red flag. Find the cita­tion and check whether it sup­ports the claim in the origi­nal pa­per (re­curs­ing if nec­es­sary). Ac­cept that this will re­quire lots of work and think­ing, but hey, at least this feeds back into step 0 so you don’t have to do it as much next time.

There are smart peo­ple out there. Hope­fully you know some, so that if some­thing seems sur­pris­ing and difficult to un­der­stand, you can ask them what they think about it.

• if there are two similar pa­pers from way in the past that you found via Google Scholar and one of them has 10x the cita­tions of the other, take that into ac­count.

This seems great for figur­ing out the con­sen­sus in a field, but not for iden­ti­fy­ing when the con­sen­sus is wrong.

• Sam­ple size is re­lated to how big an effect size you should be sur­prised by ie power. Big effect sizes in smaller pop­u­la­tions = less sur­pris­ing. Why is there no over­all rule of thumb? Be­cause it gets mod­ified a bunch by the base rate of what you’re look­ing at and some other stuff I’m not re­mem­ber­ing off the top of my head.

In gen­eral I’d say there’s enough method­olog­i­cal di­ver­sity that there’s a lot of stuff I’m look­ing for as flags that a study wasn’t de­signed well. For ex­am­ples of such you can look at the in­clu­sion crite­ria for meta-analy­ses.

There’s also more qual­i­ta­tive things about how much I’m ex­trap­o­lat­ing based on the dis­cus­sion sec­tion by the study au­thors. In the longevity posts for ex­am­ple, I laud a study for hav­ing a dis­cus­sion sec­tion where the au­thors ex­plic­itly spend a great deal of time talk­ing about what sorts of things are *not* rea­son­able to con­clude from the study even though they might be sug­ges­tive for fur­ther re­search di­rec­tions.

Con­founds are kinda like build­ing a key word map. I’m look­ing at the most well re­garded stud­ies in a do­main, not­ing down what they’re con­trol­ling for, then dis­count­ing stud­ies that aren’t con­trol­ling for them to vary­ing de­grees. This is an­other place where qual­i­ta­tive judge­ments creep in even in cochrane re­views where they are forced to just de­velop ad hoc ‘tiers’ of ev­i­dence (like A, B, C etc) and give some guidelines for do­ing so.

I have higher skep­ti­cism in gen­eral than I did years ago as I have learned about the num­ber of ways that effects can sneak into the data de­spite hon­est in­ten­tion by mod­er­ately com­pe­tent sci­en­tists. I’m also much more aware of a fun­da­men­tal prob­lem with se­lec­tion effects in that any­one run­ning a study has some vested in­ter­est in fram­ing hy­pothe­ses in var­i­ous ways be­cause no­body de­votes them­selves to some­thing about which they’re com­pletely dis­in­ter­ested. This shows up as a prob­lem in your own eval­u­a­tion in that it’s al­most im­pos­si­ble to not sneak in iso­lated de­mands for rigor based on pri­ors.

I’m also gen­er­ally read­ing over the shoulder of whichever other study re­view­ers seem to be do­ing a good job in a do­main. Epistemics is a team sport. An ex­am­ple of this is when Scott did a roundup of ev­i­dence for low carb diets and men­tion­ing lots of other peo­ple do­ing meta re­views and some spec­u­lat­ing about why differ­ent con­clu­sions were reached eg Luke Muelhauser and I came down on the side that the VLC ev­i­dence seemed weak and Will Eden came down on the side that it seemed more ro­bust, seem­ingly differ­ing on how much weight we placed on in­side view metabolic mod­els vs out­side view long term stud­ies.

That’s a hot take. It can be hard to just dump top level heuris­tics vs see­ing what comes up from more spe­cific ques­tions/​dis­cus­sion.

• Re­ca­pitu­lat­ing some­thing I’ve writ­ten about be­fore:

You should first make a se­ri­ous effort to for­mu­late both the spe­cific ques­tion you want an­swered, and why you want an an­swer. It may turn out sur­pris­ingly of­ten that you don’t need to do all this work to eval­u­ate the study.

Short of be­com­ing an ex­pert your­self, your best bet is then to learn how to talk to peo­ple in the field un­til you can un­der­stand what they think about the pa­per and why—and also how they think and talk about these things. This is roughly what Harry Col­lins calls “in­ter­ac­tional” ex­per­tise. (He takes grav­i­ta­tional-wave sci­en­tist Joe We­ber’s late work as an es­pe­cially vivid ex­am­ple: “I can promise such lay read­ers that if they teach them­selves a bit of el­e­men­tary statis­tics and per­se­vere with read­ing the pa­per, they will find it ut­terly con­vinc­ing. Scien­tific pa­pers are writ­ten to be ut­terly con­vinc­ing; over the cen­turies their spe­cial lan­guage and style has been de­vel­oped to make them read con­vinc­ingly.… The only way to know that We­ber’s pa­per is not to be read in the way it is writ­ten is to be a mem­ber of the ‘oral cul­ture’ of the rele­vant spe­cial­ist com­mu­nity.” The full pas­sage is very good.)

If you only learn from pa­pers (or even text­books and pa­pers), you won’t have any idea what you’re miss­ing. A lot of ex­per­tise is bound up in in­di­vi­d­ual tacit knowl­edge and group dy­nam­ics that never get writ­ten down. This isn’t to say that the ‘oral cul­ture’ is always right, but if you don’t have a good grasp of it, you will make at best slow progress as an out­sider.

This is the main thing hold­ing me back from run­ning the course I’ve half-writ­ten on layper­son eval­u­a­tion of sci­ence. Most of the time, the best thing is just to talk to peo­ple. (Cold emails are OK; be po­lite, con­cise, and ask a spe­cific ques­tion. Grad stu­dents tend to be gen­er­ous with their time if you have an in­ter­est­ing ques­tion or pizza and beer. And I’m glad to an­swer physics ques­tions by LW mes­sage.)

Short of talk­ing to peo­ple, you can of­ten find blogs in the field of in­ter­est. More rarely, you can also find good jour­nal­ism do­ing the above kind of work for you. (Quanta is typ­i­cally good in physics, enough so that I more or less trust them on other sub­jects.)

There’s plenty to be said about pri­mary source eval­u­a­tion, which varies with field and which the other an­swers so far get at, but I think this les­son needs to come first.

• If a psy­chol­ogy study doesn’t promi­nently say who its sub­jects were, the an­swer is “un­der­grads at the uni­ver­sity, pre­dom­i­nantly those is psy­chol­ogy classes” and it is worth­less.

• I mean, lots of phe­nom­ena are likely to still be pre­sent in un­der­grad­u­ate psy­chol­ogy stu­dents, so it seems weird to say that the re­sults are go­ing to be worth­less. Seems to me like it de­pends on the do­main on how much you ex­pect re­sults to gen­er­al­ize from that pop­u­la­tion to oth­ers.

• Already par­tially men­tioned by oth­ers, in­clud­ing OP.

I usu­ally start with com­par­ing the con­clu­sion with my ex­pec­ta­tions (I’m painfully aware that this cre­ates a con­fir­ma­tion bias, but what else am I sup­posed to com­pare it with). If they are suffi­ciently differ­ent I try to imag­ine how, us­ing the method de­scribed by the au­thors, I would be able to get a pos­i­tive re­sult to their ex­per­i­ment con­di­tional on my pri­ors be­ing true, i.e. their con­clu­sion be­ing false. This is ba­si­cally the same as try­ing to figure out how I would run the ex­per­i­ment and which data would dis­prove my as­sump­tions, and then see­ing if the pub­lished re­sults fall in that cat­e­gory.

Usu­ally the buck stops there, most pub­lished re­search use meth­ods that are suffi­ciently flimsy that (again, con­di­tional on my pri­ors), it is very likely the re­sult was a fluke. This ap­proach is pretty much the same as your third bul­let point, and also wave­man’s point num­ber 5. I would like to stress though that it’s al­most never enough to have a check­list of “com­mon flaws in method sec­tions” (al­though again, you have to start some­where). Un­for­tu­nately differ­ent strengths and types of re­sults in differ­ent fields re­quire differ­ent meth­ods.

A small Bayesian twist on the in­ter­pre­ta­tion of this ap­proach: when you’re handed a pa­per (that doesn’t match your ex­pec­ta­tions), that is ev­i­dence of some­thing. I’m speci­fi­cally look­ing at the chance that, con­di­tional on my pri­ors be­ing ac­cu­rate, the pa­per I’m given is still be­ing pub­lished.

• Edit: Awards for the best re­sponses + re­views of an­swers HERE.

I think this is a re­ally im­por­tant ques­tion and I’m ea­ger to see an­swers. I’m will­ing to put up $100 of my per­sonal money as a prize for what I think is the best an­swer and an­other$50 for what I think is the best sum­mary of mul­ti­ple an­swers. (I’m will­ing to com­bine these of the best an­swer in­cludes sum­mary of other an­swers.)

This isn’t a proper im­pact cer­tifi­cate, but if it were, I might be award­ing this prize at 5% or less of the true value of the im­pact. So in offer­ing $100, I’m say­ing the real im­pact could be worth like$2000 or more in my mind if it’s a re­ally good an­swer.

As­sum­ing Eli is okay with this, I’ll eval­u­ate in two weeks, end­ing Novem­ber 13 at 12:00 AM, and pledge to award within three weeks (for each day I’m late, I’ll in­crease the prize amounts by 5% com­pound­ing).

A thing I would be in­ter­ested in here is also peo­ple men­tion­ing how they gained their abil­ity to as­sess pa­pers, e.g. “I can do this be­cause I have a statis­tics de­gree” or “I can do this be­cause of my do­main ex­per­tise” and fur­ther bonus points on list­ing re­sources peo­ple could use to en­hance their abil­ity to as­sess re­search.

• As­sum­ing Eli is okay with this

This sounds cool to me!

• ## Awards for the Best Answers

When this ques­tion was posted a month ago, I liked it so much that I offered $100 of my own money for what I judged to be the best an­swer and an­other$50 to the best dis­til­la­tion. Here’s what I think:

Over­all prize for best an­swer ($100): Un­named Ad­di­tional prizes ($25): wave­man, Bucky

I will reach out to these au­thors via DM to ar­range pay­ment.

No one at­tempted to me what seemed like a proper dis­til­la­tion of other re­sponses so I won’t be award­ing the dis­til­la­tion prize here, how­ever I in­tend to write and pub­lish my own dis­til­la­tion/​syn­the­sis of the re­sponses soon.

Some thoughts on each of the replies:

Un­named [win­ner]: This an­swer felt very thor­ough and de­tailed, and it feels like it’s a guide I could re­ally fol­low to dra­mat­i­cally im­prove my abil­ity to as­sess stud­ies. I’m as­sum­ing limi­ta­tions of LW’s cur­rent ed­i­tor meant the for­mat­ting couldn’t be nicer, but I also re­ally like Un­named broke down his over­all re­sponse into three main ques­tions (“Is this just noise?”, “Is there any­thing in­ter­est­ing go­ing on here?” and “What is go­ing on here?”) and then pre­sented fur­ther sub-ques­tions and ex­am­ples to help one as­sess the high-level ques­tions.

I’d like to bet­ter sum­ma­rize Un­named’s re­sponse, you should re­ally just read it all.

wave­man [win­ner]: wave­man’s re­ply hits a solid amount of breadth in how to as­sess stud­ies. I feel like his re­sponse is any easy guide I could pin up my wall and eas­ily step through while read­ing pa­pers. What I would re­ally like to see is this re­sponse ex­cept fur­ther fleshed out with ex­am­ples and re­sources, e.g. “read these spe­cific pa­pers or books on how stud­ies get rigged.” I’ll note that I do have some pause with this re­sponse since other re­spon­ders con­tra­dicted at least one part of it, e.g., Kristin Lindquist say­ing not to worry about the fund­ing source of a study. I’d like to see these (per­haps only sur­face-level) dis­agree­ments re­solved. Over­all though, re­ally solid an­swer that de­serves its karma.

Bucky [win­ner]: Bucky’s an­swer is deli­ciously tech­ni­cal. Rather than dis­cussing high-level qual­i­ta­tive con­se­quences to pay at­ten­tion to (e.g. fund­ing source, has there been re­pro­duc­tions), Bucky dives and pro­vides ac­tual fo­ru­mu­las and guidance about sam­ple sizes, effect sizes, etc. What’s more, Bucky dis­cusses how he ap­plied this ap­proach to con­crete stud­ies (80k’s repli­ca­tion quiz) and the out­come. I love the de­tail of the re­ply and it be­ing backed up by con­crete us­age. I will men­tion that Bucky opens by say­ing that he uses sub­con­scious thresh­olds in his as­sess­ments but is in­ter­est­ing in dis­cussing the lev­els other peo­ple use.

I do sus­pect that learn­ing to ap­ply the kinds of calcu­la­tions Bucky points at is tricky and vuln­er­a­ble to mis­taken ap­pli­ca­tion. Prob­a­bly a longer re­source/​more train­ing is needed to be able to ap­ply Bucky’s ap­proach suc­cess­fully, but his an­swer at the least sets one on the right path.

Kristin Lindquist: Kristin’s an­swer is re­ally very solid but feels like it falls short of the lead­ing re­sponses in terms of depth and guidance and doesn’t add too much, though I do ap­pre­ci­ate the links that were in­cluded. It’s a pretty good sum­mary. Also one of the best for­mat­ted of all an­swers given. I would like to see wave­man and Kristin reach agree­ment on the ques­tion of look­ing fund­ing sources.

jim­ran­domh: Jim’s an­swer was short but added im­por­tant an­swers to the con­ver­sa­tion that no one else had stated. I think his sug­ges­tion of en­sur­ing you ask your­self about how you ended up read­ing a par­tic­u­lar study is ex­cel­lent and cru­cial. I’m also in­trigued by his re­sponse that con­trol­ling for con­founds is much, much harder than peo­ple typ­i­cally think. I’d very much like to see a longer es­say demon­strat­ing this.

Eliz­a­beth: I feel like this an­swer solidly re­minds me think to about core episte­molog­i­cal ques­tions when read­ing a study, e.g., “how do they know this?”

Romeosteven­sit: this an­swer added a few more things to look for not not in­cluded in other re­sponses, e.g. giv­ing more to au­thors who dis­cuss what can’t be con­cluded from their study. Also I like his men­tion­ing that spu­ri­ous effects can sneak into de­spite the hon­est in­ten­tions of mod­er­ately com­pe­tent sci­en­tists. My ex­pe­rience with data anal­y­sis sup­ports this. I’d like to see a dis­cus­sion be­tween Romeosten­vsit and jim­rand­homh since they both seem to have thoughts about con­founds (and I fur­ther know they both have in­ter­est in nu­tri­tion re­search).

Char­lie Steiner: Good ad­di­tional de­tail in this one, e.g. the in­struc­tion to com­pare pa­pers to other similar pa­pers and gen­eral en­courage­ment to get a sense of what meth­ods are rea­son­able. This is a good an­swer, just not as good as the very top an­swers. Would like to see some con­crete ex­am­ples to learn from with this one. I ap­pre­ci­ate the clar­ifi­ca­tion that this re­sponse is for Con­densed Mat­ter Physics. I’d be cu­ri­ous to see how other re­searchers feel it gen­er­al­izes to their do­mains.

whales: Good ad­vice and they could be right that a lot of key knowl­edge is tacit (in the oral tra­di­tion) and not in­cluded in pa­pers or text­books. That seems like some­thing well worth re­mem­ber­ing. I’d be rather keen to see whales’s course on layper­son eval­u­a­tion of sci­ence.

The Ma­jor: Re­sponse seems con­gru­ent with other an­swers but is much shorter and less de­tailed them.

• It would be good know if offer­ing prizes like this is helpful in pro­duc­ing coun­ter­fac­tu­ally more and bet­ter re­sponses. So, to all those who re­sponded with the great an­swers, I have a ques­tion:

How did the offer of a prize in­fluence your con­tri­bu­tion? Did it make any differ­ence? If so, how come?

• Thanks Ruby.

Good sum­mary of my an­swer; by the time I got round to writ­ing mine there were so many good qual­i­ta­tive sum­maries I wanted to do some­thing differ­ent. I think you’ve hit the nail on the head with the main weak­ness be­ing difficulty in ap­pli­ca­tion, par­tic­u­larly in es­ti­mat­ing Co­hen’s d.

I am cur­rently tak­ing part in repli­ca­tion mar­kets and bas­ing my judge­ments mainly on ex­per­i­men­tal power. Hope­fully this will give me a bet­ter idea of what works and I may write an up­dated guide next year.

As a data point r.e. the prize, I’m pretty sure that if the prize wasn’t there I would have done my usual and in­tended to write some­thing and never ac­tu­ally got round to it. I think this kind of prize is par­tic­u­larly use­ful for ques­tions which take a while to work on and at­ten­tion would oth­er­wise drift.

• Hope­fully this will give me a bet­ter idea of what works and I may write an up­dated guide next year.

I’d be ex­cited to see that.

As a data point r.e. the prize, I’m pretty sure that if the prize wasn’t there I would have done my usual and in­tended to write some­thing and never ac­tu­ally got round to it. I think this kind of prize is par­tic­u­larly use­ful for ques­tions which take a while to work on and at­ten­tion would oth­er­wise drift.

Oh, that’s helpful to know and re­minds me that I in­tended to ask re­spon­dents how the offer a prize af­fected their con­tri­bu­tions.

• A re­cent pa­per de­vel­oped a statis­ti­cal model for pre­dict­ing whether pa­pers would repli­cate.

We have de­rived an au­to­mated, data-driven method for pre­dict­ing repli­ca­bil­ity of ex­per­i­ments. The method uses ma­chine learn­ing to dis­cover which fea­tures of stud­ies pre­dict the strength of ac­tual repli­ca­tions. Even with our fairly small data set, the model can fore­cast repli­ca­tion re­sults with sub­stan­tial ac­cu­racy — around 70%. Pre­dic­tive ac­cu­racy is sen­si­tive to the vari­ables that are used, in in­ter­est­ing ways. The statis­ti­cal fea­tures (p-value and effect size) of the origi­nal ex­per­i­ment are the most pre­dic­tive. How­ever, the ac­cu­racy of the model is also in­creased by vari­ables such as the na­ture of the find­ing (an in­ter­ac­tion, com­pared to a main effect), num­ber of au­thors, pa­per length and the lack of perfor­mance in­cen­tives. All those vari­ables are as­so­ci­ated with a re­duc­tion in the pre­dicted chance of repli­ca­bil­ity.
...
The first re­sult is that one vari­able that is pre­dic­tive of poor repli­ca­bil­ity is whether cen­tral tests de­scribe in­ter­ac­tions be­tween vari­ables or (sin­gle-vari­able) main effects. Only eight of 41 in­ter­ac­tion effect stud­ies repli­cated, while 48 of the 90 other stud­ies did.

Another, un­re­lated, thing is that au­thors of­ten make in­flated in­ter­pre­ta­tions of their stud­ies (in the ab­stract, the gen­eral dis­cus­sion sec­tion, etc). Whereas there is a lot of crit­i­cism of p-hack­ing and other re­lated prac­tices per­tain­ing to the stud­ies them­selves, there is less scrutiny of how au­thors in­ter­pret their re­sults (in part that’s un­der­stand­able, since what counts as a dodgy in­ter­pre­ta­tion is more sub­jec­tive). Hence when you read the meth­ods and re­sults sec­tions it’s good to think about whether you’d make the same high-level in­ter­pre­ta­tion of the re­sults as the au­thors.

• This ques­tion has loads of great an­swers, with peo­ple shar­ing their hard-earned in­sights about how to en­gage with mod­ern sci­en­tific pa­pers and make sure to get the truth out of them, so I cu­rated it.

• For­give me if I rant a lit­tle against this cu­ra­tion no­tice.