How to Fix Science

Like The Cog­ni­tive Science of Ra­tion­al­ity, this is a post for be­gin­ners. Send the link to your friends!

Science is bro­ken. We know why, and we know how to fix it. What we lack is the will to change things.

In 2005, sev­eral analy­ses sug­gested that most pub­lished re­sults in medicine are false. A 2008 re­view showed that per­haps 80% of aca­demic jour­nal ar­ti­cles mis­take “statis­ti­cal sig­nifi­cance” for “sig­nifi­cance” in the col­lo­quial mean­ing of the word, an el­e­men­tary er­ror ev­ery in­tro­duc­tory statis­tics text­book warns against. This year, a de­tailed in­ves­ti­ga­tion showed that half of pub­lished neu­ro­science pa­pers con­tain one par­tic­u­lar sim­ple statis­ti­cal mis­take.

Also this year, a re­spected se­nior psy­chol­o­gist pub­lished in a lead­ing jour­nal a study claiming to show ev­i­dence of pre­cog­ni­tion. The ed­i­tors ex­plained that the pa­per was ac­cepted be­cause it was writ­ten clearly and fol­lowed the usual stan­dards for ex­per­i­men­tal de­sign and statis­ti­cal meth­ods.

Science writer Jonah Lehrer asks: “Is there some­thing wrong with the sci­en­tific method?”

Yes, there is.

This shouldn’t be a sur­prise. What we cur­rently call “sci­ence” isn’t the best method for un­cov­er­ing na­ture’s se­crets; it’s just the first set of meth­ods we’ve col­lected that wasn’t to­tally use­less like per­sonal anec­dote and au­thor­ity gen­er­ally are.

As time passes we learn new things about how to do sci­ence bet­ter. The An­cient Greeks prac­ticed some sci­ence, but few sci­en­tists tested hy­pothe­ses against math­e­mat­i­cal mod­els be­fore Ibn al-Haytham’s 11th-cen­tury Book of Op­tics (which also con­tained hints of Oc­cam’s ra­zor and pos­i­tivism). Around the same time, Al-Biruni em­pha­sized the im­por­tance of re­peated tri­als for re­duc­ing the effect of ac­ci­dents and er­rors. Gal­ileo brought math­e­mat­ics to greater promi­nence in sci­en­tific method, Ba­con de­scribed elimi­na­tive in­duc­tion, New­ton demon­strated the power of con­silience (unifi­ca­tion), Peirce clar­ified the roles of de­duc­tion, in­duc­tion, and ab­duc­tion, and Pop­per em­pha­sized the im­por­tance of falsifi­ca­tion. We’ve also dis­cov­ered the use­ful­ness of peer re­view, con­trol groups, blind and dou­ble-blind stud­ies, plus a va­ri­ety of statis­ti­cal meth­ods, and added these to “the” sci­en­tific method.

In many ways, the best sci­ence done to­day is bet­ter than ever — but it still has prob­lems, and most sci­ence is done poorly. The good news is that we know what these prob­lems are and we know mul­ti­ple ways to fix them. What we lack is the will to change things.

This post won’t list all the prob­lems with sci­ence, nor will it list all the promis­ing solu­tions for any of these prob­lems. (Here’s one I left out.) Below, I only de­scribe a few of the ba­sics.

Prob­lem 1: Publi­ca­tion bias

When the study claiming to show ev­i­dence of pre­cog­ni­tion was pub­lished, psy­chol­o­gist Richard Wise­man set up a reg­istry for ad­vance an­nounce­ment of new at­tempts to repli­cate the study.

Carl Shul­man ex­plains:

A repli­ca­tion reg­istry guards against pub­li­ca­tion bias, and at least 5 at­tempts were reg­istered. As far as I can tell, all of the sub­se­quent repli­ca­tions have, un­sur­pris­ingly, failed to repli­cate Bem’s re­sults. How­ever, JPSP and the other high-end psy­chol­ogy jour­nals re­fused to pub­lish the re­sults, cit­ing stand­ing poli­cies of not pub­lish­ing straight repli­ca­tions.

From the jour­nals’ point of view, this (com­mon) policy makes sense: bold new claims will tend to be cited more and raise jour­nal pres­tige (which de­pends on cita­tions per ar­ti­cle), even though this means most of the ‘dis­cov­er­ies’ they pub­lish will be false de­spite their low p-val­ues (high statis­ti­cal sig­nifi­cance). How­ever, this means that over­all the jour­nals are giv­ing ca­reer in­cen­tives for sci­en­tists to mas­sage and mine their data for bo­gus re­sults, but not to challenge bo­gus re­sults pre­sented by oth­ers.

This is an ex­am­ple of pub­li­ca­tion bias:

Publi­ca­tion bias is the term for what oc­curs when­ever the re­search that ap­pears in the pub­lished liter­a­ture is sys­tem­at­i­cally un­rep­re­sen­ta­tive of the pop­u­la­tion of com­pleted stud­ies. Sim­ply put, when the re­search that is read­ily available differs in its re­sults from the re­sults of all the re­search that has been done in an area, read­ers and re­view­ers of that re­search are in dan­ger of draw­ing the wrong con­clu­sion about what that body of re­search shows. In some cases this can have dra­matic con­se­quences, as when an in­effec­tive or dan­ger­ous treat­ment is falsely viewed as safe and effec­tive. [Roth­stein et al. 2005]

Some­times, pub­li­ca­tion bias can be more de­liber­ate. The anti-in­flam­ma­tory drug Rofe­coxib (Vioxx) is a fa­mous case. The drug was pre­scribed to 80 mil­lion peo­ple, but in it was later re­vealed that its maker, Merck, had with­held ev­i­dence of the drug’s risks. Merck was forced to re­call the drug, but it had already re­sulted in 88,000-144,000 cases of se­ri­ous heart dis­ease.

Ex­am­ple par­tial solution

One way to com­bat pub­li­ca­tion bias is for jour­nals to only ac­cept ex­per­i­ments that were reg­istered in a pub­lic database be­fore they be­gan. This al­lows sci­en­tists to see which ex­per­i­ments were con­ducted but never re­ported (per­haps due to nega­tive re­sults). Sev­eral promi­nent med­i­cal jour­nals (e.g. The Lancet and JAMA) now op­er­ate this way, but this pro­to­col is not as wide­spread as it could be.

Prob­lem 2: Ex­per­i­menter bias

Scien­tists are hu­mans. Hu­mans are af­fected by cog­ni­tive heuris­tics and bi­ases (or, re­ally, hu­mans just are cog­ni­tive heuris­tics and bi­ases), and they re­spond to in­cen­tives that may not al­ign with an op­ti­mal pur­suit of truth. Thus, we should ex­pect ex­per­i­menter bias in the prac­tice of sci­ence.

There are many stages in re­search dur­ing which ex­per­i­menter bias can oc­cur:

  1. in read­ing-up on the field,

  2. in spec­i­fy­ing and se­lect­ing the study sam­ple,

  3. in [perform­ing the ex­per­i­ment],

  4. in mea­sur­ing ex­po­sures and out­comes,

  5. in an­a­lyz­ing the data,

  6. in in­ter­pret­ing the anal­y­sis, and

  7. in pub­lish­ing the re­sults. [Sack­ett 1979]

Com­mon bi­ases have been cov­ered el­se­where on Less Wrong, so I’ll let those ar­ti­cles ex­plain how bi­ases work.

Ex­am­ple par­tial solution

There is some ev­i­dence that the skills of ra­tio­nal­ity (e.g. cog­ni­tive over­ride) are teach­able. Train­ing sci­en­tists to no­tice and me­lio­rate bi­ases that arise in their think­ing may help them to re­duce the mag­ni­tude and fre­quency of the think­ing er­rors that may de­rail truth-seek­ing at­tempts dur­ing each stage of the sci­en­tific pro­cess.

Prob­lem 3: Bad statistics

I re­mem­ber when my statis­tics pro­fes­sor first taught me the rea­son­ing be­hind “null hy­poth­e­sis sig­nifi­cance test­ing” (NHST), the stan­dard tech­nique for eval­u­at­ing ex­per­i­men­tal re­sults. NHST uses “p-val­ues,” which are state­ments about the prob­a­bil­ity of get­ting some data (e.g. one’s ex­per­i­men­tal re­sults) given the hy­poth­e­sis be­ing tested. I asked my pro­fes­sor, “But don’t we want to know the prob­a­bil­ity of the hy­poth­e­sis we’re test­ing given the data, not the other way around?” The re­ply was some­thing about how this was the best we could do. (But that’s false, as we’ll see in a mo­ment.)

Another prob­lem is that NHST com­putes the prob­a­bil­ity of get­ting data as un­usual as the data one col­lected by con­sid­er­ing what might be ex­pected if that par­tic­u­lar ex­per­i­ment was re­peated many, many times. But how do we know any­thing about these imag­i­nary rep­e­ti­tions? If I want to know some­thing about a par­tic­u­lar earth­quake, am I sup­posed to imag­ine a few dozen rep­e­ti­tions of that earth­quake? What does that even mean?

I tried to an­swer these ques­tions on my own, but all my text­books as­sumed the sound­ness of the mis­taken NHST frame­work for sci­en­tific prac­tice. It’s too bad I didn’t have a class with bio­statis­ti­can Steven Good­man, who says:

The p-value is al­most noth­ing sen­si­ble you can think of. I tell stu­dents to give up try­ing.

The sad part is that the log­i­cal er­rors of NHST are old news, and have been known ever since Ron­ald Fisher be­gan ad­vo­cat­ing NHST in the 1920s. By 1960, Fisher had out-ad­vo­cated his crit­ics, and philoso­pher William Roze­boom re­marked:

De­spite the awe­some pre-em­i­nence [NHST] has at­tained… it is based upon a fun­da­men­tal mi­s­un­der­stand­ing of the na­ture of ra­tio­nal in­fer­ence, and is sel­dom if ever ap­pro­pri­ate to the aims of sci­en­tific re­search.

There are many more prob­lems with NHST and with “fre­quen­tist” statis­tics in gen­eral, but the cen­tral one is this: NHST does not fol­low from the ax­ioms (foun­da­tional log­i­cal rules) of prob­a­bil­ity the­ory. It is a grab-bag of tech­niques that, de­pend­ing on how those tech­niques are ap­plied, can lead to differ­ent re­sults when an­a­lyz­ing the same data — some­thing that should hor­rify ev­ery math­e­mat­i­cian.

The in­fer­en­tial method that solves the prob­lems with fre­quen­tism — and, more im­por­tantly, fol­lows de­duc­tively from the ax­ioms of prob­a­bil­ity the­ory — is Bayesian in­fer­ence.

So why aren’t all sci­en­tists us­ing Bayesian in­fer­ence in­stead of fre­quen­tist in­fer­ence? Partly, we can blame the vi­gor of NHST’s early ad­vo­cates. But we can also at­tribute NHST’s suc­cess to the sim­ple fact that Bayesian calcu­la­tions can be more difficult than fre­quen­tist calcu­la­tions. Luck­ily, new soft­ware tools like WinBUGS let com­put­ers do most of the heavy lift­ing re­quired for Bayesian in­fer­ence.

There’s also the prob­lem of sheer mo­men­tum. Once a prac­tice is en­shrined, it’s hard to dis­lodge it, even for good rea­sons. I took three statis­tics courses in uni­ver­sity and none of my text­books men­tioned Bayesian in­fer­ence. I didn’t learn about it un­til I dropped out of uni­ver­sity and stud­ied sci­ence and prob­a­bil­ity the­ory on my own.

Re­mem­ber the study about pre­cog­ni­tion? Not sur­pris­ingly, it was done us­ing NHST. A later Bayesian anal­y­sis of the data dis­con­firmed the origi­nal startling con­clu­sion.

Ex­am­ple par­tial solution

This one is ob­vi­ous: teach stu­dents prob­a­bil­ity the­ory in­stead of NHST. Re­train cur­rent sci­en­tists in Bayesian meth­ods. Make Bayesian soft­ware tools eas­ier to use and more wide­spread.

Conclusion

If I’m right that there is un­am­bigu­ous low-hang­ing fruit for im­prov­ing sci­en­tific prac­tice, this sug­gests that par­tic­u­lar de­part­ments, uni­ver­si­ties, or pri­vate re­search in­sti­tu­tions can (prob­a­bil­is­ti­cally) out-perform their ri­vals (in terms of ac­tual dis­cov­er­ies, not just pub­li­ca­tions) given similar re­sources.

I’ll con­clude with one par­tic­u­lar spe­cific hy­poth­e­sis. If I’m right, then a re­search group should be able to hire re­searchers trained in Bayesian rea­son­ing and in catch­ing pub­li­ca­tion bias and ex­per­i­menter bias, and have them ex­tract from the ex­ist­ing liter­a­ture valuable med­i­cal truths that the main­stream med­i­cal com­mu­nity doesn’t yet know about. This pre­dic­tion, in fact, is about to be tested.