# Two Dark Side Statistics Papers

I.

The mes­sage is hardly unique: there are lots of tricks un­scrupu­lous or des­per­ate sci­en­tists can use to ar­tifi­cially nudge re­sults to the 5% sig­nifi­cance level. The clar­ity of the pre­sen­ta­tion is unique. They start by dis­cussing four par­tic­u­lar tricks:

1. Mea­sure mul­ti­ple de­pen­dent vari­ables, then re­port the ones that are sig­nifi­cant. For ex­am­ple, if you’re mea­sur­ing whether treat­ment for a cer­tain psy­chi­a­tric di­s­or­der im­proves life out­comes, you can col­lect five differ­ent mea­sures of life out­comes – let’s say ed­u­ca­tional at­tain­ment, in­come, self-re­ported hap­piness, whether or not ever ar­rested, whether or not in ro­man­tic re­la­tion­ship – and have a 25%-ish prob­a­bil­ity one of them will come out at sig­nifi­cance by chance. Then you can pub­lish a pa­per called “Psy­chi­a­tric Treat­ment Found To In­crease Ed­u­ca­tional At­tain­ment” with­out ever men­tion­ing the four nega­tive tests.

2. Ar­tifi­cially choose when to end your ex­per­i­ment. Sup­pose you want to prove that yel­ling at a coin makes it more likely to come up tails. You yell at a coin and flip it. It comes up heads. You try again. It comes up tails. You try again. It comes up heads. You try again. It comes up tails. You try again. It comes up tails again. You try again. It comes up tails again. You note that it came up tails four out of six times – a 66% suc­cess rate com­pared to ex­pected 50% – and de­clare vic­tory. Of course, this re­sult wouldn’t be sig­nifi­cant, and it seems as if this should be a gen­eral rule – that al­most by the defi­ni­tion of sig­nifi­cance, you shouldn’t be able to ob­tain it just be stop­ping the ex­per­i­ment at the right point. But the au­thors of the study perform sev­eral simu­la­tions to prove that this trick is more suc­cess­ful than you’d think:

3. Con­trol for “con­founders” (in prac­tice, most of­ten gen­der). I some­times call this the “Elderly His­panic Wo­man Effect” af­ter drug tri­als that find that their drug doesn’t have sig­nifi­cant effects in the gen­eral pop­u­la­tion, but it does sig­nifi­cantly help el­derly His­panic women. The trick is you split the pop­u­la­tion into twenty sub­groups (young white men, young white women, el­derly white men, el­derly white women, young black men, etc), in one of those sub­groups it will achieve sig­nifi­cance by pure chance, and so you de­clare that your drug must just some­how be a perfect fit for el­derly His­panic women’s unique body chem­istry. This is not always wrong (some an­tihy­per­ten­sives have no­tably differ­ent effi­cacy in white ver­sus black pop­u­la­tions) but it is usu­ally sus­pi­cious.

4. Test differ­ent con­di­tions and re­port the ones you like. For ex­am­ple, sup­pose you are test­ing whether veg­etable con­sump­tion af­fects de­pres­sion. You con­duct the trial with three arms: low veg­gie diet, medium veg­gie diet, and high veg­gie diet. You now have four pos­si­ble com­par­i­sons – low-medium, low-high, medium-high, low-medium-high trend). One of them will be sig­nifi­cant 20% of the time, so you can just re­port that one: “Peo­ple who eat a mod­er­ate amount of veg­eta­bles are less likely to get de­pres­sion than peo­ple who eat ex­cess veg­eta­bles” sounds like a perfectly rea­son­able re­sult.

Then they run simu­la­tions to show ex­actly how much more likely you are to get a sig­nifi­cant re­sult in ran­dom data by em­ploy­ing each trick:

The image demon­strates that by us­ing all four tricks, you can squeeze ran­dom data into a re­sult sig­nifi­cant at the p < 0.05 level about 61% of the time. The au­thors then put their money where their mouth is by con­duct­ing two stud­ies. The first seems like a very very clas­sic so­cial psy­chol­ogy study. Sub­jects are ran­domly as­signed to listen to one of two songs—ei­ther a non­de­script con­trol song or a child’s nursery song. Then they are asked to rate how old they feel. Sure enough, the sub­jects who listen to the child’s song feel older (p = 0.03). The sec­ond study is very similar, with one im­por­tant ex­cep­tion. Once again, sub­jects are ran­domly as­signed to listen to one of two songs—ei­ther a non­de­script con­trol song or a song about ag­ing—“When I’m Sixty-Four” by The Bea­tles. Then they are asked to put down their ac­tual age, in years. Peo­ple who listened to the Bea­tles song be­came, on av­er­age, a year and a half younger than the con­trol group (p = 0.04). So ei­ther the ex­per­i­men­tal in­ter­ven­tion changed their sub­jects’ ages, or the re­searchers were us­ing statis­ti­cal tricks. Turns out it was the sec­ond one. They ex­plain how they used the four statis­ti­cal tricks they ex­plained above, and that with­out those tricks there would have been (ob­vi­ously) no sig­nifi­cant differ­ence. They go on to say that their ex­per­i­ment meets the in­clu­sion crite­ria for ev­ery ma­jor jour­nal and that un­der cur­rent re­port­ing rules there’s no way any­one could have de­tected their data ma­nipu­la­tion. They go on to list the changes they think the sci­en­tific es­tab­lish­ment needs to pre­vent pa­pers like theirs from reach­ing print. They’re ba­si­cally “don’t do the things we just talked about”, but as far as I can tell they rely on the honor sys­tem. I think a broader meta-point is that on im­por­tant stud­ies sci­en­tists should have to sub­mit their ex­per­i­men­tal pro­to­col to a jour­nal and get it ac­cepted or re­jected in ad­vance so they can’t change tac­tics mid-stream or drop data. This would also force jour­nals to pub­lish more nega­tive re­sults. See also their in­ter­est­ing dis­cus­sion of why they think “use Bayesian statis­tics” is a non-solu­tion to the prob­lem. II.

This study is very close to my heart, be­cause I’m work­ing on my hos­pi­tal’s Sub­stance Abuse Team this month. Every day we go see pa­tients strug­gling with al­co­holism, heroin abuse, et cetera, and we offer them treat­ment at our hos­pi­tal’s in­ten­sive in­pa­tient Chem­i­cal Depen­dency Unit. And ev­ery day, our pa­tients say thanks but no thanks, they heard of a pro­gram af­fili­ated with their lo­cal church that has a 60% suc­cess rate, or an 80% suc­cess rate, or in one es­pe­cially rosy-eyed case a frickin’ 97% suc­cess rate.

(mean­while, real re­hab pro­grams still strug­gle to prove they have a suc­cess rate greater than placebo)

My at­tend­ing as­sumes these pro­grams are scum but didn’t re­ally have a good ev­i­dence base for the claim, so I de­cided to search Google Scholar to find out what was go­ing on. I struck gold in this pa­per, which is framed as a sar­cas­tic how-to guide for un­scrupu­lous drug treat­ment pro­gram di­rec­tors who want to in­flate their suc­cess rates with­out tech­ni­cally ly­ing.

By far the best way to do this is to choose your de­nom­i­na­tor care­fully. For ex­am­ple, it seems fair to only in­clude the peo­ple who at­tended your full treat­ment pro­gram, not the peo­ple who dropped out on Day One or never showed up at all – you can hardly be blamed for that, right? So sup­pose that your treat­ment pro­gram is one month in­ten­sive in re­hab fol­lowed by a se­ries of weekly meet­ings con­tin­u­ing in­definitely. At the end of one year, you define suc­cess­ful treat­ment com­pleters as “the peo­ple who are still go­ing to these meet­ings now, at the end of the year”. But in gen­eral, peo­ple who re­lapse into al­co­holism are a whole lot less likely to con­tinue at­tend­ing their AA meet­ings than peo­ple who stay sober. So all you have to do is go up to peo­ple at your AA meet­ing, ask them if they’re still on the wagon, and your one-year suc­cess rate looks re­ally good.

Another way to hack your treat­ment pop­u­la­tion is to only ac­cept the most promis­ing can­di­dates to be­gin with (it works for pri­vate schools and it can work for you). We know that mid­dle-class, em­ployed peo­ple with houses and fam­i­lies have a much bet­ter prog­no­sis than lower-class un­em­ployed home­less sin­gle peo­ple. Although some­one would prob­a­bly no­tice if you put up a sign say­ing “MIDDLE-CLASS EMPLOYED PEOPLE WITH HOUSES AND FAMILIES ONLY”, a very prac­ti­cal op­tion is to just charge a lot of money and let your client pop­u­la­tion se­lect them­selves. This is why for-profit pri­vate re­habs will have a higher suc­cess rate than pub­lic hos­pi­tals and gov­ern­ment pro­grams that deal with poor peo­ple.

Still an­other strat­egy is to fol­low the old proverb: “If at first you don’t suc­ceed, re­define suc­cess”. “Ab­sti­nence” is such a harsh word. Why not “drink­ing in mod­er­a­tion”? This is a won­der­ful phrase, be­cause you can just let the al­co­holic in­volved de­ter­mine the defi­ni­tion of mod­er­a­tion. A year af­ter the pro­gram ends, you can send out lit­tle sur­veys say­ing “Re­mem­ber when we told you God re­ally wants you not to drink? You listened to us and are drink­ing in mod­er­a­tion now, right? Please check one: Y () N ()”. Who’s go­ing to an­swer ‘no’ to that? Heck, some of the al­co­holics I talk to say they’re drink­ing in mod­er­a­tion while they are in the emer­gency room for al­co­hol poi­son­ing.

If you can’t han­dle “mod­er­a­tion”, how about “drink­ing less than you were be­fore the treat­ment pro­gram”? This takes ad­van­tage of re­gres­sion to the mean – you’re go­ing to en­ter a re­hab pro­gram at the worst pe­riod of your life, the time when your drink­ing fi­nally spirals out of con­trol. Just by co­in­ci­dence, most other parts of your life will in­clude less drink­ing than when you first came in to re­hab, in­clud­ing the date a year af­ter treat­ment when some­one sends you a sur­vey. Clearly re­hab was a suc­cess!

And why wait a year? My at­tend­ing and my­self ac­tu­ally looked up what was go­ing on with that one 97% suc­cess rate pro­gram our pa­tient said he was go­ing to. Here’s what they do – it’s a three month res­i­den­tial pro­gram where you live in a build­ing just off the church and you’re not al­lowed to go out ex­cept on group treat­ment ac­tivi­ties. Ob­vi­ously there is no al­co­hol al­lowed in the build­ing and you are sur­rounded by very earnest coun­selors and fel­low re­cov­er­ing ad­dicts at all times. Then, at the end of the three months, while you are still in the build­ing, they ask you whether you’re drink­ing or not. You say no. Boom – 97% suc­cess rate.

One other tac­tic I have ac­tu­ally seen in stud­ies and it breaks my heart is in­ter­val sub­di­vi­sion, which re­minds me of some of the dirty tricks from the first study above. At five years’ fol­low-up, you ask peo­ple “Did you drink dur­ing Year 1? Did you drink dur­ing Year 2? Did you drink dur­ing Year 3?…” and so on. Now you have five chances to find a sig­nifi­cant differ­ence be­tween treat­ment and con­trol groups. I have liter­ally seen stud­ies that say “Our re­hab didn’t have an im­me­di­ate effect, but by Year 4 our pa­tients were do­ing bet­ter than the con­trols.” Mean­while, in years 1, 2, 3, and 5, for all we know the con­trols were do­ing bet­ter than the pa­tients.

But if all else fails, there’s always the old standby of poor re­searchers ev­ery­where – just don’t in­clude a con­trol group at all. This table re­ally speaks to me:

The great thing about this table isn’t just that it shows that seem­ingly im­pres­sive re­sults are ex­actly the same as placebo. The great thing it shows is that re­sults in the placebo groups in the four stud­ies could be any­where from a 22.5% suc­cess rate to an 87% suc­cess rate. Th­ese aren’t treat­ment differ­ences – all four groups are placebo! This is one hun­dred per­cent a differ­ence in study pop­u­la­tions and in suc­cess mea­sures used. In other words, de­pend­ing on your study pro­to­col, you can prove that there is a 22.5% chance the av­er­age un­treated al­co­holic will achieve re­mis­sion, or an 87% chance the av­er­age un­treated al­co­holic will achieve re­mis­sion.

You can bet that re­habs use the study pro­to­col that finds an 87% chance of re­mis­sion in the un­treated. And then they go on to boast of their 90% suc­cess rate. Good job, re­hab!