Case study: abuse of frequentist statistics

Recently, a colleague was reviewing an article whose key justification rested on some statistics that seemed dodgy to him, so he came to me for advice. (I guess my boss, the resident statistician, was out of his office.) Now, I’m no expert in frequentist statistics. My formal schooling in frequentist statistics comes from my undergraduate chemical engineering curriculum—I wouldn’t rely on it for consulting. But I’ve been working for someone who is essentially a frequentist for a year and a half, so I’ve had some hands-on experience. My boss hired me on the strength of my experience with Bayesian statistics, which I taught myself in grad school, and one thing reading the Bayesian literature voraciously will equip you for is critiquing frequentist statistics. So I felt competent enough to take a look.1

The article compared an old, trusted experimental method with the authors’ new method; the authors sought to show that the new method gave the same results on average as the trusted method. They performed three replicates using the trusted method and three replicates using the new method; each replicate generated a real-valued data point. They did this in nine different conditions, and for each condition, they did a statistical hypothesis test. (I’m going to lean heavily on Wikipedia for explanations of the jargon terms I’m using, so this post is actually a lot longer than it appears on the page. If you don’t feel like following along, the punch line is three paragraphs down, last sentence.)

The authors used what’s called a Mann-Whitney U test, which, in simplified terms, aims to determine if two sets of data come from different distributions. The essential thing to know about this test is that it doesn’t depend on the actual data except insofar as those data determine the ranks of the data points when the two data sets are combined. That is, it throws away most of the data, in the sense that data sets that generate the same ranking are equivalent under the test. The rationale for doing this is that it makes the test “non-parametric”—you don’t need to assume a particular form for the probability density when all you look at are the ranks.

The output of a statistical hypothesis test is a p-value; one pre-establishes a threshold for statistical significance, and if the the p-value is lower than the threshold, one draws a certain conclusion called “rejecting the null hypothesis”. In the present case, the null hypothesis is that the old method and the new method produce data from the same distribution; the authors would like to see data that do not lead to rejection of the null hypothesis. They established the conventional threshold of 0.05, and for each of the nine conditions, they reported either “p > 0.05″ or “p = 0.05”2. Thus they did not reject the null hypothesis, and argued that the analysis supported their thesis.

Now even from a frequentist perspective, this is wacky. Hypothesis testing can reject a null hypothesis, but cannot confirm it, as discussed in the first paragraph of the Wikipedia article on null hypotheses. But this is not the real WTF, as they say. There are twenty ways to choose three objects out of six, so there are only twenty possible p-values, and these can be computed even when the original data are not available, since they only depend on ranks. I put these facts together within a day of being presented with the analysis and quickly computed all twenty p-values. Here I only need discuss the most extreme case, where all three of the data points for the new method are to one side (either higher or lower) of the three data points for the trusted method. This case provides the most evidence against the notion that the two methods produce data from the same distribution, resulting in the smallest possible p-value3: p = 0.05. In other words, even before the data were collected it could have been known that this analysis would give the result the authors wanted.4

When I canvassed the Open Thread for interest in this article, Douglas Knight wrote: “If it’s really frequentism that caused the problem, please spell this out.” Frequentism per se is not the proximate cause of this problem, that being that the authors either never noticed that their analysis could not falsify their hypothesis, or they tried to pull a fast one. But it is a distal cause, in the sense that it forbids the Bayesian approach, and thus requires practitioners to become familiar with a grab-bag of unrelated methods for statistical inference5, leaving plenty of room for confusion and malfeasance. Technologos’s reply to Douglas Knight got it exactly right; I almost jokingly requested a spoiler warning.

1 I don’t mind that it wouldn’t be too hard to figure out who I am based on this paragraph. I just use a pseudonym to keep Google from indexing all my blog comments to my actual name.

2 It’s rather odd to report a p-value that is exactly equal to the significance threshold, one of many suspicious things about this analysis (the rest of which I’ve left out as they are not directly germane).

3 For those anxious to check my math, I’ve omitted some blah blah blah about one- and two-sides tests and alternative hypotheses.

4 I quickly emailed the reviewer; it didn’t make much difference, because when we initially talked about the analysis we had noticed enough other flaws that he had decided to recommend rejection. This was just the icing on the coffin.

5 … none of which actually address the question OF DIRECT INTEREST! … phew. Sorry.