Frequentist Statistics are Frequently Subjective

Andrew Gelman recently responded to a commenter on the Yudkowsky/​Gelman diavlog; the commenter complained that Bayesian statistics were too subjective and lacked rigor. I shall explain why this is unbelievably ironic, but first, the comment itself:

However, the fundamental belief of the Bayesian interpretation, that all probabilities are subjective, is problematic—for its lack of rigor… One of the features of frequentist statistics is the ease of testability. Consider a binomial variable, like the flip of a fair coin. I can calculate that the probability of getting seven heads in ten flips is 11.71875%… At some point a departure from the predicted value may appear, and frequentist statistics give objective confidence intervals that can precisely quantify the degree to which the coin departs from fairness...

Gelman’s first response is “Bayesian probabilities don’t have to be subjective.” Not sure I can back him on that; probability is ignorance and ignorance is a state of mind (although indeed, some Bayesian probabilities can correspond very directly to observable frequencies in repeatable experiments).

My own response is that frequentist statistics are far more subjective than Bayesian likelihood ratios. Exhibit One is the notion of “statistical significance” (which is what the above comment is actually talking about, although “confidence intervals” have almost the same problem). Steven Goodman offers a nicely illustrated example: Suppose we have at hand a coin, which may be fair (the “null hypothesis”) or perhaps biased in some direction. So lo and behold, I flip the coin six times, and I get the result TTTTTH. Is this result statistically significant, and if so, what is the p-value—that is, the probability of obtaining a result at least this extreme?

Well, that depends. Was I planning to flip the coin six times, and count the number of tails? Or was I planning to flip the coin until it came up heads, and count the number of trials? In the first case, the probability of getting “five tails or more” from a fair coin is 11%, while in the second case, the probability of a fair coin requiring “at least five tails before seeing one heads” is 3%.

Whereas a Bayesian looks at the experimental result and says, “I can now calculate the likelihood ratio (evidential flow) between all hypotheses under consideration. Since your state of mind doesn’t affect the coin in any way—doesn’t change the probability of a fair coin or biased coin producing this exact data—there’s no way your private, unobservable state of mind can affect my interpretation of your experimental results.”

If you’re used to Bayesian methods, it may seem difficult to even imagine that the statistical interpretation of the evidence ought to depend on a factor—namely the experimenter’s state of mind—which has no causal connection whatsoever to the experimental result. (Since Bayes says that evidence is about correlation, and no systematic correlation can appear without causal connection; evidence requires entanglement.) How can frequentists manage even in principle to make the evidence depend on the experimenter’s state of mind?

It’s a complicated story. Roughly, the trick is to make yourself artificially ignorant of the data—instead of knowing the exact experimental result, you pick a class of possible results which includes the actual experimental result, and then pretend that you were told only that the result was somewhere in this class. So if the actual result is TTTTTH, for example, you can pretend that this is part of the class {TTTTTH, TTTTTTH, TTTTTTTH, …}, a class whose total probability is 3% (1/​32). Or if I preferred to have this experimental result not be statistically significant with p < 0.05, I could just as well pretend that some helpful fellow told me only that the result was in the class {TTTTTH, HHHHHT, TTTTTTH, HHHHHHT, …}, so that the total probability of the class would be 6%, n.s. (In frequentism this question is known as applying a “two-tailed test” or “one-tailed test”.)

The arch-Bayesian E. T. Jaynes ruled out this sort of reasoning by telling us that a Bayesian ought only to condition on events that actually happened, not events that could have happened but didn’t. (This is not to be confused with the dog that doesn’t bark. In this case, the dog was in fact silent; the silence of the dog happened in the real world, not somewhere else. We are rather being told that a Bayesian should not have to worry about alternative possible worlds in which the dog did bark, while estimating the evidence to take from the real world in which the dog did not bark. A Bayesian only worries about the experimental result that was, in fact, obtained; not other experimental results which could have been obtained, but weren’t.)

The process of throwing away the actual experimental result, and substituting a class of possible results which contains the actual one—that is, deliberately losing some of your information—introduces a dose of real subjectivity. Colin Begg reports on one medical trial where the data was variously analyzed as having a significance level—that is, probability of the “experimental procedure” producing an “equally extreme result” if the null hypothesis were true—of p=0.051, p=0.001, p=0.083, p=0.28, and p=0.62. Thanks, but I think I’ll stick with the conditional probability of the actual experiment producing the actual data.

Frequentists are apparently afraid of the possibility that “subjectivity”—that thing they were accusing Bayesians of—could allow some unspecified terrifying abuse of the scientific process. Do I need to point out the general implications of being allowed to throw away your actual experimental results and substitute a class you made up? In general, if this sort of thing is allowed, I can flip a coin, get 37 heads and 63 tails, and decide that it’s part of a class which includes all mixtures with at least 75 heads plus this exact particular sequence. As if I only had the output of a fixed computer program which was written in advance to look at the coinflips and compute a yes-or-no answer as to whether the data is in that class.

Meanwhile, Bayesians are accused of being “too subjective” because we might—gasp! - assign the wrong prior probability to something. First of all, it’s obvious from a Bayesian perspective that science papers should be in the business of reporting likelihood ratios, not posterior probabilities—likelihoods add up across experiments, so to get the latest posterior you wouldn’t just need a “subjective” prior, you’d also need all the cumulative evidence from other science papers. Now, this accumulation might be a lot more straightforward for a Bayesian than a frequentist, but it’s not the sort of thing a typical science paper should have to do. Science papers should report the likelihood ratios for any popular hypotheses—but above all, make the actual raw data available, so the likelihoods can be computed for any hypothesis. (In modern times there is absolutely no excuse for not publishing the raw data, but that’s another story.)

And Bayesian likelihoods really are objective—so long as you use the actual exact experimental data, rather than substituting something else.

Meanwhile, over in frequentist-land… what if you told everyone that you had done 127 trials because that was how much data you could afford to collect, but really you kept performing more trials until you got a p-value that you liked, and then stopped? Unless I’ve got a bug in my test program, a limit of up to 500 trials of a “fair coin” would, 30% of the time, arrive on some step where you could stop and reject the null hypothesis with p<0.05. Or 9% of the time with p<0.01. Of course this requires some degree of scientific dishonesty… or, perhaps, some minor confusion on the scientist’s part… since if this is what you are thinking, you’re supposed to use a different test of “statistical significance”. But it’s not like we can actually look inside their heads to find out what the experimenters were thinking. If we’re worried about scientific dishonesty, surely we should worry about that? (A similar test program done the Bayesian way, set to stop as soon as finding likelihood ratios of 201 and 1001 relative to an alternative hypothesis that the coin was 55% biased, produced false positives of 3.2% and 0.3% respectively. Unless there was a bug; I didn’t spend that much time writing it.)

The actual subjectivity of standard frequentist methods, the ability to manipulate “statistical significance” by choosing different tests, is not a minor problem in science. There are ongoing scandals in medicine and neuroscience from lots of “statistically significant” results failing to replicate. I would point a finger, not just at publication bias, but at scientists armed with powerful statistics packages with lots of complicated tests to run on their data. Complication is really dangerous in science—unfortunately, it looks like instead we have the social rule that throwing around big fancy statistical equations is highly prestigious. (I suspect that some of the opposition to Bayesianism comes from the fact that Bayesianism is too simple.) The obvious fix is to (a) require raw data to be published; (b) require journals to accept papers before the experiment is performed, with the advance paper including a specification of what statistics were selected in advance to be run on the results; (c) raising the standard “significance” level to p<0.0001; and (d) junking all the damned overcomplicated status-seeking impressive nonsense of classical statistics and going to simple understandable Bayesian likelihoods.

Oh, and this frequentist business of “confidence intervals”? Just as subjective as “statistical significance”. Let’s say I’ve got a measuring device which returns the true value plus Gaussian noise. If I know you’re about to collect 100 results, I can write a computer program such that, before the experiment is run, it’s 90% probable that the true value will lie within the interval output by the program.

So I write one program, my friend writes another program, and my enemy writes a third program, all of which make this same guarantee. And in all three cases, the guarantee is true—the program’s interval will indeed contain the true value at least 90% of the time, if the experiment returns the true value plus Gaussian noise.

So you run the experiment and feed in the data; and the “confidence intervals” returned are [0.9-1.5], [2.0-2.2], and [“Cheesecake”-”Cheddar”].

The problem may be made clearer by considering the third program, which works as follows: 95% of the time, it does standard frequentist statistics to return an interval which will contain the true value 95% of the time, and on the other 5% of the time, it returns the interval [“Cheesecake”-”Cheddar”]. It is left as an exercise to the reader to show that this program will output an interval containing the true value at least 90% of the time.

BTW, I’m pretty sure I recall reading that “90% confidence intervals” as published in journal papers, in those cases where a true value was later pinned down more precisely, did not contain the true value 90% of the time. So what’s the point, even? Just show us the raw data and maybe give us a summary of some likelihoods.

Parapsychology, the control group for science, would seem to be a thriving field with “statistically significant” results aplenty. Oh, sure, the effect sizes are minor. Sure, the effect sizes get even smaller (though still “statistically significant”) as they collect more data. Sure, if you find that people can telekinetically influence the future, a similar experimental protocol is likely to produce equally good results for telekinetically influencing the past. Of which I am less tempted to say, “How amazing! The power of the mind is not bound by time or causality!” and more inclined to say, “Bad statistics are time-symmetrical.” But here’s the thing: Parapsychologists are constantly protesting that they are playing by all the standard scientific rules, and yet their results are being ignored—that they are unfairly being held to higher standards than everyone else. I’m willing to believe that. It just means that the standard statistical methods of science are so weak and flawed as to permit a field of study to sustain itself in the complete absence of any subject matter. With two-thirds of medical studies in prestigious journals failing to replicate, getting rid of the entire actual subject matter would shrink the field by only 33%. We have to raise the bar high enough to exclude the results claimed by parapsychology under classical frequentist statistics, and then fairly and evenhandedly apply the same bar to the rest of science.

Michael Vassar has a theory that when an academic field encounters advanced statistical methods, it becomes really productive for ten years and then bogs down because the practitioners have learned how to game the rules.

For so long as we do not have infinite computing power, there may yet be a place in science for non-Bayesian statistics. The Netflix Prize was not won by using strictly purely Bayesian methods, updating proper priors to proper posteriors. In that acid test of statistical discernment, what worked best was a gigantic ad-hoc mixture of methods. It may be that if you want to get the most mileage out of your data, in this world where we do not have infinite computing power, you’ll have to use some ad-hoc tools from the statistical toolbox—tools that throw away some of the data, that make themselves artificially ignorant, that take all sorts of steps that can’t be justified in the general case and that are potentially subject to abuse and that will give wrong answers now and then.

But don’t do that, and then turn around and tell me that—of all things! - Bayesian probability theory is too subjective. Probability theory is the math in which the results are theorems and every theorem is compatible with every other theorem and you never get different answers by calculating the same quantity in different ways. To resort to the ad-hoc variable-infested complications of frequentism while preaching your objectivity? I can only compare this with the politicians who go around preaching “Family values!” and then get caught soliciting sex in restrooms. So long as you deliver loud sermons and make a big fuss about painting yourself with the right labels, you get identified with that flag—no one bothers to look very hard at what you do. The case of frequentists calling Bayesians “too subjective” is worth dwelling on for that aspect alone—emphasizing how important it is to look at what’s actually going on instead of just listening to the slogans, and how rare it is for anyone to even glance in that direction.