I’m not seeing why what you call “the real WTF” is evidence of a problem with frequentist statistics. The fact that the hypothesis test would have given a statistically insignificant p-value whatever the actual 6 data points were just indicates that whatever the population distributions, 6 data points are simply not enough to disconfirm the null hypothesis. In fact you can see this if you look at Mann & Whitney’s original paper! (See the n=3 subtable in table I, p. 52.)
I can picture someone counterarguing that this is not immediately obvious from the details of the statistical test, but I would hope that any competent statistician, frequentist or not, would be sceptical of a nonparametric comparison of means for samples of size 3!
I’m an econometrician by training and when I was taught non-parametric testing I was told the minimum sample size to get a useful result was 10. Either the authors of the article had forgotten this, or there is something very wrong with how they were taught this test.
I’m not seeing why what you call “the real WTF” is evidence of a problem with frequentist statistics.
Check out the title: abuse of frequentist statistics. Yes, at the end, I argue from a Bayesian perspective, but you don’t have to be a Bayesian to see the structural problems with frequentist statistics as currently taught to and practiced by working scientists.
I would hope that any competent statistician, frequentist or not, would be sceptical of a nonparametric comparison of means for samples of size 3!
Me too. But not all papers with shoddy statistics are sent to statisticians for review. Experimental biologists in particular have a reputation for math-phobia. (Does the fact that when I saw the sample size the word “underpowered” instantly jumped into my head count as evidence that I am competent?)
Check out the title: abuse of frequentist statistics. Yes, at the end, I argue from a Bayesian perspective, but you don’t have to be a Bayesian to see the structural problems with frequentist statistics as currently taught to and practiced by working scientists.
I agree that frequentist statistics are often poorly taught and understood, and that this holds however you like to do your statistics. Still, the main post feels to me like a sales pitch for Bayes brand chainsaws that’s trying to scare me off Neyman-Pearson chainsaws by pointing out how often people using Neyman-Pearson chainsaws accidentally cut off a limb with them. (I am aware that I may be the only reader who feels this way about the post.)
(Does the fact that when I saw the sample size the word “underpowered” instantly jumped into my head count as evidence that I am competent?)
Yes, but it is not sufficient evidence to reject the null hypothesis of incompetence at the 0.05 significance level. (I keed, I keed.)
Still, the main post feels to me like a sales pitch...
It’s a fair point; I’m not exactly attacking the strongest representative of frequentist statistical practice. My only defense is that this actually happened, so it makes a good case study.
I assert that it is evidence in my concluding paragraph, but it’s true that I don’t give an actual argument. Whether one counts it as evidence would seem to depend on the causal assumptions one makes about the teaching and practice of statistics.
The critique of frequentist statistics, as I understand it—and I don’t think I do—is that frequentists like to count things, and trust that having large sample sizes will take care of biases for them. Therefore, a case in which frequentist statistics co-occurs with bad results counts against use of frequentist statistics, and you don’t have to worry about why the results were bad.
The whole Bayesian vs. frequentist argument seems a little silly to me. It’s like arguing that screws are better than nails. It’s true that, for any particular individual joint you wish to connect, a screw will probably connect it more securely and reversibly than a nail. That doesn’t mean there’s no use for nails.
I think that, in this case, the underlying problem was not caused by the
way frequentist statistics are commonly taught and practiced by working
scientists:
In the present case, the null hypothesis is that the old method and the
new method produce data from the same distribution; the authors would
like to see data that do not lead to rejection of the null hypothesis.
I’m no statistician, but I’m pretty sure you’re not supposed to make
your favored hypothesis the null hypothesis. That’s a pretty simple
rule and I think it’s drilled into students and enforced in peer review.
I see that as the underlying problem because it reverses the burden of
proof. If they had done it the right way around, six data points would
have been not enough to support their method instead of being not enough
to reject it. Making your favored hypothesis the null hypothesis can
allow you, in the extreme, to rely on a single data point.
Now even from a frequentist perspective, this is wacky. Hypothesis testing can reject a null hypothesis, but cannot confirm it, as discussed in the first paragraph of the Wikipedia article on null hypotheses.
You wrote:
That’s a pretty simple rule and I think it’s drilled into students and enforced in peer review.
Not all papers are reviewed by people who know the rule. I was taught that rule over ten years ago, and I didn’t remember it when my colleague showed me the analysis. (I did recall it eventually, just after I ran the sanity check. Evidence against my competence!) My colleague whose job it was to review the paper didn’t know/recall the rule either.
Check out the title: abuse of frequentist statistics. Yes, at the end, I argue from a Bayesian perspective, but you don’t have to be a Bayesian to see the structural problems with frequentist statistics as currently taught to and practiced by working scientists.
Well, I don’t see the structural problems. (I don’t even know what a structural problem is.)
Somebody, please write a top-level post addressing this. Stop saying “Frequentists are bad” and leaving it at that. This is a great story; but it’s not valid argumentation to try to convert it into an anti-frequentist tract.
I’d love to see a top-level post where someone suggests the best and/or most realistic way for scientists to do their statistics. I’m actually rather ignorant with regards to probability theory. I got a D in second semester frequentist statistics (hard teacher + I didn’t go to class or try very hard on the homework) which is indicative of how little I learned in that class. I did better in my applied statistics classes.
When is it good for scientists to do null hypothesis testing?
I’m not seeing why what you call “the real WTF” is evidence of a problem with frequentist statistics. The fact that the hypothesis test would have given a statistically insignificant p-value whatever the actual 6 data points were just indicates that whatever the population distributions, 6 data points are simply not enough to disconfirm the null hypothesis. In fact you can see this if you look at Mann & Whitney’s original paper! (See the n=3 subtable in table I, p. 52.)
I can picture someone counterarguing that this is not immediately obvious from the details of the statistical test, but I would hope that any competent statistician, frequentist or not, would be sceptical of a nonparametric comparison of means for samples of size 3!
I’m an econometrician by training and when I was taught non-parametric testing I was told the minimum sample size to get a useful result was 10. Either the authors of the article had forgotten this, or there is something very wrong with how they were taught this test.
Thanks for the pointer to the original paper.
Check out the title: abuse of frequentist statistics. Yes, at the end, I argue from a Bayesian perspective, but you don’t have to be a Bayesian to see the structural problems with frequentist statistics as currently taught to and practiced by working scientists.
Me too. But not all papers with shoddy statistics are sent to statisticians for review. Experimental biologists in particular have a reputation for math-phobia. (Does the fact that when I saw the sample size the word “underpowered” instantly jumped into my head count as evidence that I am competent?)
I agree that frequentist statistics are often poorly taught and understood, and that this holds however you like to do your statistics. Still, the main post feels to me like a sales pitch for Bayes brand chainsaws that’s trying to scare me off Neyman-Pearson chainsaws by pointing out how often people using Neyman-Pearson chainsaws accidentally cut off a limb with them. (I am aware that I may be the only reader who feels this way about the post.)
Yes, but it is not sufficient evidence to reject the null hypothesis of incompetence at the 0.05 significance level. (I keed, I keed.)
I get that impression a lot around here
It’s a fair point; I’m not exactly attacking the strongest representative of frequentist statistical practice. My only defense is that this actually happened, so it makes a good case study.
That’s true, and having been reminded of that, I think I may have been unduly pedantic about a fine detail at the expense of the main point.
It’s a good case study, but it’s not evidence of a problem with frequentist statistics.
I assert that it is evidence in my concluding paragraph, but it’s true that I don’t give an actual argument. Whether one counts it as evidence would seem to depend on the causal assumptions one makes about the teaching and practice of statistics.
Perhaps it’s frequentist evidence against frequentist statistics.
I think this is just a glib rejoinder, but if there’s a deeper thought there, I’d be interested to hear it.
The critique of frequentist statistics, as I understand it—and I don’t think I do—is that frequentists like to count things, and trust that having large sample sizes will take care of biases for them. Therefore, a case in which frequentist statistics co-occurs with bad results counts against use of frequentist statistics, and you don’t have to worry about why the results were bad.
The whole Bayesian vs. frequentist argument seems a little silly to me. It’s like arguing that screws are better than nails. It’s true that, for any particular individual joint you wish to connect, a screw will probably connect it more securely and reversibly than a nail. That doesn’t mean there’s no use for nails.
I think that, in this case, the underlying problem was not caused by the way frequentist statistics are commonly taught and practiced by working scientists:
I’m no statistician, but I’m pretty sure you’re not supposed to make your favored hypothesis the null hypothesis. That’s a pretty simple rule and I think it’s drilled into students and enforced in peer review.
I see that as the underlying problem because it reverses the burden of proof. If they had done it the right way around, six data points would have been not enough to support their method instead of being not enough to reject it. Making your favored hypothesis the null hypothesis can allow you, in the extreme, to rely on a single data point.
In the OP I did refer to that when I wrote:
You wrote:
Not all papers are reviewed by people who know the rule. I was taught that rule over ten years ago, and I didn’t remember it when my colleague showed me the analysis. (I did recall it eventually, just after I ran the sanity check. Evidence against my competence!) My colleague whose job it was to review the paper didn’t know/recall the rule either.
Well, I don’t see the structural problems. (I don’t even know what a structural problem is.)
Somebody, please write a top-level post addressing this. Stop saying “Frequentists are bad” and leaving it at that. This is a great story; but it’s not valid argumentation to try to convert it into an anti-frequentist tract.
I’d love to see a top-level post where someone suggests the best and/or most realistic way for scientists to do their statistics. I’m actually rather ignorant with regards to probability theory. I got a D in second semester frequentist statistics (hard teacher + I didn’t go to class or try very hard on the homework) which is indicative of how little I learned in that class. I did better in my applied statistics classes.
When is it good for scientists to do null hypothesis testing?
What specifically is the “this” you want addressed? I’m not sure what its referent is.
Right—show us how you would have done this test correctly using Bayesian statistics.
That did come up in comments; you can find the discussion here.