They did not prove, for instance, that it cannot be the case, with 95% certainty, that all but one of the children are made hyperactive.
They did not say this but I am confident that if this bizarre hypothesis (all but one of what group of children, exactly?) were tested, the test would reject it as well. (Ignoring the conflicting-measures-of-hyperactivity point, which I am not competent to judge.)
In general, the F-test does not reject all alternate hypotheses equally, which is a problem but a different one. However, it provides evidence against all hypotheses that imply an aggregate difference between the test group and control group: equivalently, we’re testing if the means are different.
If all children are made hyperactive, the means will be different, and the study rejects this hypothesis. But if 99% of children are made hyperactive to the same degree, the means will be different by almost the same amount, and the test would also reject this hypothesis, though not as strongly. I don’t care how you wish to do the calculations, but any hypothesis that suggests the means are different is in fact made less likely by the study.
And hence my list of alternate hypotheses that may be worth considering, and are not penalized as much as others. Let’s recap:
If the effect is present but weak, we expect the means to be close to equal, so the statistical test results don’t falsify this hypothesis. However, we also don’t care about effects that are uniformly weak.
If the effect is strong but present in a small fraction of the population, the means will also be close to equal, and we do care about such an effect. Quantifying “strong” lets us quantify “small”.
We can allow the effect to be strong and present in a larger fraction of the population, if we suppose that some or all of the remaining children are actually affected negatively.
If all children are made hyperactive, the means will be different, and the study rejects this hypothesis. But if 99% of children are made hyperactive to the same degree, the means will be different by almost the same amount, and the test would also reject this hypothesis, though not as strongly.
This is math. You can’t say “If 2+2 = 4, then 2+1.9 = 4.” There is no “as strongly” being reported here. There is only accept or reject.
The study rejects a hypothesis using a specific number that was computed using the assumption that the effect is the same in all children. That specific number is not the correct number number to reject the hypothesis that the effect is the same in all but one.
It might so happen that the data used in the study would reject that hypothesis, if the correct threshold for it were computed. But the study did not do that, so it cannot claim to have proven that.
The reality in this case is that food dye promotes hyperactivity in around 15% of children. The correct F-value threshold to reject that hypothesis would be much, much lower!
You’re correct in a broader sense that passing the F-test under one set of assumptions is strong evidence that you’ll pass it with a similar set of assumptions. But papers such as this use logic and math in order to say things precisely, and while what they claimed is supported, and similar to, what they proved, it isn’t the same thing, so it’s still an error, just as 3.9 is similar to 4 for most purposes, but it is an error to say that 2 + 1.9 = 4.
The thing is, some such reasoning has to be done in any case to interpret the paper. Even if no logical mistake was made, the F-test can’t possibly disprove a hypothesis such as “the means of these two distributions are different”. There is always room for an epsilon difference in the means to be compatible with the data. A similar objection was stated elsewhere on this thread already:
The failure to reject a null hypothesis is a failure. It doesn’t allow or even encourage you to conclude anything.
And of course it’s legitimate to give up at this step and say “the null hypothesis has not been rejected, so we have nothing to say”. But if we don’t do this, then our only recourse is to say something like: “with 95% certainty, the difference in means is less than X”. In other words, we may be fairly certain that 2 + 1.9 is less than 5, and we’re a bit less certain that 2 + 1.9 is less than 4, as well.
Incidentally, is there some standard statistical test that produces this kind of output?
They did not say this but I am confident that if this bizarre hypothesis (all but one of what group of children, exactly?) were tested, the test would reject it as well. (Ignoring the conflicting-measures-of-hyperactivity point, which I am not competent to judge.)
In general, the F-test does not reject all alternate hypotheses equally, which is a problem but a different one. However, it provides evidence against all hypotheses that imply an aggregate difference between the test group and control group: equivalently, we’re testing if the means are different.
If all children are made hyperactive, the means will be different, and the study rejects this hypothesis. But if 99% of children are made hyperactive to the same degree, the means will be different by almost the same amount, and the test would also reject this hypothesis, though not as strongly. I don’t care how you wish to do the calculations, but any hypothesis that suggests the means are different is in fact made less likely by the study.
And hence my list of alternate hypotheses that may be worth considering, and are not penalized as much as others. Let’s recap:
If the effect is present but weak, we expect the means to be close to equal, so the statistical test results don’t falsify this hypothesis. However, we also don’t care about effects that are uniformly weak.
If the effect is strong but present in a small fraction of the population, the means will also be close to equal, and we do care about such an effect. Quantifying “strong” lets us quantify “small”.
We can allow the effect to be strong and present in a larger fraction of the population, if we suppose that some or all of the remaining children are actually affected negatively.
This is math. You can’t say “If 2+2 = 4, then 2+1.9 = 4.” There is no “as strongly” being reported here. There is only accept or reject.
The study rejects a hypothesis using a specific number that was computed using the assumption that the effect is the same in all children. That specific number is not the correct number number to reject the hypothesis that the effect is the same in all but one.
It might so happen that the data used in the study would reject that hypothesis, if the correct threshold for it were computed. But the study did not do that, so it cannot claim to have proven that.
The reality in this case is that food dye promotes hyperactivity in around 15% of children. The correct F-value threshold to reject that hypothesis would be much, much lower!
I don’t think we actually disagree.
Edit: Nor does reality disagree with either of us.
You’re correct in a broader sense that passing the F-test under one set of assumptions is strong evidence that you’ll pass it with a similar set of assumptions. But papers such as this use logic and math in order to say things precisely, and while what they claimed is supported, and similar to, what they proved, it isn’t the same thing, so it’s still an error, just as 3.9 is similar to 4 for most purposes, but it is an error to say that 2 + 1.9 = 4.
The thing is, some such reasoning has to be done in any case to interpret the paper. Even if no logical mistake was made, the F-test can’t possibly disprove a hypothesis such as “the means of these two distributions are different”. There is always room for an epsilon difference in the means to be compatible with the data. A similar objection was stated elsewhere on this thread already:
And of course it’s legitimate to give up at this step and say “the null hypothesis has not been rejected, so we have nothing to say”. But if we don’t do this, then our only recourse is to say something like: “with 95% certainty, the difference in means is less than X”. In other words, we may be fairly certain that 2 + 1.9 is less than 5, and we’re a bit less certain that 2 + 1.9 is less than 4, as well.
Incidentally, is there some standard statistical test that produces this kind of output?