I think the picture is not actually so grim: the study does reject an entire class of (distributions of) effects on the population.
Specifically, it cannot be the case (with 95% certainty or whatever) that a significant proportion of children are made hyperactive, while the remainder are unaffected. This does leave a few possibilities:
Only a small fraction of the children were affected by the intervention.
Although a significant fraction of the children were affected by the intervention in one direction, the remainder were affected in the opposite direction.
A mix of the two (e.g. a strong positive effect in a few children, and a weak negative effect in many others).
The first possibility would be eliminated by a study with more participants (the smaller the fraction of children affected, the more total children you need to notice).
The second possibility is likely to be missed by the test entirely, since the net effect is much weaker than the net absolute effect. However, careful researchers should notice that the response distribution is bimodal (again, given sufficiently many children). Of course, if the researchers aren’t careful...
Specifically, it cannot be the case (with 95% certainty or whatever) that a significant proportion of children are made hyperactive, while the remainder are unaffected.
Specifically, it cannot be the case, with 95% certainty, that all children are made hyperactive. That is exactly what they proved with their F-tests (though if you look at the raw data, the measures they used of hyperactivity conflicted with each other so often that it’s hard to believe they measured anything at all). They did not prove, for instance, that it cannot be the case, with 95% certainty, that all but one of the children are made hyperactive.
Yet they claimed, as I quoted, that they proved that no children are made hyperactive. It’s a logic error with large consequences in healthcare and in other domains.
You’re correct that the study data is useful and rules out some possibilities. But the claim they made in their summary is much stronger than what they showed.
They did not prove, for instance, that it cannot be the case, with 95% certainty, that all but one of the children are made hyperactive.
They did not say this but I am confident that if this bizarre hypothesis (all but one of what group of children, exactly?) were tested, the test would reject it as well. (Ignoring the conflicting-measures-of-hyperactivity point, which I am not competent to judge.)
In general, the F-test does not reject all alternate hypotheses equally, which is a problem but a different one. However, it provides evidence against all hypotheses that imply an aggregate difference between the test group and control group: equivalently, we’re testing if the means are different.
If all children are made hyperactive, the means will be different, and the study rejects this hypothesis. But if 99% of children are made hyperactive to the same degree, the means will be different by almost the same amount, and the test would also reject this hypothesis, though not as strongly. I don’t care how you wish to do the calculations, but any hypothesis that suggests the means are different is in fact made less likely by the study.
And hence my list of alternate hypotheses that may be worth considering, and are not penalized as much as others. Let’s recap:
If the effect is present but weak, we expect the means to be close to equal, so the statistical test results don’t falsify this hypothesis. However, we also don’t care about effects that are uniformly weak.
If the effect is strong but present in a small fraction of the population, the means will also be close to equal, and we do care about such an effect. Quantifying “strong” lets us quantify “small”.
We can allow the effect to be strong and present in a larger fraction of the population, if we suppose that some or all of the remaining children are actually affected negatively.
If all children are made hyperactive, the means will be different, and the study rejects this hypothesis. But if 99% of children are made hyperactive to the same degree, the means will be different by almost the same amount, and the test would also reject this hypothesis, though not as strongly.
This is math. You can’t say “If 2+2 = 4, then 2+1.9 = 4.” There is no “as strongly” being reported here. There is only accept or reject.
The study rejects a hypothesis using a specific number that was computed using the assumption that the effect is the same in all children. That specific number is not the correct number number to reject the hypothesis that the effect is the same in all but one.
It might so happen that the data used in the study would reject that hypothesis, if the correct threshold for it were computed. But the study did not do that, so it cannot claim to have proven that.
The reality in this case is that food dye promotes hyperactivity in around 15% of children. The correct F-value threshold to reject that hypothesis would be much, much lower!
You’re correct in a broader sense that passing the F-test under one set of assumptions is strong evidence that you’ll pass it with a similar set of assumptions. But papers such as this use logic and math in order to say things precisely, and while what they claimed is supported, and similar to, what they proved, it isn’t the same thing, so it’s still an error, just as 3.9 is similar to 4 for most purposes, but it is an error to say that 2 + 1.9 = 4.
The thing is, some such reasoning has to be done in any case to interpret the paper. Even if no logical mistake was made, the F-test can’t possibly disprove a hypothesis such as “the means of these two distributions are different”. There is always room for an epsilon difference in the means to be compatible with the data. A similar objection was stated elsewhere on this thread already:
The failure to reject a null hypothesis is a failure. It doesn’t allow or even encourage you to conclude anything.
And of course it’s legitimate to give up at this step and say “the null hypothesis has not been rejected, so we have nothing to say”. But if we don’t do this, then our only recourse is to say something like: “with 95% certainty, the difference in means is less than X”. In other words, we may be fairly certain that 2 + 1.9 is less than 5, and we’re a bit less certain that 2 + 1.9 is less than 4, as well.
Incidentally, is there some standard statistical test that produces this kind of output?
I think the picture is not actually so grim: the study does reject an entire class of (distributions of) effects on the population.
Specifically, it cannot be the case (with 95% certainty or whatever) that a significant proportion of children are made hyperactive, while the remainder are unaffected. This does leave a few possibilities:
Only a small fraction of the children were affected by the intervention.
Although a significant fraction of the children were affected by the intervention in one direction, the remainder were affected in the opposite direction.
A mix of the two (e.g. a strong positive effect in a few children, and a weak negative effect in many others).
The first possibility would be eliminated by a study with more participants (the smaller the fraction of children affected, the more total children you need to notice).
The second possibility is likely to be missed by the test entirely, since the net effect is much weaker than the net absolute effect. However, careful researchers should notice that the response distribution is bimodal (again, given sufficiently many children). Of course, if the researchers aren’t careful...
Specifically, it cannot be the case, with 95% certainty, that all children are made hyperactive. That is exactly what they proved with their F-tests (though if you look at the raw data, the measures they used of hyperactivity conflicted with each other so often that it’s hard to believe they measured anything at all). They did not prove, for instance, that it cannot be the case, with 95% certainty, that all but one of the children are made hyperactive.
Yet they claimed, as I quoted, that they proved that no children are made hyperactive. It’s a logic error with large consequences in healthcare and in other domains.
You’re correct that the study data is useful and rules out some possibilities. But the claim they made in their summary is much stronger than what they showed.
They did not say this but I am confident that if this bizarre hypothesis (all but one of what group of children, exactly?) were tested, the test would reject it as well. (Ignoring the conflicting-measures-of-hyperactivity point, which I am not competent to judge.)
In general, the F-test does not reject all alternate hypotheses equally, which is a problem but a different one. However, it provides evidence against all hypotheses that imply an aggregate difference between the test group and control group: equivalently, we’re testing if the means are different.
If all children are made hyperactive, the means will be different, and the study rejects this hypothesis. But if 99% of children are made hyperactive to the same degree, the means will be different by almost the same amount, and the test would also reject this hypothesis, though not as strongly. I don’t care how you wish to do the calculations, but any hypothesis that suggests the means are different is in fact made less likely by the study.
And hence my list of alternate hypotheses that may be worth considering, and are not penalized as much as others. Let’s recap:
If the effect is present but weak, we expect the means to be close to equal, so the statistical test results don’t falsify this hypothesis. However, we also don’t care about effects that are uniformly weak.
If the effect is strong but present in a small fraction of the population, the means will also be close to equal, and we do care about such an effect. Quantifying “strong” lets us quantify “small”.
We can allow the effect to be strong and present in a larger fraction of the population, if we suppose that some or all of the remaining children are actually affected negatively.
This is math. You can’t say “If 2+2 = 4, then 2+1.9 = 4.” There is no “as strongly” being reported here. There is only accept or reject.
The study rejects a hypothesis using a specific number that was computed using the assumption that the effect is the same in all children. That specific number is not the correct number number to reject the hypothesis that the effect is the same in all but one.
It might so happen that the data used in the study would reject that hypothesis, if the correct threshold for it were computed. But the study did not do that, so it cannot claim to have proven that.
The reality in this case is that food dye promotes hyperactivity in around 15% of children. The correct F-value threshold to reject that hypothesis would be much, much lower!
I don’t think we actually disagree.
Edit: Nor does reality disagree with either of us.
You’re correct in a broader sense that passing the F-test under one set of assumptions is strong evidence that you’ll pass it with a similar set of assumptions. But papers such as this use logic and math in order to say things precisely, and while what they claimed is supported, and similar to, what they proved, it isn’t the same thing, so it’s still an error, just as 3.9 is similar to 4 for most purposes, but it is an error to say that 2 + 1.9 = 4.
The thing is, some such reasoning has to be done in any case to interpret the paper. Even if no logical mistake was made, the F-test can’t possibly disprove a hypothesis such as “the means of these two distributions are different”. There is always room for an epsilon difference in the means to be compatible with the data. A similar objection was stated elsewhere on this thread already:
And of course it’s legitimate to give up at this step and say “the null hypothesis has not been rejected, so we have nothing to say”. But if we don’t do this, then our only recourse is to say something like: “with 95% certainty, the difference in means is less than X”. In other words, we may be fairly certain that 2 + 1.9 is less than 5, and we’re a bit less certain that 2 + 1.9 is less than 4, as well.
Incidentally, is there some standard statistical test that produces this kind of output?