Go back and read the part I added, with the bulleted list. You are trying to get all subtle. No; these people did an F-test, which gave a result of the form “It is not the case that for all x, P(x)”, and they interpreted that as meaning “For all x, it is not the case that P(x).”
I don’t think you responded to my criticisms and I have nothing further to add. However, there are a few critical mistakes in what you have added that you need to correct:
Now pay attention; this is the part everyone gets wrong, including most of the commenters below.
The methodology used in this study, and in most studies, is as follows:
Divide subjects into a test group and a control group.
No, Mattes and Gittelman ran an order-randomized crossover study. In crossover studies, subjects serve as their own controls and they are not partitioned into test and control groups.
If you don’t understand why that is so, read the articles about the t-test and the F-test. The tests compute what a difference in magnitude of response such that, 95% of the time, if the measured effect difference is that large, the null hypothesis (that the responses of all subjects in both groups were drawn from the same distribution) is false.
No, the correct form is:
The tests compute a difference in magnitude of response such that if the null hypothesis is true, then 95% of the time the measured effect is not that large.
The form you quoted is a deadly undergraduate mistake.
ADDED: People are making comments proving they don’t understand how the F-test works. This is how it works: You are testing the hypothesis that two groups respond differently to food dye.
Suppose you measured the number of times a kid shouted or jumped, and you found that kids fed food dye shouted or jumped an average of 20 times per hour, and kids not fed food dye shouted or jumped an average of 17 times per hour. When you run your F-test, you compute that, assuming all kids respond to food dye the same way, you need a difference of 4 to conclude that the two distributions (test and control) are different.
If the food dye kids had shouted/jumped 21 times per hour, the study would conclude that food dye causes hyperactivity. Because they shouted/jumped only 20 times per hour, it failed to prove that food dye causes hyperactivity. That failure to prove is then taken as having proved that food dye does not cause hyperactivity, even though the evidence indicated that food dye causes hyperactivity.
This is wrong. There are reasonable prior distributions for which the observation of a small positive sample difference is evidence for a non-positive population difference. For example, this happens when the prior distribution for the population difference can be roughly factored into a null hypothesis and an alternative hypothesis that predicts a very large positive difference.
In particular, contrary to your claim, the small increase of 3 can be evidence that food dye does not cause hyperactivity if the prior distribution can be factored into a null hypothesis and an alternative hypothesis that predicts a positive response much greater than 3. This is analogous to one of Mattes and Gittelman’s central claims (they claim to have studied children for which the alternative hypothesis predicted a very large response).
If you don’t understand why that is so, read the articles about the t-test and the F-test. The tests compute what a difference in magnitude of response such that, 95% of the time, if the measured effect difference is that large, the null hypothesis (that the responses of all subjects in both groups were drawn from the same distribution) is false.
No, the correct form is:
The tests compute a difference in magnitude of response such that if the null hypothesis is true, then 95% of the time the measured effect is not that large.
The form you quoted is a deadly undergraduate mistake.
I read through most of the comments and was surprised that so little was made of this. Thanks, VincentYu. For anyone who could use a more general wording, it’s the difference between:
P(E≥S|H) the probability P of the evidence E being at least as extreme as test statistic S assuming the hypothesis H is true, and
P(H|E) the probability P of the hypothesis H being true given the evidence E.
This is going to be yet another horrible post. I just go meta and personal. Sorry.
I don’t understand how this thread (and a few others like it) on stats can happen; in particular, your second point (re: the basic mistake). It is the single solitary thing any person who knows any stats at all knows. Am I wrong? Maybe ‘knows’ meaning ‘understands’. I seem to recall the same error made by Gwern (and pointed out). I mean the system works in the sense that these comments get upvoted, but it is like. . . people having strong technical opinions with very high confidence about Shakespeare without being able to write out a sentence. It is not inconceivable the opinions are good (stroke, language, etc), but it says something very odd about the community that it happens regularly and is not extremely noticed. My impression is that Less Wrong is insane on statistics, particularly, and some areas of physics (and social aspects of science and philosophy).
I didn’t read the original post, paper, or anything other than some comment by Goetz which seemed to show he didn’t know what a p-value was and had a gigantic mouth. It’s possible I’ve missed something basic. Normally, before concluding a madness in the world, I’d be careful. For me to be right here means madness is very very likely (e.g., if I correctly guess it’s −70 outside without checking any data, I know something unusual about where I live).
It is the single solitary thing any person who knows any stats at all knows.
Many people with statistics degrees or statisticians or statistics professors make the p-value fallacy; so perhaps your standards are too high if LWers merely being as good as statistics professors comes as a disappointment to you.
I seem to recall the same error made by Gwern (and pointed out).
I’ve pointed out the mis-interpretation of p-values many times (most recently, by Yvain), and wrote a post with the commonness of the misinterpretation as a major point (http://lesswrong.com/lw/g13/against_nhst/), so I would be a little surprised if I have made that error.
Sorry, Gwern, I may be slandering you, but I thought I noticed it long before that (I’ve been reading, despite my silence). Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey. Again, I may have you misidentified. Such behavior is striking, if true, since it seems to me one of the most basic complaints Less Wrong has about science (somewhat incorrectly).
Edited: Gwern is right (on my misremembering). Either I was skimming and didn’t notice Gwern was quoting or I just mixed corrector with corrected. Sorry about that. In possible recompense: What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you’re just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end. That level of correction then also saves your ‘noticing’ something interesting and checking it specifically being circular (because you were already checking ‘everything’ and correcting appropriately).
Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey.
It’s true I didn’t do any multiple correction for the 2012 survey, but I think you’re simply not understanding the point of multiple correction.
First, ‘Data exploration’ is precisely when you don’t want to do multiple correction, because when data exploration is being done properly, it’s being done as exploration, to guide future work, to discern what signals may be there for followup. But multiple correction controls the false positive rate at the expense of then producing tons of false negatives; this is not a trade-off we want to make in exploration. If you look at the comments, dozens of different scenarios and ideas are being looked at, and so we know in advance that any multiple correction is going to trash pretty much every single result, and so we won’t wind up with any interesting hypotheses at all! Predictably defeating the entire purpose of looking. Why would you do this wittingly? It’s one thing to explore data and find no interesting relationships at all (shit happens), but it’s another thing entirely to set up procedures which nearly guarantee that you’ll ignore any relationships you do find. And which multiple correction, anyway? I didn’t come up with a list of hypotheses and then methodically go through them, I tested things as people suggested them or I thought of them; should I have done a single multiple correction of them all yesterday? (But what if I think of a new hypothesis tomorrow...?)
Second, thresholds for alpha and beta are supposed to be set by decision-theoretic considerations of cost-benefit. A false positive in medicine can be very expensive in lives and money, and hence any exploratory attitude, or undeclared data mining/dredging, is a serious issue (and one I fully agree with Ioannides on). In those scenarios, we certainly do want to reduce the false positives even if we’re forced to increase the false negatives. But this is just an online survey. It’s done for personal interest, kicks, and maybe a bit of planning or coordination by LWers. It’s also a little useful for rebutting outside stereotypes about intellectual monoculture or homogeneity. In this context, a false positive is not a big deal, and no worse than a false negative. (In fact, rather than sacrifice a disproportionate amount of beta in order to decrease alpha more, we might want to actually increase our alpha!)
This cost-benefit is a major reason why if you look through my own statistical analyses and experiments, I tend to only do multiple correction in cases where I’ve pre-specified my metrics (self-experiments are not data exploration!) and where a false positive is expensive (literally, in the case of supplements, since they cost a non-trivial amount of $ over a lifetime). So in my Zeo experiments, you will see me use multiple correction for melatonin, standing, & 2 Vitamin D experiments (and also in a recent non-public self-experiment); but you won’t see any multiple correction in my exploratory weather analysis.
What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you’re just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end.
See above on why this is pointless and inappropriate.
That level of correction then also saves your ‘noticing’ something interesting and checking it specifically being circular (because you were already checking ‘everything’ and correcting appropriately).
If you were doing it at the end, then this sort of ‘double-testing’ would be a concern as it might lead your “actual” number of tests to differ from your “corrected against” number of tests. But it’s not circular, because you’re not doing multiple correction. The positives you get after running a bunch of tests will not have a very high level of confidence, but that’s why you then take them as your new fixed set of specific hypotheses to run against the next dataset and, if the results are important, then perhaps do multiple correction.
So for example, if I cared that much about the LW survey results from the data exploration, what I should ideally do is collect the n positive results I care about, announce in advance the exact analysis I plan to do with the 2013 dataset, and decide in advance whether and what kind of multiple correction I want to do. The 2012 results using 2012 data suggest n hypotheses, and I would then actually test them with the 2013 data. (As it happens, I don’t care enough, so I haven’t.)
Gwern, I should be able to say that I appreciate the time you took to respond (which is snarky enough), but I am not able to do so. You can’t trust that your response to me is inappropriate and I can’t find any reason to invest myself in proving your response is inappropriate. I’ll agree my comment to you was somewhat inappropriate and while turnabout is fair play (and first provocation warrants an added response), it is not helpful here (whether deliberate or not). Separate from that, I disagree with you (your response is,historically, how people have managed to be wrong a lot). I’ll retire once more.
I believe it was suggested to me when I first asked the potential value of this place that they could help me with my math.
I think you’re interpreting the F test a little more strictly than you should. Isn’t it fairer to say a null result on a F test is “It is not the case that for most x, P(x)”, with “most” defined in a particular way?
You’re correct that a F-test is miserable at separating out different classes of responders. (In fact, it should be easy to develop a test that does separate out different classes of responders; I’ll have to think about that. Maybe just fit a GMM with three modes in a way that tries to maximize the distance between the modes?)
But I think the detail that you suppressed for brevity also makes a significant difference in how the results are interpreted. This paper doesn’t make the mistake of saying “artificial food coloring does not cause hyperactivity in every child, therefore artificial food coloring affects no children.” The paper says “artificial food coloring does not cause hyperactivity in every child whose parents confidently expect them to respond negatively to artificial food coloring, therefore their parents’ expectation is mistaken at the 95% confidence level.”
Now, it could be the case that there are children who do respond negatively to artificial food coloring, but the Feingold association is terrible at finding them / rejecting those children where it doesn’t have an effect. (This is unsurprising from a Hawthorne Effect or confirmation bias perspective.) As well, for small sample sizes, it seems better to use F and t tests than to try to separate out the various classes of responders, because the class sizes will be tiny; if one child responds poorly after administered artificial food die, that’s not much to go on, compared to a distinct subpopulation of 20 children in a sample of 1000.
The section of the paper where they describe their reference class:
If artificial additives affect only a small proportion of hyperactive children, significant dietary effects are unlikely to be detected in heterogeneous samples of hyperactive children. Therefore, children who had been placed on the Feingold diet by their parents and who were reported by their parents to have derived marked behavioral benefit from the diet and to experience marked deterioration when given artificial food colorings were targeted for this study. This sampling approach, combined with high dosage, was chosen to maximize the likelihood of observing behavioral deterioration with ingestion of artificial colorings.
(I should add that the first sentence is especially worth contemplating, here.)
I think I disagree with both of you here. The failure to reject a null hypothesis is a failure. It doesn’t allow or even encourage you to conclude anything.
I think I disagree with both of you here. The failure to reject a null hypothesis is a failure. It doesn’t allow or even encourage you to conclude anything.
Can you conclude that you failed to reject the null hypothesis? And if you expected to reject the null hypothesis, isn’t that failure meaningful? (Note that my language carefully included the confidence value.)
As a general comment, this is why the Bayesian approach is much more amenable to knowledge-generation than the frequentist approach. The statement “the hyperactivity increase in the experimental group was 0.36+/-2.00, and that range solidly includes 0” (with the variance of that estimate pulled out of thin air) is much more meaningful than “we can’t be sure it’s not zero.”
As a general comment, this is why Bayesian statistics is much more amenable to knowledge-generation than frequentist statistics. The statement “the hyperactivity increase in the experimental group was 0.36+/-2.00, and that range solidly includes 0” (with the variance of that estimate pulled out of thin air) is much more meaningful than “we can’t be sure it’s not zero.”
I agree with the second sentence, and the first might be true, but the second isn’t evidence for the first; interval estimation vs. hypothesis testing is an independent issue to Bayesianism vs. frequentism. There are Bayesian hypothesis tests and frequentist interval estimates.
Agreed that both have those tools, and rereading my comment I think “approach” may have been a more precise word than “statistics.” If you think in terms of “my results are certain, reality is uncertain” then the first tool you reach for is “let’s make an interval estimate / put a distribution on reality,” whereas if you think in terms of “reality is certain, my results are uncertain” then the first tool you reach for is hypothesis testing. Such defaults have very important effects on what actually gets used in studies.
And if you expected to reject the null hypothesis, isn’t that failure meaningful?
To me, but not to the theoretical foundations of the method employed.
Hypothesis testing generally works sensibly because people smuggle in intuitions that aren’t part of the foundations of the method. But since they’re only smuggling things in under a deficient theoretical framework, they’re given to mistakes, particularly when they’re applying their intuitions to the theoretical framework and not the base data.
I agree with the later comment on Bayesian statistics, and I’d go further. Scatterplot the labeled data, or show the distribution if you have tons of data. That’s generally much more productive than any particular particular confidence interval you might construct.
It would be an interesting study generative study to compare the various statistical tests on the same hypothesis versus the human eyeball. I think the eyeball will hold it’s own.
Go back and read the part I added, with the bulleted list. You are trying to get all subtle. No; these people did an F-test, which gave a result of the form “It is not the case that for all x, P(x)”, and they interpreted that as meaning “For all x, it is not the case that P(x).”
I don’t think you responded to my criticisms and I have nothing further to add. However, there are a few critical mistakes in what you have added that you need to correct:
No, Mattes and Gittelman ran an order-randomized crossover study. In crossover studies, subjects serve as their own controls and they are not partitioned into test and control groups.
No, the correct form is:
The tests compute a difference in magnitude of response such that if the null hypothesis is true, then 95% of the time the measured effect is not that large.
The form you quoted is a deadly undergraduate mistake.
This is wrong. There are reasonable prior distributions for which the observation of a small positive sample difference is evidence for a non-positive population difference. For example, this happens when the prior distribution for the population difference can be roughly factored into a null hypothesis and an alternative hypothesis that predicts a very large positive difference.
In particular, contrary to your claim, the small increase of 3 can be evidence that food dye does not cause hyperactivity if the prior distribution can be factored into a null hypothesis and an alternative hypothesis that predicts a positive response much greater than 3. This is analogous to one of Mattes and Gittelman’s central claims (they claim to have studied children for which the alternative hypothesis predicted a very large response).
I read through most of the comments and was surprised that so little was made of this. Thanks, VincentYu. For anyone who could use a more general wording, it’s the difference between:
P(E≥S|H) the probability P of the evidence E being at least as extreme as test statistic S assuming the hypothesis H is true, and
P(H|E) the probability P of the hypothesis H being true given the evidence E.
This is going to be yet another horrible post. I just go meta and personal. Sorry.
I don’t understand how this thread (and a few others like it) on stats can happen; in particular, your second point (re: the basic mistake). It is the single solitary thing any person who knows any stats at all knows. Am I wrong? Maybe ‘knows’ meaning ‘understands’. I seem to recall the same error made by Gwern (and pointed out). I mean the system works in the sense that these comments get upvoted, but it is like. . . people having strong technical opinions with very high confidence about Shakespeare without being able to write out a sentence. It is not inconceivable the opinions are good (stroke, language, etc), but it says something very odd about the community that it happens regularly and is not extremely noticed. My impression is that Less Wrong is insane on statistics, particularly, and some areas of physics (and social aspects of science and philosophy).
I didn’t read the original post, paper, or anything other than some comment by Goetz which seemed to show he didn’t know what a p-value was and had a gigantic mouth. It’s possible I’ve missed something basic. Normally, before concluding a madness in the world, I’d be careful. For me to be right here means madness is very very likely (e.g., if I correctly guess it’s −70 outside without checking any data, I know something unusual about where I live).
Many people with statistics degrees or statisticians or statistics professors make the p-value fallacy; so perhaps your standards are too high if LWers merely being as good as statistics professors comes as a disappointment to you.
I’ve pointed out the mis-interpretation of p-values many times (most recently, by Yvain), and wrote a post with the commonness of the misinterpretation as a major point (http://lesswrong.com/lw/g13/against_nhst/), so I would be a little surprised if I have made that error.
Sorry, Gwern, I may be slandering you, but I thought I noticed it long before that (I’ve been reading, despite my silence). Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey. Again, I may have you misidentified. Such behavior is striking, if true, since it seems to me one of the most basic complaints Less Wrong has about science (somewhat incorrectly).
Edited: Gwern is right (on my misremembering). Either I was skimming and didn’t notice Gwern was quoting or I just mixed corrector with corrected. Sorry about that. In possible recompense: What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you’re just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end. That level of correction then also saves your ‘noticing’ something interesting and checking it specifically being circular (because you were already checking ‘everything’ and correcting appropriately).
It’s true I didn’t do any multiple correction for the 2012 survey, but I think you’re simply not understanding the point of multiple correction.
First, ‘Data exploration’ is precisely when you don’t want to do multiple correction, because when data exploration is being done properly, it’s being done as exploration, to guide future work, to discern what signals may be there for followup. But multiple correction controls the false positive rate at the expense of then producing tons of false negatives; this is not a trade-off we want to make in exploration. If you look at the comments, dozens of different scenarios and ideas are being looked at, and so we know in advance that any multiple correction is going to trash pretty much every single result, and so we won’t wind up with any interesting hypotheses at all! Predictably defeating the entire purpose of looking. Why would you do this wittingly? It’s one thing to explore data and find no interesting relationships at all (shit happens), but it’s another thing entirely to set up procedures which nearly guarantee that you’ll ignore any relationships you do find. And which multiple correction, anyway? I didn’t come up with a list of hypotheses and then methodically go through them, I tested things as people suggested them or I thought of them; should I have done a single multiple correction of them all yesterday? (But what if I think of a new hypothesis tomorrow...?)
Second, thresholds for alpha and beta are supposed to be set by decision-theoretic considerations of cost-benefit. A false positive in medicine can be very expensive in lives and money, and hence any exploratory attitude, or undeclared data mining/dredging, is a serious issue (and one I fully agree with Ioannides on). In those scenarios, we certainly do want to reduce the false positives even if we’re forced to increase the false negatives. But this is just an online survey. It’s done for personal interest, kicks, and maybe a bit of planning or coordination by LWers. It’s also a little useful for rebutting outside stereotypes about intellectual monoculture or homogeneity. In this context, a false positive is not a big deal, and no worse than a false negative. (In fact, rather than sacrifice a disproportionate amount of beta in order to decrease alpha more, we might want to actually increase our alpha!)
This cost-benefit is a major reason why if you look through my own statistical analyses and experiments, I tend to only do multiple correction in cases where I’ve pre-specified my metrics (self-experiments are not data exploration!) and where a false positive is expensive (literally, in the case of supplements, since they cost a non-trivial amount of $ over a lifetime). So in my Zeo experiments, you will see me use multiple correction for melatonin, standing, & 2 Vitamin D experiments (and also in a recent non-public self-experiment); but you won’t see any multiple correction in my exploratory weather analysis.
See above on why this is pointless and inappropriate.
If you were doing it at the end, then this sort of ‘double-testing’ would be a concern as it might lead your “actual” number of tests to differ from your “corrected against” number of tests. But it’s not circular, because you’re not doing multiple correction. The positives you get after running a bunch of tests will not have a very high level of confidence, but that’s why you then take them as your new fixed set of specific hypotheses to run against the next dataset and, if the results are important, then perhaps do multiple correction.
So for example, if I cared that much about the LW survey results from the data exploration, what I should ideally do is collect the n positive results I care about, announce in advance the exact analysis I plan to do with the 2013 dataset, and decide in advance whether and what kind of multiple correction I want to do. The 2012 results using 2012 data suggest n hypotheses, and I would then actually test them with the 2013 data. (As it happens, I don’t care enough, so I haven’t.)
Gwern, I should be able to say that I appreciate the time you took to respond (which is snarky enough), but I am not able to do so. You can’t trust that your response to me is inappropriate and I can’t find any reason to invest myself in proving your response is inappropriate. I’ll agree my comment to you was somewhat inappropriate and while turnabout is fair play (and first provocation warrants an added response), it is not helpful here (whether deliberate or not). Separate from that, I disagree with you (your response is,historically, how people have managed to be wrong a lot). I’ll retire once more.
I believe it was suggested to me when I first asked the potential value of this place that they could help me with my math.
Nope, I don’t think you have. Not everyone is crazy, but scholarship is pretty atrocious.
I think you’re interpreting the F test a little more strictly than you should. Isn’t it fairer to say a null result on a F test is “It is not the case that for most x, P(x)”, with “most” defined in a particular way?
You’re correct that a F-test is miserable at separating out different classes of responders. (In fact, it should be easy to develop a test that does separate out different classes of responders; I’ll have to think about that. Maybe just fit a GMM with three modes in a way that tries to maximize the distance between the modes?)
But I think the detail that you suppressed for brevity also makes a significant difference in how the results are interpreted. This paper doesn’t make the mistake of saying “artificial food coloring does not cause hyperactivity in every child, therefore artificial food coloring affects no children.” The paper says “artificial food coloring does not cause hyperactivity in every child whose parents confidently expect them to respond negatively to artificial food coloring, therefore their parents’ expectation is mistaken at the 95% confidence level.”
Now, it could be the case that there are children who do respond negatively to artificial food coloring, but the Feingold association is terrible at finding them / rejecting those children where it doesn’t have an effect. (This is unsurprising from a Hawthorne Effect or confirmation bias perspective.) As well, for small sample sizes, it seems better to use F and t tests than to try to separate out the various classes of responders, because the class sizes will be tiny; if one child responds poorly after administered artificial food die, that’s not much to go on, compared to a distinct subpopulation of 20 children in a sample of 1000.
The section of the paper where they describe their reference class:
(I should add that the first sentence is especially worth contemplating, here.)
I think I disagree with both of you here. The failure to reject a null hypothesis is a failure. It doesn’t allow or even encourage you to conclude anything.
Can you conclude that you failed to reject the null hypothesis? And if you expected to reject the null hypothesis, isn’t that failure meaningful? (Note that my language carefully included the confidence value.)
As a general comment, this is why the Bayesian approach is much more amenable to knowledge-generation than the frequentist approach. The statement “the hyperactivity increase in the experimental group was 0.36+/-2.00, and that range solidly includes 0” (with the variance of that estimate pulled out of thin air) is much more meaningful than “we can’t be sure it’s not zero.”
I agree with the second sentence, and the first might be true, but the second isn’t evidence for the first; interval estimation vs. hypothesis testing is an independent issue to Bayesianism vs. frequentism. There are Bayesian hypothesis tests and frequentist interval estimates.
Agreed that both have those tools, and rereading my comment I think “approach” may have been a more precise word than “statistics.” If you think in terms of “my results are certain, reality is uncertain” then the first tool you reach for is “let’s make an interval estimate / put a distribution on reality,” whereas if you think in terms of “reality is certain, my results are uncertain” then the first tool you reach for is hypothesis testing. Such defaults have very important effects on what actually gets used in studies.
To me, but not to the theoretical foundations of the method employed.
Hypothesis testing generally works sensibly because people smuggle in intuitions that aren’t part of the foundations of the method. But since they’re only smuggling things in under a deficient theoretical framework, they’re given to mistakes, particularly when they’re applying their intuitions to the theoretical framework and not the base data.
I agree with the later comment on Bayesian statistics, and I’d go further. Scatterplot the labeled data, or show the distribution if you have tons of data. That’s generally much more productive than any particular particular confidence interval you might construct.
It would be an interesting study generative study to compare the various statistical tests on the same hypothesis versus the human eyeball. I think the eyeball will hold it’s own.