I don’t think you responded to my criticisms and I have nothing further to add. However, there are a few critical mistakes in what you have added that you need to correct:
Now pay attention; this is the part everyone gets wrong, including most of the commenters below.
The methodology used in this study, and in most studies, is as follows:
Divide subjects into a test group and a control group.
No, Mattes and Gittelman ran an order-randomized crossover study. In crossover studies, subjects serve as their own controls and they are not partitioned into test and control groups.
If you don’t understand why that is so, read the articles about the t-test and the F-test. The tests compute what a difference in magnitude of response such that, 95% of the time, if the measured effect difference is that large, the null hypothesis (that the responses of all subjects in both groups were drawn from the same distribution) is false.
No, the correct form is:
The tests compute a difference in magnitude of response such that if the null hypothesis is true, then 95% of the time the measured effect is not that large.
The form you quoted is a deadly undergraduate mistake.
ADDED: People are making comments proving they don’t understand how the F-test works. This is how it works: You are testing the hypothesis that two groups respond differently to food dye.
Suppose you measured the number of times a kid shouted or jumped, and you found that kids fed food dye shouted or jumped an average of 20 times per hour, and kids not fed food dye shouted or jumped an average of 17 times per hour. When you run your F-test, you compute that, assuming all kids respond to food dye the same way, you need a difference of 4 to conclude that the two distributions (test and control) are different.
If the food dye kids had shouted/jumped 21 times per hour, the study would conclude that food dye causes hyperactivity. Because they shouted/jumped only 20 times per hour, it failed to prove that food dye causes hyperactivity. That failure to prove is then taken as having proved that food dye does not cause hyperactivity, even though the evidence indicated that food dye causes hyperactivity.
This is wrong. There are reasonable prior distributions for which the observation of a small positive sample difference is evidence for a non-positive population difference. For example, this happens when the prior distribution for the population difference can be roughly factored into a null hypothesis and an alternative hypothesis that predicts a very large positive difference.
In particular, contrary to your claim, the small increase of 3 can be evidence that food dye does not cause hyperactivity if the prior distribution can be factored into a null hypothesis and an alternative hypothesis that predicts a positive response much greater than 3. This is analogous to one of Mattes and Gittelman’s central claims (they claim to have studied children for which the alternative hypothesis predicted a very large response).
If you don’t understand why that is so, read the articles about the t-test and the F-test. The tests compute what a difference in magnitude of response such that, 95% of the time, if the measured effect difference is that large, the null hypothesis (that the responses of all subjects in both groups were drawn from the same distribution) is false.
No, the correct form is:
The tests compute a difference in magnitude of response such that if the null hypothesis is true, then 95% of the time the measured effect is not that large.
The form you quoted is a deadly undergraduate mistake.
I read through most of the comments and was surprised that so little was made of this. Thanks, VincentYu. For anyone who could use a more general wording, it’s the difference between:
P(E≥S|H) the probability P of the evidence E being at least as extreme as test statistic S assuming the hypothesis H is true, and
P(H|E) the probability P of the hypothesis H being true given the evidence E.
This is going to be yet another horrible post. I just go meta and personal. Sorry.
I don’t understand how this thread (and a few others like it) on stats can happen; in particular, your second point (re: the basic mistake). It is the single solitary thing any person who knows any stats at all knows. Am I wrong? Maybe ‘knows’ meaning ‘understands’. I seem to recall the same error made by Gwern (and pointed out). I mean the system works in the sense that these comments get upvoted, but it is like. . . people having strong technical opinions with very high confidence about Shakespeare without being able to write out a sentence. It is not inconceivable the opinions are good (stroke, language, etc), but it says something very odd about the community that it happens regularly and is not extremely noticed. My impression is that Less Wrong is insane on statistics, particularly, and some areas of physics (and social aspects of science and philosophy).
I didn’t read the original post, paper, or anything other than some comment by Goetz which seemed to show he didn’t know what a p-value was and had a gigantic mouth. It’s possible I’ve missed something basic. Normally, before concluding a madness in the world, I’d be careful. For me to be right here means madness is very very likely (e.g., if I correctly guess it’s −70 outside without checking any data, I know something unusual about where I live).
It is the single solitary thing any person who knows any stats at all knows.
Many people with statistics degrees or statisticians or statistics professors make the p-value fallacy; so perhaps your standards are too high if LWers merely being as good as statistics professors comes as a disappointment to you.
I seem to recall the same error made by Gwern (and pointed out).
I’ve pointed out the mis-interpretation of p-values many times (most recently, by Yvain), and wrote a post with the commonness of the misinterpretation as a major point (http://lesswrong.com/lw/g13/against_nhst/), so I would be a little surprised if I have made that error.
Sorry, Gwern, I may be slandering you, but I thought I noticed it long before that (I’ve been reading, despite my silence). Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey. Again, I may have you misidentified. Such behavior is striking, if true, since it seems to me one of the most basic complaints Less Wrong has about science (somewhat incorrectly).
Edited: Gwern is right (on my misremembering). Either I was skimming and didn’t notice Gwern was quoting or I just mixed corrector with corrected. Sorry about that. In possible recompense: What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you’re just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end. That level of correction then also saves your ‘noticing’ something interesting and checking it specifically being circular (because you were already checking ‘everything’ and correcting appropriately).
Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey.
It’s true I didn’t do any multiple correction for the 2012 survey, but I think you’re simply not understanding the point of multiple correction.
First, ‘Data exploration’ is precisely when you don’t want to do multiple correction, because when data exploration is being done properly, it’s being done as exploration, to guide future work, to discern what signals may be there for followup. But multiple correction controls the false positive rate at the expense of then producing tons of false negatives; this is not a trade-off we want to make in exploration. If you look at the comments, dozens of different scenarios and ideas are being looked at, and so we know in advance that any multiple correction is going to trash pretty much every single result, and so we won’t wind up with any interesting hypotheses at all! Predictably defeating the entire purpose of looking. Why would you do this wittingly? It’s one thing to explore data and find no interesting relationships at all (shit happens), but it’s another thing entirely to set up procedures which nearly guarantee that you’ll ignore any relationships you do find. And which multiple correction, anyway? I didn’t come up with a list of hypotheses and then methodically go through them, I tested things as people suggested them or I thought of them; should I have done a single multiple correction of them all yesterday? (But what if I think of a new hypothesis tomorrow...?)
Second, thresholds for alpha and beta are supposed to be set by decision-theoretic considerations of cost-benefit. A false positive in medicine can be very expensive in lives and money, and hence any exploratory attitude, or undeclared data mining/dredging, is a serious issue (and one I fully agree with Ioannides on). In those scenarios, we certainly do want to reduce the false positives even if we’re forced to increase the false negatives. But this is just an online survey. It’s done for personal interest, kicks, and maybe a bit of planning or coordination by LWers. It’s also a little useful for rebutting outside stereotypes about intellectual monoculture or homogeneity. In this context, a false positive is not a big deal, and no worse than a false negative. (In fact, rather than sacrifice a disproportionate amount of beta in order to decrease alpha more, we might want to actually increase our alpha!)
This cost-benefit is a major reason why if you look through my own statistical analyses and experiments, I tend to only do multiple correction in cases where I’ve pre-specified my metrics (self-experiments are not data exploration!) and where a false positive is expensive (literally, in the case of supplements, since they cost a non-trivial amount of $ over a lifetime). So in my Zeo experiments, you will see me use multiple correction for melatonin, standing, & 2 Vitamin D experiments (and also in a recent non-public self-experiment); but you won’t see any multiple correction in my exploratory weather analysis.
What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you’re just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end.
See above on why this is pointless and inappropriate.
That level of correction then also saves your ‘noticing’ something interesting and checking it specifically being circular (because you were already checking ‘everything’ and correcting appropriately).
If you were doing it at the end, then this sort of ‘double-testing’ would be a concern as it might lead your “actual” number of tests to differ from your “corrected against” number of tests. But it’s not circular, because you’re not doing multiple correction. The positives you get after running a bunch of tests will not have a very high level of confidence, but that’s why you then take them as your new fixed set of specific hypotheses to run against the next dataset and, if the results are important, then perhaps do multiple correction.
So for example, if I cared that much about the LW survey results from the data exploration, what I should ideally do is collect the n positive results I care about, announce in advance the exact analysis I plan to do with the 2013 dataset, and decide in advance whether and what kind of multiple correction I want to do. The 2012 results using 2012 data suggest n hypotheses, and I would then actually test them with the 2013 data. (As it happens, I don’t care enough, so I haven’t.)
Gwern, I should be able to say that I appreciate the time you took to respond (which is snarky enough), but I am not able to do so. You can’t trust that your response to me is inappropriate and I can’t find any reason to invest myself in proving your response is inappropriate. I’ll agree my comment to you was somewhat inappropriate and while turnabout is fair play (and first provocation warrants an added response), it is not helpful here (whether deliberate or not). Separate from that, I disagree with you (your response is,historically, how people have managed to be wrong a lot). I’ll retire once more.
I believe it was suggested to me when I first asked the potential value of this place that they could help me with my math.
I don’t think you responded to my criticisms and I have nothing further to add. However, there are a few critical mistakes in what you have added that you need to correct:
No, Mattes and Gittelman ran an order-randomized crossover study. In crossover studies, subjects serve as their own controls and they are not partitioned into test and control groups.
No, the correct form is:
The tests compute a difference in magnitude of response such that if the null hypothesis is true, then 95% of the time the measured effect is not that large.
The form you quoted is a deadly undergraduate mistake.
This is wrong. There are reasonable prior distributions for which the observation of a small positive sample difference is evidence for a non-positive population difference. For example, this happens when the prior distribution for the population difference can be roughly factored into a null hypothesis and an alternative hypothesis that predicts a very large positive difference.
In particular, contrary to your claim, the small increase of 3 can be evidence that food dye does not cause hyperactivity if the prior distribution can be factored into a null hypothesis and an alternative hypothesis that predicts a positive response much greater than 3. This is analogous to one of Mattes and Gittelman’s central claims (they claim to have studied children for which the alternative hypothesis predicted a very large response).
I read through most of the comments and was surprised that so little was made of this. Thanks, VincentYu. For anyone who could use a more general wording, it’s the difference between:
P(E≥S|H) the probability P of the evidence E being at least as extreme as test statistic S assuming the hypothesis H is true, and
P(H|E) the probability P of the hypothesis H being true given the evidence E.
This is going to be yet another horrible post. I just go meta and personal. Sorry.
I don’t understand how this thread (and a few others like it) on stats can happen; in particular, your second point (re: the basic mistake). It is the single solitary thing any person who knows any stats at all knows. Am I wrong? Maybe ‘knows’ meaning ‘understands’. I seem to recall the same error made by Gwern (and pointed out). I mean the system works in the sense that these comments get upvoted, but it is like. . . people having strong technical opinions with very high confidence about Shakespeare without being able to write out a sentence. It is not inconceivable the opinions are good (stroke, language, etc), but it says something very odd about the community that it happens regularly and is not extremely noticed. My impression is that Less Wrong is insane on statistics, particularly, and some areas of physics (and social aspects of science and philosophy).
I didn’t read the original post, paper, or anything other than some comment by Goetz which seemed to show he didn’t know what a p-value was and had a gigantic mouth. It’s possible I’ve missed something basic. Normally, before concluding a madness in the world, I’d be careful. For me to be right here means madness is very very likely (e.g., if I correctly guess it’s −70 outside without checking any data, I know something unusual about where I live).
Many people with statistics degrees or statisticians or statistics professors make the p-value fallacy; so perhaps your standards are too high if LWers merely being as good as statistics professors comes as a disappointment to you.
I’ve pointed out the mis-interpretation of p-values many times (most recently, by Yvain), and wrote a post with the commonness of the misinterpretation as a major point (http://lesswrong.com/lw/g13/against_nhst/), so I would be a little surprised if I have made that error.
Sorry, Gwern, I may be slandering you, but I thought I noticed it long before that (I’ve been reading, despite my silence). Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey. Again, I may have you misidentified. Such behavior is striking, if true, since it seems to me one of the most basic complaints Less Wrong has about science (somewhat incorrectly).
Edited: Gwern is right (on my misremembering). Either I was skimming and didn’t notice Gwern was quoting or I just mixed corrector with corrected. Sorry about that. In possible recompense: What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you’re just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end. That level of correction then also saves your ‘noticing’ something interesting and checking it specifically being circular (because you were already checking ‘everything’ and correcting appropriately).
It’s true I didn’t do any multiple correction for the 2012 survey, but I think you’re simply not understanding the point of multiple correction.
First, ‘Data exploration’ is precisely when you don’t want to do multiple correction, because when data exploration is being done properly, it’s being done as exploration, to guide future work, to discern what signals may be there for followup. But multiple correction controls the false positive rate at the expense of then producing tons of false negatives; this is not a trade-off we want to make in exploration. If you look at the comments, dozens of different scenarios and ideas are being looked at, and so we know in advance that any multiple correction is going to trash pretty much every single result, and so we won’t wind up with any interesting hypotheses at all! Predictably defeating the entire purpose of looking. Why would you do this wittingly? It’s one thing to explore data and find no interesting relationships at all (shit happens), but it’s another thing entirely to set up procedures which nearly guarantee that you’ll ignore any relationships you do find. And which multiple correction, anyway? I didn’t come up with a list of hypotheses and then methodically go through them, I tested things as people suggested them or I thought of them; should I have done a single multiple correction of them all yesterday? (But what if I think of a new hypothesis tomorrow...?)
Second, thresholds for alpha and beta are supposed to be set by decision-theoretic considerations of cost-benefit. A false positive in medicine can be very expensive in lives and money, and hence any exploratory attitude, or undeclared data mining/dredging, is a serious issue (and one I fully agree with Ioannides on). In those scenarios, we certainly do want to reduce the false positives even if we’re forced to increase the false negatives. But this is just an online survey. It’s done for personal interest, kicks, and maybe a bit of planning or coordination by LWers. It’s also a little useful for rebutting outside stereotypes about intellectual monoculture or homogeneity. In this context, a false positive is not a big deal, and no worse than a false negative. (In fact, rather than sacrifice a disproportionate amount of beta in order to decrease alpha more, we might want to actually increase our alpha!)
This cost-benefit is a major reason why if you look through my own statistical analyses and experiments, I tend to only do multiple correction in cases where I’ve pre-specified my metrics (self-experiments are not data exploration!) and where a false positive is expensive (literally, in the case of supplements, since they cost a non-trivial amount of $ over a lifetime). So in my Zeo experiments, you will see me use multiple correction for melatonin, standing, & 2 Vitamin D experiments (and also in a recent non-public self-experiment); but you won’t see any multiple correction in my exploratory weather analysis.
See above on why this is pointless and inappropriate.
If you were doing it at the end, then this sort of ‘double-testing’ would be a concern as it might lead your “actual” number of tests to differ from your “corrected against” number of tests. But it’s not circular, because you’re not doing multiple correction. The positives you get after running a bunch of tests will not have a very high level of confidence, but that’s why you then take them as your new fixed set of specific hypotheses to run against the next dataset and, if the results are important, then perhaps do multiple correction.
So for example, if I cared that much about the LW survey results from the data exploration, what I should ideally do is collect the n positive results I care about, announce in advance the exact analysis I plan to do with the 2013 dataset, and decide in advance whether and what kind of multiple correction I want to do. The 2012 results using 2012 data suggest n hypotheses, and I would then actually test them with the 2013 data. (As it happens, I don’t care enough, so I haven’t.)
Gwern, I should be able to say that I appreciate the time you took to respond (which is snarky enough), but I am not able to do so. You can’t trust that your response to me is inappropriate and I can’t find any reason to invest myself in proving your response is inappropriate. I’ll agree my comment to you was somewhat inappropriate and while turnabout is fair play (and first provocation warrants an added response), it is not helpful here (whether deliberate or not). Separate from that, I disagree with you (your response is,historically, how people have managed to be wrong a lot). I’ll retire once more.
I believe it was suggested to me when I first asked the potential value of this place that they could help me with my math.
Nope, I don’t think you have. Not everyone is crazy, but scholarship is pretty atrocious.