but in the spirit of ‘the earth is actually not spherical but an oblate spheroid thus you have been educated stupid and Time has Four Corners!’ Because the standard work has flaws, they feel free to jump to whatever random bullshit they like best.
But you don’t have a complete fossil record, therefore Creationism!
Obviously that’s a problem. This somewhat confirms my comment to Phil, that linking the statistical issue to food dyes made reception of his claims harder as it better fit your pattern than a general statistical argument.
But from the numbers he reported, the basic eyeball test of the data leaves me thinking that food dyes may have an affect. Certainly if you take the data alone without priors, I’d conclude that more likely than not, food dyes have an effect. That’s how I would interpret the 84% significance threshold—probably there is a difference. Do you agree?
Unfortunately, I don’t have JAMA access to the paper to really look at the data, so I’m going by the 84% significance threshold.
I made up the 84% threshold in my example, to show what can happen in the worst case. In this study, what they found was that food dye decreased hyperactivity, but not enough to pass the threshold. (I don’t know what the threshold was or what confidence level it was set for; they didn’t say in the tables. I assume 95%.)
If they had passed the threshold, they would have concluded that food dye affects behavior, but would probably not have published because it would be an embarrassing outcome that both camps would attack.
Yes, I’m making a general argument about that mistaken conclusion. The F-test is especially tricky, because you know you’re going to find some difference between the groups. What difference D would you expect to find if there is in fact no effect? That’s a really hard question, and the F-test dodges it by using the arbitrary but standard 95% confidence interval to pick a higher threshold, F. Results between D and F would still support the hypothesis that there is an effect, while results below D would be evidence against that hypothesis. Not knowing what D is, we can’t say whether failure of an F-test is evidence for or against the hypothesis.
I’d add to the basic statistical problem the vast overgeneralization and bad decision theory.
You hit on one part of that, the generalization to the entire population.
People are different.
But even if they’re the same, U shaped response curves make it unlikely to find a signal. - you have to have the goldilocks amount to show an improvement. People vary over time. going in and out of the goldilocks range. So you when you add something, you’ll be pushing some people into the goldilocks range, and some people out.
It also comes from multiple paths to the same disease. A disease is a set of observable symptoms, not the varying particular causes of the same symptoms. Of course it’s hard to find the signal in a batch of people clustered into a dozen different underlying causes for the same symptoms.
But the bad decisions theory is the worst part, IMO. If you have a chronic problem, a 5% chance of a cure from a low risk, low cost intervention is great. But getting a 5% signal out of black box testing regimes biased against false positives is extremely unlikely, and the bias against interventions that “don’t work” keeps many doctors from trying perfectly safe treatments that have a reasonable chance of working.
The whole outlook is bad. It shouldn’t be “find me a proven cure that works for everyone”. It should be “find me interventions to control the system in a known way.” Get me knobs to turn, and let’s see if any of the knobs work for you.
Certainly if you take the data alone without priors, I’d conclude that more likely than not, food dyes have an effect. That’s how I would interpret the 84% significance threshold—probably there is a difference. Do you agree?
I haven’t looked but I suspect I would not agree and that you may be making the classic significance misinterpretation.
But you don’t have a complete fossil record, therefore Creationism!
Obviously that’s a problem. This somewhat confirms my comment to Phil, that linking the statistical issue to food dyes made reception of his claims harder as it better fit your pattern than a general statistical argument.
But from the numbers he reported, the basic eyeball test of the data leaves me thinking that food dyes may have an affect. Certainly if you take the data alone without priors, I’d conclude that more likely than not, food dyes have an effect. That’s how I would interpret the 84% significance threshold—probably there is a difference. Do you agree?
Unfortunately, I don’t have JAMA access to the paper to really look at the data, so I’m going by the 84% significance threshold.
I made up the 84% threshold in my example, to show what can happen in the worst case. In this study, what they found was that food dye decreased hyperactivity, but not enough to pass the threshold. (I don’t know what the threshold was or what confidence level it was set for; they didn’t say in the tables. I assume 95%.)
If they had passed the threshold, they would have concluded that food dye affects behavior, but would probably not have published because it would be an embarrassing outcome that both camps would attack.
To be clear, then, you’re not claiming that any evidence in the paper amounts to any kind of good evidence that an effect exists?
You’re making a general argument about the mistaken conclusion of jumping from “failure to reject the null” to a denial that any effect exists.
Yes, I’m making a general argument about that mistaken conclusion. The F-test is especially tricky, because you know you’re going to find some difference between the groups. What difference D would you expect to find if there is in fact no effect? That’s a really hard question, and the F-test dodges it by using the arbitrary but standard 95% confidence interval to pick a higher threshold, F. Results between D and F would still support the hypothesis that there is an effect, while results below D would be evidence against that hypothesis. Not knowing what D is, we can’t say whether failure of an F-test is evidence for or against the hypothesis.
I’d add to the basic statistical problem the vast overgeneralization and bad decision theory.
You hit on one part of that, the generalization to the entire population.
People are different.
But even if they’re the same, U shaped response curves make it unlikely to find a signal. - you have to have the goldilocks amount to show an improvement. People vary over time. going in and out of the goldilocks range. So you when you add something, you’ll be pushing some people into the goldilocks range, and some people out.
It also comes from multiple paths to the same disease. A disease is a set of observable symptoms, not the varying particular causes of the same symptoms. Of course it’s hard to find the signal in a batch of people clustered into a dozen different underlying causes for the same symptoms.
But the bad decisions theory is the worst part, IMO. If you have a chronic problem, a 5% chance of a cure from a low risk, low cost intervention is great. But getting a 5% signal out of black box testing regimes biased against false positives is extremely unlikely, and the bias against interventions that “don’t work” keeps many doctors from trying perfectly safe treatments that have a reasonable chance of working.
The whole outlook is bad. It shouldn’t be “find me a proven cure that works for everyone”. It should be “find me interventions to control the system in a known way.” Get me knobs to turn, and let’s see if any of the knobs work for you.
I believe Knight posted links to fulltext at http://lesswrong.com/lw/h56/the_universal_medical_journal_article_error/8pne
I haven’t looked but I suspect I would not agree and that you may be making the classic significance misinterpretation.