there is a 12% chance of seeing a correlation that strong by chance
No, a 12% chance of seeing a correlation at least as strong. Confusion about p-values isendemic! Please be super-careful explaining what they mean! (Specifically, in this case, you don’t want people thinking something like P(result | effect is real) = 1 and P(result | effect is false) = .12; I think that would be overstating the evidence.)
p=.004 is significant in most contexts, but less so when one is comparing 25 questions against one another
The issue isn’t how many questions one is comparing, but rather what is the prior for this specific correlation. I don’t think the prior for caffeine correlating with productivity is that low, and the .004 probably translates to pretty strong evidence. Of course, separately from that, you have to worry whether the correlation represents coffee → productivity causation.
The issue isn’t how many questions one is comparing, but rather what is the prior for this specific correlation.
Good clarifications; thanks. Also, I hadn’t noticed that the p-value for the caffeine/exercise comparison was also small (p=.008); on reflection, I agree with you that one or both correlations is likely to be real. Maybe I really will start drinking coffee again. (Though note that caffeine did not correlate with self-reported procrastination levels, nor with income, nor happiness (which is otherwise a strong anti-procrastination predictor).
Huh; I think I was in fact making thinking errors here, even though I understood your points enough to have explained them many times to others. My thought had been that, while it would be better to directly estimate a prior if I could do so accurately, doing so would be hard for two reasons:
Hindsight bias (plus the fact that I wrote no such priors down ahead of time);
Lack of practice with questions generated in this manner. In daily life (or while playing calibration games with trivia cards), questions are selected so that quite a high frequency have both “yes” and “no” as answers. Given the prior one should have over a randomly selected such question, I am therefore usually hesitant to assign < a 5% probability to anything that doesn’t make me think “no way could that possibly happen”, because, when I’ve done calibration practice, I’ve found that that’s what “only a 5% chance” feels like from the inside. But when one’s questions are generated from a process of automatic comparisons rather than a process of deliberate conjecture, the priors can be lower.
I was hoping, by attention to the number of questions involved, to get a feel for the latter effect. But on reflection it seems you’re right and I would have done better, in practice, to have thought about the odds of the kinds of comparisons I was actually running being true; hindsight bias or not, caffeine helping with System II overrides is clearly in a different hypothesis category than e.g. the sorts of compound / health value correlations that are sometimes mass-evaluated in medical research.
Steven, or others: if someone were to do a survey of e.g. Berkeley math majors, or others who seem similar to the LW population, what odds would you give on the caffeine/Anne correlation holding up? This might be a good subject for bets and rationality practice.
This might be a good subject for bets and rationality practice.
It sounds like a lot of effort for just one question. It would be easier to have someone transparently pick some science papers that people hadn’t heard of, and then have people guess at what their conclusions were.
Oh, that was sloppy of me, I read your “should I use caffeine again” comment and jumped to the conclusion that the correlation being discussed was between caffeine and self-reported akrasia. The correlation still doesn’t seem out of the question.
My cop-out answer: it depends on sample size, other aspects of the experiment design, and how you construe “holding up”, but probably not much less than 50%.
Suppose, for simplicity, that there is either a real correlation of about this size, or no correlation at all. (This is of course a simplification, since there could also be correlations of smaller or greater size.)
So, if we assign a 5% prior to caffeine in fact helping with the Anne question, and if we assume that the chance of seeing a chance at least this large if there is a real correlation is 50% (I’m not sure whether this is reasonable? but I’m hoping the observed correlation would be approximately centered around the actual correlation) our posterior should be:
.05.5 / (.05 .5 + .95 *.004) = 87%.
If we assign a 1% prior to caffeine in fact helping with the Anne question, the posterior (under the same simplified assumptions) should be: 56%.
If we assign a 10% prior (under the same simplified assumptions), the posterior should be: 93%.
I haven’t thought much about whether this is the correct way to formalize/simplify the question, so these numbers may be misleading.
You’re updating on the fact that we observed at least some value, when what we know is we observed exactly that value. I think this overstates the evidence against the null, because all those higher results would have been stronger evidence against the null than the actual result is, and you haven’t told the formalism that those higher results in fact didn’t happen. (ETA: I recommend the Goodman paper linked toward the end of the great-great-grandparent comment, if you haven’t seen it already. It sets upper bounds on the amount of Bayesian evidence you can get from any given p-value. The image with the table has some concrete numbers.)
If the results were by chance, that’s one reason why they probably wouldn’t replicate in Berkeley students, but there are other ways. For example, there could be some common factor like age in the LW population that caused both coffee use and correct answers to the Anne question, but that wasn’t important in the Berkeley population. Or they could just fail to replicate by coincidence, with a probability depending on what counts as “fail to replicate”. So even if I were, say, 95% convinced that this result wasn’t a coincidence, I think I still wouldn’t be more than say 70% sure that it would replicate.
No, a 12% chance of seeing a correlation at least as strong. Confusion about p-values is endemic! Please be super-careful explaining what they mean! (Specifically, in this case, you don’t want people thinking something like P(result | effect is real) = 1 and P(result | effect is false) = .12; I think that would be overstating the evidence.)
The issue isn’t how many questions one is comparing, but rather what is the prior for this specific correlation. I don’t think the prior for caffeine correlating with productivity is that low, and the .004 probably translates to pretty strong evidence. Of course, separately from that, you have to worry whether the correlation represents coffee → productivity causation.
Good clarifications; thanks. Also, I hadn’t noticed that the p-value for the caffeine/exercise comparison was also small (p=.008); on reflection, I agree with you that one or both correlations is likely to be real. Maybe I really will start drinking coffee again. (Though note that caffeine did not correlate with self-reported procrastination levels, nor with income, nor happiness (which is otherwise a strong anti-procrastination predictor).
Huh; I think I was in fact making thinking errors here, even though I understood your points enough to have explained them many times to others. My thought had been that, while it would be better to directly estimate a prior if I could do so accurately, doing so would be hard for two reasons:
Hindsight bias (plus the fact that I wrote no such priors down ahead of time);
Lack of practice with questions generated in this manner. In daily life (or while playing calibration games with trivia cards), questions are selected so that quite a high frequency have both “yes” and “no” as answers. Given the prior one should have over a randomly selected such question, I am therefore usually hesitant to assign < a 5% probability to anything that doesn’t make me think “no way could that possibly happen”, because, when I’ve done calibration practice, I’ve found that that’s what “only a 5% chance” feels like from the inside. But when one’s questions are generated from a process of automatic comparisons rather than a process of deliberate conjecture, the priors can be lower.
I was hoping, by attention to the number of questions involved, to get a feel for the latter effect. But on reflection it seems you’re right and I would have done better, in practice, to have thought about the odds of the kinds of comparisons I was actually running being true; hindsight bias or not, caffeine helping with System II overrides is clearly in a different hypothesis category than e.g. the sorts of compound / health value correlations that are sometimes mass-evaluated in medical research.
Steven, or others: if someone were to do a survey of e.g. Berkeley math majors, or others who seem similar to the LW population, what odds would you give on the caffeine/Anne correlation holding up? This might be a good subject for bets and rationality practice.
It sounds like a lot of effort for just one question. It would be easier to have someone transparently pick some science papers that people hadn’t heard of, and then have people guess at what their conclusions were.
This is a good idea; however, it is possible to update fairly strongly on the fact that a paper on the subject in question was published.
Oh, that was sloppy of me, I read your “should I use caffeine again” comment and jumped to the conclusion that the correlation being discussed was between caffeine and self-reported akrasia. The correlation still doesn’t seem out of the question.
My cop-out answer: it depends on sample size, other aspects of the experiment design, and how you construe “holding up”, but probably not much less than 50%.
Suppose, for simplicity, that there is either a real correlation of about this size, or no correlation at all. (This is of course a simplification, since there could also be correlations of smaller or greater size.)
So, if we assign a 5% prior to caffeine in fact helping with the Anne question, and if we assume that the chance of seeing a chance at least this large if there is a real correlation is 50% (I’m not sure whether this is reasonable? but I’m hoping the observed correlation would be approximately centered around the actual correlation) our posterior should be:
.05.5 / (.05 .5 + .95 *.004) = 87%.
If we assign a 1% prior to caffeine in fact helping with the Anne question, the posterior (under the same simplified assumptions) should be: 56%.
If we assign a 10% prior (under the same simplified assumptions), the posterior should be: 93%.
I haven’t thought much about whether this is the correct way to formalize/simplify the question, so these numbers may be misleading.
You’re updating on the fact that we observed at least some value, when what we know is we observed exactly that value. I think this overstates the evidence against the null, because all those higher results would have been stronger evidence against the null than the actual result is, and you haven’t told the formalism that those higher results in fact didn’t happen. (ETA: I recommend the Goodman paper linked toward the end of the great-great-grandparent comment, if you haven’t seen it already. It sets upper bounds on the amount of Bayesian evidence you can get from any given p-value. The image with the table has some concrete numbers.)
If the results were by chance, that’s one reason why they probably wouldn’t replicate in Berkeley students, but there are other ways. For example, there could be some common factor like age in the LW population that caused both coffee use and correct answers to the Anne question, but that wasn’t important in the Berkeley population. Or they could just fail to replicate by coincidence, with a probability depending on what counts as “fail to replicate”. So even if I were, say, 95% convinced that this result wasn’t a coincidence, I think I still wouldn’t be more than say 70% sure that it would replicate.