The issue isn’t how many questions one is comparing, but rather what is the prior for this specific correlation.
Good clarifications; thanks. Also, I hadn’t noticed that the p-value for the caffeine/exercise comparison was also small (p=.008); on reflection, I agree with you that one or both correlations is likely to be real. Maybe I really will start drinking coffee again. (Though note that caffeine did not correlate with self-reported procrastination levels, nor with income, nor happiness (which is otherwise a strong anti-procrastination predictor).
Huh; I think I was in fact making thinking errors here, even though I understood your points enough to have explained them many times to others. My thought had been that, while it would be better to directly estimate a prior if I could do so accurately, doing so would be hard for two reasons:
Hindsight bias (plus the fact that I wrote no such priors down ahead of time);
Lack of practice with questions generated in this manner. In daily life (or while playing calibration games with trivia cards), questions are selected so that quite a high frequency have both “yes” and “no” as answers. Given the prior one should have over a randomly selected such question, I am therefore usually hesitant to assign < a 5% probability to anything that doesn’t make me think “no way could that possibly happen”, because, when I’ve done calibration practice, I’ve found that that’s what “only a 5% chance” feels like from the inside. But when one’s questions are generated from a process of automatic comparisons rather than a process of deliberate conjecture, the priors can be lower.
I was hoping, by attention to the number of questions involved, to get a feel for the latter effect. But on reflection it seems you’re right and I would have done better, in practice, to have thought about the odds of the kinds of comparisons I was actually running being true; hindsight bias or not, caffeine helping with System II overrides is clearly in a different hypothesis category than e.g. the sorts of compound / health value correlations that are sometimes mass-evaluated in medical research.
Good clarifications; thanks. Also, I hadn’t noticed that the p-value for the caffeine/exercise comparison was also small (p=.008); on reflection, I agree with you that one or both correlations is likely to be real. Maybe I really will start drinking coffee again. (Though note that caffeine did not correlate with self-reported procrastination levels, nor with income, nor happiness (which is otherwise a strong anti-procrastination predictor).
Huh; I think I was in fact making thinking errors here, even though I understood your points enough to have explained them many times to others. My thought had been that, while it would be better to directly estimate a prior if I could do so accurately, doing so would be hard for two reasons:
Hindsight bias (plus the fact that I wrote no such priors down ahead of time);
Lack of practice with questions generated in this manner. In daily life (or while playing calibration games with trivia cards), questions are selected so that quite a high frequency have both “yes” and “no” as answers. Given the prior one should have over a randomly selected such question, I am therefore usually hesitant to assign < a 5% probability to anything that doesn’t make me think “no way could that possibly happen”, because, when I’ve done calibration practice, I’ve found that that’s what “only a 5% chance” feels like from the inside. But when one’s questions are generated from a process of automatic comparisons rather than a process of deliberate conjecture, the priors can be lower.
I was hoping, by attention to the number of questions involved, to get a feel for the latter effect. But on reflection it seems you’re right and I would have done better, in practice, to have thought about the odds of the kinds of comparisons I was actually running being true; hindsight bias or not, caffeine helping with System II overrides is clearly in a different hypothesis category than e.g. the sorts of compound / health value correlations that are sometimes mass-evaluated in medical research.