A more generous way to think about frequentism (which can be justified by some conditional probability sleight-of-hand) is that the significance of some evidence E is actually the probability that the null hypothesis is true, given E and also some prior distribution that is swept under the rug and (mostly) not under the experimenter’s control. Which is bad, yes, but in many cases the prior distribution is at least close to something reasonable. And there are some cases in which we can somewhat change the prior distribution to reflect our real priors: for example, when choosing to conduct a 1-tailed test rather than a 2-tailed one.
Under this interpretation, it is silly to expect significances to multiply. You’d really be saying something like Pr[H|E1+E2] = Pr[H|E1] Pr[H|E2]. And that’s simply not true: you are double-counting the prior probability Pr[H] when you do this. The frequentist approach is a correct way to combine these probabilities, although this isn’t obvious because nobody actually knows what the frequentist Pr[H] is.
But if you read about two experiments with a p-value of 0.05, and think of them as one experiment with a p-value of 0.0025, you are very very very wrong; not just frequentist-wrong but Bayesian-wrong as well.
the significance of some evidence E is actually the probability that the null hypothesis is true, given E
No frequentist says this. They don’t believe in P(H|E). That’s the explicit basis of the whole philosophy. People who talk about the probability of a hypothesis given the evidence are Bayesians, full stop.
Statistical significance is, albeit in a strange and distorted way, supposed to be about P(E|null hypothesis), and so, yes, two experiments with a p-value of 0.05 should add up to somewhere in the vicinity of p < 0.0025, because it’s about likelihoods, which do multiply, and not posteriors.
While some frequentist methods do use likelihoods, the mapping from likelihood to p-value is non-linear, so multiplying them would still be a mistake, at least as far as I can tell.
I’m not saying that frequentists believe this. I’m saying that the frequentist math (which computes Pr[E|H0]) is equivalent to computing Pr[H0|E] with respect to a prior distribution under which Pr[H0]=Pr[E]. Furthermore, this is a reasonable thing to look at, because from that point of view the way statistical significances combine actually makes sense.
Well, we have, in general, Pr[H0|E] = Pr[E|H0] * Pr[H0]/Pr[E]. Frequentists compute Pr[E|H0] instead of Pr[H0|E], but this turns out not to matter if Pr[H0]/Pr[E] cancels, which happens when the above equality holds.
From a certain point of view, this is just mathematical sleight of hand, of course. Also, the “E” is actually some class of outcomes that are grouped together (e.g. all outcomes in which 8 or more coins, out of 10, came up heads). But if we combine sequences of experimental results in the correct way, then this means that the frequentist and Bayesian result differ only by a constant factor (precisely the factor which we assumed, above, to be 1).
Why the heck would the probability of seeing the evidence, conditional on the mix of all hypotheses being considered, exactly equal the prior probability of the null hypothesis?
It wouldn’t. Probably a better way to explain it would have been to factor their ratio out as a constant.
Anyway, I’ve totally messed up explaining this, so I will fold for now and direct you to a completely different argument made elsewhere in the comments which is more worthy of being considered.
A more generous way to think about frequentism (which can be justified by some conditional probability sleight-of-hand) is that the significance of some evidence E is actually the probability that the null hypothesis is true, given E and also some prior distribution that is swept under the rug and (mostly) not under the experimenter’s control. Which is bad, yes, but in many cases the prior distribution is at least close to something reasonable. And there are some cases in which we can somewhat change the prior distribution to reflect our real priors: for example, when choosing to conduct a 1-tailed test rather than a 2-tailed one.
Under this interpretation, it is silly to expect significances to multiply. You’d really be saying something like Pr[H|E1+E2] = Pr[H|E1] Pr[H|E2]. And that’s simply not true: you are double-counting the prior probability Pr[H] when you do this. The frequentist approach is a correct way to combine these probabilities, although this isn’t obvious because nobody actually knows what the frequentist Pr[H] is.
But if you read about two experiments with a p-value of 0.05, and think of them as one experiment with a p-value of 0.0025, you are very very very wrong; not just frequentist-wrong but Bayesian-wrong as well.
No frequentist says this. They don’t believe in P(H|E). That’s the explicit basis of the whole philosophy. People who talk about the probability of a hypothesis given the evidence are Bayesians, full stop.
Statistical significance is, albeit in a strange and distorted way, supposed to be about P(E|null hypothesis), and so, yes, two experiments with a p-value of 0.05 should add up to somewhere in the vicinity of p < 0.0025, because it’s about likelihoods, which do multiply, and not posteriors.
While some frequentist methods do use likelihoods, the mapping from likelihood to p-value is non-linear, so multiplying them would still be a mistake, at least as far as I can tell.
I’m not saying that frequentists believe this. I’m saying that the frequentist math (which computes Pr[E|H0]) is equivalent to computing Pr[H0|E] with respect to a prior distribution under which Pr[H0]=Pr[E]. Furthermore, this is a reasonable thing to look at, because from that point of view the way statistical significances combine actually makes sense.
Whaa?
Well, we have, in general, Pr[H0|E] = Pr[E|H0] * Pr[H0]/Pr[E]. Frequentists compute Pr[E|H0] instead of Pr[H0|E], but this turns out not to matter if Pr[H0]/Pr[E] cancels, which happens when the above equality holds.
From a certain point of view, this is just mathematical sleight of hand, of course. Also, the “E” is actually some class of outcomes that are grouped together (e.g. all outcomes in which 8 or more coins, out of 10, came up heads). But if we combine sequences of experimental results in the correct way, then this means that the frequentist and Bayesian result differ only by a constant factor (precisely the factor which we assumed, above, to be 1).
Why the heck would the probability of seeing the evidence, conditional on the mix of all hypotheses being considered, exactly equal the prior probability of the null hypothesis?
It wouldn’t. Probably a better way to explain it would have been to factor their ratio out as a constant.
Anyway, I’ve totally messed up explaining this, so I will fold for now and direct you to a completely different argument made elsewhere in the comments which is more worthy of being considered.