Incidentally, Eliezer, I don’t think you’re right about the example at the beginning of the post. The two frequentist tests are asking distinct questions of the data, and there is not necessarily any inconsistency when we ask two different questions of the same data and get two different answers.
Suppose A and B are tossing coins. A and B both get the same string of results—a whole bunch of heads (let’s say 9999) followed by a single tail. But A got this by just deciding to flip a coin 10000 times, while B got it by flipping a coin until the first tail came up. Now suppose they each ask the question “what is the probability that, when doing what I did, one will come up with at most the number of tails I actually saw?”
In A’s case the answer is of course very small; most strings of 10000 flips have many more than one tail. In B’s case the answer is of course 1; B’s method ensures that exactly one tail is seen, no matter what happens. The data was the same, but the questions were different, because of the “when doing what I did” clause (since A and B did different things). Frequentist tests are often like this—they involve some sort of reasoning about hypothetical repetitions of the procedure, and if the procedure differs, the question differs.
If we wanted to restate this in Bayesian terms, we’d have to do so by taking into account that the interpreter knows what the method is, not just what the data is, and the distributions used by a Bayesian interpreter should take this into account. For instance, one would be a pretty dumb Bayesian if one’s prior for B’s method didn’t say you’d get one tail with probability one. The observation that’s causing us to update isn’t “string of data,” it’s “string of data produced by a given physical process,” where the process is different in the two cases.
(I apologize if this has all been mentioned before—I didn’t carefully read all the comments above.)
Now suppose they each ask the question “what is the probability that, when doing what I did, one will come up with at most the number of tails I actually saw?”
That is throwing away data. The evidence that they each observed is the sequence of coin flip results, and the number of tails in that sequence is a partial summary of the data. The reason they get different answers is because that summary throws away more data for B than A. As you say, B already expected to get exactly one tail, so that summary tells him nothing new and he has no information to update on, while A can recover from this summary the number of heads and only loses information about the order (which cancels out anyways in the likelihood ratios between theories of independent coin flips). But if you calculate the probability that they each see that sequence you get the same answer for both, p(heads)^9999 * (1 - p(heads).
That is, the data gathering procedure is needed to interpret a partial summary of the data, but not the complete data.
Sure, the likelihoods are the same in both cases, since A and B’s probability distributions assign the same probability to any sequence that is in both of their supports. But the distributions are still different, and various functionals of them are still different—e.g., the number of tails, the moments (if we convert heads and tails to numbers), etc.
If you’re a Bayesian, you think any hypothesis worth considering can predict a whole probability distribution, so there’s no reason to worry about these functionals when you can just look at the probability of your whole data set given the hypothesis. If (as in actual scientific practice, at present) you often predict functionals but not the whole distribution, then the difference in the functionals matters. (I admit that the coin example is too basic here, because in any theory about a real coin, we really would have a whole distribution.)
My point is just that there are differences between the two cases. Bayesians don’t think these differences could possibly matter to the sort of hypotheses they are interested in testing, but that doesn’t mean that in principle there can be no reason to differentiate between the two.
Incidentally, Eliezer, I don’t think you’re right about the example at the beginning of the post. The two frequentist tests are asking distinct questions of the data, and there is not necessarily any inconsistency when we ask two different questions of the same data and get two different answers.
Suppose A and B are tossing coins. A and B both get the same string of results—a whole bunch of heads (let’s say 9999) followed by a single tail. But A got this by just deciding to flip a coin 10000 times, while B got it by flipping a coin until the first tail came up. Now suppose they each ask the question “what is the probability that, when doing what I did, one will come up with at most the number of tails I actually saw?”
In A’s case the answer is of course very small; most strings of 10000 flips have many more than one tail. In B’s case the answer is of course 1; B’s method ensures that exactly one tail is seen, no matter what happens. The data was the same, but the questions were different, because of the “when doing what I did” clause (since A and B did different things). Frequentist tests are often like this—they involve some sort of reasoning about hypothetical repetitions of the procedure, and if the procedure differs, the question differs.
If we wanted to restate this in Bayesian terms, we’d have to do so by taking into account that the interpreter knows what the method is, not just what the data is, and the distributions used by a Bayesian interpreter should take this into account. For instance, one would be a pretty dumb Bayesian if one’s prior for B’s method didn’t say you’d get one tail with probability one. The observation that’s causing us to update isn’t “string of data,” it’s “string of data produced by a given physical process,” where the process is different in the two cases.
(I apologize if this has all been mentioned before—I didn’t carefully read all the comments above.)
That is throwing away data. The evidence that they each observed is the sequence of coin flip results, and the number of tails in that sequence is a partial summary of the data. The reason they get different answers is because that summary throws away more data for B than A. As you say, B already expected to get exactly one tail, so that summary tells him nothing new and he has no information to update on, while A can recover from this summary the number of heads and only loses information about the order (which cancels out anyways in the likelihood ratios between theories of independent coin flips). But if you calculate the probability that they each see that sequence you get the same answer for both, p(heads)^9999 * (1 - p(heads).
That is, the data gathering procedure is needed to interpret a partial summary of the data, but not the complete data.
Sure, the likelihoods are the same in both cases, since A and B’s probability distributions assign the same probability to any sequence that is in both of their supports. But the distributions are still different, and various functionals of them are still different—e.g., the number of tails, the moments (if we convert heads and tails to numbers), etc.
If you’re a Bayesian, you think any hypothesis worth considering can predict a whole probability distribution, so there’s no reason to worry about these functionals when you can just look at the probability of your whole data set given the hypothesis. If (as in actual scientific practice, at present) you often predict functionals but not the whole distribution, then the difference in the functionals matters. (I admit that the coin example is too basic here, because in any theory about a real coin, we really would have a whole distribution.)
My point is just that there are differences between the two cases. Bayesians don’t think these differences could possibly matter to the sort of hypotheses they are interested in testing, but that doesn’t mean that in principle there can be no reason to differentiate between the two.