For probabilistically-guaranteed methods, there is a epistemic gap—in principle—in going from the properties of such procedures in classes of repeating situations (i.e., pre-data claims about the procedure) to well-warranted claims in the cases at hand (i.e., post-data claims about the world).
Well, if you believe post-data probabilities reflect real knowledge, then that’s a start. Because, you can think of pre-data probabilities as more conservative versions of post-data probabilities. That is, if pre-data calculations tell you to be sure of something, you can probably be at least that sure, post-data.
The example that’s guiding me here is a confidence interval. When you derive a confidence interval, you’re really calculating the probability that some parameter of interest R will be between two estimators E1 and E2.
%20=%20.95)
Post-data, you just calculate E1 and E2 from the data and call that your 95\% confidence interval. So you’re still using the pre-data probability that R is between those two estimators.
I know of two precise senses in which the pre-data probabilities are conservative, when you use them in this way.
Sense the first: Let H be the hypothesis that E1<R<E2. H is probably true, so you’re probably going to get evidence in favor of it. The post-data probability, then, will probably be higher than the pre-data probability.
So, epistemically… I don’t know. If you’re doing many experiments, this explains why using pre-data probabilities is a conservative strategy: in most experiments, you’re underestimating the probability that the parameter is between the estimators. Or, you can view this as logical uncertainty about a post-data probability that you don’t know how to calculate: you think that if you did the calculation, it would probably make you more, rather than less sure that the parameter is between the estimators.
Another precise sense in which the pre-data probabilities are more conservative is that pre-data probability distributions have higher entropy than post-data ones, on average.
Let’s say R and D are random variables. Let H(R) be the entropy of the probability distribution of R, likewise for D. That is,
%20=%20E[-\log{P(D)}])
I hope this notation is clear… see, usually I’d write P(D=d), but when it’s in an expectation operator, I want to make it clear that D is a random variable that the expectation operator is integrating over, so I write things like E[P(D)] (the expected value of P(D=d) when d is randomly selected).
Define the conditional entropy as follows:
%20=%20E[-\log{P(R|D=d)}|D=d])
The theorem, then, is this:
]%20\le%20H(R))
(I don’t have a free reference on hand, but it’s theorem 9.3.2 in Sheldon Ross’s “A First Course in Probability”)
So, imagine that R is a paRameter and D is some Data. And note that the expectation is not conditional on D, all this is in the pre-data state of knowledge. So what this theorem means is that, before seeing the data, the expected value of the post-data entropy is below the current entropy.
This one’s a little weirder to interpret, but it clearly seems to be saying something relevant. As a statement about doing many independent experiments, it means that the average pre-data distribution entropy is higher than the average post-data distribution entropy, so when you use the pre-data probabilities, you’re taking them from a higher-entropy distribution. So that’s a sense in which you could call it a conservative strategy: it tends to use a probability distribution that’s too spread out. As a statement about logical uncertainty, when you haven’t calculated the post-data probabilities, I guess it could mean that your best estimate of the post-data entropy is lower than the entropy of the pre-data distribution. Which means, if your best estimate is near true, you’re using a distribution that’s too spread out, not too concentrated.
So that’s what I’ve got. I think there’s a lot more to be said here. I haven’t read about this topic, I’m just putting together some stuff that I’ve observed incidentally, so I would appreciate a reference. But what it adds up to is that using pre-data probabilities is a conservative strategy.
And the reason that’s important is because conservative strategies can be really useful for science. Sometimes you wanna gather evidence until you’ve got enough that you can publish and say that you’ve proved something with confidence. Conservative calculations can often show what you want to show, which is that your evidence is sufficient.
Well, if you believe post-data probabilities reflect real knowledge, then that’s a start. Because, you can think of pre-data probabilities as more conservative versions of post-data probabilities. That is, if pre-data calculations tell you to be sure of something, you can probably be at least that sure, post-data.
The example that’s guiding me here is a confidence interval. When you derive a confidence interval, you’re really calculating the probability that some parameter of interest R will be between two estimators E1 and E2.
%20=%20.95)Post-data, you just calculate E1 and E2 from the data and call that your 95\% confidence interval. So you’re still using the pre-data probability that R is between those two estimators.
I know of two precise senses in which the pre-data probabilities are conservative, when you use them in this way.
Sense the first: Let H be the hypothesis that E1<R<E2. H is probably true, so you’re probably going to get evidence in favor of it. The post-data probability, then, will probably be higher than the pre-data probability.
So, epistemically… I don’t know. If you’re doing many experiments, this explains why using pre-data probabilities is a conservative strategy: in most experiments, you’re underestimating the probability that the parameter is between the estimators. Or, you can view this as logical uncertainty about a post-data probability that you don’t know how to calculate: you think that if you did the calculation, it would probably make you more, rather than less sure that the parameter is between the estimators.
Another precise sense in which the pre-data probabilities are more conservative is that pre-data probability distributions have higher entropy than post-data ones, on average.
Let’s say R and D are random variables. Let H(R) be the entropy of the probability distribution of R, likewise for D. That is,
%20=%20E[-\log{P(D)}])I hope this notation is clear… see, usually I’d write P(D=d), but when it’s in an expectation operator, I want to make it clear that D is a random variable that the expectation operator is integrating over, so I write things like E[P(D)] (the expected value of P(D=d) when d is randomly selected).
Define the conditional entropy as follows:
%20=%20E[-\log{P(R|D=d)}|D=d])The theorem, then, is this:
]%20\le%20H(R))(I don’t have a free reference on hand, but it’s theorem 9.3.2 in Sheldon Ross’s “A First Course in Probability”)
So, imagine that R is a paRameter and D is some Data. And note that the expectation is not conditional on D, all this is in the pre-data state of knowledge. So what this theorem means is that, before seeing the data, the expected value of the post-data entropy is below the current entropy.
This one’s a little weirder to interpret, but it clearly seems to be saying something relevant. As a statement about doing many independent experiments, it means that the average pre-data distribution entropy is higher than the average post-data distribution entropy, so when you use the pre-data probabilities, you’re taking them from a higher-entropy distribution. So that’s a sense in which you could call it a conservative strategy: it tends to use a probability distribution that’s too spread out. As a statement about logical uncertainty, when you haven’t calculated the post-data probabilities, I guess it could mean that your best estimate of the post-data entropy is lower than the entropy of the pre-data distribution. Which means, if your best estimate is near true, you’re using a distribution that’s too spread out, not too concentrated.
So that’s what I’ve got. I think there’s a lot more to be said here. I haven’t read about this topic, I’m just putting together some stuff that I’ve observed incidentally, so I would appreciate a reference. But what it adds up to is that using pre-data probabilities is a conservative strategy.
And the reason that’s important is because conservative strategies can be really useful for science. Sometimes you wanna gather evidence until you’ve got enough that you can publish and say that you’ve proved something with confidence. Conservative calculations can often show what you want to show, which is that your evidence is sufficient.