Are calibration and rational decisions mutually exclusive? (Part one)
I’m planning a two-part sequence with the aim of throwing open the question in the title to the LW commentariat. In this part I’ll briefly go over the concept of calibration of probability distributions and point out a discrepancy between calibration and Bayesian updating.
It’s a tenet of rationality that we should seek to be well-calibrated. That is, suppose that we are called on to give interval estimates for a large number of quantities; we give each interval an associated epistemic probability. We declare ourselves well-calibrated if the relative frequency with which the quantities fall within our specified intervals matches our claimed probability. (The Technical Explanation of Technical Explanations discusses calibration in more detail, although it mostly discusses discrete estimands, while here I’m thinking about continuous estimands.)
Frequentists also produce interval estimates, at least when “random” data is available. A frequentist “confidence interval” is really a function from the data and a user-specified confidence level (a number from 0 to 1) to an interval. The confidence interval procedure is “valid” if in a hypothetical infinite sequence of replications of the experiment, the relative frequency with which the realized intervals contain the estimand is equal to the confidence level. (Less strictly, we may require “greater than or equal” rather than “equal”.) The similarity between valid confidence coverage and well-calibrated epistemic probability intervals is evident.
This similarity suggests an approach for specifying non-informative prior distributions, i.e., we require that such priors yield posterior intervals that are also valid confidence intervals in a frequentist sense. This “matching prior” program does not succeed in full generality. There are a few special cases of data distributions where a matching prior exists, but by and large, posterior intervals can at best produce only asymptotically valid confidence coverage. Furthurmore, according to my understanding of the material, if your model of the data-generating process contains more than one scalar parameter, you have to pick one “interest parameter” and be satisfied with good confidence coverage for the marginal posterior intervals for that parameter alone. For approximate matching priors with the highest order of accuracy, a different choice of interest parameter usually implies a different prior.
The upshot is that we have good reason to think that Bayesian posterior intervals will not be perfectly calibrated in general. I have good justifications, I think, for using the Bayesian updating procedure, even if it means the resulting posterior intervals are not as well-calibrated as frequentist confidence intervals. (And I mean good confidence intervals, not the obviously pathological ones.) But my justifications are grounded in an epistemic view of probability, and no committed frequentist would find them as compelling as I do. However, there is an argument for Bayesian posteriors over confidence intervals than even a frequentist would have to credit. That will be the focus of the second part.