# Are calibration and rational decisions mutually exclusive? (Part one)

I’m planning a two-part sequence with the aim of throwing open the question in the title to the LW commentariat. In this part I’ll briefly go over the concept of calibration of probability distributions and point out a discrepancy between calibration and Bayesian updating.

It’s a tenet of rationality that we should seek to be well-calibrated. That is, suppose that we are called on to give interval estimates for a large number of quantities; we give each interval an associated epistemic probability. We declare ourselves well-calibrated if the relative frequency with which the quantities fall within our specified intervals matches our claimed probability. (The Technical Explanation of Technical Explanations discusses calibration in more detail, although it mostly discusses discrete estimands, while here I’m thinking about continuous estimands.)

Frequentists also produce interval estimates, at least when “random” data is available. A frequentist “confidence interval” is really a function from the data and a user-specified confidence level (a number from 0 to 1) to an interval. The confidence interval procedure is “valid” if in a hypothetical infinite sequence of replications of the experiment, the relative frequency with which the realized intervals contain the estimand is equal to the confidence level. (Less strictly, we may require “greater than or equal” rather than “equal”.) The similarity between valid confidence coverage and well-calibrated epistemic probability intervals is evident.

This similarity suggests an approach for specifying non-informative prior distributions, i.e., we require that such priors yield posterior intervals that are also valid confidence intervals in a frequentist sense. This “matching prior” program does not succeed in full generality. There are a few special cases of data distributions where a matching prior exists, but by and large, posterior intervals can at best produce only asymptotically valid confidence coverage. Furthurmore, according to my understanding of the material, if your model of the data-generating process contains more than one scalar parameter, you have to pick one “interest parameter” and be satisfied with good confidence coverage for the marginal posterior intervals for that parameter alone. For approximate matching priors with the highest order of accuracy, a different choice of interest parameter usually implies a different prior.

The upshot is that we have good reason to think that Bayesian posterior intervals will not be perfectly calibrated in general. I have good justifications, I think, for using the Bayesian updating procedure, even if it means the resulting posterior intervals are not as well-calibrated as frequentist confidence intervals. (And I mean *good *confidence intervals, not the obviously pathological ones.) But my justifications are grounded in an epistemic view of probability, and no committed frequentist would find them as compelling as I do. However, there is an argument for Bayesian posteriors over confidence intervals than even a frequentist would have to credit. That will be the focus of the second part.

I don’t get it.

I admit my math background is limited to upper-division undergraduate, and I admit I could have tried harder to make sense of the jargon, but after reading this a few times, I really just don’t get what your point is, or even what kind of thing your point is supposed to be.

Suppose the actual frequentist probability of an event is 90%. Your prior distribution for the frequentist probability of the event is uniform. Your Bayesian probability of the event will start at 50% and approach 90%; in the long run, the average will be less than 90%.

If the post is getting at more than this, I understand as little as you do. My answer to the title question was “no, they can’t be” going in, and if the post is trying to say something I haven’t understood, then I hope to convince the author e’s wrong through sheer disagreement.

Try rephrasing your first paragraph when the quantity of interest is not a frequency but, say, Avogadro’s number, and you’re Jean Perrin trying to determine exactly what that number is.

A frequentist would take a probability model for the data you’re generating and give you a confidence interval. A billion scientists repeat your experiments, getting their own data and their own intervals. Among those intervals, the proportion that contain the true value of Avogadro’s number is equal to the confidence (up to sampling error).

A Bayesian would take the same probability model, plus a prior, and combine them using Bayes. Each scientist may have her own prior, and posterior calibration is only guaranteed if (i) all the priors taken as a group were calibrated, or, (ii) everyone is using the matching prior if it exists (these are typically improper, so prior calibration cannot be calculated).

The short short version of this part of the argument reads:

What Bayesians call calibration, frequentists call valid confidence coverage. Bayesian posterior probability intervals do not have valid confidence coverage in general; priors that can guarantee it do not exist.

Please provide an example where frequentists get exact answers and Bayesians get only approximations, all from the same data. This looks highly improbable to me. Or did you mean something else?

No, this is more-or-less what I meant. I equivocate on “exact,” because I regard the Bayesian answer as exactly what one actually wants, and perfect frequentist validity as a secondary consideration. To provide the example you requested, I’ll have to go searching for one of the papers that set off this line of thought—the bloody thing’s not online, so it might take a while.

Could you state your point with math? I don’t understand what you are saying.

You can find some of the math, and pointers into the literature, in this paper

I came to this post via a Google search (hence this late comment). The problem that Cyan’s pointing out—the lack of calibration of Bayesian posteriors—is a real problem, and in fact something I’m facing in my own research currently. Upvoted for raising an important, and under-discussed, issue.

“The upshot is that we have good reason to think that Bayesian posterior intervals will not be perfectly calibrated in general.”

This seems to be the main point of your post; and nothing in the post seems to be connected to it.

The ideas of the post are: calibration seems to me to be equivalent to confidence coverage (second and third paragraphs); in general, Bayesian posterior intervals do not have valid confidence coverage (fourth paragraph). The sentence you quote above follows from these two ideas.

Okay, that helps. My problem is that, on re-reading, I still don’t know what the 4th paragraph means.

Why would anybody want non-informative distributions?

I don’t know what it means for a confidence interval to be asymptotically valid, or why posterior intervals have this effect. This seems like an important point that should be justified.

You lost me entirely.

To have a prior distribution to use when very little is known about the estimand. It’s meant to somehow capture the notion of minimal prior knowledge contributing to the posterior distribution, so that the data drive the conclusions, not the prior.

The confidence coverage of a posterior interval is equal to the posterior probability mass of the interval plus a term which goes to zero as the amount of data increases without bound.

E.g., a regression with more than one predictor. Each predictor has its own coefficient, so the model of the data-generating process contains more than one scalar parameter.

Is this a standard frequentist idea? Is there a link to a longer explanation somewhere? Well-calibrated priors and well-calibrated likelihood ratios should result in well-calibrated posteriors.

Valid confidence coverage is a standard frequentist idea. Wikipedia’s article on the subject is a good introduction. I’ve added the link to the post.

The problem is exactly: how do you get a well-calibrated prior when you know very little about the question at hand? If your posterior is well-calibrated, your prior must have been as well. So, seek a prior that guarantees posterior calibration. This is the “matching prior” program I described above.

This sounds like Gibbs sampling or expectation maximization. Are Gibbs and/or EM considered Bayesian or frequentist? (And what’s the difference between them?)

Gibbs sampling and EM aren’t relevant to the ideas of this post.

Neither Gibbs sampling nor EM is intrinsically Bayesian or frequentist. EM is just a maximization algorithm useful for certain special cases; the maximized function could be a likelihood or a posterior density. Gibbs sampling is just a MCMC algorithm; usually the target distribution is a Bayesian posterior distribution, but it doesn’t have to be.

You said, “seek a prior that guarantees posterior calibration.” That’s what both EM and Gibbs sampling do, which is why I asked.

You and I have very different understandings of what EM and Gibbs sampling accomplish. Do you have references for your point of view?