# Are calibration and rational decisions mutually exclusive? (Part one)

I’m plan­ning a two-part se­quence with the aim of throw­ing open the ques­tion in the ti­tle to the LW com­men­tariat. In this part I’ll briefly go over the con­cept of cal­ibra­tion of prob­a­bil­ity dis­tri­bu­tions and point out a dis­crep­ancy be­tween cal­ibra­tion and Bayesian up­dat­ing.

It’s a tenet of ra­tio­nal­ity that we should seek to be well-cal­ibrated. That is, sup­pose that we are called on to give in­ter­val es­ti­mates for a large num­ber of quan­tities; we give each in­ter­val an as­so­ci­ated epistemic prob­a­bil­ity. We de­clare our­selves well-cal­ibrated if the rel­a­tive fre­quency with which the quan­tities fall within our speci­fied in­ter­vals matches our claimed prob­a­bil­ity. (The Tech­ni­cal Ex­pla­na­tion of Tech­ni­cal Ex­pla­na­tions dis­cusses cal­ibra­tion in more de­tail, al­though it mostly dis­cusses dis­crete es­ti­mands, while here I’m think­ing about con­tin­u­ous es­ti­mands.)

Fre­quen­tists also pro­duce in­ter­val es­ti­mates, at least when “ran­dom” data is available. A fre­quen­tist “con­fi­dence in­ter­val” is re­ally a func­tion from the data and a user-speci­fied con­fi­dence level (a num­ber from 0 to 1) to an in­ter­val. The con­fi­dence in­ter­val pro­ce­dure is “valid” if in a hy­po­thet­i­cal in­finite se­quence of repli­ca­tions of the ex­per­i­ment, the rel­a­tive fre­quency with which the re­al­ized in­ter­vals con­tain the es­ti­mand is equal to the con­fi­dence level. (Less strictly, we may re­quire “greater than or equal” rather than “equal”.) The similar­ity be­tween valid con­fi­dence cov­er­age and well-cal­ibrated epistemic prob­a­bil­ity in­ter­vals is ev­i­dent.

This similar­ity sug­gests an ap­proach for spec­i­fy­ing non-in­for­ma­tive prior dis­tri­bu­tions, i.e., we re­quire that such pri­ors yield pos­te­rior in­ter­vals that are also valid con­fi­dence in­ter­vals in a fre­quen­tist sense. This “match­ing prior” pro­gram does not suc­ceed in full gen­er­al­ity. There are a few spe­cial cases of data dis­tri­bu­tions where a match­ing prior ex­ists, but by and large, pos­te­rior in­ter­vals can at best pro­duce only asymp­tot­i­cally valid con­fi­dence cov­er­age. Furthur­more, ac­cord­ing to my un­der­stand­ing of the ma­te­rial, if your model of the data-gen­er­at­ing pro­cess con­tains more than one scalar pa­ram­e­ter, you have to pick one “in­ter­est pa­ram­e­ter” and be satis­fied with good con­fi­dence cov­er­age for the marginal pos­te­rior in­ter­vals for that pa­ram­e­ter alone. For ap­prox­i­mate match­ing pri­ors with the high­est or­der of ac­cu­racy, a differ­ent choice of in­ter­est pa­ram­e­ter usu­ally im­plies a differ­ent prior.

The up­shot is that we have good rea­son to think that Bayesian pos­te­rior in­ter­vals will not be perfectly cal­ibrated in gen­eral. I have good jus­tifi­ca­tions, I think, for us­ing the Bayesian up­dat­ing pro­ce­dure, even if it means the re­sult­ing pos­te­rior in­ter­vals are not as well-cal­ibrated as fre­quen­tist con­fi­dence in­ter­vals. (And I mean good con­fi­dence in­ter­vals, not the ob­vi­ously patholog­i­cal ones.) But my jus­tifi­ca­tions are grounded in an epistemic view of prob­a­bil­ity, and no com­mit­ted fre­quen­tist would find them as com­pel­ling as I do. How­ever, there is an ar­gu­ment for Bayesian pos­te­ri­ors over con­fi­dence in­ter­vals than even a fre­quen­tist would have to credit. That will be the fo­cus of the sec­ond part.

• I don’t get it.

I ad­mit my math back­ground is limited to up­per-di­vi­sion un­der­grad­u­ate, and I ad­mit I could have tried harder to make sense of the jar­gon, but af­ter read­ing this a few times, I re­ally just don’t get what your point is, or even what kind of thing your point is sup­posed to be.

• Sup­pose the ac­tual fre­quen­tist prob­a­bil­ity of an event is 90%. Your prior dis­tri­bu­tion for the fre­quen­tist prob­a­bil­ity of the event is uniform. Your Bayesian prob­a­bil­ity of the event will start at 50% and ap­proach 90%; in the long run, the av­er­age will be less than 90%.

If the post is get­ting at more than this, I un­der­stand as lit­tle as you do. My an­swer to the ti­tle ques­tion was “no, they can’t be” go­ing in, and if the post is try­ing to say some­thing I haven’t un­der­stood, then I hope to con­vince the au­thor e’s wrong through sheer dis­agree­ment.

• Try rephras­ing your first para­graph when the quan­tity of in­ter­est is not a fre­quency but, say, Avo­gadro’s num­ber, and you’re Jean Per­rin try­ing to de­ter­mine ex­actly what that num­ber is.

A fre­quen­tist would take a prob­a­bil­ity model for the data you’re gen­er­at­ing and give you a con­fi­dence in­ter­val. A billion sci­en­tists re­peat your ex­per­i­ments, get­ting their own data and their own in­ter­vals. Among those in­ter­vals, the pro­por­tion that con­tain the true value of Avo­gadro’s num­ber is equal to the con­fi­dence (up to sam­pling er­ror).

A Bayesian would take the same prob­a­bil­ity model, plus a prior, and com­bine them us­ing Bayes. Each sci­en­tist may have her own prior, and pos­te­rior cal­ibra­tion is only guaran­teed if (i) all the pri­ors taken as a group were cal­ibrated, or, (ii) ev­ery­one is us­ing the match­ing prior if it ex­ists (these are typ­i­cally im­proper, so prior cal­ibra­tion can­not be calcu­lated).

• The short short ver­sion of this part of the ar­gu­ment reads:

What Bayesi­ans call cal­ibra­tion, fre­quen­tists call valid con­fi­dence cov­er­age. Bayesian pos­te­rior prob­a­bil­ity in­ter­vals do not have valid con­fi­dence cov­er­age in gen­eral; pri­ors that can guaran­tee it do not ex­ist.

• Please provide an ex­am­ple where fre­quen­tists get ex­act an­swers and Bayesi­ans get only ap­prox­i­ma­tions, all from the same data. This looks highly im­prob­a­ble to me. Or did you mean some­thing else?

• No, this is more-or-less what I meant. I equiv­o­cate on “ex­act,” be­cause I re­gard the Bayesian an­swer as ex­actly what one ac­tu­ally wants, and perfect fre­quen­tist val­idity as a sec­ondary con­sid­er­a­tion. To provide the ex­am­ple you re­quested, I’ll have to go search­ing for one of the pa­pers that set off this line of thought—the bloody thing’s not on­line, so it might take a while.

• Could you state your point with math? I don’t un­der­stand what you are say­ing.

• I came to this post via a Google search (hence this late com­ment). The prob­lem that Cyan’s point­ing out—the lack of cal­ibra­tion of Bayesian pos­te­ri­ors—is a real prob­lem, and in fact some­thing I’m fac­ing in my own re­search cur­rently. Upvoted for rais­ing an im­por­tant, and un­der-dis­cussed, is­sue.

• “The up­shot is that we have good rea­son to think that Bayesian pos­te­rior in­ter­vals will not be perfectly cal­ibrated in gen­eral.”

This seems to be the main point of your post; and noth­ing in the post seems to be con­nected to it.

• The ideas of the post are: cal­ibra­tion seems to me to be equiv­a­lent to con­fi­dence cov­er­age (sec­ond and third para­graphs); in gen­eral, Bayesian pos­te­rior in­ter­vals do not have valid con­fi­dence cov­er­age (fourth para­graph). The sen­tence you quote above fol­lows from these two ideas.

• Okay, that helps. My prob­lem is that, on re-read­ing, I still don’t know what the 4th para­graph means.

This similar­ity sug­gests an ap­proach for spec­i­fy­ing non-in­for­ma­tive prior distributions

Why would any­body want non-in­for­ma­tive dis­tri­bu­tions?

by and large, pos­te­rior in­ter­vals can at best pro­duce only asymp­tot­i­cally valid con­fi­dence cov­er­age.

I don’t know what it means for a con­fi­dence in­ter­val to be asymp­tot­i­cally valid, or why pos­te­rior in­ter­vals have this effect. This seems like an im­por­tant point that should be jus­tified.

if your model of the data-gen­er­at­ing pro­cess con­tains more than one scalar pa­ram­e­ter, you have to pick one “in­ter­est pa­ram­e­ter” and be satis­fied with good con­fi­dence cov­er­age for the marginal pos­te­rior in­ter­vals for that pa­ram­e­ter alone

You lost me en­tirely.

• Why would any­body want non-in­for­ma­tive dis­tri­bu­tions?

To have a prior dis­tri­bu­tion to use when very lit­tle is known about the es­ti­mand. It’s meant to some­how cap­ture the no­tion of min­i­mal prior knowl­edge con­tribut­ing to the pos­te­rior dis­tri­bu­tion, so that the data drive the con­clu­sions, not the prior.

I don’t know what it means for a con­fi­dence in­ter­val to be asymp­tot­i­cally valid.

The con­fi­dence cov­er­age of a pos­te­rior in­ter­val is equal to the pos­te­rior prob­a­bil­ity mass of the in­ter­val plus a term which goes to zero as the amount of data in­creases with­out bound.

if your model of the data-gen­er­at­ing pro­cess con­tains more than one scalar pa­ram­e­ter...

E.g., a re­gres­sion with more than one pre­dic­tor. Each pre­dic­tor has its own co­effi­cient, so the model of the data-gen­er­at­ing pro­cess con­tains more than one scalar pa­ram­e­ter.

• Is this a stan­dard fre­quen­tist idea? Is there a link to a longer ex­pla­na­tion some­where? Well-cal­ibrated pri­ors and well-cal­ibrated like­li­hood ra­tios should re­sult in well-cal­ibrated pos­te­ri­ors.

• Valid con­fi­dence cov­er­age is a stan­dard fre­quen­tist idea. Wikipe­dia’s ar­ti­cle on the sub­ject is a good in­tro­duc­tion. I’ve added the link to the post.

The prob­lem is ex­actly: how do you get a well-cal­ibrated prior when you know very lit­tle about the ques­tion at hand? If your pos­te­rior is well-cal­ibrated, your prior must have been as well. So, seek a prior that guaran­tees pos­te­rior cal­ibra­tion. This is the “match­ing prior” pro­gram I de­scribed above.

• This sounds like Gibbs sam­pling or ex­pec­ta­tion max­i­miza­tion. Are Gibbs and/​or EM con­sid­ered Bayesian or fre­quen­tist? (And what’s the differ­ence be­tween them?)

• Gibbs sam­pling and EM aren’t rele­vant to the ideas of this post.

Nei­ther Gibbs sam­pling nor EM is in­trin­si­cally Bayesian or fre­quen­tist. EM is just a max­i­miza­tion al­gorithm use­ful for cer­tain spe­cial cases; the max­i­mized func­tion could be a like­li­hood or a pos­te­rior den­sity. Gibbs sam­pling is just a MCMC al­gorithm; usu­ally the tar­get dis­tri­bu­tion is a Bayesian pos­te­rior dis­tri­bu­tion, but it doesn’t have to be.

• You said, “seek a prior that guaran­tees pos­te­rior cal­ibra­tion.” That’s what both EM and Gibbs sam­pling do, which is why I asked.

• You and I have very differ­ent un­der­stand­ings of what EM and Gibbs sam­pling ac­com­plish. Do you have refer­ences for your point of view?