Are calibration and rational decisions mutually exclusive? (Part one)

I’m plan­ning a two-part se­quence with the aim of throw­ing open the ques­tion in the ti­tle to the LW com­men­tariat. In this part I’ll briefly go over the con­cept of cal­ibra­tion of prob­a­bil­ity dis­tri­bu­tions and point out a dis­crep­ancy be­tween cal­ibra­tion and Bayesian up­dat­ing.

It’s a tenet of ra­tio­nal­ity that we should seek to be well-cal­ibrated. That is, sup­pose that we are called on to give in­ter­val es­ti­mates for a large num­ber of quan­tities; we give each in­ter­val an as­so­ci­ated epistemic prob­a­bil­ity. We de­clare our­selves well-cal­ibrated if the rel­a­tive fre­quency with which the quan­tities fall within our speci­fied in­ter­vals matches our claimed prob­a­bil­ity. (The Tech­ni­cal Ex­pla­na­tion of Tech­ni­cal Ex­pla­na­tions dis­cusses cal­ibra­tion in more de­tail, al­though it mostly dis­cusses dis­crete es­ti­mands, while here I’m think­ing about con­tin­u­ous es­ti­mands.)

Fre­quen­tists also pro­duce in­ter­val es­ti­mates, at least when “ran­dom” data is available. A fre­quen­tist “con­fi­dence in­ter­val” is re­ally a func­tion from the data and a user-speci­fied con­fi­dence level (a num­ber from 0 to 1) to an in­ter­val. The con­fi­dence in­ter­val pro­ce­dure is “valid” if in a hy­po­thet­i­cal in­finite se­quence of repli­ca­tions of the ex­per­i­ment, the rel­a­tive fre­quency with which the re­al­ized in­ter­vals con­tain the es­ti­mand is equal to the con­fi­dence level. (Less strictly, we may re­quire “greater than or equal” rather than “equal”.) The similar­ity be­tween valid con­fi­dence cov­er­age and well-cal­ibrated epistemic prob­a­bil­ity in­ter­vals is ev­i­dent.

This similar­ity sug­gests an ap­proach for spec­i­fy­ing non-in­for­ma­tive prior dis­tri­bu­tions, i.e., we re­quire that such pri­ors yield pos­te­rior in­ter­vals that are also valid con­fi­dence in­ter­vals in a fre­quen­tist sense. This “match­ing prior” pro­gram does not suc­ceed in full gen­er­al­ity. There are a few spe­cial cases of data dis­tri­bu­tions where a match­ing prior ex­ists, but by and large, pos­te­rior in­ter­vals can at best pro­duce only asymp­tot­i­cally valid con­fi­dence cov­er­age. Furthur­more, ac­cord­ing to my un­der­stand­ing of the ma­te­rial, if your model of the data-gen­er­at­ing pro­cess con­tains more than one scalar pa­ram­e­ter, you have to pick one “in­ter­est pa­ram­e­ter” and be satis­fied with good con­fi­dence cov­er­age for the marginal pos­te­rior in­ter­vals for that pa­ram­e­ter alone. For ap­prox­i­mate match­ing pri­ors with the high­est or­der of ac­cu­racy, a differ­ent choice of in­ter­est pa­ram­e­ter usu­ally im­plies a differ­ent prior.

The up­shot is that we have good rea­son to think that Bayesian pos­te­rior in­ter­vals will not be perfectly cal­ibrated in gen­eral. I have good jus­tifi­ca­tions, I think, for us­ing the Bayesian up­dat­ing pro­ce­dure, even if it means the re­sult­ing pos­te­rior in­ter­vals are not as well-cal­ibrated as fre­quen­tist con­fi­dence in­ter­vals. (And I mean good con­fi­dence in­ter­vals, not the ob­vi­ously patholog­i­cal ones.) But my jus­tifi­ca­tions are grounded in an epistemic view of prob­a­bil­ity, and no com­mit­ted fre­quen­tist would find them as com­pel­ling as I do. How­ever, there is an ar­gu­ment for Bayesian pos­te­ri­ors over con­fi­dence in­ter­vals than even a fre­quen­tist would have to credit. That will be the fo­cus of the sec­ond part.