A Proper Scoring Rule for Confidence Intervals

You prob­a­bly already know that you can in­cen­tivise hon­est re­port­ing of prob­a­bil­ities us­ing a proper scor­ing rule like log score, but did you know that you can also in­cen­tivize hon­est re­port­ing of con­fi­dence in­ter­vals?

To in­cen­tize re­port­ing of a con­fi­dence in­ter­val, take the score , where is the size of your con­fi­dence in­ter­val, and is the dis­tance be­tween the true value and the in­ter­val. is when­ever the true value is in the in­ter­val.

This in­cen­tivizes not only giv­ing an in­ter­val that has the true value of the time, but also dis­tributes the re­main­ing 10% equally be­tween over­es­ti­mates and un­der­es­ti­mates.

To keep the lower bound of the in­ter­val im­por­tant, I recom­mend mea­sur­ing and in log space. So if the true value is and the in­ter­val is , then is and is for un­der­es­ti­mates and for over­es­ti­mates. Of course, you need ques­tions with pos­i­tive an­swers to do this.

To do a con­fi­dence in­ter­val, take the score .

This can be used to make train­ing cal­ibra­tion, us­ing some­thing like Wits and Wagers cards more fun. I also think it could be turned into app, if one could get a large list of ques­tions with nu­mer­i­cal val­ues.

• EDIT: I origi­nally said you can do this for mul­ti­ple choice ques­tions, which is wrong. It only works for ques­tions with two an­swers.

(In a com­ment, to keep top level post short.)

One cute way to do cal­ibra­tion for prob­a­bil­ities, is to con­strust a spin­ner. If you have a true/​false ques­tion, you can con­struct a spin­ner which is di­vided up ac­cord­ing to your prob­a­bil­ity that each an­swer is the cor­rect an­swer.

If you were to then spin the spin­ner once, and win if it comes up on the cor­rect an­swer, this would not in­cen­tize con­struct­ing the spin­ner to rep­re­sent your true be­liefs. The best stratege is to put all the mass on the most likely an­swer.

How­ever, if you spin the spin­ner twice, and win if ei­ther spin lands on the cor­rect an­swer, you are ac­tu­ally in­cen­tivized to make the spin­ner match your true prob­a­bil­ities!

One rea­son this game is nice is that it does not re­quire hav­ing a cor­rectly speci­fied util­ity func­tion that you are try­ing to max­i­mize in ex­pec­ta­tion. There are only two states, win and lose, and as long as win­ning is prefered to los­ing, you should con­struct your spin­ner with your true prob­a­bil­ities.

Un­for­tu­nately this doesnt work for the con­fi­dence in­ter­vals, since they seem to re­quire a score that is not bounded be­low.

• Two spins only works for two pos­si­ble an­swers. Do you need N spins for N an­swers?

• You are cor­rect. It doesn’t work for more than two an­swers. I knew that when I thought about this be­fore, but for­got. Cor­rected above.

I dont have a nice al­gorithm for N an­swers. I tried a bunch of the ob­vi­ous sim­ple things, and they dont work.

• I think an al­gorithm for N out­comes is: spin twice, gain 1 ev­ery time you get the an­swer right but lose 1 if both guesses are the same.

One can “see in­tu­itively” why it works: when we in­crease the spin­ner-prob­a­bil­ity of out­come i by a small delta (imag­in­ing that all other prob­a­bil­ities stay fixed, and not wor­ry­ing about the fact that our sum of prob­a­bil­ities is now 1 + delta) then the spin­ner-prob­a­bil­ity of get­ting the same out­come twice goes up by 2 x delta x p[i]. How­ever, on each spin we get the right an­swer delta x q[i] more of the time, where q[i] is the true prob­a­bil­ity of out­come i. Since we’re spin­ning twice we get the right an­swer 2 x delta x q[i] more of­ten. Th­ese can­cel out if and only if p[i] = q[i]. [Ob­vi­ously some work would need to be done to turn that into a proof...]

• Just to be clear: if you spin twice and both come up right, you’re gain­ing 2 and then los­ing 1? (I.e., this is equiv­a­lent to what you wrote in an ear­lier ver­sion of the com­ment?)

• That’s right.

• (Why does the two-spin work?)

• In a true/​false ques­tion that is true with prob­a­bil­ity , if you as­sign prob­a­bil­ity , your prob­a­bil­ity of los­ing is . (The prob­a­bily the an­swer is true and you spin false twice plus the prob­a­bil­ity the an­swer is false and you spin true twice.)

This prob­a­bil­ity is min­i­mized when its deriva­tive with re­spect to is , or at the bound­ary. This deriva­tive is , whis is when . We now know the min­i­mum is achieved when is , , or . The prob­a­bil­ity of los­ing when is . The prob­a­bil­ity of los­ing when is . The prob­a­bil­ity of los­ing when is , which is the low­est of the three op­tions.

• Copied with­out LaTeX:

In a true/​false ques­tion that is true with prob­a­bil­ity p, if you as­sign prob­a­bil­ity q, your prob­a­bil­ity of los­ing is p(1−q)^2+(1−p)q^2. (The prob­a­bily the an­swer is true and you spin false twice plus the prob­a­bil­ity the an­swer is false and you spin true twice.)

This prob­a­bil­ity is min­i­mized when its deriva­tive with re­spect to q is 0, or at the bound­ary. This deriva­tive is −2p(1−q)+2(1−p)q, whis is 0 when q=p. We now know the min­i­mum is achieved when q is 0, 1, or p. The prob­a­bil­ity of los­ing when q=0 is p. The prob­a­bil­ity of los­ing when q=1 is 1−p. The prob­a­bil­ity of los­ing when q=p is p(1−p), which is the low­est of the three op­tions.

• This is called ei­ther Brier or quadratic scor­ing, not sure which.

• Not ex­actly. Its ex­pected value is the same as the ex­pected value of the Brier score, but the score it­self is ei­ther 0 or 1.

• For some rea­son, the la­tex is not ren­der­ing for me. I can see it when I edit the com­ment, but not oth­er­wise.

• The com­ment has just started ren­der­ing for me.

Edit: Oh wait no, you just added an­other com­ment with­out LaTex.

• Huh, that’s re­ally weird. The server must some­how be chok­ing on the spe­cific LaTeX you posted. Will check it out.

• This is an un­der­ap­pre­ci­ated fact! I like how sim­ple the rule is when framed in terms of size and dis­tance.

You men­tion both the lin­ear and log rules. The log rule has the benefit of be­ing scale-in­var­i­ant, so your score isn’t af­fect by the units the an­swer is mea­sured in, but it can’t deal with nega­tives and gets overly sen­si­tive around zero. The lin­ear rule doesn’t blow up around zero, is shift-in­var­i­ant, and can han­dle nega­tive val­ues fine. The best generic scor­ing rule would have all these prop­er­ties.

Turns out (based on Lam­bert and Sho­ham, “Elic­it­ing truth­ful an­swers to mul­ti­ple choice ques­tions”) that all scor­ing rules for sym­met­ric con­fi­dence in­ter­vals with cov­er­age prob­a­bil­ity can be rep­re­sented (up to af­fine trans­for­ma­tion) as

where is the true value, is the in­di­ca­tor func­tion, and is any in­creas­ing func­tion. Un­sur­pris­ingly, the lin­ear rule uses and the log rule uses . If we want scale-in­var­i­ance on the whole real line, first thing I’d be tempted to do is use for pos­i­tive and for nega­tive ex­cept for that pesky bit about go­ing off to around zero. Let’s paste in a lin­ear por­tion around zero so the func­tion is in­creas­ing ev­ery­where:

Us­ing this , the score is sen­si­tive to ab­solute val­ues around zero and sen­si­tive to rel­a­tive val­ues on both sides of it. Since the rule ex­pects more ac­cu­racy around zero, the ori­gin should vary de­pend­ing on ques­tion do­main. Like if the ques­tion is about dates, ac­cu­racy should be the high­est around the pre­sent year and get less ac­cu­rate go­ing into the past or fu­ture. That sug­gests we should set the ori­gin at the pre­sent year. For tem­per­a­tures, the ori­gin should prob­a­bly be room tem­per­a­ture. Are there any other stan­dard do­mains that should have a non-zero ori­gin? An al­ter­nate ori­gin can be added as a shift ev­ery­where:

Not some­thing you’d want to calcu­late by hand, but if some­one im­ple­ments a cal­ibra­tion app, this has more con­sis­tent scores. Go­ing one step fur­ther, the scores could be made more in­tepretable by com­par­i­son to a perfectly cal­ibrated refer­ence score: where is the ex­pected score for perfectly cal­ibrated in­ter­vals if, say, and is a fixed value cho­sen to keep plau­si­ble scores mostly pos­i­tive.

• I need help figur­ing out how to use this scor­ing rule. Please con­sider the fol­low­ing ap­pli­ca­tion.

How much does it cost to mail a let­ter un­der 30g in Canada?

I re­mem­ber when I was a child buy­ing 45c stamps, so it’s likely to be larger than that. It’s been over a decade or so, and as­sum­ing a 2% rise in cost per year, then we should be around c per stamp. How­ever, we also had big bud­get cuts to our postal ser­vice that even I learned about de­spite not read­ing the news. Let’s say that Canada Post in­creased their prices by 25% to ac­co­mo­date some short­fall. My es­ti­mate is that stamps cost 75c.

What should be my con­fi­dence in­ter­val? Would I be sur­prised if a stamp cost a dol­lar? Not re­ally, but it feels like an up­per bound. Would I be sur­prised if a stamp cost less than 50c? Yes. 60c? Yes. 70c? Hmmm.… As­sume that I’m well cal­ibrated, so I’m re­port­ing 90% con­fi­dence for an in­ter­val of stamps cost­ing 70c to 100c.

An­swer: Stamps in book­lets cost 85c each, in­di­vi­d­ual stamps are 100c each. Be­cause I would always buy stamps in book­lets, I will use the 85c figure.

S is the size of my con­fi­dence in­ter­val, . D is the dis­tance be­tween the true value and the in­ter­val, but is 0 in this case be­cause the true value is in the in­ter­val.

I’m not re­ally sure what to do with this num­ber, so let’s move to the next para­graph of the post.

The true value is and the in­ter­val is . Be­cause the true value is con­tained in the in­ter­val, .

How does this in­cen­tivise hon­est re­port­ing of con­fi­dence in­ter­vals?

Let’s say that, when I in­tu­ited my con­fi­dence in­ter­val above that I was per­turbed that it wasn’t sym­met­ric about my es­ti­mate of 75c, so I set it to for aes­thetic rea­sons. In this case, my score would be Which is worse than my pre­vi­ous score by a fac­tor of 2.

Let’s say that, when I re­mem­bered the price of stamps in my child­hood, I was way off and re­mem­bered 14c stamps. Then I would be­lieve that stamps should cost around 22c now. (Here I have the feel­ing of “noth­ing costs less than a quar­ter!”, so I would prob­a­bly re­ject this es­ti­mate.)That would likely an­chor me, so that I would set a high con­fi­dence on the price be­ing within

,

Am I try­ing to max­i­mize this score?

I looked up the an­swer, and the low­est cost stan­dard de­liv­ery is for let­ters un­der 30g.

• I messed up, and swapped the words over­es­ti­mate and un­der­es­ti­mate in the 4th para­graph. I fixed it now. Score should always be nega­tive.

This will change the value at the end to , or , mak­ing the score .

This score is a very nega­tive num­ber, so you get pun­ished for hav­ing a bad in­ter­val, rel­a­tive to the above.

• The idea is that the two terms in the score bal­ance be­tween two effects: try­ing to make S as small as pos­si­ble means mak­ing your in­ter­val as small as pos­si­ble, but if you make it too small you’re more likely to use an in­ter­val which doesn’t con­tain the truth. Try­ing to make D as small as pos­si­ble means mak­ing your in­ter­val more likely to con­tain the truth. The co­effi­cients bal­ance the trade­off be­tween the two so that the in­ter­val you end up with is your 90% con­fi­dence in­ter­val. (Ac­cord­ing to Scott; I haven’t ver­ified this per­son­ally.)

• I have ver­ified it. I was in the pro­cess of writ­ing a (fairly lengthy) re­ply to Ste­fan’s com­ment, in­clud­ing a proof that Scott’s scor­ing rule does in­deed have the prop­erty that your ex­pected score (ac­cord­ing to your ac­tual be­liefs about the quan­tity you’re es­ti­mat­ing) is max­i­mized when the con­fi­dence in­ter­val you state has (again ac­cord­ing to your ac­tual be­liefs) a 5% chance that the quan­tity lies be­low its lower bound and a 5% chance that the quan­tity lies above its up­per bound … but then some­thing I did (I have no inkling what, though it co­in­cided with some com­bi­na­tion of key­presses as I was try­ing to en­ter some math­e­mat­ics) made the page go en­tirely blank, and I didn’t find any way to get my par­tially-writ­ten com­ment back again.

Any­way, here’s one way (I don’t guaran­tee it’s best and it feels like there should be a slicker way) to prove it. Let’s sup­pose the con­fi­dence in­ter­val you state is (l,r); con­sider the deriva­tive w.r.t. ei­ther of those bounds—let’s say r, but l is similar—of your ex­pected score. The first term in the score is just l-r, and the deriva­tive of that is always −1. The sec­ond term can be writ­ten as an in­te­gral; differ­en­ti­at­ing it w.r.t. r turns out to give you 20Pr(X>r). (The calcu­la­tion is easy.) So the deriva­tive is zero only when 1-20Pr(X>r)=0; that is, when Pr(X>r)=5%. So if the con­fi­dence in­ter­val you state doesn’t have the prop­erty that you ex­pect to be above it ex­actly 5% of the time, then this deriva­tive is nonzero and there­fore some small change in r in­creases your ex­pected score.

• would you mind spel­ling out the in­te­gral part?

• Sup­pose f is your prob­a­bil­ity den­sity func­tion for the quan­tity X you’re in­ter­ested in.

Then the ex­pec­ta­tion of D is the in­te­gral of D(x)f(x), which equals the in­te­gral of [max(0,l-x)+max(0,x-r)]f(x). When we differ­en­ti­ate w.r.t. r, the first term ob­vi­ously goes away be­cause it’s in­de­pen­dent of r, so we get the in­te­gral of [d/​dr max(0,x-r)] f(x). That deriva­tive is 0 for x<r and 1 for x>r, so this is the in­te­gral of f(x) from r up­wards; in other words it’s Pr(X>r). So d(score)/​dr = 1-20Pr(X>r).

The calcu­la­tion for l is ex­actly the same but with a change of sign; we end up with 20Pr(X<l)-1.

• Thanks for this re­ply. The tech­nique of ask­ing what each term of your equa­tion rep­re­sents is one I have not prac­ticed in some time.

This an­swer very much helped me to un­der­stand the model.

• Thank you for pro­vid­ing an ex­am­ple!

• You’re wel­come. Some­thing that I’m try­ing to im­prove about how I en­gage with less­wrong is writ­ing out ei­ther a sum­mary of the ar­ti­cle (with­out re-refer­ing to the ar­ti­cle) or an ex­plicit ex­am­ple of the con­cept in the ar­ti­cle. My hope is that this will help me to ac­tu­ally grok what we’re dis­cussing.

• I get a dozen ’re­fresh to ren­der LaTeX’s here (but re­fresh­ing doesn’t fix it).

• This scor­ing rules has some down­sides from a us­abil­ity stand­point. See Green­berg 2018, a whitepa­per pre­pared as back­ground ma­te­rial for a (forth­com­ing) cal­ibra­tion train­ing app.

• In­cen­tivis­ing ac­cu­rate prob­a­bil­is­tic pre­dic­tions is cen­tral to any art of ra­tio­nal­ity, this post gives a sign­fi­cant part of this that’s su­per read­able, so I’ve cu­rated it.

(Also, nice move adding ex­tra points in the com­ments.)

• Is there a way to in­cen­tivize re­port­ing true prob­a­bil­ity dis­tri­bu­tion? Say I Bob wants Alice to provide her prob­a­bil­ity dis­tri­bu­tion of IQ she’ll get on the test. He is will­ing to give her a real num­ber as a re­ward, he wants to hear her prob­a­bil­ity dis­tri­bu­tion of her re­sult. What should he do?

Would be nice if it worked for both dis­crete and non-dis­crete prob­a­bil­ity spaces.

• In the dis­crete case log scor­ing still works, it gen­er­al­izes past the bi­nary case.

That is, if is the set of pos­si­ble out­comes of the test, Bob elic­its from Alice a prob­a­bil­ity dis­tri­bu­tion on , then Alice takes the test and gets some out­come , then Bob re­wards Alice . (This num­ber is un­for­tu­nately always nega­tive; you can add a pos­i­tive con­stant to it if you want.)

Alice’s ex­pected pay­off ac­cord­ing to her true prob­a­bil­ity dis­tri­bu­tion is

also known as the (nega­tive of the) cross en­tropy be­tween and . And you can do a com­pu­ta­tion, e.g. with La­grange mul­ti­pli­ers, which will ver­ify that for fixed , the op­ti­mal value of is . I do this calcu­la­tion in this blog post.

A test isn’t a good ex­am­ple to use be­cause the out­come of the test is un­der Alice’s con­trol, so she can e.g. throw the test and pre­dict this fact. This pro­ce­dure is best used to elicit Alice’s pre­dic­tion of some­thing which she can­not in­fluence in any way.

• How did us­ing LaTeX fail?

• I tried start­ing with a dol­lar sign, which brought up a yel­low prompt that I couldn’t figure out how to eas­ily exit; hit­ting En­ter just started a new line in the prompt. The only way I’ve found to exit it so far is Ctrl + En­ter, which sub­mits the com­ment with the LaTeX dis­play­ing as “re­fresh to dis­play LaTeX,” and con­tin­u­ing to dis­play that af­ter I re­fresh.

• Ah, you exit the yel­low prompt with Esc and in the yel­low prompt you can type any LaTeX, with a live-pre­view be­neath it.

Some­what sur­prised that it con­tinued to show “re­fresh to dis­play LaTeX’ even af­ter you re­freshed. I never had that hap­pen to me. That might have been a re­sult of you sub­mit­ting from the in­side of the prompt, which I can imag­ine caus­ing er­rors.

• Awe­some, ev­ery­thing’s fine now.

• Do you have some ar­gu­ment that your pro­posed for­mula are op­ti­mal?

• What do you mean by op­ti­mal?

If you mean they are proper (i.e. in­cen­tivize hon­est re­port­ing), gjm’s com­ment gives a quick sketch of a proof.

• [Edit: I’m re­tract­ing this com­ment, as I made some in­cor­rect as­sump­tions about Scott’s claim.] This is wrong. It is well known that the only strictly proper scor­ing rule that de­pends only on the prob­a­bil­ity at the ac­tu­ally oc­cur­ring value is the log­a­r­ith­mic scor­ing rule (if there are more than two al­ter­na­tives), or trans­la­tions and/​or pos­i­tive scal­ing of the same. In this case, that would be log(Nor­mal(x | mu, sigma)), where x is the value that oc­curs, and mu and sigma^2 are the mean and var­i­ance of the nor­mal dis­tri­bu­tion that fits the in­ter­val you defined at the given con­fi­dence level. This may be sim­plified to

-log(sigma^2) - (x—mu)^2 /​ sigma^2.

Your scor­ing rule is not a trans­la­tion and/​or pos­i­tive scal­ing of the log­a­r­ith­mic scor­ing rule.

• Throw­ing out an at­tempt to re­solve the dis­agree­ment, sorry if this is ac­tu­ally what we are dis­agree­ing about:

Am un­known­igly us­ing words that im­ply that I care about nor­mal dis­tri­bu­tions? I am imag­in­ing get­ting hon­est re­port­ing out of an agent try­ing to max­i­mize ex­pected score, but with ar­bi­trary be­liefs. I am only try­ing to get an hon­est re­port­ing of the sub­jec­tive 5th and 95th per­centiles, and am not try­ing to get any other in­for­ma­tion.

• I’m used to see­ing nor­mal (or log-nor­mal) dis­tri­bu­tions fit to sub­jec­tive con­fi­dence in­ter­vals—be­cause the con­fi­dence in­ter­vals are be­ing used to do some sub­jec­tive prob­a­bil­is­tic anal­y­sis. I as­sumed that was what you were do­ing, given that you were us­ing the ac­tual at­tained value x, and not just which of the three pos­si­bil­ities A:(x < left), B:(left < x < right), and C:(right < x) oc­curred.

Hmmm… you seem to have evaded the the­o­rem about the only strictly proper lo­cal scor­ing rule be­ing the log­a­r­ith­mic score, by only seek­ing to find the con­fi­dence in­ter­val, but us­ing more in­for­ma­tion than just the re­gion (A, B, or C) the out­come be­longs to.

It would help to see a proof of the claim; do you have a refer­ence or a link to a URL giv­ing the proof?

• I dont have a refer­ence. gjm’s com­ment gives a quick sketch.

• Oh, a quick thing thats not a proof that may con­vince you it is true:

It works ex­actly the same way as say­ing that mea­sur­ing the dis­tance be­tween re­ported value and true value in­cen­tivizes hon­est re­port­ing of your me­dian. (The point you think it the true value is above with prob­a­bil­ity 50%)

• This scor­ing rule does not de­pend only on the prob­a­bil­ity at the ac­tu­ally oc­cur­ing value. You dont even re­port the prob­a­bil­ity at any value. I am not try­ing to in­cen­tivize re­port­ing of prob­a­bil­ities of spe­cific value, I am try­ing to in­cen­tivize re­port­ing an in­ter­val such that the per­son re­port­ing the be­lief be­lieves the point will lie in with prob­a­bil­ity .

Your rule seems to be try­ing to do some­thing else, but it will not in­cen­tivize me giv­ing my con­fi­dence in­ter­val in cases where my be­liefs are not nor­mally dis­tributed.