A note about calibration of confidence

Background

In a recent Slate Star Codex Post (http://​​slatestarcodex.com/​​2016/​​01/​​02/​​2015-predictions-calibration-results/​​), Scott Alexander made a number of predictions and presented associated confidence levels, and then at the end of the year, scored his predictions in order to determine how well-calibrated he is. In the comments, however, there arose a controversy over how to deal with 50% confidence predictions. As an example, Scott has these predictions at 50% confidence, among his others:

Proposition

Scott’s Prior

Result

A

Jeb Bush will be the top-polling Republican candidate

P(A) = 50%

A is False

B

Oil will end the year greater than $60 a barrel

P(B) = 50%

B is False

C

Scott will not get any new girlfriends

P(C) = 50%

C is False

D

At least one SSC post in the second half of 2015 will get > 100,000 hits: 70%

P(D) = 70%

D is False

E

Ebola will kill fewer people in second half of 2015 than the in first half

P(E) = 95%

E is True

Scott goes on to score himself as having made 03 correct predictions at the 50% confidence interval, which looks like significant overconfidence. He addresses this by noting that with only 3 data points it’s not much data to go by, and could easily have been correct if any of those results had turned out differently. His resulting calibration curve is this:

Scott Alexander

However, the commenters had other objections about the anomaly at 50%. After all, P(A) = 50% implies P(~A) = 50%, so the choice of “I will not get any new girlfriends: 50% confidence” is logically equivalent to “I will get at least 1 new girlfriend: 50% confidence”, except that one results as true and the other false. Therefore, the question seems sensitive only to the particular phrasing chosen, independent of the outcome.

One commenter suggests that close to perfect calibration at 50% confidence can be achieved by choosing whether to represent propositions as positive or negative statements by flipping a fair coin. Another suggests replacing 50% confidence with 50.1% or some other number arbitrarily close to 50%, but not equal to it. Others suggest getting rid of the 50% confidence bin altogether.

Scott recognizes that predicting A and predicting ~A are logically equivalent, and choosing to use one or the other is arbitrary. But by choosing to only include A in his data set rather than ~A, he creates a problem that occurs when P(A) = 50%, where the arbitrary choice of making a prediction phrased as ~A would have changed the calibration results despite being the same prediction.

Symmetry

This conundrum illustrates an important point about these calibration exercises. Scott chooses all of his propositions to be in the form of statements to which he assigns greater or equal to 50% probability, by convention, recognizing that he doesn’t need to also do a calibration of probabilities less than 50%, as the upper-half of the calibration curve captures all the relevant information about his calibration.

This is because the calibration curve has a property of symmetry about the 50% mark, as implied by the mathematical relation P(X) = 1- P(~X) and of course P(~X) = 1 –P(X).

We can enforce that symmetry by recognizing that when we make the claim that proposition X has probability P(X), we are also simultaneously making the claim that proposition ~X has probability 1-P(X). So we add those to the list of predictions and do the bookkeeping on them too. Since we are making both claims, why not be clear about it in our bookkeeping?

When we do this, we get the full calibration curve, and the confusion about what to do about 50% probability disappears. Scott’s list of predictions looks like this:

Proposition

Scott’s Prior

Result

A

Jeb Bush will be the top-polling Republican candidate

P(A) = 50%

A is False

~A

Jeb Bush will not be the top-polling Republican candidate

P(~A) = 50%

~A is True

B

Oil will end the year greater than $60 a barrel

P(B) = 50%

B is False

~B

Oil will not end the year greater than $60 a barrel

P(~B) = 50%

~B is True

C

Scott will not get any new girlfriends

P(C) = 50%

C is False

~C

Scott will get new girlfriend(s)

P(~C) = 50%

~C is True

D

At least one SSC post in the second half of 2015 will get > 100,000 hits: 70%

P(D) = 70%

D is False

~D

No SSC post in the second half of 2015 will get > 100,000 hits

P(~D) = 30%

~D is True

E

Ebola will kill fewer people in second half of 2015 than the in first half

P(E) = 95%

E is True

~E

Ebola will kill as many or more people in second half of 2015 than the in first half

P(~E) = 05%

~E is False

You will by now have noticed that there will always be an even number of predictions, and that half of the predictions always are true and half are always false. In most cases, like with E and ~E, that means you get a 95% likely prediction that is true and a 5%-likely prediction that is false, which is what you would expect. However, with 50%-likely predictions, they are always accompanied by another 50% prediction, one of which is true and one of which is false. As a result, it is actually not possible to make a binary prediction at 50% confidence that is out of calibration.

The resulting calibration curve, applied to Scott’s predictions, looks like this:

no error bars

Sensitivity

By the way, this graph doesn’t tell the whole calibration story; as Scott noted it’s still sensitive to how many predictions were made in each bucket. We can add “error bars” that show what would have resulted if Scott had made one more prediction in each bucket, and whether the result of that prediction had been true or false. The result is the following graph:

with error bars

Note that the error bars are zero about the point of 0.5. That’s because even if one additional prediction had been added to that bucket, it would have had no effect. That point is fixed by the inherent symmetry.

I believe that this kind of graph does a better job of showing someone’s true calibration. But it’s not the whole story.

Ramifications for scoring calibration (updated)

Clearly, it is not possible to make a binary prediction with 50% confidence that is poorly calibrated. This shouldn’t come as a surprise; a prediction at 50% between two choices represents the correct prior for the case where you have no information that discriminates between X and ~X. However, that doesn’t mean that you can improve your ability to make correct predictions just by giving them all 50% confidence and claiming impeccable calibration! An easy way to “cheat” your way into apparently good calibration is to take a large number of predictions that you are highly (>99%) confident about, negate a fraction of them, and falsely record a lower confidence for those. If we’re going to measure calibration, we need a scoring method that will encourage people to write down the true probabilities they believe, rather than faking low confidence and ignoring their data. We want people to only claim 50% confidence when they genuinely have 50% confidence, and we need to make sure our scoring method encourages that.

A first guess would be to look at that graph and do the classic assessment of fit: sum of squared errors. We can sum the squared error of our predictions against the ideal linear calibration curve. If we did this, we would want to make sure we summed all the individual predictions, rather than the averages of the bins, so that the binning process itself doesn’t bias our score.

If we do this, then our overall prediction score can be summarized by one number:

S = \frac{1}{N}\left(\sum_{i=1}^{N}(P(X_i)-X_i)^2 \right )

Here P(Xi) is the assigned confidence of the truth of Xi, and Xi is the ith proposition and has a value of 1 if it is True and 0 if it is False. S is the prediction score, and lower is better. Note that because these are binary predictions, the sum of squared errors gives an optimal score if you assign the probabilities you actually believe (ie, there is no way to “cheat” your way to a better score by giving false confidence).

In this case, Scott’s score is S=0.139, much of this comes from the 0.4/​0.6 bracket. The worst score possible would be S=1, and the best score possible is S=0. Attempting to fake a perfect calibration by everything by claiming 50% confidence for every prediction, regardless of the information you actually have available, yields S=0.25 and therefore isn’t a particularly good strategy (at least, it won’t make you look better-calibrated than Scott).

Several of the commenters pointed out that log scoring is another scoring rule that works better in the general case. Before posting this I ran the calculus to confirm that the least-squares error did encourage an optimal strategy of honest reporting of confidence, but I did have a feeling that it was an ad-hoc scoring rule and that there must be better ones out there.

The logarithmic scoring rule looks like this:

S = \frac{1}{N}\sum_{i=1}^{N}X_i\ln(P(X_i))

Here again Xi is the ith proposition and has a value of 1 if it is True and 0 if it is False. The base of the logarithm is arbitrary so I’ve chosen base “e” as it makes it easier to take derivatives. This scoring method gives a negative number and the closer to zero the better. The log scoring rule has the same honesty-encouraging properties as the sum-of-squared-errors, plus the additional nice property that it penalizes wrong predictions of 100% or 0% confidence with an appropriate score of minus-infinity. When you claim 100% confidence and are wrong, you are infinitely wrong. Don’t claim 100% confidence!

In this case, Scott’s score is calculated to be S=-0.42. For reference, the worst possible score would be minus-infinity, and claiming nothing but 50% confidence for every prediction results in a score of S=-0.69. This just goes to show that you can’t win by cheating.

Example: Pretend underconfidence to fake good calibration

In an attempt to appear like I have better calibration than Scott Alexander, I am going to make the following predictions. For clarity I have included the inverse propositions in the list (as those are also predictions that I am making), but at the end of the list so you can see the point I am getting at a bit better.

Proposition

Quoted Prior

Result

A

I will not win the lottery on Monday

P(A) = 50%

A is True

B

I will not win the lottery on Tuesday

P(B) = 66%

B is True

C

I will not win the lottery on Wednesday

P(C) = 66%

C is True

D

I will win the lottery on Thursday

P(D) =66%

D is False

E

I will not win the lottery on Friday

P(E) = 75%

E is True

F

I will not win the lottery on Saturday

P(F) = 75%

F is True

G

I will not win the lottery on Sunday

P(G) = 75%

G is True

H

I will win the lottery next Monday

P(H) = 75%

H is False

~A

I will win the lottery on Monday

P(~A) = 50%

~A is False

~B

I will win the lottery on Tuesday

P(~B) = 34%

~B is False

~C

I will win the lottery on Wednesday

P(~C) = 34%

~C is False

Look carefully at this table. I’ve thrown in a particular mix of predictions that I will or will not win the lottery on certain days, in order to use my extreme certainty about the result to generate a particular mix of correct and incorrect predictions.

To make things even easier for me, I’m not even planning to buy any lottery tickets. Knowing this information, an honest estimate of the odds of me winning the lottery are astronomically small. The odds of winning the lottery are about 1 in 14 million (for the Canadian 649 lottery). I’d have to win by accident (one of my relatives buying me a lottery ticket?). Not only that, but since the lottery is only held on Wednesday and Saturday, that makes most of these scenarios even more implausible since the lottery corporation would have to hold the draw by mistake.

I am confident I could make at least 1 billion similar statements of this exact nature and get them all right, so my true confidence must be upwards of (100% − 0.0000001%).

If I assemble 50 of these types of strategically-underconfident predictions (and their 50 opposites) and plot them on a graph, here’s what I get:

Looks like good calibration...? Not so fast.

You can see that the problem with cheating doesn’t occur only at 50%. It can occur anywhere!

But here’s the trick: The log scoring algorithm rates me −0.37. If I had made the same 100 predictions all at my true confidence (99.9999999%), then my score would have been −0.000000001. A much better score! My attempt to cheat in order to make a pretty graph has only sabotaged my score.

By the way, what if I had gotten one of those wrong, and actually won the lottery one of those times without even buying a ticket? In that case my score is −0.41 (the wrong prediction had a probability of 1 in 10^9 which is about 1 in e^21, so it’s worth −21 points, but then that averages down to −0.41 due to the 49 correct predictions that are collectively worth a negligible fraction of a point).* Not terrible! The log scoring rule is pretty gentle about being very badly wrong sometimes, just as long as you aren’t infinitely wrong. However, if I had been a little less confident and said the chance of winning each time was only 1 in a million, rather than 1 in a billion, my score would have improved to −0.28, and if I had expressed only 98% confidence I would have scored −0.098, the best possible score for someone who is wrong one in every fifty times.

This has another important ramification: If you’re going to honestly test your calibration, you shouldn’t pick the predictions you’ll make. It is easy to improve your score by throwing in a couple predictions that you are very certain about, like that you won’t win the lottery, and by making few predictions that you are genuinely uncertain about. It is fairer to use a list of propositions that is generated by somebody else, and then pick your probabilities. Scott demonstrates his honesty by making public predictions about a mix of things he was genuinely uncertain about, but if he wanted to cook his way to a better score in the future, he would avoid making any predictions at the 50% category that he wasn’t forced to.

Input and comments are welcome! Let me know what you think!

* This result surprises me enough that I would appreciate if someone in the comments can double-check it on their own. What is the proper score for being right 49 times with 1-1 in a billion certainty, but wrong once?