• I do not think I am quite addressing your question. Specifically, I don’t think there has been a wide enough discussion about calibration for there to be a single widely accepted method.

However, what I would like to point out is that a single-number calibration necessarily discards information, and there is no one true way to decide which information to discard.

A gets binary questions right 98% of the time, but expects to get them correct 99% of the time. B gets binary questions right 51% of time time, but expects to get them correct 52% of the time.

In some cases, A and B must be treated as equally calibrated (Zut Allais! is relevant). In some cases, B can be considered much better calibrated, and in almost all cases we don’t care either way, because B’s predictions are almost never useful, whereas A’s almost always are.

Even this is a dramatic simplification, painfully restricting our information about the situation. Perhaps A never has false positives; or maybe B never has false positives! This is extremely relevant to many questions, but can’t be represented in any single-number metric.

No matter what your purpose, domain knowledge matters, and I suspect that calibration does not carry over well from one domain to another, so finding out that you know little history but are well calibrated to how poorly you know things will not help you evaluate how reliable your predictions in your primary field are.

Binary questions are usually already horribly under-sampled. We can ask binary questions about history, but it probably matters in the real world whether your answer was 2172 or 1879 if the correct answer was 1880. Ideally, we could provide a probability distribution for the entire range of incorrectness, but in practice, I think the best measure is to report the false positive and false negative rate of an agent on a set of questions along with their own estimates for their performance on those questions. I realize this is four times as many numbers as you want, but you can then condense them however you like, and I really think that the 4-tuple is more than four times more useful than any single-number measure!

Do you have a more specific purpose in mind? I’m curious what spurred your question.

• Do you have a more specific purpose in mind? I’m curious what spurred your question.

A prof doing an experiment gave me a bunch of data from calibration tests with demographic identifiers, and I’d like to be able to analyze it to say things like “Old people have better calibration than young people” or “Training in finance improves your calibration”.

• Oh, excellent. I do love data. What is the format (what is the maximum amount of information you have about each individual)?

Given that you already have the data, (and you probably have reason to suspect that individuals were not trying to game the test?), I suspect the best way is to graph both accuracy and anticipated accuracy against the chosen demographic, and then for all your readers who want numbers, compute either the ratio or the difference of those two and publish the PMCC of that against the demographic (it’s Frequentist, but it’s also standard practice, and I’ve had papers rejected that don’t follow it...).

• Leaving them with two separate metrics would allow you to make interesting statements like “financial training increased accuracy, but it also decreased calibration. Subjects overestimated their ability.”