I endorse this (while remarking that both Lumifer and I have—independently, so far as I know—suggested in this discussion that a better approach may be simply to turn the observed prediction results into some sort of smoothed/interpolated curve and plot that rather than the usual bar chart).
Let me make a more concrete suggestion.
Step 0 (need be done only once, but so far as I know never has been): Get a number of experimental subjects with highly varied personalities, intelligence, statistical sophistication, etc. Get them to make a lot of predictions with fine-grained confidence levels. Use this to estimate how much calibration error actually varies with confidence level; this effectively gives you a prior distribution on calibration functions.
Step 1 (given actual calibration data): You’re now trying to predict a single calibration function. Each prediction-result has a corresponding likelihood: if something happened that you gave probability p to, the likelihood is simply f(p) where f is the function you’re trying to estimate; if not, the likelihood is 1-f(p). So you’re trying to maximize the sum of log f(p) over successful predictions + the sum of log [1-f(p)] over unsuccessful predictions. So now find the posterior-maximizing calibration function. (You could e.g. pick some space of functions large enough to have good approximations to all plausible calibration functions, and optimize over a parameterization of that space.) You can figure out how confident you should be about the calibration function by sampling from the posterior distribution and looking at the resulting distribution of values at any given point. If what you have is lots of prediction results at each of some number of confidence levels, then a normal approximation applies and you’re basically doing Gaussian process regression or kriging, which quite cheaply gives you not only a smooth curve but error estimates everywhere; in this case you don’t need an explicit representation of the space of (approximate) permissible calibration functions.
[EDITED: I wrote 1-log where I meant log 1- and have now fixed this.]
I endorse this (while remarking that both Lumifer and I have—independently, so far as I know—suggested in this discussion that a better approach may be simply to turn the observed prediction results into some sort of smoothed/interpolated curve and plot that rather than the usual bar chart).
Let me make a more concrete suggestion.
Step 0 (need be done only once, but so far as I know never has been): Get a number of experimental subjects with highly varied personalities, intelligence, statistical sophistication, etc. Get them to make a lot of predictions with fine-grained confidence levels. Use this to estimate how much calibration error actually varies with confidence level; this effectively gives you a prior distribution on calibration functions.
Step 1 (given actual calibration data): You’re now trying to predict a single calibration function. Each prediction-result has a corresponding likelihood: if something happened that you gave probability p to, the likelihood is simply f(p) where f is the function you’re trying to estimate; if not, the likelihood is 1-f(p). So you’re trying to maximize the sum of log f(p) over successful predictions + the sum of log [1-f(p)] over unsuccessful predictions. So now find the posterior-maximizing calibration function. (You could e.g. pick some space of functions large enough to have good approximations to all plausible calibration functions, and optimize over a parameterization of that space.) You can figure out how confident you should be about the calibration function by sampling from the posterior distribution and looking at the resulting distribution of values at any given point. If what you have is lots of prediction results at each of some number of confidence levels, then a normal approximation applies and you’re basically doing Gaussian process regression or kriging, which quite cheaply gives you not only a smooth curve but error estimates everywhere; in this case you don’t need an explicit representation of the space of (approximate) permissible calibration functions.
[EDITED: I wrote 1-log where I meant log 1- and have now fixed this.]