I think part of the trouble is that its very difficult to comment meaningfully on the calibration of a single estimate without background information.
For example, suppose Alice and Bob each make one prediction, with confidence of 90% and 80% respectively, and both turn out to be right. I’d be happy to say that it seems like Alice is so far the better predictor of the two (although I’d be prepared to revise this estimate with more data) but its much harder for me to say who is better calibrated without some background information about what sort of evidence they were working from.
With that in mind, I don’t think you’re likely to find something as convenient as log scoring, though there are a few less mathematically elegant solutions that only work when you have a reasonably large set of predictions to test for calibration (I don’t know if this is rigorous to help you). Both of these only work for binary true/false predictions but can probably be generalised to other uses.
You could check what proportion of the time they are right, calculate what their log score would have been if they had used this as their confidence for every prediction, and compare this to the score hey actually got.
Another would be to examine what happens to their score when you multiply the log odds of every estimate by a constant. Multiplying by a constant greater than one will move estimates towards 0 and 1 and away from 50%, while a constant less than one will do the opposite. Find the constant which maximises their score, if its significantly less than 1 they’re overconfident, if its significantly more than 1 their underconfident, and if its roughly equal to 1 they’re well calibrated.
You could check what proportion of the time they are right, calculate what their log score would have been if they had used this as their confidence for every prediction, and compare this to the score hey actually got.
Someone who is perfectly calibrated and doesn’t always give the same confidence, will have a better log score than someone who gives the same series of guesses all using the mean accuracy as confidence. So the latter can’t be used as a gold standard.
That’s actually intentional. I think that if someone is right 90% of the time in some subjects but only right 60% of the time in others, they are better calibrated if they give the appropriate estimate for each subject than if they just give 75% for everything.
I think part of the trouble is that its very difficult to comment meaningfully on the calibration of a single estimate without background information.
For example, suppose Alice and Bob each make one prediction, with confidence of 90% and 80% respectively, and both turn out to be right. I’d be happy to say that it seems like Alice is so far the better predictor of the two (although I’d be prepared to revise this estimate with more data) but its much harder for me to say who is better calibrated without some background information about what sort of evidence they were working from.
With that in mind, I don’t think you’re likely to find something as convenient as log scoring, though there are a few less mathematically elegant solutions that only work when you have a reasonably large set of predictions to test for calibration (I don’t know if this is rigorous to help you). Both of these only work for binary true/false predictions but can probably be generalised to other uses.
You could check what proportion of the time they are right, calculate what their log score would have been if they had used this as their confidence for every prediction, and compare this to the score hey actually got.
Another would be to examine what happens to their score when you multiply the log odds of every estimate by a constant. Multiplying by a constant greater than one will move estimates towards 0 and 1 and away from 50%, while a constant less than one will do the opposite. Find the constant which maximises their score, if its significantly less than 1 they’re overconfident, if its significantly more than 1 their underconfident, and if its roughly equal to 1 they’re well calibrated.
Someone who is perfectly calibrated and doesn’t always give the same confidence, will have a better log score than someone who gives the same series of guesses all using the mean accuracy as confidence. So the latter can’t be used as a gold standard.
That’s actually intentional. I think that if someone is right 90% of the time in some subjects but only right 60% of the time in others, they are better calibrated if they give the appropriate estimate for each subject than if they just give 75% for everything.