Looking for information on scoring calibration
There are lots of scoring rules for probability assessments. Log scoring is popular here, and squared error also works.
But if I understand these correctly, they are combined measurements of both domain-ability and calibration. For example, if several people took a test on which they had to estimate their confidence in their answers to certain true or false questions about history, then well-calibrated people would have a low squared error, but so would people who know a lot about history.
So (I think) someone who always said 70% confidence and got 70% of the questions right would get a higher score than someone who always said 60% confidence and got 60% of the questions right, even though they are both equally well calibrated.
The only pure calibration estimates I’ve ever seen are calibration curves in the form of a set of ordered pairs, or those limited to a specific point on the cuve (eg “if ey says ey’s 90% sure, ey’s only right 60% of the time”). There should be a way to take the area under (or over) the curve to get a single value representing total calibration, but I’m not familiar with the method or whether it’s been done before. Is there an accepted way to get single-number calibration scores separate from domain knowledge?