You could check what proportion of the time they are right, calculate what their log score would have been if they had used this as their confidence for every prediction, and compare this to the score hey actually got.
Someone who is perfectly calibrated and doesn’t always give the same confidence, will have a better log score than someone who gives the same series of guesses all using the mean accuracy as confidence. So the latter can’t be used as a gold standard.
That’s actually intentional. I think that if someone is right 90% of the time in some subjects but only right 60% of the time in others, they are better calibrated if they give the appropriate estimate for each subject than if they just give 75% for everything.
Someone who is perfectly calibrated and doesn’t always give the same confidence, will have a better log score than someone who gives the same series of guesses all using the mean accuracy as confidence. So the latter can’t be used as a gold standard.
That’s actually intentional. I think that if someone is right 90% of the time in some subjects but only right 60% of the time in others, they are better calibrated if they give the appropriate estimate for each subject than if they just give 75% for everything.