This notion of calibratedness seems to have bad properties to me. Consider a situation where I’m trying to guess a distribution for the outcomes of a coin flip with a coin which, my information tells me, lands “heads” 99% of the time. Then a guess of 50% and 50% is “calibrated” because of the 50% predictions I make, exactly half come out right. But a guess 49.9% heads and 50.1% tails is horribly calibrated; the “49.9%” predictions come out 99% correct, and the “50.1%” predictions come out 1% correct. So the concept, as defined like this, seems hypersensitive, and therefore not very useful. I think a proper definition must necessarily be in terms of relative entropy, or perhaps considering Bayesian posteriors from subsets of your information, but I still don’t see how it should work. Sorry if someone already gave a robust definition that I missed.
Nick: If you don’t mean expected log probability, then I don’t know what you’re talking about. And if you do, it seems to me that you’re saying that well-calibratedness means that relative entropy of the “correct” distribution relative to yours is equal to your entropy. But then the uniform prior doesn’t seem well-calibrated; again, consider a coin that lands “heads” 99% of the time. Then your entropy is 1, while the relative entropy of the “correct” distribution is (-log(99%)-log(1%))/2, which is >2.
Could you give a more precise definition of “calibrated”? Your example of 1⁄37 for each of 37 different possibilities, justified by saying that indeed one of the 37 will happen, seems facile. Do you mean that the “correct” distribution, relative to your guess, has low relative entropy?