Anon: no, I mean the log probability. In your example, the calibratedness will generally be high: - \log 0.499 - H(p) ~= 0.00289 each time you see tails, and—log 0.501 - H(p) ~= − 0.00289 each time you come up tails. It’s continuous.
Let’s be specific. We have H(p) = - \sum_x p(x) \log p(x), where p is some probability distribution over a finite set. If we observe x0, the say the predictor’s calibration is
so the expected calibration is 0 by the definition of H(p). The calibration is continuous in p. If \log p(x0) is higher then the expected value of \log p(x) then we are underconfident and C(x0) < 0; if \log p(x0) is lower than expected we are overconfident, and C>0.
With q = p(x) d(x,x0) the non-normalised probability distribution that assigns value only x0, we have
Anon: no, I mean the log probability. In your example, the calibratedness will generally be high: - \log 0.499 - H(p) ~= 0.00289 each time you see tails, and—log 0.501 - H(p) ~= − 0.00289 each time you come up tails. It’s continuous.
Let’s be specific. We have H(p) = - \sum_x p(x) \log p(x), where p is some probability distribution over a finite set. If we observe x0, the say the predictor’s calibration is
C(x0) = \sum_x p(x) \log p(x) - \log p(x0) = - \log p(x0) - H(p)
so the expected calibration is 0 by the definition of H(p). The calibration is continuous in p. If \log p(x0) is higher then the expected value of \log p(x) then we are underconfident and C(x0) < 0; if \log p(x0) is lower than expected we are overconfident, and C>0.
With q = p(x) d(x,x0) the non-normalised probability distribution that assigns value only x0, we have
C = D(p||q)
so this is a relative entropy of sorts.