Thanks, I really enjoyed this post—this was a novel but persuasive argument for not using binary predictions, and I now feel excited to try it out!
One quibble—When you discuss calculating your calibration, doesn’t this implicitly assume that your mean was accurate? If my mean is very off but my standard deviation is correct, then this method says my standard deviation is way too low. But maybe this is fine because if I have a history of getting the mean wrong I should have a wider distribution?
I’m not sure how to think about the proposed calibration rule—it’s a heuristic to approximate something, but I’m not sure what. Probably the right thing to approximate is “how much money could someone make by betting against me,” which is proportional to something like the KL divergence or the earth mover’s distance, depending on your model of what bets are available.
Anyway, if you’re quite confident about the wrong value, somebody can take a bunch of money from you. You can think of yourself as having overestimated the amount of knowledge you had, and as you suggested, you can think of the correction as uniformly decreasing your knowledge estimate going forward.
Talking about the stdev being “correct” is perfectly sensible if the ground truth is actually normally distributed, but makes less sense as the distribution becomes less normal.
I agree, most things is not normal distributed and my callibrations rule answers how to rescale to a normal. Metaculus uses the cdf of the predicted distribution which is better If you have lots of predictions, my scheme gives an actionable number faster, by making assumptions that are wrong, but if you like me have intervals that seems off by almost a a factor of 2, then your problem is not the tails but the entire region :), so the trade of seems worth it.
Yes, You can change future μ by being smarter and future σ by being better calibrated, my rule assumes you don’t get smarter and therefore have to adjust only future σ.
If you actually get better at prediction you could argue you would need to update σ less than the RMSE estimate suggests :)
That’s also how I conseptiolize it, you have to change your intervals because you are to stupid to make better predictions, if the prediction was always spot on then sigma should be 0 and then my scheme does not make sense
If you suck like me and get a prediction very close then I would probably say: that sometimes happen :)
note I assume the average squared error should be 1, which means most errors are less than 1, because 02+22=2>1
If you suck like me and get a prediction very close then I would probably say: that sometimes happen :) note I assume the average squared error should be 1, which means most errors are less than 1, because 02+22=2>1
I assume you’re making some unspoken assumptions here, because 02+22>12 is not enough to say that. A naive application of Chebyshev’s inequality would just say that E(X2)=1,E(X)=0⇒P(X≤1)≤1.
To be more concrete, if you were very weird, and either end up forecasting 0.5 s.d. or 1.1 s.d. away, (still with mean 0 and average squared error 1) then you’d find “most” errors are more than 1.
Consider an error which is 0.5 22% of the time, 1.1 78% of the time. The squared errors are 0.25 and 1.21. The median error is 1.1 > 1. (The mean squared error is 1)
1/ You aren’t “[assuming] the errors are normally distributed”. (Since a mixture of two normals isn’t normal) in what you’ve written above.
2/ If your assumption is X∼N(0,1) then yes, I agree the median ofX2 is ~0.45 (although
from scipy import stats
stats.chi2.ppf(.5, df=1)
>>> 0.454936
would have been an easier way to illustrate your point). I think this is actually the assumption you’re making. [Which is a horrible assumption, because if it were true, you would already be perfectly calibrated].
3/ I guess you’re new claim is “[assuming] the errors are a mixture of normal distributions, centered at 0”, which okay, fine that’s probably true, I don’t care enough to check because it seems a bad assumption to make.
More importantly, there’s a more fundamental problem with your post. You can’t just take some numbers from my post and then put them in a different model and think that’s in some sense equivalent. It’s quite frankly bizarre. The equivalent model would be something like:
Our ability to talk past each other is impressive :)
would have been an easier way to illustrate your point). I think this is actually the assumption you’re making. [Which is a horrible assumption, because if it were true, you would already be perfectly calibrated].
Yes this is almost the assumption I am making, the general point of this post is to assume that all your predictions follow a Normal distribution, with μ as “guessed” and with a σ that is different from what you guessed, and then use X2 to get a point estimate for the counterfactual σ you should have used. And as you point out if (counterfactual) σ=1 then the point estimate suggests you are well calibrated.
Thanks, I really enjoyed this post—this was a novel but persuasive argument for not using binary predictions, and I now feel excited to try it out!
One quibble—When you discuss calculating your calibration, doesn’t this implicitly assume that your mean was accurate? If my mean is very off but my standard deviation is correct, then this method says my standard deviation is way too low. But maybe this is fine because if I have a history of getting the mean wrong I should have a wider distribution?
I’m not sure how to think about the proposed calibration rule—it’s a heuristic to approximate something, but I’m not sure what. Probably the right thing to approximate is “how much money could someone make by betting against me,” which is proportional to something like the KL divergence or the earth mover’s distance, depending on your model of what bets are available.
Anyway, if you’re quite confident about the wrong value, somebody can take a bunch of money from you. You can think of yourself as having overestimated the amount of knowledge you had, and as you suggested, you can think of the correction as uniformly decreasing your knowledge estimate going forward.
Talking about the stdev being “correct” is perfectly sensible if the ground truth is actually normally distributed, but makes less sense as the distribution becomes less normal.
I agree, most things is not normal distributed and my callibrations rule answers how to rescale to a normal. Metaculus uses the cdf of the predicted distribution which is better If you have lots of predictions, my scheme gives an actionable number faster, by making assumptions that are wrong, but if you like me have intervals that seems off by almost a a factor of 2, then your problem is not the tails but the entire region :), so the trade of seems worth it.
You keep claiming this, but I don’t understand why you think this
Yes, You can change future μ by being smarter and future σ by being better calibrated, my rule assumes you don’t get smarter and therefore have to adjust only future σ.
If you actually get better at prediction you could argue you would need to update σ less than the RMSE estimate suggests :)
That’s also how I conseptiolize it, you have to change your intervals because you are to stupid to make better predictions, if the prediction was always spot on then sigma should be 0 and then my scheme does not make sense
If you suck like me and get a prediction very close then I would probably say: that sometimes happen :) note I assume the average squared error should be 1, which means most errors are less than 1, because 02+22=2>1
I assume you’re making some unspoken assumptions here, because 02+22>12 is not enough to say that. A naive application of Chebyshev’s inequality would just say that E(X2)=1,E(X)=0⇒P(X≤1)≤1.
To be more concrete, if you were very weird, and either end up forecasting 0.5 s.d. or 1.1 s.d. away, (still with mean 0 and average squared error 1) then you’d find “most” errors are more than 1.
I am making the simple observation that the median error is less than one because the mean squares error is one.
That isn’t a “simple” observation.
Consider an error which is 0.5 22% of the time, 1.1 78% of the time. The squared errors are 0.25 and 1.21. The median error is 1.1 > 1. (The mean squared error is 1)
Yes you are right, but under the assumption the errors are normal distributed, then I am right:
If:
p∼Bern(0.78)σ=p×N(0,1.1)+(p−1)N(0,0.5)
Then E[σ2]≈0.37 Which is much less than 1.
proof:
Under what assumption?
1/ You aren’t “[assuming] the errors are normally distributed”. (Since a mixture of two normals isn’t normal) in what you’ve written above.
2/ If your assumption is X∼N(0,1) then yes, I agree the median ofX2 is ~0.45 (although
would have been an easier way to illustrate your point). I think this is actually the assumption you’re making. [Which is a horrible assumption, because if it were true, you would already be perfectly calibrated].
3/ I guess you’re new claim is “[assuming] the errors are a mixture of normal distributions, centered at 0”, which okay, fine that’s probably true, I don’t care enough to check because it seems a bad assumption to make.
More importantly, there’s a more fundamental problem with your post. You can’t just take some numbers from my post and then put them in a different model and think that’s in some sense equivalent. It’s quite frankly bizarre. The equivalent model would be something like:
p∼Bern(0.78)
σ∼p⋅N(1.1,ε)+(1−p)∼N(0.5,ε)
Our ability to talk past each other is impressive :)
Yes this is almost the assumption I am making, the general point of this post is to assume that all your predictions follow a Normal distribution, with μ as “guessed” and with a σ that is different from what you guessed, and then use X2 to get a point estimate for the counterfactual σ you should have used. And as you point out if (counterfactual) σ=1 then the point estimate suggests you are well calibrated.
In the post counter factual σ is ^σz