Neel Nanda comments on Use Normal Predictions

Neel Nanda 13 Jan 2022 2:02 UTC
3 points
Thanks, I really enjoyed this post—this was a novel but persuasive argument for not using binary predictions, and I now feel excited to try it out!

One quibble—When you discuss calculating your calibration, doesn’t this implicitly assume that your mean was accurate? If my mean is very off but my standard deviation is correct, then this method says my standard deviation is way too low. But maybe this is fine because if I have a history of getting the mean wrong I should have a wider distribution?
- Jalex Stark 13 Jan 2022 4:09 UTC
  4 points
  Parent
  I’m not sure how to think about the proposed calibration rule—it’s a heuristic to approximate something, but I’m not sure what. Probably the right thing to approximate is “how much money could someone make by betting against me,” which is proportional to something like the KL divergence or the earth mover’s distance, depending on your model of what bets are available.
  
  Anyway, if you’re quite confident about the wrong value, somebody can take a bunch of money from you. You can think of yourself as having overestimated the amount of knowledge you had, and as you suggested, you can think of the correction as uniformly decreasing your knowledge estimate going forward.
  
  Talking about the stdev being “correct” is perfectly sensible if the ground truth is actually normally distributed, but makes less sense as the distribution becomes less normal.
  - Jan Christian Refsgaard 13 Jan 2022 5:59 UTC
    2 points
    Parent
    I agree, most things is not normal distributed and my callibrations rule answers how to rescale to a normal. Metaculus uses the cdf of the predicted distribution which is better If you have lots of predictions, my scheme gives an actionable number faster, by making assumptions that are wrong, but if you like me have intervals that seems off by almost a a factor of 2, then your problem is not the tails but the entire region :), so the trade of seems worth it.
    - SimonM 13 Jan 2022 14:33 UTC
      2 points
      Parent
      Metaculus uses the cdf of the predicted distribution which is better If you have lots of predictions, my scheme gives an actionable number faster
      You keep claiming this, but I don’t understand why you think this
- Jan Christian Refsgaard 16 Jan 2022 22:04 UTC
  1 point
  Parent
  Yes, You can change future $μ$ by being smarter and future $σ$ by being better calibrated, my rule assumes you don’t get smarter and therefore have to adjust only future $σ$ .
  
  If you actually get better at prediction you could argue you would need to update $σ$ less than the RMSE estimate suggests :)
- Jan Christian Refsgaard 13 Jan 2022 6:10 UTC
  1 point
  Parent
  That’s also how I conseptiolize it, you have to change your intervals because you are to stupid to make better predictions, if the prediction was always spot on then sigma should be 0 and then my scheme does not make sense
  
  If you suck like me and get a prediction very close then I would probably say: that sometimes happen :) note I assume the average squared error should be 1, which means most errors are less than 1, because 0²⁺²2=2>1
  - SimonM 13 Jan 2022 14:32 UTC
    1 point
    Parent
    If you suck like me and get a prediction very close then I would probably say: that sometimes happen :) note I assume the average squared error should be 1, which means most errors are less than 1, because 0²⁺²2=2>1
    I assume you’re making some unspoken assumptions here, because $0^{2} + 2^{2} > 1^{2}$ is not enough to say that. A naive application of Chebyshev’s inequality would just say that $E (X^{2}) = 1, E (X) = 0 \Rightarrow P (X \leq 1) \leq 1$ .
    To be more concrete, if you were very weird, and either end up forecasting 0.5 s.d. or 1.1 s.d. away, (still with mean 0 and average squared error 1) then you’d find “most” errors are more than 1.
    - Jan Christian Refsgaard 13 Jan 2022 15:36 UTC
      1 point
      Parent
      I am making the simple observation that the median error is less than one because the mean squares error is one.
      - SimonM 13 Jan 2022 15:39 UTC
        2 points
        Parent
        That isn’t a “simple” observation.
        Consider an error which is 0.5 22% of the time, 1.1 78% of the time. The squared errors are 0.25 and 1.21. The median error is 1.1 > 1. (The mean squared error is 1)
        Jan Christian Refsgaard 14 Jan 2022 10:56 UTC
        1 point
        Parent
        Yes you are right, but under the assumption the errors are normal distributed, then I am right:
        
        If:
        
        $p \sim B e r n (0.78) σ = p \times N (0, 1.1) + (p - 1) N (0, 0.5)$
        
        Then $E [σ^{2}] \approx 0.37$ Which is much less than 1.
        
        proof:
        
        import scipy as sp x1 = sp.stats.norm(0, 0.5).rvs(22 * 10000) x2 = sp.stats.norm(0, 1.1).rvs(78 * 10000) x12 = pd.Series(np.array(x1.tolist() + x2.tolist())) print((x12 ** 2).median())
        SimonM 14 Jan 2022 14:54 UTC
        3 points
        Parent
        Under what assumption?
        1/ You aren’t “[assuming] the errors are normally distributed”. (Since a mixture of two normals isn’t normal) in what you’ve written above.
        2/ If your assumption is $X \sim N (0, 1)$ then yes, I agree the median of $X^{2}$ is ~0.45 (although
        from scipy import stats stats.chi2.ppf(.5, df=1) >>> 0.454936
        would have been an easier way to illustrate your point). I think this is actually the assumption you’re making. [Which is a horrible assumption, because if it were true, you would already be perfectly calibrated].
        3/ I guess you’re new claim is “[assuming] the errors are a mixture of normal distributions, centered at 0”, which okay, fine that’s probably true, I don’t care enough to check because it seems a bad assumption to make.
        More importantly, there’s a more fundamental problem with your post. You can’t just take some numbers from my post and then put them in a different model and think that’s in some sense equivalent. It’s quite frankly bizarre. The equivalent model would be something like:
        $p \sim B e r n (0.78)$
        $σ \sim p \cdot N (1.1, ε) + (1 - p) \sim N (0.5, ε)$
        Jan Christian Refsgaard 16 Jan 2022 21:47 UTC
        1 point
        Parent
        Our ability to talk past each other is impressive :)
        
        would have been an easier way to illustrate your point). I think this is actually the assumption you’re making. [Which is a horrible assumption, because if it were true, you would already be perfectly calibrated].
        
        Yes this is almost the assumption I am making, the general point of this post is to assume that all your predictions follow a Normal distribution, with $μ$ as “guessed” and with a $σ$ that is different from what you guessed, and then use $X^{2}$ to get a point estimate for the counterfactual $σ$ you should have used. And as you point out if (counterfactual) $σ = 1$ then the point estimate suggests you are well calibrated.
        
        In the post counter factual $σ$ is ${^σ}_{z}$