I’m not sure I’m convinced by what you say about 50% predictions. Rather than making an explicit counterargument, I will appeal to one of the desirable features you mentioned in your blog post, namely continuity. If your policy is to treat all 50% predictions as being half “right” and half “wrong” (or any other special treatment of this sort) then you introduce a discontinuity as the probability attached to those predictions changes from 0.5 to, say 0.50001.
the 70% range does not appear very special. It looks more like the whole area around 70% to 90% data points has a slight problem
Interesting observation. I’m torn between saying “no, 70% really is special and your graph misleads by drawing attention to the separation between the lines rather than how fast it’s increasing” and saying “yes, you’re right, 70% isn’t special, and the traditional plot misleads by focusing attention on single probabilities”. I think adjudicating between the two comes down to how “smoothly” it’s reasonable to expect calibration errors to vary: is it really plausible that Scott has some sort of weird miscalibration blind spot at 70%, or not? If we had an agreed answer to that question, actually quantifying our expectations of smoothness, then we could use it to draw a smooth estimated-calibration curve, and it seems to me that that would actually be a better solution to the problem.
Relatedly: your plot doesn’t (I think) distinguish very clearly between more predictions at a given level of miscalibration, and more miscalibration there. Hmm, perhaps if we used a logarithmic scale for the vertical axis or something it would help with that, but I think that would make it more confusing.
I’m not sure I’m convinced by what you say about 50% predictions. Rather than making an explicit counterargument, I will appeal to one of the desirable features you mentioned in your blog post, namely continuity. If your policy is to treat all 50% predictions as being half “right” and half “wrong” (or any other special treatment of this sort) then you introduce a discontinuity as the probability attached to those predictions changes from 0.5 to, say 0.50001.
This is true, but it is also true that the discontinuity is always present somewhere around 0.5 regardless of which method you use (you’ll see it at some point when changing from 0.49999 to 0.50001).
To really get rid of this problem, you’d need a more clever trick: e.g. you could draw a single unified curve on the full interval from 0 to 1, and then draw another version of it that is its point reflection (using the (0.5, 0.5) point as center).
is it really plausible that Scott has some sort of weird miscalibration blind spot at 70%, or not?
Yes, this appears to be the crux here. My intuitive prior is against this “single blind spot” theory, but I don’t have any evidence beyond Occam’s razor and what I tend to observe in statistics of my personal prediction results.
Relatedly: your plot doesn’t (I think) distinguish very clearly between more predictions at a given level of miscalibration, and more miscalibration there.
I’m not sure what exactly you think it doesn’t distinguish. The same proportional difference, but with more predictions, is in fact more evidence for miscalibration (which is what my graph shows).
Yes, but it’s not evidence for more miscalibration, and I think “how miscalibrated?” is usually at least as important a question as “how sure are we of how miscalibrated?”.
Sure. So “how miscalibrated” is simply the proportional difference between values of the two curves. I.e. if you adjust the scales of graphs to make them the same size, it’s simply how far they appear to be visually.
adjust the scales of graphs to make them the same size
Note that if you have substantially different numbers of predictions at different confidence levels, you will need to do this adjustment within a single graph. That was the point of my remark about maybe using a logarithmic scale on the y-axis. But I still think that would be confusing.
it is also true that the discontinuity is always present somewhere around 0.5 regardless of which method you use
No, you don’t always have a discontinuity. You have to throw out predictions at 0.5, but this could be a consequence of a treatment that is continuous as a function of p. You could simply weight predictions and say that those close to 0.5 count less. I don’t know if that is reasonable for your approach, but similar things are forced upon us. For example, if you want to know whether you are overconfident at 0.5+ε you need 1/ε predictions. It is not just that calibration is impossible to discern at 0.5, but it is also difficult to discern near 0.5.
Yes, thank you, I was speaking about a more narrow set of options (which we were considering).
I don’t currently have an elegant idea about how to do weighing (but I suspect that to fit in nicely, it would be most likely done by subtraction not multiplication).
I’m not sure I’m convinced by what you say about 50% predictions. Rather than making an explicit counterargument, I will appeal to one of the desirable features you mentioned in your blog post, namely continuity. If your policy is to treat all 50% predictions as being half “right” and half “wrong” (or any other special treatment of this sort) then you introduce a discontinuity as the probability attached to those predictions changes from 0.5 to, say 0.50001.
Interesting observation. I’m torn between saying “no, 70% really is special and your graph misleads by drawing attention to the separation between the lines rather than how fast it’s increasing” and saying “yes, you’re right, 70% isn’t special, and the traditional plot misleads by focusing attention on single probabilities”. I think adjudicating between the two comes down to how “smoothly” it’s reasonable to expect calibration errors to vary: is it really plausible that Scott has some sort of weird miscalibration blind spot at 70%, or not? If we had an agreed answer to that question, actually quantifying our expectations of smoothness, then we could use it to draw a smooth estimated-calibration curve, and it seems to me that that would actually be a better solution to the problem.
Relatedly: your plot doesn’t (I think) distinguish very clearly between more predictions at a given level of miscalibration, and more miscalibration there. Hmm, perhaps if we used a logarithmic scale for the vertical axis or something it would help with that, but I think that would make it more confusing.
This is true, but it is also true that the discontinuity is always present somewhere around 0.5 regardless of which method you use (you’ll see it at some point when changing from 0.49999 to 0.50001).
To really get rid of this problem, you’d need a more clever trick: e.g. you could draw a single unified curve on the full interval from 0 to 1, and then draw another version of it that is its point reflection (using the (0.5, 0.5) point as center).
Yes, this appears to be the crux here. My intuitive prior is against this “single blind spot” theory, but I don’t have any evidence beyond Occam’s razor and what I tend to observe in statistics of my personal prediction results.
I’m not sure what exactly you think it doesn’t distinguish. The same proportional difference, but with more predictions, is in fact more evidence for miscalibration (which is what my graph shows).
Yes, but it’s not evidence for more miscalibration, and I think “how miscalibrated?” is usually at least as important a question as “how sure are we of how miscalibrated?”.
Sure. So “how miscalibrated” is simply the proportional difference between values of the two curves. I.e. if you adjust the scales of graphs to make them the same size, it’s simply how far they appear to be visually.
Note that if you have substantially different numbers of predictions at different confidence levels, you will need to do this adjustment within a single graph. That was the point of my remark about maybe using a logarithmic scale on the y-axis. But I still think that would be confusing.
No, you don’t always have a discontinuity. You have to throw out predictions at 0.5, but this could be a consequence of a treatment that is continuous as a function of p. You could simply weight predictions and say that those close to 0.5 count less. I don’t know if that is reasonable for your approach, but similar things are forced upon us. For example, if you want to know whether you are overconfident at 0.5+ε you need 1/ε predictions. It is not just that calibration is impossible to discern at 0.5, but it is also difficult to discern near 0.5.
Yes, thank you, I was speaking about a more narrow set of options (which we were considering).
I don’t currently have an elegant idea about how to do weighing (but I suspect that to fit in nicely, it would be most likely done by subtraction not multiplication).