You probably already know that you can incentivise honest reporting of probabilities using a proper scoring rule like log score, but did you know that you can also incentivize honest reporting of confidence intervals?
To incentize reporting of a 90% confidence interval, take the score −S−20⋅D, where S is the size of your confidence interval, and D is the distance between the true value and the interval.D is 0 whenever the true value is in the interval.
This incentivizes not only giving an interval that has the true value 90% of the time, but also distributes the remaining 10% equally between overestimates and underestimates.
To keep the lower bound of the interval important, I recommend measuring S and D in log space. So if the true value is T and the interval is (L,U), then S is log(UL) and D is log(TU) for underestimates and log(LT) for overestimates. Of course, you need questions with positive answers to do this.
To do a P% confidence interval, take the score −S−200100−P⋅D.
This can be used to make training calibration, using something like Wits and Wagers cards more fun. I also think it could be turned into app, if one could get a large list of questions with numerical values.
A Proper Scoring Rule for Confidence Intervals
You probably already know that you can incentivise honest reporting of probabilities using a proper scoring rule like log score, but did you know that you can also incentivize honest reporting of confidence intervals?
To incentize reporting of a 90% confidence interval, take the score −S−20⋅D, where S is the size of your confidence interval, and D is the distance between the true value and the interval.D is 0 whenever the true value is in the interval.
This incentivizes not only giving an interval that has the true value 90% of the time, but also distributes the remaining 10% equally between overestimates and underestimates.
To keep the lower bound of the interval important, I recommend measuring S and D in log space. So if the true value is T and the interval is (L,U), then S is log(UL) and D is log(TU) for underestimates and log(LT) for overestimates. Of course, you need questions with positive answers to do this.
To do a P% confidence interval, take the score −S−200100−P⋅D.
This can be used to make training calibration, using something like Wits and Wagers cards more fun. I also think it could be turned into app, if one could get a large list of questions with numerical values.