This is an underappreciated fact! I like how simple the rule is when framed in terms of size and distance.

You mention both the linear and log rules. The log rule has the benefit of being scale-invariant, so your score isn’t affect by the units the answer is measured in, but it can’t deal with negatives and gets overly sensitive around zero. The linear rule doesn’t blow up around zero, is shift-invariant, and can handle negative values fine. The best generic scoring rule would have all these properties.

Turns out (based on Lambert and Shoham, “Eliciting truthful answers to multiple choice questions”) that all scoring rules for symmetric confidence intervals (a,b) with coverage probability 1−α can be represented (up to affine transformation) as

where x is the true value, I is the indicator function, and g(⋅) is any increasing function. Unsurprisingly, the linear rule uses g(x)=x and the log rule uses g(x)=log(x). If we want scale-invariance on the whole real line, first thing I’d be tempted to do is use log(x) for positive x and −log(|x|) for negative x except for that pesky bit about going off to ±∞ around zero. Let’s paste in a linear portion around zero so the function is increasing everywhere: g(x)=I(|x|≤10)⋅(x/10)+I(|x|>10)⋅sign(x)⋅log10(|x|)

Using this g(⋅), the score is sensitive to absolute values around zero and sensitive to relative values on both sides of it. Since the rule expects more accuracy around zero, the origin should vary depending on question domain. Like if the question is about dates, accuracy should be the highest around the present year and get less accurate going into the past or future. That suggests we should set the origin at the present year. For temperatures, the origin should probably be room temperature. Are there any other standard domains that should have a non-zero origin? An alternate origin t can be added as a shift everywhere:

Not something you’d want to calculate by hand, but if someone implements a calibration app, this has more consistent scores. Going one step further, the scores could be made more intepretable by comparison to a perfectly calibrated reference score: 100+k⋅(Sα(a,b,x)−S∗α) where S∗α is the expected score for perfectly calibrated intervals if, say, x∼N(0,10) and k is a fixed value chosen to keep plausible scores mostly positive.

This is an underappreciated fact! I like how simple the rule is when framed in terms of size and distance.

You mention both the linear and log rules. The log rule has the benefit of being scale-invariant, so your score isn’t affect by the units the answer is measured in, but it can’t deal with negatives and gets overly sensitive around zero. The linear rule doesn’t blow up around zero, is shift-invariant, and can handle negative values fine. The best generic scoring rule would have all these properties.

Turns out (based on Lambert and Shoham, “Eliciting truthful answers to multiple choice questions”) that all scoring rules for symmetric confidence intervals (a,b) with coverage probability 1−α can be represented (up to affine transformation) as

where x is the true value, I is the indicator function, and g(⋅) is any increasing function. Unsurprisingly, the linear rule uses g(x)=x and the log rule uses g(x)=log(x). If we want scale-invariance on the whole real line, first thing I’d be tempted to do is use log(x) for positive x and −log(|x|) for negative x except for that pesky bit about going off to ±∞ around zero. Let’s paste in a linear portion around zero so the function is increasing everywhere: g(x)=I(|x|≤10)⋅(x/10)+I(|x|>10)⋅sign(x)⋅log10(|x|)

Using this g(⋅), the score is sensitive to absolute values around zero and sensitive to relative values on both sides of it. Since the rule expects more accuracy around zero, the origin should vary depending on question domain. Like if the question is about dates, accuracy should be the highest around the present year and get less accurate going into the past or future. That suggests we should set the origin at the present year. For temperatures, the origin should probably be room temperature. Are there any other standard domains that should have a non-zero origin? An alternate origin t can be added as a shift everywhere:

Not something you’d want to calculate by hand, but if someone implements a calibration app, this has more consistent scores. Going one step further, the scores could be made more intepretable by comparison to a perfectly calibrated reference score: 100+k⋅(Sα(a,b,x)−S∗α) where S∗α is the expected score for perfectly calibrated intervals if, say, x∼N(0,10) and k is a fixed value chosen to keep plausible scores mostly positive.