The ideal thing is to judge Bob as if he were making the same prediction every day until he makes a new one, and log-score all of them when the event is revealed. (That is, if Bob says 75% on January 1st and 60% on February 1st, and then on March 1st the event is revealed to have happened, Bob’s score equals 31*log(.25) + 28*log(.4). Then Bob’s best strategy is to update his prediction to his actual current estimate as often as possible; past predictions are sunk costs.
The real-world version is remembering to dock people’s bad predictions more, the longer they persisted in them. But of course this is hard.
538 did do this with their self-evaluation, which is a good way to try and establish a norm in the domain of model-driven reporting.
Yes, that seems right, if it can be used as the sole criteria, and be properly normalized for the time frames and questions involved. There are big second-level Goodhart traps lying in wait if people care about this metric.
The ideal thing is to judge Bob as if he were making the same prediction every day until he makes a new one, and log-score all of them when the event is revealed. (That is, if Bob says 75% on January 1st and 60% on February 1st, and then on March 1st the event is revealed to have happened, Bob’s score equals 31*log(.25) + 28*log(.4). Then Bob’s best strategy is to update his prediction to his actual current estimate as often as possible; past predictions are sunk costs.
The real-world version is remembering to dock people’s bad predictions more, the longer they persisted in them. But of course this is hard.
538 did do this with their self-evaluation, which is a good way to try and establish a norm in the domain of model-driven reporting.
Yes, that seems right, if it can be used as the sole criteria, and be properly normalized for the time frames and questions involved. There are big second-level Goodhart traps lying in wait if people care about this metric.