Comparing groups of forecasters who worked on different question sets only using simple accuracy measures like brier scores is basically not feasible. You’re right that forecasters can prioritize easier questions and do other hacks.
I don’t get the impression that platforms like Metaculus or GJP bias their questions much to achieve higher brier scores. This is one reason why they typically focus more on their calibration graphs, and on direct question comparisons between platforms.
All that said, I definitely think we have a lot of room to get better at doing comparisons of forecasting between platforms.
I’m less interested in comparing groups of forecasters with each other based on brier scores than with getting a referendum on forecasting generally.
The forecasting industry has a collective interest in maintaining their reputation for predictive accuracy on general questions. I want to know if they are in fact accurate in general questions, or whether some of their apparent success rests on choosing the questions that they address with some cunning.
Comparing groups of forecasters who worked on different question sets only using simple accuracy measures like brier scores is basically not feasible. You’re right that forecasters can prioritize easier questions and do other hacks.
This post goes into detail on several incentive problems:
https://forum.effectivealtruism.org/posts/ztmBA8v6KvGChxw92/incentive-problems-with-current-forecasting-competitions
I don’t get the impression that platforms like Metaculus or GJP bias their questions much to achieve higher brier scores. This is one reason why they typically focus more on their calibration graphs, and on direct question comparisons between platforms.
All that said, I definitely think we have a lot of room to get better at doing comparisons of forecasting between platforms.
I’m less interested in comparing groups of forecasters with each other based on brier scores than with getting a referendum on forecasting generally.
The forecasting industry has a collective interest in maintaining their reputation for predictive accuracy on general questions. I want to know if they are in fact accurate in general questions, or whether some of their apparent success rests on choosing the questions that they address with some cunning.