I have an idea along these lines: adversarial question-asking.
I have a big concern about various forms of forecasting calibration.
Each forecasting team establishes its reputation by showing that its predictions, in aggregate, are well-calibrated and accurate on average.
However, questions are typically posed by a questioner who’s part of the forecasting team. This creates an opportunity for them to ask a lot of softball questions that are easy for an informed forecaster to answer correctly, or at least to calibrate their confidence on.
By advertising their overall level of calibration and average accuracy, they can “dilute away” inaccuracies on hard problems that other people really care about. They gain a reputation for accuracy, yet somehow don’t seem so accurate when we pose a truly high-stakes question to them.
This problem could be at least partly solved by having an external, adversarial question-asker. Even better would be some sort of mechanical system for generating the questions that forecasters must answer.
For example, imagine that you had a way to extract every objectively answerable question posed by the New York Times in 2021.
Currently, their headline article is “Duty or Party? For Republicans, a Test of Whether to Enable Trump”
Though it does not state this in so many words, one of the primary questions it raises is whether the Michigan board that certifies vote results will certify Biden’s victory ahead of the Electoral College vote on Dec. 14.
Imagine that one team’s job was to extract such questions from a newspaper. Then they randomly selected a certain number of them each day, and posed them to a team of forecasters.
In this way, the work of superforecasters would be chained to the concerns of the public, rather than spent on questions that may or may not be “hackable.”
To me, this is a critically important, and to my knowledge totally unexplored question that I would very much like to see treated.
Comparing groups of forecasters who worked on different question sets only using simple accuracy measures like brier scores is basically not feasible. You’re right that forecasters can prioritize easier questions and do other hacks.
I don’t get the impression that platforms like Metaculus or GJP bias their questions much to achieve higher brier scores. This is one reason why they typically focus more on their calibration graphs, and on direct question comparisons between platforms.
All that said, I definitely think we have a lot of room to get better at doing comparisons of forecasting between platforms.
I’m less interested in comparing groups of forecasters with each other based on brier scores than with getting a referendum on forecasting generally.
The forecasting industry has a collective interest in maintaining their reputation for predictive accuracy on general questions. I want to know if they are in fact accurate in general questions, or whether some of their apparent success rests on choosing the questions that they address with some cunning.
I have an idea along these lines: adversarial question-asking.
I have a big concern about various forms of forecasting calibration.
Each forecasting team establishes its reputation by showing that its predictions, in aggregate, are well-calibrated and accurate on average.
However, questions are typically posed by a questioner who’s part of the forecasting team. This creates an opportunity for them to ask a lot of softball questions that are easy for an informed forecaster to answer correctly, or at least to calibrate their confidence on.
By advertising their overall level of calibration and average accuracy, they can “dilute away” inaccuracies on hard problems that other people really care about. They gain a reputation for accuracy, yet somehow don’t seem so accurate when we pose a truly high-stakes question to them.
This problem could be at least partly solved by having an external, adversarial question-asker. Even better would be some sort of mechanical system for generating the questions that forecasters must answer.
For example, imagine that you had a way to extract every objectively answerable question posed by the New York Times in 2021.
Currently, their headline article is “Duty or Party? For Republicans, a Test of Whether to Enable Trump”
Though it does not state this in so many words, one of the primary questions it raises is whether the Michigan board that certifies vote results will certify Biden’s victory ahead of the Electoral College vote on Dec. 14.
Imagine that one team’s job was to extract such questions from a newspaper. Then they randomly selected a certain number of them each day, and posed them to a team of forecasters.
In this way, the work of superforecasters would be chained to the concerns of the public, rather than spent on questions that may or may not be “hackable.”
To me, this is a critically important, and to my knowledge totally unexplored question that I would very much like to see treated.
Comparing groups of forecasters who worked on different question sets only using simple accuracy measures like brier scores is basically not feasible. You’re right that forecasters can prioritize easier questions and do other hacks.
This post goes into detail on several incentive problems:
https://forum.effectivealtruism.org/posts/ztmBA8v6KvGChxw92/incentive-problems-with-current-forecasting-competitions
I don’t get the impression that platforms like Metaculus or GJP bias their questions much to achieve higher brier scores. This is one reason why they typically focus more on their calibration graphs, and on direct question comparisons between platforms.
All that said, I definitely think we have a lot of room to get better at doing comparisons of forecasting between platforms.
I’m less interested in comparing groups of forecasters with each other based on brier scores than with getting a referendum on forecasting generally.
The forecasting industry has a collective interest in maintaining their reputation for predictive accuracy on general questions. I want to know if they are in fact accurate in general questions, or whether some of their apparent success rests on choosing the questions that they address with some cunning.