Resolutions to the Challenge of Resolving Forecasts
One of the biggest challenges in forecasting is formulating clear and resolvable questions, where the resolution reflects the intent of the question. When this doesn’t happen, there is often uncertainty about the way the question will be resolved, leading to uncertainty about what to predict. I want to discuss this problem, and in this post, point out that there are a variety of methods which are useful for resolving predictions.
But first, the problem.
What is the Problem?
The OpenPhil / Good Judgement COVID-19 dashboard provides an example. The goal was to predict the number of cases and deaths due to COVID. The text of the questions was “How many X will be reported/estimated as of 31 March 2021?” and the explanation clarified that “The outcome will be determined based on reporting and estimates provided by Johns Hopkins of X...”
Early on, the question was fairly clear—it was about what would happen. As time went on, however, it became clearer that because reporting was based on very limited testing, there would be a significant gap between the reported totals, and the estimated totals. Discussions of what total to predict were then partly side-tracked by predictions of whether the reports or the predictions would be used—with a very large gap between the two.
Solving the Problem?
This is a very general problem for forecasting, and various paths towards solutions have been proposed. One key desiderata, however, is clear; whatever resolution criteria are used, they should be explicitly understood beforehand. The choice of which approach to be clear you are using, however, is still up for debate—and I’ll present various approaches in this post.
We can be inflexible with resolution criteria, and always specify exactly what number or fact will be used for the resolution, and never change that. To return to the example above, the COVID prediction could have been limited to, say, the final number displayed on the Johns Hopkins dashboard by the resolution date. If it is discontinued, it would use the final number displayed before then, and if it is modified to no longer display a single number, say, providing a range, it would use the final number before the change is introduced .
Of course, this means that even the smallest deviation from what you expected or planned for will lead to the question resolving in a way other than representing the outcome in question. Worse, the prediction is now in large part about whether something will trigger the criteria for effectively ending the question early. That means that the prediction is less ambiguous, but also less useful.
An alternative is to try to specify what happens in every case.If a range is presented, or alternative figures are available on an updated dashboard, the highest estimate or figure will be used. If the dashboard is discontinued, the people running it will be asked to provide a final number to resolve the question. If they do not reply, or do not agree on a specific value, a projection of the totals based on a linear regression using the final month of data will be used. This type of resolution requires specifying every possible eventuality, which is sometimes infeasible. It also needs to fall back on some final simple criteria to cover edge cases—and it needs to do so unambiguously.
As another alternative, Metaculus sometimes chooses to leave a question as “ambiguous” if the data source is discontinued, or it is later discovered that for other reasons the resolution as stated doesn’t work—for example, a possibility other than those listed occurs. That is undesirable because the forecasters cannot get feedback, any awards are not given, forecasters feel like they have wasted their time, and the question that the prediction was supposed to answer ends up giving no information.
Augur, and perhaps other prediction markets, also allow for one of the resolutions to be “ambiguous” (or, “Invalid Market”, source). For example, a question on who was the president of Venezuela in 2020 might have been resolved as “Invalid” given that both Juan Guaidó and Maduro had a claim to the position. Crucially, “ambiguous” resolutions can be traded on (and thus predicted) on Augur—this creates a better incentive than walking away, but in cases where there is a “morally correct” answer, it falls short of ideal.
Resolve with a Probability
One way to make the resolution less problematic when the outcome is ambiguous is to resolve probabilistically, or similar. In such a case, instead of a yes or no question resolving with a binary yes or no, a question can resolve with a probability, with a confidence interval or with a distribution. This is the approach taken by Polymarket (example, for binary questions) or foretold (for continuous questions). We can imagine this as a useful solution if a baseball game is rained out. In such cases, perhaps the rules would be to pick a probability based on the resolution of past games—with the teams tied, it resolves at 50%, and with one team up by 3 runs in the 7th inning, it resolves at whatever percentage of games where a team is up by 3 runs at that point in the game wins.
Aside: Ambiguity can be good!
As the last two resolution methods indicate, eliminating ambiguity also greatly reduces the usefulness of a question. An example of this was a question in the original Good Judgement competition, which asked “Will there be a violent confrontation between China and a neighbor in the South China Sea?” and the resolution criteria was whether there was a fatal interaction between different countries. The predictions were intended to be about whether military confrontations would occur, but the resolution ended up being about a Chinese fisherman stabbing a South Korean coast guard officer.
Barb Mellers said that the resolution “just reflected the fact that life is very difficult to predict.” I would disagree—I claim that the resolution reflected a failure to make a question well aligned with the intent of the question, which was predicting increased Chinese aggression. But this is inevitable when questions are concrete, and the metric used is an imperfect proxy. (Don’t worry—I’m not going to talk about Goodhart’s Law yet again. But it’s relevant.)
This is why we might prefer a solution that allows some ambiguity, or at least interpretation, without depending on ambiguous or overly literal resolutions. I know of two such approaches.
One approach for dealing with ambiguous resolutions that still resolves predictions unambiguously is to appeal to an outside authority.
A recent Metaculus question for the 20⁄20 Insight Prediction Contest asked about a “Democratic majority in the US senate” and when the Senate was tied, with Democratic control due to winning the presidency, the text of the question was cited—it said “The question resolves positively if Democrats hold 51 seats or more in the Senate according to the official election results.” Since the vice president votes, but does not hold a seat, the technical criteria were not met, despite the result being understood informally as a “Democratic majority.” The Metaculus admins said that they agreed on the question resolution due to the “51 seats or more” language.
Instead of relying narrowly on the wording, however, the contest rules were that in case of any ambiguity, the contest administrators “will consult at least three independent individuals, blind to our hypotheses and to the identity of participants, to make a judgment call in these contested cases.” The question resolution didn’t change, but the process was based on outside advisors.
Even more extreme, a second approach is that forecasters can be asked to predict what they think the experts will decide. This is instead of predicting a narrow and well specified outcome, and allows for predicting things that are hard to pin down at present.
This is the approach proposed by Jacob Lagerros and Ben Gold for an AI Forecasting Resolution Council, where they propose using a group of experts to resolve otherwise likely-to-become-ambiguous questions. Another example of this is Kleros, a decentralized dispute resolution service. To use it, forecasts could have the provision that they be submitted to Kleros if the resolution is unclear, or perhaps all cases would be resolved that way.
This potentially increases fidelity with the intent of a question—but has costs. First, there are serious disadvantages to the ambiguity, since forecasters are predicting a meta-level outcome. Second, there are both direct and management costs to having experts weigh-in on predictions. And lastly, this doesn’t actually avoid the problem with how to resolve the question—it offloads it, albeit in a way that can decrease the costs of figuring out how to decide.
As an interesting application of a similar approach, meta-forecasts have also been proposed as a way to resolve very long term questions. In this setup, we can ask forecasters to predict what a future forecast will be. Instead of predicting the price of gold in 2100, they can predict what another market will predict in 2030 - and perhaps that market can itself be similarly predicting a market in 2040, and so on. But this strays somewhat from this posts’ purpose, since the eventual resolution is still clear.
In this post, I’ve tried to outline the variety of methods that exist for resolving forecasts. I think this is useful as a reference and starting point for thinking about how to create and resolve forecasts. I also think it’s useful to frame a different problem that I want to discuss in the next post, about the difference between ambiguity and flexibility, and how to allow flexibility without making resolutions as ambiguous.
Thanks to Ozzie Gooen for inspiring the post. Thanks also to Edo Arad, Nuño Sempere, and again, Ozzie, for helpful comments and suggestions.