On both the piece and the question, I feel consistently confused that people keep asking “is long-range forecasting feasible” as a binary in an overly general context, which, as TedSanders mentioned, is trivially false in some cases and trivially true in others.
I get that if you are doing research on things, you’ll probably do research on real-world-esque cases. But if you were trying to prove long-term forecasting feasibility-at-all (which Luke’s post appears to, as it ends with sounding unsure about this point), you’d want to start from the easiest case for feasibility: the best superforecaster ever predicting the absolute easiest questions, over and over. This is narrow on forecasters and charitable on difficulty. I’m glad to see Tetlock et al looking at a narrower group of people this time, but you could go further. And I feel like people are still ignoring difficulty, to the detriment of everyone’s understanding.
If you predict coin tosses, you’re going to get a ROC AUC of .5. Chaos theory says some features have sensitive dependence to initial conditions that are at too low of resolution for us to track, and that we won’t be able to predict these. Other features are going to sit within basins of attraction that are easy to predict. The curve of AUC over time should absolutely drop off over time like that, because more features slip out of predictability as time goes on. This should not be surprising! The real question is “which questions are how predictable for which people?” (Evidently not the current questions for the current general forecasting pool.)
There are different things to do to answer that. Firstly, two things NOT to do that I see lots:
1. Implying low resolution/AUC is a fault without checking calibration (as I maybe wrongly perceive the above graph or post as doing, but have seen elsewhere in a similar context). If you have good calibration, then a .52 AUC can be fine if you say 50% to most questions and 90% to one question; if you don’t, that 90% is gonna be drowned out in a sea of other wrong 90%s
2. Trying to zero out questions that you give to predictors, e.g. “will Tesla produce more or less than [Tesla’s expected production] next year?”. If you’re looking for resolution/AUC, then baselining on a good guess specifically destroys your ability to measure that. (If you ask the best superforecaster to guess whether a series of 80% heads-weighted coin flips comes up with an average more than .8, they’ll have no resolution, but if you ask what the average will be from 0 to 1 then they’ll have high resolution.) It will also hamstring your ability to remove low-information answers if you try subtracting background, as mentioned in the next list.
Some positive options if you’re interested in figuring out what long-term questions are predictable by whom:
1. At the very least, ask questions you expect people to have real information about
2. Ask superforecasters to forecast metadata about questions, like whether people will have any resolution/AUC on subclasses of questions, or how much resolution/AUC differently ranked people will have on subclasses, or whether a prediction market would answer a question better (e.g. if there is narrowly-dispersed hidden information that is very strong). Then you could avoid asking questions that were expected to be unpredictable or wasteful in some other way.
3. Go through and trying to find simple features of predictable vs unpredictable long-term questions
4. Amplify the informational signal by reducing the haze of uncertainty not specific to the thing the question is interested in (mostly important for decade+ predictions). One option is to ask conditionals, e.g. “what percent chance is there that CRISPR-edited babies account for more than 10% of births if no legislation is passed banning the procedure” or something if you know legislation is very difficult to predict; another option is to ask about upstream features, like specifically whether legislation will be passed banning CRISPR. (Had another better idea here but fell asleep and forgot it)
5. Do a sort of anti-funnel plot or other baselining of the distribution over predictors’ predictions. This could look like subtracting the primary-fit beta distribution from the prediction histogram to see if there’s a secondary beta, or looking for higher-order moments or outliers of high credibility, or other signs of nonrandom prediction distribution that might generalize well. A good filter here is to not anchor them by saying “chances of more than X units” where X is already ~the aggregate mean, but instead make them rederive things (or to be insidious, provide a faulty anchor and subtract an empirical distribution from around that point). Other tweaked opportunities for baseline subtraction abound.
If Luke is primarily just interested in whether OpenPhil employees can make long-term forecasts on the kind of thing they forecast on, they shouldn’t be looking at resolution/AUC, just calibration, and making sure it’s still good at reasonably long timescales. To bootstrap, it would speed things along if they used their best forecasters to predict metadata—if there are classes of questions that are too unpredictable for them, I’m sure they can figure that out, especially if they spot-interviewed some people about long-term predictions they made.
On both the piece and the question, I feel consistently confused that people keep asking “is long-range forecasting feasible” as a binary in an overly general context, which, as TedSanders mentioned, is trivially false in some cases and trivially true in others.
I get that if you are doing research on things, you’ll probably do research on real-world-esque cases. But if you were trying to prove long-term forecasting feasibility-at-all (which Luke’s post appears to, as it ends with sounding unsure about this point), you’d want to start from the easiest case for feasibility: the best superforecaster ever predicting the absolute easiest questions, over and over. This is narrow on forecasters and charitable on difficulty. I’m glad to see Tetlock et al looking at a narrower group of people this time, but you could go further. And I feel like people are still ignoring difficulty, to the detriment of everyone’s understanding.
If you predict coin tosses, you’re going to get a ROC AUC of .5. Chaos theory says some features have sensitive dependence to initial conditions that are at too low of resolution for us to track, and that we won’t be able to predict these. Other features are going to sit within basins of attraction that are easy to predict. The curve of AUC over time should absolutely drop off over time like that, because more features slip out of predictability as time goes on. This should not be surprising! The real question is “which questions are how predictable for which people?” (Evidently not the current questions for the current general forecasting pool.)
There are different things to do to answer that. Firstly, two things NOT to do that I see lots:
1. Implying low resolution/AUC is a fault without checking calibration (as I maybe wrongly perceive the above graph or post as doing, but have seen elsewhere in a similar context). If you have good calibration, then a .52 AUC can be fine if you say 50% to most questions and 90% to one question; if you don’t, that 90% is gonna be drowned out in a sea of other wrong 90%s
2. Trying to zero out questions that you give to predictors, e.g. “will Tesla produce more or less than [Tesla’s expected production] next year?”. If you’re looking for resolution/AUC, then baselining on a good guess specifically destroys your ability to measure that. (If you ask the best superforecaster to guess whether a series of 80% heads-weighted coin flips comes up with an average more than .8, they’ll have no resolution, but if you ask what the average will be from 0 to 1 then they’ll have high resolution.) It will also hamstring your ability to remove low-information answers if you try subtracting background, as mentioned in the next list.
Some positive options if you’re interested in figuring out what long-term questions are predictable by whom:
1. At the very least, ask questions you expect people to have real information about
2. Ask superforecasters to forecast metadata about questions, like whether people will have any resolution/AUC on subclasses of questions, or how much resolution/AUC differently ranked people will have on subclasses, or whether a prediction market would answer a question better (e.g. if there is narrowly-dispersed hidden information that is very strong). Then you could avoid asking questions that were expected to be unpredictable or wasteful in some other way.
3. Go through and trying to find simple features of predictable vs unpredictable long-term questions
4. Amplify the informational signal by reducing the haze of uncertainty not specific to the thing the question is interested in (mostly important for decade+ predictions). One option is to ask conditionals, e.g. “what percent chance is there that CRISPR-edited babies account for more than 10% of births if no legislation is passed banning the procedure” or something if you know legislation is very difficult to predict; another option is to ask about upstream features, like specifically whether legislation will be passed banning CRISPR. (Had another better idea here but fell asleep and forgot it)
5. Do a sort of anti-funnel plot or other baselining of the distribution over predictors’ predictions. This could look like subtracting the primary-fit beta distribution from the prediction histogram to see if there’s a secondary beta, or looking for higher-order moments or outliers of high credibility, or other signs of nonrandom prediction distribution that might generalize well. A good filter here is to not anchor them by saying “chances of more than X units” where X is already ~the aggregate mean, but instead make them rederive things (or to be insidious, provide a faulty anchor and subtract an empirical distribution from around that point). Other tweaked opportunities for baseline subtraction abound.
If Luke is primarily just interested in whether OpenPhil employees can make long-term forecasts on the kind of thing they forecast on, they shouldn’t be looking at resolution/AUC, just calibration, and making sure it’s still good at reasonably long timescales. To bootstrap, it would speed things along if they used their best forecasters to predict metadata—if there are classes of questions that are too unpredictable for them, I’m sure they can figure that out, especially if they spot-interviewed some people about long-term predictions they made.