1. I think it’s very good to consider many different outside views for a problem. This is why I considered section 2.1 of Yudkowsky’s Intelligence Explosion Microeconomics to be frustrating/a weak man, because I think it’s plausibly much better to ensemble a bunch of weak outside views than to use a single brittle outside view.
2. One interesting (obvious?) note on base rates that I haven’t seen anybody else point out: across time, you can think of “base rate forecasting” as just taking the zeroth derivative (while linear regression is a first derivative, etc).
3.
So which reference class is correct? In my (inside) view as a superforecaster, this is where we turn to a different superforecasting trick, about considering multiple models. As the saying goes, hedgehogs know one reference class, but foxes consult many hedgehogs.
I think while consulting many models is a good reminder, the hard part is choosing which model(s) to use in the end. I think your ensemble of models can often do much better than an unweighted average of all the models you’ve considered, since some models are a) much less applicable, b) much more brittle, c) much less intuitively plausible, or d) much too strongly correlated than other models you have.
As you’ve illustrated in some examples above, sometimes the final ensemble is composed of practically only one model!
4. I suspect starting with good meta-priors (in this case, good examples of reference classes to start investigating) is a substantial fraction of the battle. Often, you can have good priors even when things are very confusing.
5. One thing I’m interested in is how “complex” do you expect a reasonably good forecast to be. How many factors go into the final forecast, how complex the interactions between the parameters are, etc. I suspect final forecasts that are “good enough” are often shockingly simple, and the hard part of a forecast is building/extracting a “correct enough” simplified model of reality and getting a small amount of the appropriate data that you actually need.
Once an experienced analyst has the minimum information necessary to make an informed judgment, obtaining additional information generally does not improve the accuracy of his or her estimates. Additional information does, however, lead the analyst to become more confident in the judgment, to the point of overconfidence.
Experienced analysts have an imperfect understanding of what information they actually use in making judgments. They are unaware of the extent to which their judgments are determined by a few dominant factors, rather than by the systematic integration of all available information. Analysts actually use much less of the available information than they think they do.
There is strong experimental evidence, however, that such self-insight is usually faulty. The expert perceives his or her own judgmental process, including the number of different kinds of information taken into account, as being considerably more complex than is in fact the case. Experts overestimate the importance of factors that have only a minor impact on their judgment and underestimate the extent to which their decisions are based on a few major variables. In short, people’s mental models are simpler than they think, and the analyst is typically unaware not only of which variables should have the greatest influence, but also which variables actually are having the greatest influence.
If this theory is correct, or broadly correct, it’d point to human judgmental forecasting being dramatically different from dominant paradigms in statistical machine learning, where more data and greater parameters usually improve accuracy.
(I think there may be some interesting analogies with the lottery ticket hypothesis that I’d love to explore more at one point)
I suspect final forecasts that are “good enough” are often shockingly simple, and the hard part of a forecast is building/extracting a “correct enough” simplified model of reality and getting a small amount of the appropriate data that you actually need.
I think that it’s often true that good forecasts can be simple, but I also think that the gulf between “good enough” and “very good” usually contains a perverse effect, where slightly more complexity makes the model perhaps slightly better in expectation, and far worse in properly estimating variance or accounting for uncertainties outside the model. That means that for the purpose of forecasting, you get much worse (brier scores) before you get better.
As a concrete example, this is seen when people forecast COVID deaths. They start with a simple linear trend, then say they don’t really think it’s linear, it’s actually exponential, so they roughly adjust their confidence and have appropriate uncertainties around a bad model. Then they get fancier, and try using a SIR model that gives “the” answer, and the forecaster simulates 100 runs to create a distribution by varying R_0 withing a reasonable range. That gives an uncertainty range, and a very narrow resulting distribution—which the forecaster is more narrowly willing to adjust, because their model accounts for the obvious sources of variance. Then schools are reopened, or treatment methods improve, or contact rates drop as people see case counts rise, and the model’s assumptions are invalidated in a different way than was expected.
I think while consulting many models is a good reminder, the hard part is choosing which model(s) to use in the end. I think your ensemble of models can often do much better than an unweighted average of all the models you’ve considered, since some models are a) much less applicable, b) much more brittle, c) much less intuitively plausible, or d) much too strongly correlated than other models you have.
As I said to Luke in a comment to his link to an excellent earlier post that discusses this, I think there is far more to be said about how to do model fusion, and agreed with his point in his paper that ensembles which simply average models are better than single models, but still worse than actually figuring out what each model tells you.
Some scattered thoughts:
1. I think it’s very good to consider many different outside views for a problem. This is why I considered section 2.1 of Yudkowsky’s Intelligence Explosion Microeconomics to be frustrating/a weak man, because I think it’s plausibly much better to ensemble a bunch of weak outside views than to use a single brittle outside view.
“Beware the man of one reference class” as they say.
2. One interesting (obvious?) note on base rates that I haven’t seen anybody else point out: across time, you can think of “base rate forecasting” as just taking the zeroth derivative (while linear regression is a first derivative, etc).
3.
I think while consulting many models is a good reminder, the hard part is choosing which model(s) to use in the end. I think your ensemble of models can often do much better than an unweighted average of all the models you’ve considered, since some models are a) much less applicable, b) much more brittle, c) much less intuitively plausible, or d) much too strongly correlated than other models you have.
As you’ve illustrated in some examples above, sometimes the final ensemble is composed of practically only one model!
4. I suspect starting with good meta-priors (in this case, good examples of reference classes to start investigating) is a substantial fraction of the battle. Often, you can have good priors even when things are very confusing.
5. One thing I’m interested in is how “complex” do you expect a reasonably good forecast to be. How many factors go into the final forecast, how complex the interactions between the parameters are, etc. I suspect final forecasts that are “good enough” are often shockingly simple, and the hard part of a forecast is building/extracting a “correct enough” simplified model of reality and getting a small amount of the appropriate data that you actually need.
From Psychology of Intelligence Analysis, as summarized in the forecasting newsletter (emphasis mine).
If this theory is correct, or broadly correct, it’d point to human judgmental forecasting being dramatically different from dominant paradigms in statistical machine learning, where more data and greater parameters usually improve accuracy.
(I think there may be some interesting analogies with the lottery ticket hypothesis that I’d love to explore more at one point)
I think that it’s often true that good forecasts can be simple, but I also think that the gulf between “good enough” and “very good” usually contains a perverse effect, where slightly more complexity makes the model perhaps slightly better in expectation, and far worse in properly estimating variance or accounting for uncertainties outside the model. That means that for the purpose of forecasting, you get much worse (brier scores) before you get better.
As a concrete example, this is seen when people forecast COVID deaths. They start with a simple linear trend, then say they don’t really think it’s linear, it’s actually exponential, so they roughly adjust their confidence and have appropriate uncertainties around a bad model. Then they get fancier, and try using a SIR model that gives “the” answer, and the forecaster simulates 100 runs to create a distribution by varying R_0 withing a reasonable range. That gives an uncertainty range, and a very narrow resulting distribution—which the forecaster is more narrowly willing to adjust, because their model accounts for the obvious sources of variance. Then schools are reopened, or treatment methods improve, or contact rates drop as people see case counts rise, and the model’s assumptions are invalidated in a different way than was expected.
As I said to Luke in a comment to his link to an excellent earlier post that discusses this, I think there is far more to be said about how to do model fusion, and agreed with his point in his paper that ensembles which simply average models are better than single models, but still worse than actually figuring out what each model tells you.