Epistemic status: perspective derived from following Dan Luu’s output for the last 5 years or so. Trying to vaguely gesture at a few things at once. Please ask questions if you find something confusing.
Dan Luu has written a interesting post analysing the track record of futurists’ predictions. The motivation:
I’ve been reading a lot of predictions from people who are looking to understand what problems humanity will face 10-50 years out (and sometimes longer) in order to work in areas that will be instrumental for the future and wondering how accurate these predictions of the future are. The timeframe of predictions that are so far out means that only a tiny fraction of people making those kinds of predictions today have a track record so, if we want to evaluate which predictions are plausible, we need to look at something other than track record.
The idea behind the approach of this post was to look at predictions from an independently chosen set of predictors (Wikipedia’s list of well-known futurists1) whose predictions are old enough to evaluate in order to understand which prediction techniques worked and which ones didn’t work, allowing us to then (mostly in a future post) evaluate the plausibility of predictions that use similar methodologies.
I’m primarily going to address the appendix, particularly the section on Holden Karnofsky’s analysis on the same subject, but the article is interesting reading and I’d recommend going through the whole thing. (I think Dan is evaluating forecasting track records pretty differently from how I would, and I haven’t actually dug into any of the other analysis. On priors I’d expect it to be similar to his analysis of Holden’s work.)
Karnofsky’s evaluation of Kurzweil being “fine” to “mediocre” relies on these two analyses done on LessWrong and then uses a very generous interpretation of the results to conclude that Kurzweil’s predictions are fine. Those two posts rate predictions as true, weakly true, cannot decide, weakly false, or false. Karnofsky then compares the number of true + weakly true to false + weakly false, which is one level of rounding up to get an optimistic result; another way to look at it is that any level other than “true” is false when read as written. This issue is magnified if you actually look at the data and methodology used in the LW analyses.
The specific claim I have an issue with here is “another way to look at it is that any level other than “true” is false when read as written”. Depending on how you want to evaluate it it, it’s either technically true but irrelevant, or not even wrong.
In the second post, the author, Stuart Armstrong indirectly noted that there were actually no predictions that were, by strong consensus, very true when he noted that the “most true” prediction had a mean score of 1.3 (1 = true, 2 = weakly true … , 5 = false) and the second highest rated prediction had a mean score of 1.4. Although Armstrong doesn’t note this in the post, if you look at the data, you’ll see that the third “most true” prediction had a mean score of 1.45 and the fourth had a mean score of 1.6, i.e., if you round to the nearest prediction score, only 3 out of 105 predictions score “true” and 32 are >= 4.5 and score “false”. Karnofsky reads Armstrong’s as scoring 12% of predictions true, but the post effectively makes no comment on what fraction of predictions were scored true and the 12% came from summing up the total number of each rating given.
I’m not going to say that taking the mean of each question is the only way one could aggregate the numbers (taking the median or modal values could also be argued for, as well as some more sophisticated scoring function, an extremizing function, etc.), but summing up all of the votes across all questions results in a nonsensical number that shouldn’t be used for almost anything. If every rater rated every prediction or there was a systematic interleaving of who rated what questions, then the number could be used for something (though not as a score for what fraction of predictions are accurate), but since each rater could any questions (although people were instructed to start rating at the first question and rate all questions until they stop, people did not do that and skipped arbitrary questions), aggregating the number of each score given is not meaningful and actually gives very little insight into what fraction of questions are true.
This seems like a basically accurate description of the methodology used in the 2019 assessment. Stuart Armstrong says in a footnote that that removing 4 of the 34 assessors who had gaps in their predictions didn’t change any of the results, but I don’t expect this would address Dan’s primary criticism. I performed my own data cleaning, and removed 9 of the predictors who had substantial gaps in their predictions (there are 2 left who have “any” missing predictions). In both cases, the results obtained on mean (<2 and <1.5), median, and mode are identical:
|MEAN < 2||8|
|MEAN < 1.5||3|
|MEDIAN == 1||6|
|MODE == 1||14|
(Quantities absolute, divide by 105 questions.)
I think requiring a mean under 1.5 to decide something is “true” puts too much weight in the hands of outliers who are either interpreting the prediction differently from the rest, or are simply wrong as a matter of fact.
With that said, I think deferring to these aggregate evaluations at all is a mistake. It seems like Dan agrees, though for reasons that I disagree with:
Another fundamental issue with the analysis is that it relies on aggregating votes of a kind from Less Wrong readers and the associated community. As we discussed here, it’s common to see the most upvoted comments in forums like HN, lobsters, LW, etc., be statements that can clearly be seen to be wrong with no specialized knowledge and a few seconds of though (and an example is given from LW in the link), so why should an aggregation of votes from the LW community be considered meaningful?
I can’t actually look at the paywalled source, but putting aside the accuracy of the “most upvoted comments” on LessWrong, the graders were not randomly selected from those LessWrong users who make heavily upvoted comments. Nearly half of them were publicly named, and many those have extensively documented their thoughts online. One could, if desired, go read some of their writings, and judge their epistemics for one’s self.
Of the predictions which had a modal grade of 1, I personally consider 8 of them to be true, and maybe 5-6 to be non-trivial. I think Dan would consider some of them—perhaps even most—to be insufficiently rigorously specified to grade.
Dan seems to have an unusual knack for noticing inconsistencies and $20 bills on the sidewalk. His work sometimes seems to avoid performing inside-view analysis, which can make engaging with it a bit tricky. It does seem to pay off in cases like this—I don’t have time to dig into it right now, but in the appendix, he also linked to a post by nostagebraist addressing the intial Bio Anchors report that seems worth following up on.
Someone in 1960 predicting an unlikely outcome in 2010, and that outcome actually occuring in 2011, is technically “wrong”, but a very different kind of wrong from someone in 1960 predicting an unlikely outcome in 1965 but that outcome not having occurred yet at all.
Other people’s evaluations of predictions are not, in fact, especially solid pointers to the truth-value of those predictions. Given the subject of the article I think Dan probably appreciates this point.
Three of the basic aggregations suggested by Dan as being even minimally informative.
The fact that the modal grade for question 47 (about cochlear implants) was a 1, while the mean was ~2.5 (with many 4s and 5s), is mostly an indication that the prediction was underspecified, and that the graders in question had very different ideas in mind of what “very effective” and “widely used” meant (or had similar ideas but didn’t bother looking up the actual numbers).
In the sense that most people were not making similar predictions at the time, and priors on those predictions were probably low.
Quote: “I find it a bit odd that, with all of the commentary of these LW posts, few people spent the one minute (and I mean one minute literally — it took me a minute to read the post, see the comment Armstrong made which is a red flag, and then look at the raw data) it would take to look at the data and understand what the post is actually saying, but as we’ve noted previously, almost no one actually reads what they’re citing.”
For reasons that at least aren’t obviously wrong, though I think foregoing an inside-view opinion while simultaneously delivering an outside-view refutation is not enormously productive.