Beware boasting about non-existent forecasting track records

Imagine if there was a financial pundit who kept saying “Something really bad is brewing in the markets and we may be headed for a recession. But we can’t know when recessions will come, nobody can predict them”. And then every time there was a selloff in the market, they tell everyone “I’ve been saying we were headed for trouble”, taking credit. This doesn’t work as a forecasting track record, and it shouldn’t be thought of that way.

If they want forecaster prestige, their forecasts must be:

Pre-registered,
So unambiguous that people actually agree whether the event “happened”,
With probabilities and numbers so we can gauge calibration,
And include enough forecasts that it’s not just a fluke or cherry-picking.

When Eliezer Yudkowsky talks about forecasting AI, he has several times ~~claimed~~ implied he has a great forecasting track record. But a meaningful “forecasting track record” has well-known and very specific requirements, and Eliezer doesn’t show these.

Here he dunks on the Metaculus predictors as “excruciatingly predictable” about a weak-AGI question, saying that he is a sane person with self-respect (implying the Metaculus predictors aren’t):

To be a slightly better Bayesian is to spend your entire life watching others slowly update in excruciatingly predictable directions that you jumped ahead of 6 years earlier so that your remaining life could be a random epistemic walk like a sane person with self-respect.
I wonder if a Metaculus forecast of “what this forecast will look like in 3 more years” would be saner. Is Metaculus reflective, does it know what it’s doing wrong?

He clearly believes he could be placing forecasts showing whether or not he is better. Yet he doesn’t.

Some have argued “but he may not have time to keep up with the trends, forecasting is demanding”. But he’s the one making a claim about relative accuracy! And this is in the domain he says is the most important one of our era. And he seems to already be keeping up with trends—just submit the distribution then.

And here he dunks on Metaculus predictors again:

What strange inputs other people require instead of the empty string, to arrive at conclusions that they could have figured out for themselves earlier; if they hadn’t waited around for an obvious whack on the head that would predictably arrive later. I didn’t update off this.

But still without being transparent about his own forecasts, preventing a fair comparison.

In another context, Paul Christiano offered to bet Eliezer about AI timelines. This is great, a bet is a tax on bullshit. While it doesn’t show a nice calibration chart like on Metaculus, it does give information about performance. You would be right to be fearful of betting against Bryan Caplan. And to Eliezer’s great credit, he has actually made a related bet with Bryan! EDIT: Note that Eliezer also agreed to this bet with Paul.

But at one point in responding to Paul, Eliezer mentions some nebulous, unscorable debates and claims:

I claim that I came off better than Robin Hanson in our FOOM debate compared to the way that history went. I’d claim that my early judgments of the probable importance of AGI, at all, stood up generally better than early non-Yudkowskian EA talking about that.

Nothing about this is a forecasting track record. These are post-hoc opinions. There are unavoidable reasons we require pre-registering of the forecasts, removal of definitional wiggle room, explicit numbers, and a decent sample. This response sounds like the financial pundit, saying he called the recession.

EDIT: I think some people are thinking that Eliezer was an unambiguous winner of that debate, and therefore this works as part of a forecasting track record. But you can see examples of why it’s far more ambiguous than that in this comment by Paul Christiano.

In this comment, Eliezer said Paul didn’t need to bet him, and that Paul is...lacking a forecasting track record.

I think Paul doesn’t need to bet against me to start producing a track record like this; I think he can already start to accumulate reputation by saying what he thinks is bold and predictable about the next 5 years; and if it overlaps “things that interest Eliezer” enough for me to disagree with some of it, better yet.

But Eliezer himself doesn’t have a meaningful forecasting track record.

In other domains, where we have more practice detecting punditry tactics, we would dismiss such an uninformative “track record”. We’re used to hearing Tetlock talk about ambiguity in political statements. We’re used to hearing about a financial pundit like Jim Cramer underperforming the market. But the domain is novel in AI timelines.

When giving “AGI timelines”, I’ve heard several EAs claim there are no ambiguity risks for the forecast resolution. They think this because the imagery in their heads is dramatic, and we’ll just know if they were right. No we won’t. This shows wild overconfidence in scenarios they can imagine, and overconfidence in how powerful words are at distinguishing.

Even “the AGI question” on Metaculus had some major ambiguities that could’ve prevented resolution. Matthew Barnett nicely proposed solutions to clarify them. Many people talking about AI timelines should find this concerning. Because they make “predictions” that aren’t defined anywhere near as well as that question. It’s okay for informal discussions to be nebulous. But while nebulous predictions sound informative, it takes years before it’s obvious that they were meaningless.

So why won’t Eliezer use the ways of Tetlock? He says this:

I consider naming particular years to be a cognitively harmful sort of activity; I have refrained from trying to translate my brain’s native intuitions about this into probabilities, for fear that my verbalized probabilities will be stupider than my intuitions if I try to put weight on them. What feelings I do have, I worry may be unwise to voice; AGI timelines, in my own experience, are not great for one’s mental health, and I worry that other people seem to have weaker immune systems than even my own. But I suppose I cannot but acknowledge that my outward behavior seems to reveal a distribution whose median seems to fall well before 2050.

He suggests that if he uses proper forecasting methods, it would hurt people’s mental health. But Eliezer seems willing to format his message as blatant fearmongering like this. For years he’s been telling people they are doomed, and often suggests they are intellectually flawed if they don’t agree. To me, he doesn’t come across like he’s sparing me an upsetting truth. To me he sounds like he’s catastrophizing, which isn’t what I expect to see in a message tailored for mental health.

I’m not buying speculative infohazard arguments, or other “reasons” to obfuscate. If Eliezer thinks he has detected an imminent world-ending danger to humans, then the best approach would probably be to give a transparent, level-headed assessment.

...for fear that my verbalized probabilities will be stupider than my intuitions if I try to put weight on them

Well, with practice he would improve at verbalized probabilities, as Tetlock found. Also, how does he expect to know if his intuitions are stupid, if he doesn’t test them against reality? Sure, it would probably make him seem much less prescient. But that’s good, if it’s more objective and real.

And no, his domain eminence isn’t much of an update. The edge in forecasting from being an expert is generally pretty underwhelming, however special you think AI is. Maybe even less so if we consider him a relatively famous expert. Does anyone predict they can dominate the long-term question leaderboards, by having insights, and skipping the proper forecasting practice? This is wishful thinking.

One justification I’ve heard: “Shorter-term questions can’t show how good his judgment is about longer-term questions”. This seems like a rationalization. Suppose if you have 2 groups: those who show good calibration on 3-year AI questions, and those who don’t. Now in many cases, both groups end up being dart-throwing chimps on 30-year AI questions. But this hardly justifies not even trying to do it properly. And if some do outperform at the long-term questions, they should have a much better chance if they were at least calibrated on 3 -year questions, versus the group who didn’t demonstrate that. It’s easy to have an outcome where the uncalibrated just do even worse than a dart-throwing chimp.

If you would like to have some chance at forecasting AI timelines, here are a couple paths. 1) Good generalist forecasters can study supplemental domain material. 2) Non-forecaster domain experts can start building a calibration graph of proper forecasts. Those are basically the options.

People who avoid forecasting accountability shouldn’t boast about their forecasting performance. And other people shouldn’t rationalize it. I thought Eliezer did great betting with Bryan. Before dunking on properly-scored forecasts, he should be transparent, create a public Metaculus profile, place properly-scored forecasts, and start getting feedback.

Thank you to KrisMoore, Linch, Stefan Schubert, Nathan Young, Peterwildeford, Rob Lee, Ruby, and tenthkrige for suggesting changes.