How superforecasting could be manipulated

Introduction

What if Philip Tetlock wanted to game his own superforecasting system? He runs his own consulting firm. The information he’s selling is not available to the public. How would he do it in a way that isn’t violating any of the rules he’s set up for forecasting?

I’m not saying he is, not at all. I respect Tetlock greatly, admire the work he’s done, and seek to learn from it. In fact, it’s in the spirit he brings to his enterprise that I’m questioning it. After all, his work is based on calling out the wiggle room “expert” commentators give themselves, and trying to get rid of it. I’m examining whether there might yet be a little bit of wiggle room left in the system.

How superforecasting could be manipulated

My understanding is that Tetlock designates as superforecasters the top 2% of participants in his tournaments. He evaluates their accuracy to see whether it stays steady or regresses to the mean, and finds that superforecasters really are better than the average person at predicting complex world events.

In theory, the way the system could be manipulated is in the selection of questions.

Superforecasters are eventually set apart from regular forecasters. For example, the questions answered at Good Judgment, Tetlock’s professional consulting firm, are not visible to the public. I have to assume they’re not (all) mirrored at Good Judgment Open, his open-to-all platform for forecasting. Hence, no way for even a client to compare how superforecasters do on these questions compared to the average Joanne.

How could their accuracy be inflated, even while answering unambiguous questions and being evaluated in a purely algorithmic fashion?

By choosing primarily questions that are in fact easy to answer for a professional, yet appear difficult to the typical client.

Here are two versions of the same question:

“Will Donald Trump and Kim Jong Un meet on or before Dec. 31, 2020?”

“Will Donald Trump and Kim Jong Un meet in Seoul on or before Dec. 31, 2020?”

Most readers here are savvy enough to realize that the odds of these world leaders meeting in Seoul are lower than them meeting at all, and familiar enough with the conjunction fallacy to know that “in Seoul” may actually make the second version seem more plausible to some readers. A professional superforecaster (or question writer for a global risk analysis firm) would know that Seoul, being the capital of South Korea, is an unlikely location for a next meeting place for Dear Leader. So by positing this question, you give the superforecasters an easy A.

Include a large number of these seems-hard-but-actually-easy questions, and you can inflate the average accuracy of your team. Include a few genuinely tough questions, like the number of Coronavirus cases in a month’s time, and when your team gets them right, you can trumpet it. When they get other truly uncertain questions wrong, well, you can’t get them all.

I want to reiterate that I am not accusing Tetlock or any other group of forecasters of doing this, consciously or unconsciously. In fact, on their analytics page, Good Judgment advertises that clients can choose the questions:

Clients… can… pose their own questions to the Superforecasters… And, if we don’t have your topic on our current subscription list, our expert question team will help frame your mission-critical questions to get the answers you need.

GJ might do this with perfect integrity. And clients using this service have an incentive to pose questions where an increase in accuracy has genuine value.

But an unscrupulous firm might get a lot of practice at the additional skill of guiding clients to frame questions that appear uncertain but are in fact easy to answer. If the incentives of those hiring the consulting firm diverge from the overall interests of the company—if those purchasing the analytics aren’t directly invested in getting answers to the most pressing questions—if the object at the level of the transaction is to have reassuringly advanced methodology backing up your report—then heck, maybe it’s in the interests of the client, as in whichever manager is doing the hiring, to have the consultant spit back a bunch of highly confident answers that almost all turn out to be objectively accurate. It makes it really easy to defend your choice of analytics firm later on.

Testing calibration is not a fix

After Nate Silver sort-of failed to predict Trump’s Presidential win, he checked to see whether his predictions are well-calibrated. Turns out, he is. When he predicts an event has a 70% chance of occurring (rounded to the nearest 5%), 71% of the time, it does.

If an unscrupulous forecasting firm wanted to hack this, could they?

Let’s say they have a batch of 100 questions. 99 are easy questions—so unlikely that on each of them, the superforecasters predict the event has a 1% chance of occurring. One, though, is a hard question, with a 50% chance, a coin flip.

All they have to do is predict that the hard question also has a 1% chance of occurring. Without the hard question, they might have gotten 1-2 questions wrong, maybe even 3. With the hard question, they now have a 50% chance of getting one more question wrong. Even with some bad luck, when they predict a 1% chance of an event occurring, it actually occurs 4% of the time. And doesn’t that still look pretty darn good?

Why this matters

First, it suggests to me that if the superforecasting system can be hacked in this way, then we’re back to square one. We are trusting an accuracy assessment of experts that is manipulable. It might be more difficult to manipulate than that of the pundits who never really forecast anything. But at least with the pundits, we know to be suspicious.

The superforecasters still have an aura of Truth about them. It can get to the point where if we find that our analysis disagrees with the superforecasters, we are inclined to doubt ourselves almost automatically, with a similar sort of fundamentalism that we might subscribe to the efficient market hypothesis.

The advantage we amateurs have is that we’re likely to pick questions with high uncertainty and high importance. Why would I bother looking around for questions I can easily guess the answer to, unless I was trying to impress other people with my clairvoyance?

I don’t think a good strategy is to accuse other people of using this strategy to artificially inflate their accuracy score. But it’s hard for me to imagine this isn’t already happening as the superforecasting meme spreads.

I’m not sure I have a well-drawn-out way to deal with this, but here are the sketches.

If someone posts a list of their predictions, you choose the ones from their list that you personally feel the most uncertain about. Make a public, dated statement that those are your uncertainties. When the questions are resolved, calibrate their accuracy based only on the questions that you had genuine uncertainty about.

A better way, of course, would be to submit that question to a formal forecasting tournament, and see whether the median prediction was closer to 1% or 99% (highly unlikely/likely), or closer to 50% (highly uncertain). But that’s not typically tractable.

Betting is also an OK strategy, but the failure mode there is that if you can make up your own questions, then find an idiot to bet with, you can accomplish the same thing. It only really works if an outsider would respect both people involved. I do respect Warren Buffett’s bet with Protégé Partners LLC about the relative long-term performance of an index fund vs. a hedge fund, because it was high-profile, with not just financial but reputational stakes for both involved.

I do plan, personally, on being on watch against people putting on airs because of the sheer number of correct predictions they’ve made. Based on this, I’m downgrading my heuristic for interpreting superforecaster predictions to “soft evidence” unless I can see their models and evidence, or can find some other way to evaluate whether the overall system they’re participating is really being run with integrity.

In the end, it’s a choice between believing the reason and evidence you can personally evaluate, and accepting a consulting firm’s analysis because of what the CEO tells you about how they run the company. If you’re in a long-term relationship with the consultants, maybe you can get a sense of whether they’re steering you right over the long run. But when superforecasters predicted a 3% chance of 200,000 Coronavirus cases by mid-March, maybe it’s time to downgrade our confidence in superforecasters, rather than in ourselves.