# The Stopped Clock Problem

When a low-prob­a­bil­ity, high-im­pact event oc­curs, and the world “got it wrong”, it is tempt­ing to look for the peo­ple who did suc­cess­fully pre­dict it in ad­vance in or­der to dis­cover their se­cret, or at least see what else they’ve pre­dicted. Un­for­tu­nately, as Wei Dai dis­cov­ered re­cently, this tends to back­fire.

It may feel a bit coun­ter­in­tu­itive, but this is ac­tu­ally fairly pre­dictable: the math backs it up on some rea­son­able as­sump­tions. First, let’s as­sume that the topic re­quired un­usual lev­els of clar­ity of thought not to be sucked into the pre­vailing (wrong) con­sen­sus: say a mere 0.001% of peo­ple ac­com­plished this. Th­ese peo­ple are worth find­ing, and listen­ing to.

But we must also note that a good chunk of the pop­u­la­tion are just pes­simists. Let’s say, very con­ser­va­tively, that 0.01% of peo­ple pre­dicted the same dis­aster just be­cause they always pre­dict the most ob­vi­ous pos­si­ble dis­aster. Sud­denly the odds are pretty good that any­body you find who suc­cess­fully pre­dicted the dis­aster is a crank. The mere fact that they cor­rectly pre­dicted the dis­aster be­comes ev­i­dence only of ex­treme rea­son­ing, but is in­suffi­cient to tell whether that rea­son­ing was ex­tremely good, or ex­tremely bad. And on bal­ance, most of the time, it’s ex­tremely bad.

Un­for­tu­nately, the prob­lem here is not just that the good pre­dic­tors are buried in a moun­tain of ran­dom oth­ers; it’s that the good pre­dic­tors are buried in a moun­tain of ex­tremely poor pre­dic­tors. The re­sult is that the mean pre­dic­tion of that group is go­ing to be no­tice­ably worse than the pre­vailing con­sen­sus on most ques­tions, not bet­ter.

Ob­vi­ously the 0.001% and 0.01% num­bers above are made up; I spent some time look­ing for real statis­tics and couldn’t find any­thing use­ful; this ar­ti­cle claims roughly 1% of Amer­i­cans are “prep­pers”, which might be a good in­di­ca­tion, ex­cept it pro­vides no source and could equally well just be the lizard­man con­stant. Re­gard­less, my point re­lies mainly on the sec­ond group be­ing an or­der of mag­ni­tude or more larger than the first, which seems (to me) fairly in­tu­itively likely to be true. If any­body has real statis­tics to prove or dis­prove this, they would be much ap­pre­ci­ated.

• There’s also an el­e­ment of “past perfor­mance is not a guaran­tee of fu­ture re­sults”. It’s pos­si­ble that some­one cor­rectly con­fi­dently pre­dicted one thing for ex­actly the right rea­sons, and then con­fi­dently makes an er­ror in the next thing for al­most ex­actly the right rea­sons.

Likely, even, be­cause the peo­ple who are con­fi­dent about hard ques­tions are more likely to be over­con­fi­dent than have su­per­pow­ers.

• If ran­dom­ness/​noise is a fac­tor, there is also re­gres­sion to the mean when the luck dis­ap­pears on the fol­low­ing rounds.

• Wai Dai didn’t chose the peo­ple he refered to for be­ing right in hind­sight but be­cause they sounded sen­si­ble at t the time. Sen­si­ble enough to fol­low them be­fore we had hind­sight knowl­edge of how the Corono situ­a­tion will evolve.

• Maybe I mi­s­un­der­stand how this works, but if a cor­rect pre­dic­tion is made by 1 ge­nius and 100 cranks, mak­ing this pre­dic­tion should still be treated as a smart thing. Be­cause:

• pun­ish­ing the right an­swer just feels wrong;

• you are not sup­posed to perfectly dis­t­in­guish be­tween ge­niuses and cranks based on one pre­dic­tion;

• if you eval­u­ate many differ­ent pre­dic­tions, then the crank will ran­domly suc­ceed at one and fail at hun­dred, re­sult­ing in a nega­tive score, while the ge­nius will suc­ceed at many and fail at a few, re­sult­ing in a pos­i­tive score, so now ev­ery­thing works as ex­pected.

It seems like a base-rate fal­lacy. As­sum­ing that ge­niuses are gen­er­ally bet­ter at pre­dic­tions than cranks, the ex­pla­na­tion why the difficult cor­rect pre­dic­tion was made by 1 ge­nius and 100 cranks is that the pop­u­la­tion con­tains maybe 10 ge­niuses and 100 000 cranks, and on a spe­cific hard an­swer, the ge­nius has a 10% chance of suc­cess by think­ing hard, and the crank has a 0.1% of suc­cess by choos­ing a ran­dom thing to be­lieve.

But this means that award­ing the “cor­rect­ness point” to the 1 ge­nius and 100 cranks is okay in long term, be­cause the ge­nius will keep col­lect­ing points, but for the crank it was the only point earned for a long time.

• I think your un­der­stand­ing is gen­er­ally cor­rect. The failure case I see is where peo­ple say “this prob­lem was re­ally re­ally re­ally hard, in­stead of one point, I’m go­ing to award one thou­sand cor­rect­ness points to ev­ery­one who pre­dicted it”, and then end up sur­prised that most of those peo­ple still turn out to be cranks.

• You can filter out some of the cranks by check­ing the fore­caster’s rea­son­ing, data, cre­den­tials, and track record, by look­ing for a con­sen­sus of similarly-qual­ified peo­ple, and by tak­ing the in­cen­tives of the fore­cast­ers into ac­count. But this comes with its own prob­lems:

To a non-ex­pert, it’s hard to tell to what de­gree an ex­pert’s area of spe­cial­iza­tion over­laps with the ques­tion at hand. Is a hos­pi­tal ad­minis­tra­tor a trust­wor­thy source of guidance on the risk that a novel coro­n­avirus turns into a pan­demic?

To a non-ex­pert, easy ques­tions look hard, and hard ques­tions some­times look easy. Can we dis­t­in­guish be­tween the two?

To a non-ex­pert, it’s hard to tell whether an ex­pert con­sen­sus is re­ally what it seems, or whether it’s coal­i­tion-build­ing by a poli­ti­cal fac­tion un­der the cloak of “ob­jec­tivity.”

Th­ese are just a few ex­am­ples.

In the end, you have to de­cide whether it’s eas­ier to check the fore­caster’s rea­son­ing or their trust­wor­thi­ness.

• It seems like a bet­ter model to say the strate­gies are:

1. Does (re­spectable source) say I should panic?

2. Re­search­ing a thing to see how se­ri­ous it is.

So it’s less “pes­simism” and more peo­ple try­ing to ring the alarm ear­lier. Your per­ceived high qual­ity sources may have good track records—but if they’re all cor­re­lated (pos­si­bly from talk­ing to each other and reach­ing a con­sen­sus), then look­ing at an in­de­pen­dent source gives more in­for­ma­tion than look­ing at an­other one of them.

Are ear­lier alarms use­ful? Yes, but they go off more of­ten, so more fil­ter­ing is needed (since it hasn’t been done be­fore hand to the same stan­dards).