Raising the forecasting waterline (part 1)

Pre­vi­ously: Rais­ing the wa­ter­line, see also: 1001 Pre­dic­tionBook Nights (LW copy), Tech­niques for prob­a­bil­ity estimates

Low wa­ter­lines im­ply that it’s rel­a­tively easy for a novice to out­perform the com­pe­ti­tion. (In poker, as dis­cussed in Nate Silver’s book, the “fish” are those who can’t mas­ter ba­sic tech­niques such as fold­ing when they have a poor hand, or calcu­lat­ing even roughly the ex­pected value of a pot.) Does this ap­ply to the do­main of mak­ing pre­dic­tions? It’s early days, but it looks as if a small­ish set of tools—a con­scious sta­tus quo bias, re­spect­ing prob­a­bil­ity ax­ioms when con­sid­er­ing al­ter­na­tives, con­sid­er­ing refer­ences classes, leav­ing your­self a line of re­treat, de­tach­ing from sunk costs, and a few more—can at least place you in a good po­si­tion.

A bit of backstory

Like per­haps many LessWrongers, my first en­counter with the no­tion of cal­ibrated con­fi­dence was “A Tech­ni­cal Ex­pla­na­tion of Tech­ni­cal Ex­pla­na­tion”. My first se­ri­ous stab at pub­li­cly ex­press­ing my own be­liefs as quan­tified prob­a­bil­ities was the Amanda Knox case—an eye-opener, wak­ing me up to how ev­ery­day opinions could cor­re­spond to de­grees of cer­tainty, and how these had con­se­quences. By the fol­low­ing year, I was try­ing to im­prove my cal­ibra­tion for work-re­lated pur­poses, and play­ing with var­i­ous Web sites, like Pre­dic­tionBook or Gues­sum (now de­funct).

Then the Good Judg­ment Pro­ject was an­nounced on Less Wrong. Like sev­eral of us, I ap­plied, un­ex­pect­edly got in, and started tak­ing fore­cast­ing more se­ri­ously. (I tend to ap­ply my­self some­what bet­ter to learn­ing when there is a com­pet­i­tive el­e­ment—not an at­ti­tude I’m par­tic­u­larly proud of, but be­ing aware of that is use­ful.)

The GJP is both a con­test and an ex­per­i­men­tal study, in fact a group of re­lated stud­ies: sev­eral dis­tinct groups of re­searchers (1,2,3,4) are be­ing funded by IARPA to each run their own ex­per­i­men­tal pro­gram. Within each, small or large num­ber of par­ti­ci­pants have been re­cruited, al­lo­cated to differ­ent ex­per­i­men­tal con­di­tions, and en­couraged to com­pete with each other (or even, as far as I know, for some ex­per­i­men­tal con­di­tions, col­lab­o­rate with each other). The goal is to make pre­dic­tions about “world events”—and if pos­si­ble to get them more right, col­lec­tively, than we would in­di­vi­d­u­ally.1

Tool 1: Fa­vor the sta­tus quo

The first hint I got that my ap­proach to fore­cast­ing needed more ex­plicit think­ing tools was a blog post by Paul He­witt I came across late in the first sea­son. My scores in that pe­riod (sum­mer 2011 to spring 2012) had been de­cent but not fan­tas­tic; I ended up 5th on my team, which it­self placed quite mod­estly in the con­test.

He­witt pointed out that in gen­eral, you could do bet­ter than most other fore­cast­ers by fa­vor­ing the sta­tus quo out­come.2 This may not quite be on the same or­der of effec­tive­ness as the poker ad­vice to “err on the side of fold­ing mediocre hands more of­ten”, but it makes a lot of sense, at least for the Good Judg­ment Pro­ject (and pos­si­bly for many of the ques­tions we might worry about). Many of the GJP ques­tions re­fer to pos­si­bil­ities that loom large in the me­dia at a given time, that are highly available—in the sense of the availa­bil­ity heuris­tic. This re­sults in a ten­dency to fa­vor fore­casts of change from sta­tus quo.

For in­stance, one of the Sea­son 1 ques­tions was “Will Marine LePen cease to be a can­di­date for Pres­i­dent of France be­fore 10 April 2012?” (also on Pre­dic­tionBook). Just be­cause the ques­tion is be­ing asked doesn’t mean that you should as­sign “yes” and “no” equal prob­a­bil­ities of 50%, or even close to 50%, any more than you should as­sign 50% to the propo­si­tion “I will win the lot­tery”.

Rather, you might start from a rel­a­tively low prior prob­a­bil­ity that any­one who un­der­takes some­thing as sig­nifi­cant as a bid for na­tional pres­i­dency would throw in the towel be­fore the con­test even starts. Then, try to find ev­i­dence that pos­i­tively fa­vors a change. In this par­tic­u­lar case, there was such ev­i­dence - the Na­tional Front, of which she was the can­di­date, con­sis­tently re­ports difficul­ties round­ing up the en­dorse­ments re­quired to reg­ister a can­di­date legally. How­ever, only once in the past (1981) had this re­sulted in their can­di­date be­ing barred (ad­mit­tedly a very small sam­ple). It would have been a mis­take to weigh that ev­i­dence ex­ces­sively. (I got a good score on that ques­tion, com­pared to the team, but definitely ow­ing to a “home ground ad­van­tage” as a French cit­i­zen rather than my su­pe­rior fore­cast­ing skills.)

Tool 2: Flip the ques­tion around

The next tech­nique I try to ap­ply con­sis­tently is re­spect­ing the ax­ioms of prob­a­bil­ity. If the prob­a­bil­ity of event A is 70%, then the prob­a­bil­ity of not-A is 30%.

This may strike ev­ery­one as ob­vi­ous… it’s not. In Sea­son 2, sev­eral of my team-mates are on record as as­sign­ing a 75% prob­a­bil­ity to the propo­si­tion “The num­ber of reg­istered Syr­ian con­flict re­fugees re­ported by the UNHCR will ex­ceed 250,000 at any point be­fore 1 April 2013”.

That num­ber was reached to­day, six months in ad­vance of the dead­line. This was clear as early as Au­gust. The trend in the past few months has been an in­crease of 1000 to 2000 a day, and the UNHCR have re­cently pro­vided es­ti­mates that this num­ber will even­tu­ally reach 700,000. The kicker is that this num­ber is only the count of peo­ple who are fully pro­cessed by the UNHCR ad­minis­tra­tion and offi­cially in their database; there are tens of thou­sands more in the camps who only have “ap­point­ments to be reg­istered”.

I’ve been find­ing it hard to un­der­stand why my team-mates haven’t been up­dat­ing to, maybe not 100%, but at least 99%; and how one wouldn’t see these as the only an­swers worth con­sid­er­ing. At any point in the past few weeks, to state your prob­a­bil­ity as 85% or 91% (as some have quite re­cently) was to say, “There is still a one in ten chance that the Syr­ian con­flict will sud­denly stop and all these peo­ple will go home, maybe next week?.”

This is kind of like say­ing “There is a one in ten chance Santa Claus will be the one dis­tribut­ing the pre­sents this year.” It feels like a huge “clack”.

I can only spec­u­late as to what’s go­ing on there. Queried for a prob­a­bil­ity, peo­ple are trans­lat­ing some­thing like “Sure, A is hap­pen­ing” into a big­gish num­ber, and re­port­ing that. They are to­tally failing to flip the ques­tion around and ex­plic­itly con­sider what it would take for not-A to hap­pen. (Per­haps, too, peo­ple have been so strongly cau­tioned by cau­tions, from Tet­lock and oth­ers, against be­ing over­con­fi­dent that they re­flex­ively shy away from the ex­treme num­bers.)

Just be­cause you’re ex­press­ing be­liefs as per­centages doesn’t mean that you are au­to­mat­i­cally ap­ply­ing the ax­ioms of prob­a­bil­ity. Just be­cause you use “75%” as a short­hand for “I’m pretty sure” doesn’t mean you are think­ing prob­a­bil­is­ti­cally; you must train the skill of see­ing that for some events, its com­ple­ment “25%” also counts as “I’m pretty sure”. The ax­ioms are more im­por­tant than the use of num­bers—in fact for this sort of fore­cast “91%” strikes me as need­lessly pre­cise; in­cre­ments of 5% are more than enough, away from the ex­tremes.

Tool 3: Refer­ence class forecasting

The or­der in which I’m dis­cussing these “ba­sics of fore­cast­ing” re­flects not so much their im­por­tance, as the or­der in which I tend to run through them when en­coun­ter­ing a new ques­tion. (This might not be the op­ti­mal or­der, or even very good—but that should mat­ter lit­tle if the wa­ter­line is in­deed low.)

Us­ing refer­ence classes was ac­tu­ally part of the “train­ing pack­age” of the GJP. From the linked post comes the warn­ing that “de­cid­ing what’s the proper refer­ence class is not straight­for­ward”. And in fact, this tool only ap­plies in some cases, not sys­tem­at­i­cally. One of our re­cently closed ques­tions was “Will any gov­ern­ment force gain con­trol of the So­mali town of Kis­mayo be­fore 1 Novem­ber 2012?”. Clearly, you could spend quite a while try­ing to figure out an ap­pro­pri­ate refer­ence class here. (In fact, this ques­tion also stands as a counter-ex­am­ple to the “Fa­vor sta­tus quo” tool, and flip­ping the ques­tion around might not have been too use­ful ei­ther. All these tools re­quire some dis­crim­i­na­tion.)

On the other hand, it came in rather handy in as­sess­ing the short-term ques­tion we got late septem­ber: “What change will oc­cur in the FAO Food Price in­dex dur­ing Septem­ber 2012?”—with barely two weeks to go be­fore the FAO was to post the up­dated in­dex in early Oc­to­ber. More gen­er­ally, it’s a use­ful tool when you’re asked to make pre­dic­tions re­gard­ing a nu­mer­i­cal in­di­ca­tor, for which you can ob­serve past data.

The FAO price data can be re­trieved as a spread­sheet (.xsl down­load). Our fore­cast ques­tion di­vided the out­comes into four: A) an in­crease of 3% or more, B) an in­crease of less than 3%, C) a de­crease of less than 3%, D) a de­crease of more than 3%, E) “no change”—mean­ing a change too small to al­ter the value rounded to the near­est in­te­ger.

It’s not clear from the chart that there is any con­sis­tent sea­sonal vari­a­tion. A change of 3% would have been about 6.4 points; since 8/​2011 there had been four month-on-month changes of that mag­ni­tude, 3 de­creases and 1 in­crease. Based on that refer­ence class, the prob­a­bil­ity of a small change (B+C+E) came out to about 23. The prob­a­bil­ity for “no change” (E) to 112 - the Au­gust price was the same as the July price. The prob­a­bil­ity for an in­crease (A+B), roughly the same as for a de­crease (C+D). My first-cut fore­cast al­lo­cated the prob­a­bil­ity mass as fol­lows: 15/​30/​30/​15/​10.

How­ever, I figured I did need to ap­ply a cor­rec­tion, based on re­ports of a drought in the US that could lead to some food short­ages. I took 10% prob­a­bil­ity mass from the “de­crease” out­comes and al­lo­cated it to the “in­crease” out­comes. My fi­nal fore­cast was 20/​35/​25/​10/​10. I didn’t mess around with it any more than that. As it turned out, the ac­tual out­come was B! My score was bet­tered by only 3 fore­cast­ers, out of a to­tal of 9.

Next up: lines of re­treat, ditch­ing sunk costs, loss functions

This post has grown long enough, and I still have 3+ tools I want to cover. Stay tuned for Part 2!


1 The GJP is be­ing run by Phil Tet­lock, known for his “hedge­hog and fox” anal­y­sis of fore­cast­ing. At that time I wasn’t aware of the com­pet­ing groups—one of them, DAGGRE, is run by Robin Han­son (of OB fame) among oth­ers, which might have made it an ap­peal­ing al­ter­nate choice if I’d know about it.

2 Un­for­tu­nately, the ex­per­i­men­tal con­di­tion Paul be­longed to used a pre­dic­tion mar­ket where fore­cast­ers played vir­tual money by “bet­ting” on pre­dic­tions; this makes it hard to trans­late the num­bers he pro­vides into prob­a­bil­ities. The gen­eral point is still in­ter­est­ing.