Statistical Prediction Rules Out-Perform Expert Human Judgments

A pa­role board con­sid­ers the re­lease of a pris­oner: Will he be vi­o­lent again? A hiring officer con­sid­ers a job can­di­date: Will she be a valuable as­set to the com­pany? A young cou­ple con­sid­ers mar­riage: Will they have a happy mar­riage?

The cached wis­dom for mak­ing such high-stakes pre­dic­tions is to have ex­perts gather as much ev­i­dence as pos­si­ble, weigh this ev­i­dence, and make a judg­ment. But 60 years of re­search has shown that in hun­dreds of cases, a sim­ple for­mula called a statis­ti­cal pre­dic­tion rule (SPR) makes bet­ter pre­dic­tions than lead­ing ex­perts do. Or, more ex­actly:

When based on the same ev­i­dence, the pre­dic­tions of SPRs are at least as re­li­able as, and are typ­i­cally more re­li­able than, the pre­dic­tions of hu­man ex­perts for prob­lems of so­cial pre­dic­tion.1

For ex­am­ple, one SPR de­vel­oped in 1995 pre­dicts the price of ma­ture Bordeaux red wines at auc­tion bet­ter than ex­pert wine tasters do. Re­ac­tion from the wine-tast­ing in­dus­try to such wine-pre­dict­ing SPRs has been “some­where be­tween vi­o­lent and hys­ter­i­cal.”

How does the SPR work? This par­tic­u­lar SPR is called a proper lin­ear model, which has the form:

P = w1(c1) + w2(c2) + w3(c3) + …wn(cn)

The model calcu­lates the summed re­sult P, which aims to pre­dict a tar­get prop­erty such as wine price, on the ba­sis of a se­ries of cues. Above, cn is the value of the nth cue, and wn is the weight as­signed to the nth cue.2

In the wine-pre­dict­ing SPR, c1 re­flects the age of the vin­tage, and other cues re­flect rele­vant cli­matic fea­tures where the grapes were grown. The weights for the cues were as­signed on the ba­sis of a com­par­i­son of these cues to a large set of data on past mar­ket prices for ma­ture Bordeaux wines.3

There are other ways to con­struct SPRs, but rather than sur­vey these de­tails, I will in­stead sur­vey the in­cred­ible suc­cess of SPRs.

  • Howard and Dawes (1976) found they can re­li­ably pre­dict mar­i­tal hap­piness with one of the sim­plest SPRs ever con­ceived, us­ing only two cues: P = [rate of love­mak­ing] - [rate of fight­ing]. The re­li­a­bil­ity of this SPR was con­firmed by Ed­wards & Ed­wards (1977) and by Thorn­ton (1979).

  • Un­struc­tured in­ter­views re­li­ably de­grade the de­ci­sions of gate­keep­ers (e.g. hiring and ad­mis­sions officers, pa­role boards, etc.). Gate­keep­ers (and SPRs) make bet­ter de­ci­sions on the ba­sis of dossiers alone than on the ba­sis of dossiers and un­struc­tured in­ter­views. (Bloom and Brundage 1947, DeVaul et. al. 1957, Oskamp 1965, Milstein et. al. 1981; Hunter & Hunter 1984; Wies­ner & Cron­shaw 1988). If you’re hiring, you’re prob­a­bly bet­ter off not do­ing in­ter­views.

  • Wittman (1941) con­structed an SPR that pre­dicted the suc­cess of elec­troshock ther­apy for pa­tients more re­li­ably than the med­i­cal or psy­cholog­i­cal staff.

  • Car­roll et. al. (1988) found an SPR that pre­dicts crim­i­nal re­ci­di­vism bet­ter than ex­pert crim­i­nol­o­gists.

  • An SPR con­structed by Gold­berg (1968) did a bet­ter job of di­ag­nos­ing pa­tients as neu­rotic or psy­chotic than did trained clini­cal psy­chol­o­gists.

  • SPRs reg­u­larly pre­dict aca­demic perfor­mance bet­ter than ad­mis­sions officers, whether for med­i­cal schools (DeVaul et. al. 1957), law schools (Swets, Dawes and Mon­a­han 2000), or grad­u­ate school in psy­chol­ogy (Dawes 1971).

  • SPRs pre­dict loan and credit risk bet­ter than bank officers (Stil­lwell et. al. 1983).

  • SPRs pre­dict new­borns at risk for Sud­den In­fant Death Syn­drome bet­ter than hu­man ex­perts do (Lowry 1975; Car­pen­ter et. al. 1977; Gold­ing et. al. 1985).

  • SPRs are bet­ter at pre­dict­ing who is prone to vi­o­lence than are foren­sic psy­chol­o­gists (Faust & Ziskin 1988).

  • Libby (1976) found a sim­ple SPR that pre­dicted firm bankruptcy bet­ter than ex­pe­rienced loan officers.

And that is barely scratch­ing the sur­face.

If this is not amaz­ing enough, con­sider the fact that even when ex­perts are given the re­sults of SPRs, they still can’t out­perform those SPRs (Leli & Filskov 1985; Gold­berg 1968).

So why aren’t SPRs in use ev­ery­where? Prob­a­bly, sug­gest Bishop & Trout, we deny or ig­nore the suc­cess of SPRs be­cause of deep-seated cog­ni­tive bi­ases, such as over­con­fi­dence in our own judg­ments. But if these SPRs work as well as or bet­ter than hu­man judg­ments, shouldn’t we use them?

Robyn Dawes (2002) drew out the nor­ma­tive im­pli­ca­tions of such stud­ies:

If a well-val­i­dated SPR that is su­pe­rior to pro­fes­sional judg­ment ex­ists in a rele­vant de­ci­sion mak­ing con­text, pro­fes­sion­als should use it, to­tally ab­sent­ing them­selves from the pre­dic­tion.

Some­times, be­ing ra­tio­nal is easy. When there ex­ists a re­li­able statis­ti­cal pre­dic­tion rule for the prob­lem you’re con­sid­er­ing, you need not waste your brain power try­ing to make a care­ful judg­ment. Just take an out­side view and use the damn SPR.4

Recom­mended Reading

Notes

1 Bishop & Trout, Episte­mol­ogy and the Psy­chol­ogy of Hu­man Judg­ment, p. 27. The defini­tive case for this claim is made in a 1996 study by Grove & Meehl that sur­veyed 136 stud­ies yield­ing 617 com­par­i­sons be­tween the judg­ments of hu­man ex­perts and SPRs (in which hu­mans and SPRs made pre­dic­tions about the same cases and the SPRs never had more in­for­ma­tion than the hu­mans). Grove & Meehl found that of the 136 stud­ies, 64 fa­vored the SPR, 64 showed roughly equal ac­cu­racy, and 8 fa­vored hu­man judg­ment. Since these last 8 stud­ies “do not form a pocket of pre­dic­tive ex­cel­lent in which [ex­perts] could prof­itably spe­cial­ize,” Grove and Meehl spec­u­lated that these 8 out­liers may be due to ran­dom sam­pling er­ror.

2 Read­ers of Less Wrong may rec­og­nize SPRs as a rel­a­tively sim­ple type of ex­pert sys­tem.

3 But, see Ana­toly_Vorobey’s fine ob­jec­tions.

4 There are oc­ca­sional ex­cep­tions, usu­ally referred to as “bro­ken leg” cases. Sup­pose an SPR re­li­ably pre­dicts an in­di­vi­d­ual’s movie at­ten­dance, but then you learn he has a bro­ken leg. In this case it may be wise to aban­don the SPR. The prob­lem is that there is no gen­eral rule for when ex­perts should aban­don the SPR. When they are al­lowed to do so, they aban­don the SPR far too fre­quently, and thus would have been bet­ter off stick­ing strictly to the SPR, even for le­gi­t­i­mate “bro­ken leg” in­stances (Gold­berg 1968; Sawyer 1966; Leli and Filskov 1984).

References

Bloom & Brundage (1947). “Pre­dic­tions of Suc­cess in Ele­men­tary School for En­listed Per­son­nel”, Per­son­nel Re­search and Test Devel­op­ment in the Nat­u­ral Bureau of Per­son­nel, ed. D.B. Stuit, 233-61. Prince­ton: Prince­ton Univer­sity Press.

Car­pen­ter, Gard­ner, McWeeny, & Emery (1977). “Mul­tistage scory sys­tem­for iden­ti­fy­ing in­fants at risk of un­ex­pected death”, Arch. Dis. Childh., 53: 606−612.

Car­roll, Winer, Coates, Galegher, & Alibrio (1988). “Eval­u­a­tion, Di­ag­no­sis, and Pre­dic­tion in Pa­role De­ci­sion-Mak­ing”, Law and So­ciety Re­view, 17: 199-228.

Dawes (1971). “A Case Study of Grad­u­ate Ad­mis­sions: Ap­pli­ca­tions of Three Prin­ci­ples of Hu­man De­ci­sion-Mak­ing”, Amer­i­can Psy­chol­o­gist, 26: 180-88.

Dawes (2002). “The Ethics of Us­ing or Not Us­ing Statis­ti­cal Pre­dic­tion Rules in Psy­cholog­i­cal Prac­tice and Re­lated Con­sult­ing Ac­tivi­ties”, Philos­o­phy of Science, 69: S178-S184.

DeVaul, Jer­vey, Chap­pell, Carver, Short, & O’Keefe (1957). “Med­i­cal School Perfor­mance of Ini­tially Re­jected Stu­dents”, Jour­nal of the Amer­i­can Med­i­cal As­so­ci­a­tion, 257: 47-51.

Faust & Ziskin (1988). “The ex­pert wit­ness in psy­chol­ogy and psy­chi­a­try”, Science, 241: 1143−1144.

Gold­berg (1968). “Sim­ple Models of Sim­ple Pro­cess? Some Re­search on Clini­cal Judg­ments”, Amer­i­can Psy­chol­o­gist, 23: 483-96.

Gold­ing, Lim­er­ick, & MacFar­lane (1985). Sud­den In­fant Death. Som­er­set: Open Books.

Ed­wards & Ed­wards (1977). “Mar­riage: Direct and Con­tin­u­ous Mea­sure­ment”, Bul­letin of the Psy­cho­nomic So­ciety, 10: 187-88.

Howard & Dawes (1976). “Lin­ear Pre­dic­tion of Mar­i­tal Hap­piness”, Per­son­al­ity and So­cial Psy­chol­ogy Bul­letin, 2: 478-80.

Hunter & Hunter (1984). “Val­idity and util­ity of al­ter­nate pre­dic­tors of job perfor­mance”, Psy­cholog­i­cal Bul­letin, 96: 72-98

Leli & Filskov (1984). “Clini­cal De­tec­tion of In­tel­lec­tual De­te­ri­o­ra­tion As­so­ci­ated with Brain Da­m­age”, Jour­nal of Clini­cal Psy­chol­ogy, 40: 1435–1441.

Libby (1976). “Man ver­sus model of man: Some con­flict­ing ev­i­dence”, Or­ga­ni­za­tional Be­hav­ior and Hu­man Perfor­mance, 16: 1-12.

Lowry (1975). “The iden­ti­fi­ca­tion of in­fants at high risk of early death”, Med. Stats. Re­port, Lon­don School of Hy­giene and Trop­i­cal Medicine.

Milstein, Wild­kin­son, Bur­row, & Kessen (1981). “Ad­mis­sion De­ci­sions and Perfor­mance dur­ing Med­i­cal School”, Jour­nal of Med­i­cal Ed­u­ca­tion, 56: 77-82.

Oskamp (1965). “Over­con­fi­dence in Case Study Judg­ments”, Jour­nal of Con­sult­ing Psy­chol­ogy, 63: 81-97.

Sawyer (1966). “Mea­sure­ment and Pre­dic­tion, Clini­cal and Statis­ti­cal”, Psy­cholog­i­cal Bul­letin, 66: 178-200.

Stil­lwell, Bar­ron, & Ed­wards (1983). “Eval­u­at­ing Credit Ap­pli­ca­tions: A Val­i­da­tion of Mul­ti­at­tribute Utility Weight Elic­i­ta­tion Tech­niques”, Or­ga­ni­za­tional Be­hav­ior and Hu­man Perfor­mance, 32: 87-108.

Swets, Dawes, & Mon­a­han (2000). “Psy­cholog­i­cal Science Can Im­prove Di­ag­nos­tic De­ci­sions”, Psy­cholog­i­cal Science in the Public In­ter­est, 1: 1–26.

Thorn­ton (1977). “Lin­ear Pre­dic­tion of Mar­i­tal Hap­piness: A Repli­ca­tion”, Per­son­al­ity and So­cial Psy­chol­ogy Bul­letin, 3: 674-76.

Wies­ner & Cron­shaw (1988). “A meta-an­a­lytic in­ves­ti­ga­tion of the im­pact of in­ter­view for­mat and de­gree of struc­ture on the val­idity of the em­ploy­ment in­ter­view”, Jour­nal of Ap­plied Psy­chol­ogy, 61: 275-290.

Wittman (1941). “A Scale for Mea­sur­ing Prog­no­sis in Schizophrenic Pa­tients”, El­gin Papers 4: 20-33.