Statistical Prediction Rules Out-Perform Expert Human Judgments
A parole board considers the release of a prisoner: Will he be violent again? A hiring officer considers a job candidate: Will she be a valuable asset to the company? A young couple considers marriage: Will they have a happy marriage?
The cached wisdom for making such high-stakes predictions is to have experts gather as much evidence as possible, weigh this evidence, and make a judgment. But 60 years of research has shown that in hundreds of cases, a simple formula called a statistical prediction rule (SPR) makes better predictions than leading experts do. Or, more exactly:
When based on the same evidence, the predictions of SPRs are at least as reliable as, and are typically more reliable than, the predictions of human experts for problems of social prediction.1
For example, one SPR developed in 1995 predicts the price of mature Bordeaux red wines at auction better than expert wine tasters do. Reaction from the wine-tasting industry to such wine-predicting SPRs has been “somewhere between violent and hysterical.”
How does the SPR work? This particular SPR is called a proper linear model, which has the form:
P = w1(c1) + w2(c2) + w3(c3) + …wn(cn)
The model calculates the summed result P, which aims to predict a target property such as wine price, on the basis of a series of cues. Above, cn is the value of the nth cue, and wn is the weight assigned to the nth cue.2
In the wine-predicting SPR, c1 reflects the age of the vintage, and other cues reflect relevant climatic features where the grapes were grown. The weights for the cues were assigned on the basis of a comparison of these cues to a large set of data on past market prices for mature Bordeaux wines.3
There are other ways to construct SPRs, but rather than survey these details, I will instead survey the incredible success of SPRs.
Howard and Dawes (1976) found they can reliably predict marital happiness with one of the simplest SPRs ever conceived, using only two cues: P = [rate of lovemaking] - [rate of fighting]. The reliability of this SPR was confirmed by Edwards & Edwards (1977) and by Thornton (1979).
Unstructured interviews reliably degrade the decisions of gatekeepers (e.g. hiring and admissions officers, parole boards, etc.). Gatekeepers (and SPRs) make better decisions on the basis of dossiers alone than on the basis of dossiers and unstructured interviews. (Bloom and Brundage 1947, DeVaul et. al. 1957, Oskamp 1965, Milstein et. al. 1981; Hunter & Hunter 1984; Wiesner & Cronshaw 1988). If you’re hiring, you’re probably better off not doing interviews.
Wittman (1941) constructed an SPR that predicted the success of electroshock therapy for patients more reliably than the medical or psychological staff.
Carroll et. al. (1988) found an SPR that predicts criminal recidivism better than expert criminologists.
An SPR constructed by Goldberg (1968) did a better job of diagnosing patients as neurotic or psychotic than did trained clinical psychologists.
SPRs regularly predict academic performance better than admissions officers, whether for medical schools (DeVaul et. al. 1957), law schools (Swets, Dawes and Monahan 2000), or graduate school in psychology (Dawes 1971).
SPRs predict loan and credit risk better than bank officers (Stillwell et. al. 1983).
SPRs predict newborns at risk for Sudden Infant Death Syndrome better than human experts do (Lowry 1975; Carpenter et. al. 1977; Golding et. al. 1985).
SPRs are better at predicting who is prone to violence than are forensic psychologists (Faust & Ziskin 1988).
Libby (1976) found a simple SPR that predicted firm bankruptcy better than experienced loan officers.
And that is barely scratching the surface.
If this is not amazing enough, consider the fact that even when experts are given the results of SPRs, they still can’t outperform those SPRs (Leli & Filskov 1985; Goldberg 1968).
So why aren’t SPRs in use everywhere? Probably, suggest Bishop & Trout, we deny or ignore the success of SPRs because of deep-seated cognitive biases, such as overconfidence in our own judgments. But if these SPRs work as well as or better than human judgments, shouldn’t we use them?
Robyn Dawes (2002) drew out the normative implications of such studies:
If a well-validated SPR that is superior to professional judgment exists in a relevant decision making context, professionals should use it, totally absenting themselves from the prediction.
Sometimes, being rational is easy. When there exists a reliable statistical prediction rule for the problem you’re considering, you need not waste your brain power trying to make a careful judgment. Just take an outside view and use the damn SPR.4
Chapter 2 of Bishop & Trout, Epistemology and the Psychology of Human Judgment
Chapter 40 of (eds.) Gilovich, Griffin, & Kahneman, Heuristics and Biases: The Psychology of Intuitive Judgment
1 Bishop & Trout, Epistemology and the Psychology of Human Judgment, p. 27. The definitive case for this claim is made in a 1996 study by Grove & Meehl that surveyed 136 studies yielding 617 comparisons between the judgments of human experts and SPRs (in which humans and SPRs made predictions about the same cases and the SPRs never had more information than the humans). Grove & Meehl found that of the 136 studies, 64 favored the SPR, 64 showed roughly equal accuracy, and 8 favored human judgment. Since these last 8 studies “do not form a pocket of predictive excellent in which [experts] could profitably specialize,” Grove and Meehl speculated that these 8 outliers may be due to random sampling error.
2 Readers of Less Wrong may recognize SPRs as a relatively simple type of expert system.
3 But, see Anatoly_Vorobey’s fine objections.
4 There are occasional exceptions, usually referred to as “broken leg” cases. Suppose an SPR reliably predicts an individual’s movie attendance, but then you learn he has a broken leg. In this case it may be wise to abandon the SPR. The problem is that there is no general rule for when experts should abandon the SPR. When they are allowed to do so, they abandon the SPR far too frequently, and thus would have been better off sticking strictly to the SPR, even for legitimate “broken leg” instances (Goldberg 1968; Sawyer 1966; Leli and Filskov 1984).
Bloom & Brundage (1947). “Predictions of Success in Elementary School for Enlisted Personnel”, Personnel Research and Test Development in the Natural Bureau of Personnel, ed. D.B. Stuit, 233-61. Princeton: Princeton University Press.
Carpenter, Gardner, McWeeny, & Emery (1977). “Multistage scory systemfor identifying infants at risk of unexpected death”, Arch. Dis. Childh., 53: 606−612.
Carroll, Winer, Coates, Galegher, & Alibrio (1988). “Evaluation, Diagnosis, and Prediction in Parole Decision-Making”, Law and Society Review, 17: 199-228.
Dawes (1971). “A Case Study of Graduate Admissions: Applications of Three Principles of Human Decision-Making”, American Psychologist, 26: 180-88.
Dawes (2002). “The Ethics of Using or Not Using Statistical Prediction Rules in Psychological Practice and Related Consulting Activities”, Philosophy of Science, 69: S178-S184.
DeVaul, Jervey, Chappell, Carver, Short, & O’Keefe (1957). “Medical School Performance of Initially Rejected Students”, Journal of the American Medical Association, 257: 47-51.
Faust & Ziskin (1988). “The expert witness in psychology and psychiatry”, Science, 241: 1143−1144.
Goldberg (1968). “Simple Models of Simple Process? Some Research on Clinical Judgments”, American Psychologist, 23: 483-96.
Golding, Limerick, & MacFarlane (1985). Sudden Infant Death. Somerset: Open Books.
Edwards & Edwards (1977). “Marriage: Direct and Continuous Measurement”, Bulletin of the Psychonomic Society, 10: 187-88.
Howard & Dawes (1976). “Linear Prediction of Marital Happiness”, Personality and Social Psychology Bulletin, 2: 478-80.
Hunter & Hunter (1984). “Validity and utility of alternate predictors of job performance”, Psychological Bulletin, 96: 72-98
Leli & Filskov (1984). “Clinical Detection of Intellectual Deterioration Associated with Brain Damage”, Journal of Clinical Psychology, 40: 1435–1441.
Libby (1976). “Man versus model of man: Some conflicting evidence”, Organizational Behavior and Human Performance, 16: 1-12.
Lowry (1975). “The identification of infants at high risk of early death”, Med. Stats. Report, London School of Hygiene and Tropical Medicine.
Milstein, Wildkinson, Burrow, & Kessen (1981). “Admission Decisions and Performance during Medical School”, Journal of Medical Education, 56: 77-82.
Oskamp (1965). “Overconfidence in Case Study Judgments”, Journal of Consulting Psychology, 63: 81-97.
Sawyer (1966). “Measurement and Prediction, Clinical and Statistical”, Psychological Bulletin, 66: 178-200.
Stillwell, Barron, & Edwards (1983). “Evaluating Credit Applications: A Validation of Multiattribute Utility Weight Elicitation Techniques”, Organizational Behavior and Human Performance, 32: 87-108.
Swets, Dawes, & Monahan (2000). “Psychological Science Can Improve Diagnostic Decisions”, Psychological Science in the Public Interest, 1: 1–26.
Thornton (1977). “Linear Prediction of Marital Happiness: A Replication”, Personality and Social Psychology Bulletin, 3: 674-76.
Wiesner & Cronshaw (1988). “A meta-analytic investigation of the impact of interview format and degree of structure on the validity of the employment interview”, Journal of Applied Psychology, 61: 275-290.
Wittman (1941). “A Scale for Measuring Prognosis in Schizophrenic Patients”, Elgin Papers 4: 20-33.