# Statistical Prediction Rules Out-Perform Expert Human Judgments

A pa­role board con­sid­ers the re­lease of a pris­oner: Will he be vi­o­lent again? A hiring officer con­sid­ers a job can­di­date: Will she be a valuable as­set to the com­pany? A young cou­ple con­sid­ers mar­riage: Will they have a happy mar­riage?

The cached wis­dom for mak­ing such high-stakes pre­dic­tions is to have ex­perts gather as much ev­i­dence as pos­si­ble, weigh this ev­i­dence, and make a judg­ment. But 60 years of re­search has shown that in hun­dreds of cases, a sim­ple for­mula called a statis­ti­cal pre­dic­tion rule (SPR) makes bet­ter pre­dic­tions than lead­ing ex­perts do. Or, more ex­actly:

When based on the same ev­i­dence, the pre­dic­tions of SPRs are at least as re­li­able as, and are typ­i­cally more re­li­able than, the pre­dic­tions of hu­man ex­perts for prob­lems of so­cial pre­dic­tion.1

For ex­am­ple, one SPR de­vel­oped in 1995 pre­dicts the price of ma­ture Bordeaux red wines at auc­tion bet­ter than ex­pert wine tasters do. Re­ac­tion from the wine-tast­ing in­dus­try to such wine-pre­dict­ing SPRs has been “some­where be­tween vi­o­lent and hys­ter­i­cal.”

How does the SPR work? This par­tic­u­lar SPR is called a proper lin­ear model, which has the form:

P = w1(c1) + w2(c2) + w3(c3) + …wn(cn)

The model calcu­lates the summed re­sult P, which aims to pre­dict a tar­get prop­erty such as wine price, on the ba­sis of a se­ries of cues. Above, cn is the value of the nth cue, and wn is the weight as­signed to the nth cue.2

In the wine-pre­dict­ing SPR, c1 re­flects the age of the vin­tage, and other cues re­flect rele­vant cli­matic fea­tures where the grapes were grown. The weights for the cues were as­signed on the ba­sis of a com­par­i­son of these cues to a large set of data on past mar­ket prices for ma­ture Bordeaux wines.3

There are other ways to con­struct SPRs, but rather than sur­vey these de­tails, I will in­stead sur­vey the in­cred­ible suc­cess of SPRs.

• Howard and Dawes (1976) found they can re­li­ably pre­dict mar­i­tal hap­piness with one of the sim­plest SPRs ever con­ceived, us­ing only two cues: P = [rate of love­mak­ing] - [rate of fight­ing]. The re­li­a­bil­ity of this SPR was con­firmed by Ed­wards & Ed­wards (1977) and by Thorn­ton (1979).

• Un­struc­tured in­ter­views re­li­ably de­grade the de­ci­sions of gate­keep­ers (e.g. hiring and ad­mis­sions officers, pa­role boards, etc.). Gate­keep­ers (and SPRs) make bet­ter de­ci­sions on the ba­sis of dossiers alone than on the ba­sis of dossiers and un­struc­tured in­ter­views. (Bloom and Brundage 1947, DeVaul et. al. 1957, Oskamp 1965, Milstein et. al. 1981; Hunter & Hunter 1984; Wies­ner & Cron­shaw 1988). If you’re hiring, you’re prob­a­bly bet­ter off not do­ing in­ter­views.

• Wittman (1941) con­structed an SPR that pre­dicted the suc­cess of elec­troshock ther­apy for pa­tients more re­li­ably than the med­i­cal or psy­cholog­i­cal staff.

• Car­roll et. al. (1988) found an SPR that pre­dicts crim­i­nal re­ci­di­vism bet­ter than ex­pert crim­i­nol­o­gists.

• An SPR con­structed by Gold­berg (1968) did a bet­ter job of di­ag­nos­ing pa­tients as neu­rotic or psy­chotic than did trained clini­cal psy­chol­o­gists.

• SPRs reg­u­larly pre­dict aca­demic perfor­mance bet­ter than ad­mis­sions officers, whether for med­i­cal schools (DeVaul et. al. 1957), law schools (Swets, Dawes and Mon­a­han 2000), or grad­u­ate school in psy­chol­ogy (Dawes 1971).

• SPRs pre­dict loan and credit risk bet­ter than bank officers (Stil­lwell et. al. 1983).

• SPRs pre­dict new­borns at risk for Sud­den In­fant Death Syn­drome bet­ter than hu­man ex­perts do (Lowry 1975; Car­pen­ter et. al. 1977; Gold­ing et. al. 1985).

• SPRs are bet­ter at pre­dict­ing who is prone to vi­o­lence than are foren­sic psy­chol­o­gists (Faust & Ziskin 1988).

• Libby (1976) found a sim­ple SPR that pre­dicted firm bankruptcy bet­ter than ex­pe­rienced loan officers.

And that is barely scratch­ing the sur­face.

If this is not amaz­ing enough, con­sider the fact that even when ex­perts are given the re­sults of SPRs, they still can’t out­perform those SPRs (Leli & Filskov 1985; Gold­berg 1968).

So why aren’t SPRs in use ev­ery­where? Prob­a­bly, sug­gest Bishop & Trout, we deny or ig­nore the suc­cess of SPRs be­cause of deep-seated cog­ni­tive bi­ases, such as over­con­fi­dence in our own judg­ments. But if these SPRs work as well as or bet­ter than hu­man judg­ments, shouldn’t we use them?

Robyn Dawes (2002) drew out the nor­ma­tive im­pli­ca­tions of such stud­ies:

If a well-val­i­dated SPR that is su­pe­rior to pro­fes­sional judg­ment ex­ists in a rele­vant de­ci­sion mak­ing con­text, pro­fes­sion­als should use it, to­tally ab­sent­ing them­selves from the pre­dic­tion.

Some­times, be­ing ra­tio­nal is easy. When there ex­ists a re­li­able statis­ti­cal pre­dic­tion rule for the prob­lem you’re con­sid­er­ing, you need not waste your brain power try­ing to make a care­ful judg­ment. Just take an out­side view and use the damn SPR.4

Notes

1 Bishop & Trout, Episte­mol­ogy and the Psy­chol­ogy of Hu­man Judg­ment, p. 27. The defini­tive case for this claim is made in a 1996 study by Grove & Meehl that sur­veyed 136 stud­ies yield­ing 617 com­par­i­sons be­tween the judg­ments of hu­man ex­perts and SPRs (in which hu­mans and SPRs made pre­dic­tions about the same cases and the SPRs never had more in­for­ma­tion than the hu­mans). Grove & Meehl found that of the 136 stud­ies, 64 fa­vored the SPR, 64 showed roughly equal ac­cu­racy, and 8 fa­vored hu­man judg­ment. Since these last 8 stud­ies “do not form a pocket of pre­dic­tive ex­cel­lent in which [ex­perts] could prof­itably spe­cial­ize,” Grove and Meehl spec­u­lated that these 8 out­liers may be due to ran­dom sam­pling er­ror.

2 Read­ers of Less Wrong may rec­og­nize SPRs as a rel­a­tively sim­ple type of ex­pert sys­tem.

3 But, see Ana­toly_Vorobey’s fine ob­jec­tions.

4 There are oc­ca­sional ex­cep­tions, usu­ally referred to as “bro­ken leg” cases. Sup­pose an SPR re­li­ably pre­dicts an in­di­vi­d­ual’s movie at­ten­dance, but then you learn he has a bro­ken leg. In this case it may be wise to aban­don the SPR. The prob­lem is that there is no gen­eral rule for when ex­perts should aban­don the SPR. When they are al­lowed to do so, they aban­don the SPR far too fre­quently, and thus would have been bet­ter off stick­ing strictly to the SPR, even for le­gi­t­i­mate “bro­ken leg” in­stances (Gold­berg 1968; Sawyer 1966; Leli and Filskov 1984).

References

Bloom & Brundage (1947). “Pre­dic­tions of Suc­cess in Ele­men­tary School for En­listed Per­son­nel”, Per­son­nel Re­search and Test Devel­op­ment in the Nat­u­ral Bureau of Per­son­nel, ed. D.B. Stuit, 233-61. Prince­ton: Prince­ton Univer­sity Press.

Car­pen­ter, Gard­ner, McWeeny, & Emery (1977). “Mul­tistage scory sys­tem­for iden­ti­fy­ing in­fants at risk of un­ex­pected death”, Arch. Dis. Childh., 53: 606−612.

Car­roll, Winer, Coates, Galegher, & Alibrio (1988). “Eval­u­a­tion, Di­ag­no­sis, and Pre­dic­tion in Pa­role De­ci­sion-Mak­ing”, Law and So­ciety Re­view, 17: 199-228.

Dawes (1971). “A Case Study of Grad­u­ate Ad­mis­sions: Ap­pli­ca­tions of Three Prin­ci­ples of Hu­man De­ci­sion-Mak­ing”, Amer­i­can Psy­chol­o­gist, 26: 180-88.

Dawes (2002). “The Ethics of Us­ing or Not Us­ing Statis­ti­cal Pre­dic­tion Rules in Psy­cholog­i­cal Prac­tice and Re­lated Con­sult­ing Ac­tivi­ties”, Philos­o­phy of Science, 69: S178-S184.

DeVaul, Jer­vey, Chap­pell, Carver, Short, & O’Keefe (1957). “Med­i­cal School Perfor­mance of Ini­tially Re­jected Stu­dents”, Jour­nal of the Amer­i­can Med­i­cal As­so­ci­a­tion, 257: 47-51.

Faust & Ziskin (1988). “The ex­pert wit­ness in psy­chol­ogy and psy­chi­a­try”, Science, 241: 1143−1144.

Gold­berg (1968). “Sim­ple Models of Sim­ple Pro­cess? Some Re­search on Clini­cal Judg­ments”, Amer­i­can Psy­chol­o­gist, 23: 483-96.

Gold­ing, Lim­er­ick, & MacFar­lane (1985). Sud­den In­fant Death. Som­er­set: Open Books.

Ed­wards & Ed­wards (1977). “Mar­riage: Direct and Con­tin­u­ous Mea­sure­ment”, Bul­letin of the Psy­cho­nomic So­ciety, 10: 187-88.

Howard & Dawes (1976). “Lin­ear Pre­dic­tion of Mar­i­tal Hap­piness”, Per­son­al­ity and So­cial Psy­chol­ogy Bul­letin, 2: 478-80.

Hunter & Hunter (1984). “Val­idity and util­ity of al­ter­nate pre­dic­tors of job perfor­mance”, Psy­cholog­i­cal Bul­letin, 96: 72-98

Leli & Filskov (1984). “Clini­cal De­tec­tion of In­tel­lec­tual De­te­ri­o­ra­tion As­so­ci­ated with Brain Da­m­age”, Jour­nal of Clini­cal Psy­chol­ogy, 40: 1435–1441.

Libby (1976). “Man ver­sus model of man: Some con­flict­ing ev­i­dence”, Or­ga­ni­za­tional Be­hav­ior and Hu­man Perfor­mance, 16: 1-12.

Lowry (1975). “The iden­ti­fi­ca­tion of in­fants at high risk of early death”, Med. Stats. Re­port, Lon­don School of Hy­giene and Trop­i­cal Medicine.

Milstein, Wild­kin­son, Bur­row, & Kessen (1981). “Ad­mis­sion De­ci­sions and Perfor­mance dur­ing Med­i­cal School”, Jour­nal of Med­i­cal Ed­u­ca­tion, 56: 77-82.

Oskamp (1965). “Over­con­fi­dence in Case Study Judg­ments”, Jour­nal of Con­sult­ing Psy­chol­ogy, 63: 81-97.

Sawyer (1966). “Mea­sure­ment and Pre­dic­tion, Clini­cal and Statis­ti­cal”, Psy­cholog­i­cal Bul­letin, 66: 178-200.

Stil­lwell, Bar­ron, & Ed­wards (1983). “Eval­u­at­ing Credit Ap­pli­ca­tions: A Val­i­da­tion of Mul­ti­at­tribute Utility Weight Elic­i­ta­tion Tech­niques”, Or­ga­ni­za­tional Be­hav­ior and Hu­man Perfor­mance, 32: 87-108.

Swets, Dawes, & Mon­a­han (2000). “Psy­cholog­i­cal Science Can Im­prove Di­ag­nos­tic De­ci­sions”, Psy­cholog­i­cal Science in the Public In­ter­est, 1: 1–26.

Thorn­ton (1977). “Lin­ear Pre­dic­tion of Mar­i­tal Hap­piness: A Repli­ca­tion”, Per­son­al­ity and So­cial Psy­chol­ogy Bul­letin, 3: 674-76.

Wies­ner & Cron­shaw (1988). “A meta-an­a­lytic in­ves­ti­ga­tion of the im­pact of in­ter­view for­mat and de­gree of struc­ture on the val­idity of the em­ploy­ment in­ter­view”, Jour­nal of Ap­plied Psy­chol­ogy, 61: 275-290.

Wittman (1941). “A Scale for Mea­sur­ing Prog­no­sis in Schizophrenic Pa­tients”, El­gin Papers 4: 20-33.

• I’m skep­ti­cal, and will now pro­ceed to ques­tion some of the as­ser­tions made/​refer­ences cited. Note that I’m not trained in statis­tics.

Un­for­tu­nately, most of the ar­ti­cles cited are not eas­ily available. I would have liked to check the method­ol­ogy of a few more of them.

For ex­am­ple, one SPR de­vel­oped in 1995 pre­dicts the price of ma­ture Bordeaux red wines at auc­tion bet­ter than ex­pert wine tasters do.

The pa­per doesn’t ac­tu­ally es­tab­lish what you say it does. There is no statis­ti­cal anal­y­sis of ex­pert wine tasters, only one or two anec­do­tal state­ments of their fury at the whole idea. In­stead, the SPR is com­pared to ac­tual mar­ket prices—not to ex­perts’ pre­dic­tions. I think it’s fair to say that the claim I quoted is over­reached.

Now, about the model and its fit to data. Note that the SPR is older than 1995, when the pa­per was pub­lished. The NYTimes ar­ti­cle about it which you refer­ence is from 1990 (the pa­per bizarrely dates it to 1995; I’m not sure what’s go­ing on there).

The fact that there’s a lin­ear model—not speci­fied pre­cisely any­where in the ar­ti­cle—which is a good fit to wine prices for vin­tages of 1961-1972 (Table 3 in the pa­per) is not, I think, very sig­nifi­cant on its own. To judge the model, we should look at what it pre­dicts for up­com­ing years. Both the pa­per and the NYTimes ar­ti­cle make two spe­cific pre­dic­tions. First, the 1986 vin­tage, claimed to be ex­tol­led by ex­perts early on, will prove mediocre be­cause of the weather con­di­tions that year (see Figure 3 in the pa­per − 1986 is clearly the worst of the 80ies). NYTimes says “When the dust set­tles, he pre­dicts, it will be judged the worst vin­tage of the 1980′s, and no bet­ter than the un­mem­o­rable 1974′s or 1969′s”. The 1995 pa­per says, more mod­estly, “We should ex­pect that, in due course, the prices of these wines will de­cline rel­a­tive to the prices of most of the other vin­tages of the 1980s”. Se­cond, the 1989-1990 is pre­dicted to be “out­stand­ing” (pa­per), “stun­ningly good” (NYTimes), “ad­justed for age, will out­sell at a sig­nifi­cant pre­mium the great 1961 vin­tage (NYTimes).”

It’s now 16 years later. How do we test these pre­dic­tions?

First, I’ve stum­bled on a differ­ent pa­per from the pri­mary au­thor, Prof. Ashen­felter, from 2007. Pub­lished 12 years later than the one you refer­ence, this pa­per has sub­stan­tially the same con­tents, with whole pages copied ver­ba­tim from the ear­lier one. That, by it­self, wor­ries me. Even more wor­ry­ing is the fact that the 1986 pre­dic­tion, promi­nent in the 1990 ar­ti­cle and the 1995 pa­per, is com­pletely miss­ing from the 2007 pa­per (the data be­low might in­di­cate why). And most wor­ry­ing of all is the change of lan­guage re­gard­ing the 1989/​1990 pre­dic­tion. The 1995 pa­per says about its pre­dic­tion that the 1989/​1990 will turn out to be out­stand­ing, “Many wine writ­ers have made the same pre­dic­tions in the trade mag­a­z­ines”. The 2007 pa­per says “Iron­i­cally, many pro­fes­sional wine writ­ers did not con­cur with this pre­dic­tion at the time. In the years that have fol­lowed minds have been changed; and there is now vir­tu­ally unan­i­mous agree­ment that 1989 and 1990 are two of the out­stand­ing vin­tages of the last 50 years.”

Uhm. Right. Well, be­cause the claims aren’t strong enough, they do not ex­actly con­tra­dict each other, but this change leaves a bad taste. I don’t think I should give much trust to these pa­pers’ claims.

The data I could find quickly to test the pre­dic­tions is here. The prices are bro­ken down by the chateaux, by the vin­tage year, the pack­ag­ing (I’ve always cho­sen BT—bot­tle), and the auc­tion year (I’ve always cho­sen the last year available, typ­i­cally 2004). Un­for­tu­nately, Ashen­felter un­der­speci­fies how he came up with the ag­gre­gate prices for a given year—he says he chose a pack­age of the best 15 winer­ies, but doesn’t say which ones or how the prices are com­bined. I used 5 winer­ies that are speci­fied as the best in the 2007 pa­per, and looked up the prices for years 1981-1990. The data is in this spread­sheet. I haven’t tried to statis­ti­cally an­a­lyze it, but even from a quick glance, I think the fol­low­ing is clear. 1986 did not sta­bi­lize as the worst year of the 1980s. It is fre­quently sec­ond- or third-best of the decade. It is always much bet­ter than ei­ther 1984 or 1987, which are sup­posed to be vastly bet­ter ac­cord­ing to the 1995 pa­per’s weather data (see Figure 3). 1989/​1990 did turn out well, es­pe­cially 1990. Still, they’re both nearly always less ex­pen­sive than 1982, which is again vastly in­fe­rior in the weather data (it isn’t even in the best quar­ter). Over­all, I fail to see much cor­re­la­tion be­tween the weather data in the pa­per for the 1980s, the spe­cific claims about 1986 and 1989/​1990, and the mar­ket prices as of 2004. I wouldn’t recom­mend us­ing this SPR to pre­dict mar­ket prices.

Now, this was the first ex­am­ple in your post, and I found what I be­lieve to be sub­stan­tial prob­lems with its method­ol­ogy and the qual­ity of its SPR. If I were to pro­ceed and ex­am­ine ev­ery ex­am­ple you cite in the same de­tail, would I en­counter many such prob­lems? It’s difficult to tell, but my pre­dic­tion is “yes”. I an­ti­ci­pate overfit­ting and shoddy method­ol­ogy. I an­ti­ci­pate huge in­fluence of the se­lec­tion bias—the au­thors that pub­lish these kinds of pa­pers will not pub­lish a pa­per that says “The ex­perts were bet­ter than our SPR”. And fi­nally, I an­ti­ci­pate over­reach­ing claims of wide-reach­ing ap­pli­ca­bil­ity of the mod­els, based on pa­pers that ac­tu­ally in­di­cate mod­est effect in a very spe­cific situ­a­tion with a small sam­ple size.

I’ve looked at your sec­ond ex­am­ple:

Howard and Dawes (1976) found they can re­li­ably pre­dict mar­i­tal hap­piness with one of the sim­plest SPRs ever con­ceived, us­ing only two cues: P = [rate of love­mak­ing] - [rate of fight­ing].

I couldn’t find the origi­nal pa­per, but the re­sults are sum­marised in Dawes (1979). Look­ing at it, it turns out that when you say “pre­dict mar­i­tal hap­piness”, it re­ally means “pre­dicts one of the part­ners’ sub­jec­tive opinion of their mar­i­tal hap­piness”—as op­posed to e.g. sta­bil­ity of the mar­riage over time. There’s no in­di­ca­tion as to how the part­ner to ques­tion was cho­sen from each pair (e.g. whether the ex­per­i­menter knew the rate when they chose). There was very good cor­re­la­tion with bi­nary out­come (happy/​un­happy), but when a finer scale of 7 de­grees of hap­piness was used, the cor­re­la­tion was weak—rate of 0.4. In a fol­low-up ex­per­i­ment, cor­re­la­tion rate went up to 0.8, but there the sub­ject looked at the love­mak­ing/​fight­ing statis­tics be­fore opin­ing on the de­gree of hap­piness, thus con­tam­i­nat­ing their de­ci­sion. And even in the ear­lier ex­per­i­ment, the sub­ject had been record­ing those love­mak­ing/​fight­ing statis­tics in the first place, so it would make sense for them to re­call those events when they’re asked to as­sess whether their mar­riage is a happy one. Over­all, the model is witty and naively ap­pears to be use­ful, but the sus­pect method­ol­ogy and the rel­a­tively weak cor­re­la­tion en­courages me to dis­count the anal­y­sis.

Fi­nally, the fol­low­ing claim is the sin­gle most ob­jec­tion­able one in your post, to my taste:

If you’re hiring, you’re prob­a­bly bet­ter off not do­ing in­ter­views.

My own ex­pe­rience strongly sug­gests to me that this claim is inane—and is highly dan­ger­ous ad­vice. I’m not able to view the pa­pers you base it on, but if they’re any­thing like the first and sec­ond ex­am­ple, they’re far, far away from con­vinc­ing me of the truth of this claim, which I in any case strongly sus­pect to over­reach gi­gan­ti­cally over what the pa­pers are prov­ing. It may be true, for ex­am­ple, that a very large body of hiring de­ci­sion-mak­ers in a huge or­gani­sa­tion or a state on av­er­age make poorer de­ci­sions based on their pro­fes­sional judge­ment dur­ing in­ter­views than they would have made based purely on the re­sume. I can see how this claim might be true, be­cause any such very large body must be largely in­com­pe­tent. But it doesn’t fol­low that it’s good ad­vice for you to ab­strain from in­ter­view­ing—it would only fol­low if you be­lieve your­self to be no more com­pe­tent than the av­er­age hiring man­ager in such a body, or in the pa­pers you refer­ence. My per­sonal ex­pe­rience from in­ter­view­ing many, many can­di­dates for a large com­pany sug­gests that in­ter­view­ing is cru­cial (though I will freely grant that differ­ent kinds of in­ter­views vary wildly in their effec­tive­ness).

• If you’re hiring, you’re prob­a­bly bet­ter off not do­ing in­ter­views.

My own ex­pe­rience strongly sug­gests to me that this claim is inane—and is highly dan­ger­ous ad­vice… My per­sonal ex­pe­rience from in­ter­view­ing many, many can­di­dates for a large com­pany sug­gests that in­ter­view­ing is cru­cial (though I will freely grant that differ­ent kinds of in­ter­views vary wildly in their effec­tive­ness).

The whole point of this ar­ti­cle is that ex­perts of­ten think them­selves bet­ter than SPR’s when ac­tu­ally they perform no bet­ter than SPRs on av­er­age. Here we have an ex­pert tel­ling us that he thinks he would perform bet­ter than an SPR. Why should we be in­ter­ested?

• Be­cause I didn’t just state a blan­ket opinion. I dug into the stud­ies, looked for data to test one of them in depth, and found it to be highly flawed. I called into ques­tion the method­ol­ogy em­ployed by the stud­ies, as well as over­gen­er­al­iz­ing and over­reach­ing con­clu­sions they’re drummed up to sup­port. The ev­i­dence that at least some stud­ies are flawed and the method­ol­ogy is shoddy should make you ques­tion the uni­ver­sal claim ”… ac­tu­ally they perform no bet­ter than SPRs on av­er­age”. That’s why you should be in­ter­ested.

My per­sonal ex­pe­rience with in­ter­view­ing is cer­tainly not as im­por­tant piece of ev­i­dence against the ar­ti­cle as the spe­cific crit­i­cisms of the stud­ies. It’s just an­other anec­do­tal data point. That’s why I didn’t ex­pand on it as much as I did on the wine study, al­though I do be­lieve it can be made more con­vinc­ing through fur­ther elu­ci­da­tion.

• Cool, I’ll look into these points.

I made one small change so far. The above ar­ti­cle now read: “Re­ac­tion from the wine-tast­ing in­dus­try to such wine-pre­dict­ing SPRs has been ‘some­where be­tween vi­o­lent and hys­ter­i­cal.’”

Also, I’ll post links to the spe­cific pa­pers when I have time to visit UCLA and grab them.

Psy­chol­ogy is not my field, but my un­der­stand­ing is that the ‘in­ter­view effect’ for un­struc­tured in­ter­views is a very ro­bust find­ing across many decades. For more, you can listen to my in­ter­view with Michael Bishop. But hey, maybe he’s wrong!

Up­date 1: If I read the 1995 study cor­rectly, they judged the ac­cu­racy of wine tasters by com­par­ing the price of im­ma­ture wines to those of ma­ture wines, but I’m not sure. The way I phrased that is from Bishop & Trout, and that is how Bishop re­calls it, though it’s been sev­eral years now since he co-wrote Episte­mol­ogy and the Psy­chol­ogy of Hu­man Judg­ment.

• My own ex­pe­rience strongly sug­gests to me that this claim is inane … it would only fol­low if you be­lieve your­self to be no more com­pe­tent than the av­er­age hiring man­ager in such a body, or in the pa­pers you refer­ence.

What ev­i­dence do you have that you are bet­ter than av­er­age?

My per­sonal ex­pe­rience from in­ter­view­ing many, many can­di­dates for a large com­pany sug­gests that in­ter­view­ing is crucial

“It is difficult to get a man to un­der­stand some­thing, when his salary de­pends upon his not un­der­stand­ing it!”

• I have heard of one job in­ter­view that I felt con­sti­tuted a use­ful tool that could not effec­tively be re­placed by re­sume ex­am­i­na­tion and statis­ti­cal anal­y­sis. A friend of mine got a job work­ing for a com­pany that pro­vides math­e­mat­i­cal mod­el­ing ser­vices for other com­pa­nies, and his “in­ter­view” was a sev­eral hour test to cre­ate a num­ber of math­e­mat­i­cal mod­els, and then ex­plain­ing to the ex­am­iner in lay­man’s terms how and why the mod­els worked.

Most job in­ter­views are re­ally not a demon­stra­tion of job skills and ap­ti­tude, and it’s pos­si­ble to sim­ply bul­lshit your way through them. On the other hand, if you have a sim­ple and di­rect way to test the com­pe­tence of your ap­pli­cants, then by all means use it.

• That isn’t an in­ter­view, it’s a test. Tests are ex­tremely use­ful. IQ tests are an ex­cel­lent pre­dic­tor of job perfor­mance, maybe the best one available. Re­gard­less, IQ tests are usu­ally de facto ille­gal in the US due to dis­parate im­pact.

• I put in­ter­view in quotes be­cause they called it an in­ter­view. Speak­ing broadly enough, all in­ter­views are tests, but most are un­struc­tured and not very good at ex­am­in­ing the rele­vant pre­dic­tor vari­ables. All tests are of course not nec­es­sar­ily in­ter­views, but the part where they had ap­pli­cants ex­plain their pro­cesses in lay­man’s terms might qual­ify it, at least if you’re gen­er­ous with your defi­ni­tions.

Of course, it’s cer­tainly un­clear if not out­right in­cor­rect to call it an in­ter­view, but that was their choice; pos­si­bly they felt that sub­ject­ing ap­pli­cants to a “test” rather than an “in­ter­view” pro­jected a less pos­i­tive image.

• I’m most fa­mil­iar with in­ter­views for pro­gram­ming jobs, where an in­ter­view that doesn’t ask the can­di­date to demon­strate job-spe­cific skills, knowl­edge and ap­ti­tude is nearly worth­less. Th­ese jobs are also startlingly prone to re­sume dis­tor­tion that can make vastly differ­ent can­di­dates look similar, es­pe­cially re­cent grad­u­ates.

Ask­ing for cod­ing sam­ples and call­ing pre­vi­ous em­ploy­ers, es­pe­cially if cou­pled with a re­quest for code solv­ing a new (re­quested) prob­lem, could po­ten­tially re­place in­ter­views. How­ever, judg­ing the qual­ity of code still re­quires a per­son, so that doesn’t seem to re­ally change things to me.

• That’s what I think of, too, when I hear the phrase “job in­ter­view”. Is this not typ­i­cal out­side fields like pro­gram­ming?

• I can con­firm that such a “job in­ter­view” is not com­mon in medicine. The po­ten­tial em­ployer gen­er­ally re­lies on the cre­den­tial­ing pro­cess of the med­i­cal es­tab­lish­ment. Most physi­ci­ans, upon com­plet­ing their train­ing, pass a test demon­strat­ing their abil­ity to re­gur­gi­tate the teach­ers’ pass­words, and are recom­mended to the ap­pro­pri­ate cer­tifi­ca­tion board as “qual­ified” by their pro­gram di­rec­tor; to do oth­er­wise would re­flect badly on the pro­gram. Also, pro­gram di­rec­tors are loath to re­move a res­i­dent/​fel­low dur­ing ad­vanced train­ing be­cause some warm body must show up to do the work, or the pro­fes­sor him­self/​her­self might have to fill in. It is difficult to find re­place­ments for up­per level res­i­dents; the only com­mon rea­son such would be available is dis­mis­sal/​trans­fer from an­other pro­gram. Con­se­quently, the USA turns out physi­ci­ans of widely varied skill lev­els, even though their cre­den­tials are similar. In sur­gi­cal spe­cial­ities, it is not un­usual for a par­tic­u­larly bright in­di­vi­d­ual with all the pass­words but very poor tech­ni­cal skills to be­come a sur­gi­cal pro­fes­sor.

• My mother has told me an anec­dote about a fam­ily friend who was a sur­geon who had a former stu­dent call him while con­duct­ing an op­er­a­tion be­cause he couldn’t re­mem­ber what to do.

• My mother has told me an anec­dote about a fam­ily friend who was a sur­geon who had a former stu­dent call him while con­duct­ing an op­er­a­tion be­cause he couldn’t re­mem­ber what to do.

The (ru­mored) stu­dent has my re­spect. I would ex­pect most sur­geons to have too much of an ego to ad­mit to that doubt rather than stum­ble ahead full of hubris. It would be com­fort­ing to know that your sur­geon acted as if (as op­posed to merely be­liev­ing that) he cared more about the pa­tient than the im­me­di­ate per­cep­tion of sta­tus loss. (I wouldn’t care whether that just meant his thought out an­ti­ci­pa­tion of fu­ture sta­tus loss for a failed op­er­a­tion over­rode his im­me­di­ate so­cial in­stincts.)

• “It is difficult to get a man to un­der­stand some­thing, when his salary de­pends upon his not un­der­stand­ing it!”

I don’t think it’s fair, as his job is not be­ing an in­ter­viewer, but per­haps hiring smart peo­ple we can benefit from.

• Re­gard­ing hiring, I think the key­word might be “un­struc­tured”—what makes an in­ter­view an “un­struc­tured” in­ter­view?

• That’s what I thought too. The defi­ni­tions I found search­ing all say that any in­ter­view where you de­cide what to ask and how to in­ter­pret the re­sults is “un­struc­tured”. The only “struc­tured” in­ter­views seem to be tests with pre-de­ter­mined sets of ques­tions, and the can­di­date’s an­swers judged by for­mal crite­ria.

I’m not sure this di­vi­sion of the “in­ter­view-space” is all that use­ful. I would dis­t­in­guish three cat­e­gories:

1. You have an in­for­mat chat with me about the na­ture of the job, my ex­pe­rience, my pre­vi­ous em­ploy­ment, my claims about my ap­ti­tude, etc. Your im­pres­sions from this chat de­ter­mine your judge­ment of my suit­abil­ity for the job.

2. You ask me to an­swer ques­tions or perform tasks that demon­strate my ap­ti­tude. It’s up to you to choose the tasks, in­ter­pret my perfor­mance, and guide the whole pro­cess.

3. You give me a pre-de­ter­mined set of ques­tions/​tasks that is the same for all can­di­dates. My an­swers are me­chan­i­cally in­ter­preted by whether they co­in­cide with the pre-de­ter­mined set of cor­rect an­swers.

If I in­ter­pret the defi­ni­tions I could find cor­rectly, 3 is a “struc­tured” in­ter­view, and both 1 and 2 are “un­struc­tured”. To my mind, there’s a world of differ­ence be­tween 1 and 2, how­ever. 1 is of very limited util­ity (I want to say “next to worth­less”, but that’d be too pre­sump­tu­ous), and, quite pos­si­bly, does no bet­ter than de­cid­ing on the ba­sis of the re­sume alone, thought I’d still want to see the data to be con­vinced. 2, when performed by a trained and cal­ibrated in­ter­viewer, is—again, in my own ex­pe­rience—ob­vi­ously su­pe­rior both to 1 and to de­cid­ing on the ba­sis of the re­sume alone. Maybe this is some­how unique to the pro­fes­sion I in­ter­view for, but I doubt it.

Sup­pose there’s re­search which demon­strates that in some set­ting type 1 in­ter­views are worse than us­ing the re­sume alone. I don’t know whether this is the case in the pa­pers cited in this post (I couldn’t read them), but I find it plau­si­ble. Sup­pose then that the con­clu­sions drawn are the uni­ver­sal state­ments “un­struc­tured in­ter­views re­li­ably de­grade the de­ci­sions of gate­keep­ers” and “if you’re hiring, you’re prob­a­bly bet­ter off not do­ing in­ter­views”. I con­sider such con­clu­sions then to be ob­vi­ously un­sub­stan­ti­ated, in­cred­ibly over­reached, and highly dan­ger­ous ad­vice.

• The in­ter­view ex­am­ple makes sense to me if the usual hiring man­ager is strongly bi­ased re­gard­ing in­for­ma­tion that are not cru­cial. A dossier only gives lit­tle but im­por­tant in­for­ma­tion. In a face-to-face in­ter­view var­i­ous other fac­tors can play a role (of­ten un­con­sciously), e.g. smell or the abil­ity to re­turn a look.

• More here. Surely that isn’t strong ev­i­dence but an­other in­di­ca­tion that if you are not an LW type per­son then in­for­ma­tion that are not cru­cial might al­ter your per­cep­tion and sub­se­quent de­ci­sion when do­ing face-to-face in­ter­views ver­sus dossier based rul­ing.

• Read the Dawes pdf linked in the top post. I can’t speak for the other ex­am­ples, but that one is solid.

edit: my apolo­gies, re-read­ing I see you dis­cussed the mar­riage ex­am­ple. What is your opinion on the grad­u­ate rat­ing and Hodgkin’s dis­ease ex­am­ples?

• that one is solid

Why do you say that? My re­ac­tion to that pa­per was very nega­tive. In large part, it was the anec­do­tal fla­vor of the ar­gu­ments made there, but also be­cause I didn’t see the two things I was speci­fi­cally look­ing for:

• Ci­ta­tions of stud­ies in which a lin­ear model was con­structed us­ing one set of data, and then com­pared as to perfor­mance against the ex­perts us­ing a differ­ent set of data.

• Failing that, some num­bers that would con­vince me that the failure to test mod­els us­ing differ­ent data than was used to con­struct them just doesn’t mat­ter.

In­stead, here and in the 1996 study by Grove & Meehl, I find ar­gu­ments from in­cre­dulity—in effect: “Do our crit­ics re­ally think that this re­ally mat­ters? Don’t be ab­surd!”. I also no­tice that this ide­ol­ogy is be­ing pro­moted by a small num­ber of re­searchers who re­peat­edly cite each other’s work, and do not cite crit­ics (ex­cept as straw­men).

• Like Per­plexed, I hated this pa­per. Of course, it has the very good ex­cuse that it is from 1979. But in 2011, it is sort of ex­pected that you eval­u­ate your model on a sec­ond, in­de­pen­dent dataset. (My mod­els of­ten crash and burn at this stage.) Did any of these stud­ies do this?

• Also, if I may be per­mit­ted to make a more gen­eral crit­i­cism in re­sponse to this post, I would say that while the ar­ti­cle ap­pears to be well-re­searched, it has demon­strated some of the worst prob­lems I com­monly no­tice on this fo­rum. The same goes for the ma­jor­ity of the com­ments, even though many are knowl­edge­able and in­for­ma­tive. What I have in mind is the fix­a­tion on con­coct­ing the­o­ries about hu­man be­hav­ior and so­ciety based on var­i­ous idées fixes and leit­mo­tifs that are parts of the in­tel­lec­tual folk­lore here, while failing to no­tice is­sues sug­gested by ba­sic com­mon sense that are likely to be far more im­por­tant.

Thus the poster no­tices that these mod­els are not used in prac­tice de­spite con­sid­er­able ev­i­dence in their fa­vor, and rushes to pro­pose cog­ni­tive bi­ases à la Kah­ne­man & Tver­sky as the likely ex­pla­na­tion. This with­out even stop­ping to think of two ques­tions that just scream for at­ten­tion. First, what is the im­por­tance of the fact that just about any is­sue of sort­ing out peo­ple is nowa­days likely to be ide­olog­i­cally charged and legally dan­ger­ous? Se­cond, what about the fact that these mod­els are sup­posed to throw some high-sta­tus peo­ple out of work, and in a way that makes them look like they’ve been in­com­pe­tent all along?

Re­gard­less of whether var­i­ous hy­pothe­ses based on these ques­tions have any merit, the fact that some­one could write a post with­out even giv­ing them the slight­est pass­ing at­ten­tion, offer­ing in­stead a blinkered ex­pla­na­tion in­volv­ing the stan­dard old LW/​OB folk­lore, and still get up­voted to +40 is, in my opinion, in­dica­tive of some se­vere and wide­spread bi­ases.

• While this post has +40 up­votes, the ma­jor­ity of the top-voted com­ments are skep­ti­cal of it. I think this rep­re­sents con­fu­sion as to how to up­vote, al­though this is merely a hy­poth­e­sis. The ar­ti­cle sur­veys a very in­ter­est­ing topic that is right in the sweet spot of in­ter­est for this com­mu­nity, it also ap­pears schol­arly, how­ever the con­clu­sions syn­the­sized by the au­thor strike me as naive and I sus­pect that’s also the con­clu­sion of the ma­jor­ity. Whether it de­serves an up­vote is de­bate­able. I down­voted.

• I felt the con­fu­sion you are talk­ing about. If read­ers could be ex­pected to read the top-voted replies (RTFC), then the cur­rent dis­tri­bu­tion of votes would be ideal: The in­ter­est­ing ar­ti­cle gets some well-de­served at­ten­tion, and the skep­ti­cal replies give a coun­ter­bal­ance. But if read­ers don’t read the com­ments, then frankly I think this ar­ti­cle got too many up­votes when com­pared to many oth­ers.

Off­topic: Is there a meta thread some­where dis­cussing the se­man­tics of votes? I am happy that we don’t use slash­dot’s baroque in­sight­ful/​in­ter­est­ing/​funny dis­tinc­tions, but some con­sen­sus about the mean­ing of +1 would be nice.

• I don’t know about a meta-thread, but the rule of thumb I’ve seen quoted of­ten is “up­vote what you want more of; down­vote what you want less of.” Karma scores are in­tended, on this view, as an in­di­ca­tor of how many peo­ple (net) want more en­tries like that.

One im­pli­ca­tion of this view is that a score of 40 isn’t “ten times bet­ter” than a score of 4, it just means that many more peo­ple want to see posts like this than don’t want to.

Of course, this view com­petes with peo­ple’s en­tirely pre­dictable ten­dency to treat karma as an in­di­ca­tor of the en­try’s (and the user’s) over­all worth, or as a game to max­i­mize one’s score on, or as a form of re­ward/​pun­ish­ment.

Equally pre­dictably, this pre­dictable but un­in­tended use of karma far far far out­weighs the in­tended use.

• Karma-max­i­miz­ing is of­ten but not always a good ap­prox­i­ma­tion to worth-as-judged-by-com­mu­nity max­i­miz­ing, which is a good thing to max­i­mize.

• Yes. The ques­tion is how sig­nifi­cant the gap be­tween “of­ten” and “always” is.

• Though if you have a tar­get au­di­ence in mind, it is some­times worth post­ing things that will be down­voted by the com­mu­nity-at-large.

(I’ve been do­ing this a lot re­cently, though I plan on cut­ting back and re­gain­ing some gen­eral ra­tio­nal­ist cred­i­bil­ity.)

• My in­tent was to sum­ma­rize the liter­a­ture on SPRs, not provide an ac­count for why they are not used more widely. I al­most didn’t in­clude that sen­tence at all. Surely, more anal­y­sis would be im­por­tant to have in a post in­tend­ing to dis­cuss the psy­cholog­i­cal is­sues in­volved in our re­ac­tion to SPRs, but that was not my sub­ject.

In point­ing to cog­ni­tive bi­ases as an ex­pla­na­tion, I was merely re­peat­ing what Bishop & Trout & Dawes have sug­gested on the mat­ter, not mak­ing up my own ex­pla­na­tions in light of LW lore.

In fact, the ar­rows point the other way. Many of the au­thors cited in my ar­ti­cle worked closely with peo­ple like Kah­ne­man who are the origi­nal aca­demic sources of much of LW lore.

Edit: I’ve added a clause about the source of the “cog­ni­tive bi­ases” sug­ges­tion, in case oth­ers are tempted to make the same mis­taken as­sump­tion as you made.

• First, what is the im­por­tance of the fact that just about any is­sue of sort­ing out peo­ple is nowa­days likely to be ide­olog­i­cally charged and legally dan­ger­ous? Se­cond, what about the fact that these mod­els are sup­posed to throw some high-sta­tus peo­ple out of work, and in a way that makes them look like they’ve been in­com­pe­tent all along?

I am not sure what you think the an­swers to these ques­tions are, but I would say my per­sonal opinion on the mat­ter is that the more ide­olog­i­cally charged and legally dan­ger­ous a mat­ter is, the more im­por­tant ac­cu­racy and cor­rect­ness—at the ex­pense, if nec­es­sary, of strongly-held be­liefs. I would also say that pro­tect­ing the rep­u­ta­tion of com­pe­tency en­joyed by high-sta­tus peo­ple is not an ac­tivity that strongly cor­re­lates with be­ing right; I pre­dict a small nega­tive cor­re­la­tion, in fact.

Fur­ther­more, there is a se­lec­tion effect: learn­ing the LW/​OB folk­lore will re­sult in you notic­ing spe­cific cases of their ap­pli­ca­tion, and you are far, far more likely to write a post about that any about any given sub­ject. That is, you see a prevalence of “stan­dard bias ex­pla­na­tion” be­cause top-level posters are ac­tively look­ing for ac­tual cases of bias to dis­cuss.

• The sec­ond rea­son is in­valid un­less the ac­tor is self-de­lud­ing—a smart ac­tor that faces be­ing put out of work would silently adopt a SPR as his de­ci­sion-mak­ing sys­tem with­out ad­mit­ting to it. Since the su­pe­ri­or­ity of SPR con­tinues in many fields, ei­ther rele­vant ac­tors are con­sis­tently not smart, perfor­mance is not a sig­nifi­cant con­tribut­ing crite­rion to their suc­cess, or they’re self-de­lud­ing ie. over­rat­ing their own judg­ment as the poster stated.  I’d guess a com­bi­na­tion of the last two.

• Yes, I’d say it’s a com­bi­na­tion of the last two points, with em­pha­sis on the sec­ond last.

The crit­i­cal ques­tion is whether max­i­miz­ing the ac­cu­racy of your judg­ments is a prac­ti­cal way to get ahead in a given pro­fes­sion. Some­times that is in­deed the case, and in such fields we in­deed see tremen­dous efforts to au­to­mate as much ex­pert work as pos­si­ble, of­ten with great suc­cess, as in the elec­tron­ics in­dus­try. But in pro­fes­sions that op­er­ate as more tightly-knit guilds, ad­her­ence to ac­cepted stan­dards is much more im­por­tant than any ob­jec­tive met­rics of effec­tive­ness. Step­ping out­side of stan­dard work pro­ce­dures is of­ten treated as a se­ri­ous in­frac­tion with po­ten­tially se­vere con­se­quences. (Espe­cially if your non-stan­dard method­ol­ogy fails in some par­tic­u­lar case, as it will sooner or later, and you can’t cover your ass by claiming that you fol­lowed all the stan­dard ac­cepted pro­ce­dures and hav­ing your pro­fes­sion back you up or­ga­ni­za­tion­ally.)

Now, you could try en­hanc­ing your work with de­ci­sion mod­els in se­cret. But even then, it’s hard to do it in a com­pletely se­cre­tive way, and more­over, hu­man minds be­ing what they are, most peo­ple can achieve pro­fes­sional suc­cess only if they are re­ally sincerely con­vinced in their ex­per­tise and effec­tive­ness. Keep­ing a pub­lic fa­cade is hard for ev­ery­one ex­cept a very small minor­ity of peo­ple.

• So why aren’t SPRs in use ev­ery­where? Prob­a­bly, we deny or ig­nore the suc­cess of SPRs be­cause of deep-seated cog­ni­tive bi­ases, such as over­con­fi­dence in our own judg­ments. But if these SPRs work as well as or bet­ter than hu­man judg­ments, shouldn’t we use them?

Without even get­ting into the con­crete de­tails of these mod­els, I’m sur­prised that no­body so far has pointed out the elephant in the room: in con­tem­po­rary so­ciety, statis­ti­cal in­fer­ence about hu­man be­hav­ior and char­ac­ter­is­tics is a topic bear­ing tremen­dous poli­ti­cal, ide­olog­i­cal, and le­gal weight. [*] Nowa­days there ex­ists a firm main­stream con­sen­sus that the use of cer­tain sorts of con­di­tional prob­a­bil­ities to make statis­ti­cal pre­dic­tions about peo­ple is dis­crim­i­na­tory and there­fore evil, and do­ing so may re­sult not only in loss of rep­u­ta­tion, but also in se­ri­ous le­gal con­se­quences. (Note that even if none of the for­bid­den crite­ria are built into your de­ci­sion-mak­ing ex­plic­itly, that still doesn’t leave you off the hook—just search for “dis­parate im­pact” if you don’t know what I’m talk­ing about.)

Now of course, mak­ing any pre­dic­tion about peo­ple at all nec­es­sar­ily in­volves one sort of statis­ti­cal dis­crim­i­na­tion or an­other. The bound­aries be­tween the types of statis­ti­cal dis­crim­i­na­tion that are con­sid­ered OK and those that are con­sid­ered evil and risk le­gal li­a­bil­ity are an ar­bi­trary re­sult of cul­tural, poli­ti­cal, and ide­olog­i­cal fac­tors. (They would cer­tainly look strange and ar­bi­trary to some­one who isn’t im­mersed in the cul­ture that gen­er­ated them to the point where they ap­pear com­mon-sen­si­cal or at least ex­pli­ca­ble.) There­fore, while your model may well be ac­cu­rate in es­ti­mat­ing the prob­a­bil­ity of re­ci­di­vism, job perfor­mance, etc., it’s un­likely that it will be able to nav­i­gate the so­cial con­ven­tions that de­ter­mine these for­bid­den lines. A lot of the seem­ingly ab­surd and in­effec­tive rit­u­als and reg­u­la­tions in mod­ern busi­ness, gov­ern­ment, academia, etc. ex­ist ex­actly for the pur­pose of satis­fy­ing these com­plex con­straints, even if they’re not com­monly thought of as such.

--

[*] Edit: I missed the com­ment be­low in which the com­menter Stu­dent_UK already raised a similar point.

• If the best way to choose who to hire is with a statis­ti­cal anal­y­sis of legally for­bid­den crite­ria, then keep your rea­sons se­cret and shred your work. Is that so hard?

• That doesn’t close the loop­hole, it adds a con­straint. And it’s only sig­nifi­cant for those who both hire enough peo­ple to be vuln­er­a­ble to statis­ti­cal anal­y­sis of their hiring prac­tices, and re­ceive too many bad ap­pli­cants from pro­tected classes. If it is a sig­nifi­cant con­straint, you want to find that out from the data, not from guess­work, and ap­ply the min­i­mum legally ac­cept­able cor­rec­tion fac­tor.

Be­sides, it’s not like mug­gles are a pro­tected class. And if they were? Just keep them from ap­ply­ing in the first place, by build­ing your office some­where they can’t get to. There aren’t any le­gal re­stric­tions on that.

• Be­sides, it’s not like mug­gles are a pro­tected class. And if they were? Just keep them from ap­ply­ing in the first place, by build­ing your office some­where they can’t get to. There aren’t any le­gal re­stric­tions on that.

You joke, but the world [1] re­ally is chok­ing with in­effi­cient, kludgey workarounds for the le­gal pro­hi­bi­tion of effec­tive em­ploy­ment screen­ing. For ex­am­ple, the en­tire higher ed­u­ca­tion mar­ket has be­come, ba­si­cally, a case of em­ploy­ers pass­ing off tests to uni­ver­si­ties that they can’t legally ad­minister them­selves. You’re a ter­ror­ist if you give an IQ test to ap­pli­cants, but not if you re­quire a com­pletely ir­rele­vant col­lege de­gree that re­quires tak­ing the SAT (or the mil­i­tary’s ASVAB or what­ever the call it now).

It feels so good to ban dis­crim­i­na­tion, as long as you don’t have to di­rectly face the trade­off you’re mak­ing.

[1] Per MattherW’s cor­rec­tion, this should read “Western de­vel­oped economies” in­stead of “the world”—though I’m sure the phe­nomenon I’ve de­scribed is more gen­eral the form it takes in the West.

• You say ‘the world’, but it seems to me you’re talk­ing about a re­gion which is a lit­tle smaller.

• I’m not sure the cor­rec­tion is that rele­vant. The US and the EU to­gether make up about 40% of global GDP (PPP).

Sev­eral minor economies with nearly iden­ti­cal con­di­tions and re­stric­tions such as Canada, New Zealand, Aus­tralia, South Africa, Nor­way, Switzer­land … add up to an­other 3% or so.Most states in Latin Amer­ica have similar le­gal pro­hi­bi­tions as well, they are not as well en­forced, but avoid­ing them still im­poses costs. This is men­tion­ing noth­ing of Ja­pan or other de­vel­oped East Asian economies (though to be fair losses are prob­a­bly much smaller than the de­vel­oped West and per­haps even Latin Amer­ica).

The other half of the world’s has a mas­sive op­por­tu­nity cost due to the men­tioned half’s de­scribed in­effi­ciency. Con­vert­ing this loss into num­ber of lives or qual­ity of life is a de­press­ing ex­er­cise.

For­tu­nately that is only a prob­lem if you care about hu­mans.

• Well, I’m in the UK, and there’s no law against us­ing IQ-style tests for job ap­pli­cants here. Is that re­ally the case in the US? (I as­sume the “You’re a ter­ror­ist” bit was hy­per­bole.)

Em­ploy­ers here still of­ten ask for ap­par­ently-ir­rele­vant de­grees. But ad­mis­sion to uni­ver­sity here isn’t no­tice­ably based on ‘generic’ tests like the SAT; it’s mostly done on the grades from sub­ject-spe­cific ex­ams. So I doubt em­ploy­ers are treat­ing the de­grees as a proxy for SAT-style test­ing.

• Cor­rec­tion ac­cepted.

• That doesn’t close the loop­hole, it adds a con­straint.

Yes, it does close the loop­hole. You say con­ceal the cause (in­tent to dis­crim­i­nate) and you can get away with as much effect (dis­pro­por­tionate ex­clu­sion) as you want. Ex­cept the law already speci­fies that the effect is pun­ish­able as well as the cause.

So now the best you can do, as­sum­ing the pop­u­la­tions are equally com­pe­tent and suited for the job, is 20% dis­crim­i­na­tion.

And of course, in the real world, pop­u­la­tions usu­ally differ in their suit­abil­ity for the job. Blacks tend not to have as many CS de­grees as whites, for ex­am­ple. So if you are an em­ployer of CS de­grees, you may not be able to get away with any dis­crim­i­na­tion be­fore you have breached the 20% limit, and may need to dis­crim­i­nate against the non-blacks in or­der to be com­pli­ant.

Be­sides, it’s not like mug­gles are a pro­tected class.

I would sus­pect that if the US Mug­gle le­gal sys­tem had any­thing to say about it, they would be. If mag­i­cal-ness is con­ferred by genes, then it’s vi­o­lat­ing ei­ther the gen­eral racial guideline or it’s vi­o­lat­ing re­cent laws (signed by GWB, IIRC) for­bid­ding em­ployer dis­crim­i­na­tion based on ge­net­ics (in the con­text of genome se­quenc­ing, true, but prob­a­bly gen­eral). If it’s not con­ferred by genes, then there may be a gen­eral cul­tural ba­sis on which to sue (Mug­gles as dis­abled be­cause they lack an abil­ity nec­es­sary for ba­sic func­tion­ing in Wizard­ing so­ciety, per­haps).

• You can put de­gree re­quire­ments on the job ad­ver­tise­ment, which should act as a filter on ap­pli­ca­tions, some­thing that can’t be caught by the 80% rule.

(Of course, uni­ver­si­ties tend to use racial crite­ria for ad­mis­sion in the US, some­thing which, iron­i­cally, can be an in­cen­tive for com­pa­nies to dis­crim­i­nate based on race even amongst ap­pli­cants with CS de­grees.)

• The 80% rule is only part of it. Again, racist re­quire­ments is an ob­vi­ous loop­hole you should ex­pect to have been ad­dressed; you can only get away with a lit­tle covert dis­crim­i­na­tion if any.

For ex­am­ple, a fire de­part­ment re­quiring ap­pli­cants to carry a 100 lb (50 kg) pack up three flights of stairs. The up­per-body strength re­quired typ­i­cally has an ad­verse im­pact on women. The fire de­part­ment would have to show that this re­quire­ment is job-re­lated for the po­si­tion. This typ­i­cally re­quires em­ploy­ers to con­duct val­i­da­tion stud­ies that ad­dress both the Uniform Guidelines and pro­fes­sional stan­dards.

If you add un­nec­es­sary re­quire­ments as a stealth filter, how do you show the re­quire­ments are job-re­lated?

• I thought we were talk­ing about how to use nec­es­sary re­quire­ments with­out risk­ing a suit, not how to con­ceal racial prefer­ences by us­ing clev­erly cho­sen proxy re­quire­ments. But it looks like you can’t use job ap­pli­ca­tion de­gree re­quire­ments with­out show­ing a busi­ness need ei­ther.

• topy­nate:

But it looks like you can’t use job ap­pli­ca­tion de­gree re­quire­ments with­out show­ing a busi­ness need ei­ther.

The rele­vant land­mark case in U.S. law is the 1971 Supreme Court de­ci­sion in Griggs v. Duke Power Co. The court ruled that not just test­ing of prospec­tive em­ploy­ees, but also aca­demic de­gree re­quire­ments that have dis­parate im­pact across pro­tected groups are ille­gal un­less they are “demon­stra­bly a rea­son­able mea­sure of job perfor­mance.

Now of course, “a rea­son­able mea­sure of job perfor­mance” is a vague crite­rion, which de­pends on con­tro­ver­sial facts as well as sub­jec­tive opinion. To take only the most no­table ex­am­ple, these peo­ple would prob­a­bly say that IQ tests are a rea­son­able mea­sure of perfor­mance for a great va­ri­ety of jobs, but the pre­sent le­gal prece­dent dis­agrees. This situ­a­tion has given rise to end­less reams of of case law and a le­gal minefield that takes ex­perts to nav­i­gate.

At the end, as might be ex­pected, what sorts of tests and aca­demic re­quire­ments are per­mit­ted to differ­ent in­sti­tu­tions in prac­tice de­pends on ar­bi­trary cus­tom and the pub­lic per­cep­tion of their sta­tus. The de facto rules are only partly cod­ified for­mally. Thus, to take again the most no­table ex­am­ple, the army and the uni­ver­si­ties are al­lowed to use what are IQ tests in all but name, which is an ab­solute taboo for al­most any other in­sti­tu­tion.

• I thought we were talk­ing about how to use nec­es­sary re­quire­ments with­out risk­ing a suit, not how to con­ceal racial prefer­ences by us­ing clev­erly cho­sen proxy re­quire­ments.

I wasn’t. I was talk­ing about how the ob­vi­ous loop­holes are already closed or have been heav­ily re­stricted (even at the cost of false pos­i­tives), and hence how Quir­rel’s com­ments are naive and un­in­formed.

But it looks like you can’t use job ap­pli­ca­tion de­gree re­quire­ments with­out show­ing a busi­ness need ei­ther.

Yes, that doesn’t sur­prise me in the least.

• Just keep them from ap­ply­ing in the first place, by build­ing your office some­where they can’t get to. There aren’t any le­gal re­stric­tions on that.

You re­ally are new here, aren’t you?

http://​​en.wikipe­dia.org/​​wiki/​​Amer­i­cans_with_Dis­abil­ities_Act_of_1990#Ti­tle_III_-_Public_Ac­com­mo­da­tions_.28and_Com­mer­cial_Fa­cil­ities.29

http://​​en.wikipe­dia.org/​​wiki/​​Zoning

In short, there most cer­tainly ARE le­gal re­stric­tions on build­ing your office some­where de­liber­ately se­lected for it’s in­ac­cessibil­ity to those with a con­gen­i­tal in­abil­ity to e.g. tele­port, and a lack of tele­por­ta­tion-spe­cific case law would not work in your fa­vor, given the judge’s ac­cess to state­ments you’ve already made.

• In short, there most cer­tainly ARE le­gal re­stric­tions on build­ing your office some­where de­liber­ately se­lected for it’s in­ac­cessibil­ity to those with a con­gen­i­tal in­abil­ity to e.g. tele­port,

The Amer­i­cans with Dis­abil­ities Act limits what you can build (ev­ery build­ing needs ramps and ele­va­tors), not where you can build it. Zon­ing laws are black­list-based, not whitelist-based, so ex­tradi­men­sional spaces are fine. More com­monly, you can eas­ily find office space in lo­ca­tions that poor peo­ple can’t af­ford to live near. And in the un­likely event that race or na­tional ori­gin is the key fac­tor, you get to choose which coun­try or city’s de­mo­graph­ics you want.

A lack of tele­por­ta­tion-spe­cific case law would not work in your fa­vor, given the judge’s ac­cess to state­ments you’ve already made.

This is the iden­tity un­der which I speak freely and teach defense against the dark arts. This is not the iden­tity un­der which I buy office build­ings and hire minions. If it was, I wouldn’t be talk­ing about hiring strate­gies.

• This is the iden­tity un­der which I speak freely and teach defense against the dark arts. This is not the iden­tity un­der which I buy office build­ings and hire minions. If it was, I wouldn’t be talk­ing about hiring strate­gies.

Up voted for hav­ing the sense to em­ploy a blind­ingly ob­vi­ous strat­egy that some­how con­sis­tently fails to be­come com­mon sense.

• More com­monly, you can eas­ily find office space in lo­ca­tions that poor peo­ple can’t af­ford to live near.

But that they could, in prin­ci­ple, walk to and from.

• Be­sides, it’s not like mug­gles are a pro­tected class. And if they were? Just keep them from ap­ply­ing in the first place, by build­ing your office some­where they can’t get to. There aren’t any le­gal re­stric­tions on that.

My google-fu is not strong enough to find the le­gal doc­trine, but in the US at least, you can be sued for ~im­plicit dis­crim­i­na­tion, i.e. if the news­pa­per you ad­ver­tise in has a reader pop­u­la­tion that does not ref­elect the gen­eral pop­u­la­tion, you’re dis­crim­i­nat­ing against the un­der rep­re­sented pop­u­la­tion.

• i.e. if the news­pa­per you ad­ver­tise in has a reader pop­u­la­tion that does not re­flect the gen­eral pop­u­la­tion, you’re dis­crim­i­nat­ing against the un­der rep­re­sented pop­u­la­tion.

...I thought this was a joke. Now… not so sure.

• See the last sen­tence of my first para­graph above (the one in paren­the­ses).

• An in­ter­est­ing story that I think I re­mem­ber read­ing:

One study found that rel­a­tively in­ex­pe­rienced psy­chi­a­trists were more ac­cu­rate at di­ag­nos­ing men­tal ill­ness than ex­pe­rienced ones. This is be­cause in­ex­pe­rienced psy­chi­a­trists stuck closely to check­lists rather than rely on their own judg­ment, and whether or not a di­ag­no­sis was con­sid­ered “ac­cu­rate” was based on how closely the re­ported symp­toms matched the check­list. ;)

• If we are mea­sur­ing the ac­cu­racy of A vs. B, we are im­plic­itly mea­sur­ing A against gold stan­dard C, and B against gold stan­dard C. If a bet­ter C is not read­ily available, we may choose to use A or B as an ap­prox­i­ma­tion, the choice of which de­ter­mines our out­come.

Now I won­der:

Are the peo­ple that are sym­pa­thetic to the hy­poth­e­sis that com­put­ers are bet­ter in the cases above (and ig­nored be­cause of bi­ases) as­sum­ing we made the fal­lacy of us­ing hu­mans as a gold stan­dard?

Are the peo­ple that are sym­pa­thetic to the hy­poth­e­sis that hu­mans are bet­ter (and ig­nored be­cause of bi­ases) as­sum­ing we made the fal­lacy of us­ing com­put­ers as a gold stan­dard?

The union of which is a lot of up­votes. I can’t de­cide which was meant.

• This is one of the top 3 rated com­ments on this post. I think you should spec­ify more di­rectly how this anec­dote re­lates to how you in­ter­pret the ar­ti­cle’s in­ten­tion.

• He should spec­ify where he has read that.

• I don’t re­mem­ber. I may have ac­tu­ally heard one of my par­ents talk­ing about it in­stead of read­ing it. So con­sider it an ur­ban leg­end.

• If this is not amaz­ing enough, con­sider the fact that even when ex­perts are given the re­sults of SPRs, they still can’t out­perform those SPRs (Leli & Filskov 1985; Gold­berg 1968).

Now THAT part is just plain em­bar­rass­ing. I mean, it’s truly a mark of shame upon us if we have a tool that we know works, we are given ac­cess to the tool, and we still can’t do bet­ter than the tool it­self, un­aided. (EDIT: By “we”, I mean “the ex­perts in the rele­vant fields”… which I guess isn’t re­ally a “we” as such, but you know what I mean)

Any­ways, are there any nice on­line in­dexes or what­ever of SPRs that make it easy to put in class of prob­lem and have it find a SPR that’s been ver­ified to work for that sort of prob­lem?

• Now THAT part is just plain em­bar­rass­ing. I mean, it’s truly a mark of shame upon us if we have a tool that we know works, we are given ac­cess to the tool, and we still can’t do bet­ter than the tool it­self, un­aided.

Coin­ci­den­tally, I was plan­ning to write an ar­ti­cle “defend­ing” the use of fal­la­cies on Bayesian grounds. A typ­i­cal pas­sage would go like this:

Peo­ple say it’s fal­la­cious to ap­peal to au­thor­ity. How­ever, if you learn that ex­perts be­lieve X, you should cer­tainly up­date some finite amount in fa­vor of be­liev­ing X, as ex­perts are, in gen­eral, more likely to be­lieve X if it is true than it is false—even as you may find many ex­cep­tions.

In­deed, it would be quite a strange world if ex­perts were con­sis­tently wrong about a given sub­ject mat­ter X, thus mak­ing their opinions for X into ev­i­dence against X, be­cause they would have to per­sist in this er­ror, even know­ing that their en­tan­gle­ment with X means they only have to in­vert their pro­nounce­ments or re­main ag­nos­tic to im­prove ac­cu­racy.

Well, it seems we ac­tu­ally do live in such a world, where (some classes of) ex­perts make pre­dictable er­rors, and don’t take triv­ial steps to make their opinions more ac­cu­rate (and en­tan­gled with the sub­ject mat­ter).

• Well, ex­perts still do bet­ter than non-ex­perts on av­er­age (afaik), just that they seem to to­tally ig­nore tools that could let them do a whole lot bet­ter, and also ap­par­ently can’t do much bet­ter than the tools them­selves, even when they’re able to use the tools.

• Mak­ing pre­dictable er­rors isn’t the same thing as their opinions be­ing anti-cor­re­lated with re­al­ity.

• If any­body would like to try some statis­ti­cal ma­chine learn­ing at home, it’s ac­tu­ally not that hard. The tough part is get­ting a data set. Once that’s done, most of the ex­am­ples in this ar­ti­cle are things you could just feed to some soft­ware like Weka, press a few but­tons, and get a statis­ti­cal model. BAM!

Let’s try an ex­am­ple. Here is some breast can­cer di­ag­nos­tic data, show­ing a bunch of ob­ser­va­tions of peo­ple with breast can­cer (age, size of tu­mors, etc.) and whether or not the can­cer re­oc­curred af­ter treat­ment. Can we pre­dict can­cer re­cur­rence?

If you look at it with a de­ci­sion tree, it turns out that you can get about 70% ac­cu­racy by ob­serv­ing two of the sev­eral fac­tors that were ob­served, in a very sim­ple de­ci­sion pro­ce­dure. You can do a lit­tle bet­ter by us­ing some­thing more so­phis­ti­cated, like a naive Bayes clas­sifier. Th­ese show us what fac­tors are the most im­por­tant, and how.

If you’re in­ter­ested, go ahead and play around. It’s pretty easy to get started. Ob­vi­ously, take ev­ery­thing with a grain of salt, but still, ba­sic ma­chine learn­ing is sur­pris­ingly easy.

Let me brag a bit. Once in a friendly dis­cus­sion the fol­low­ing ques­tion came up: How to pre­dict for an un­known first name whether it is a male or fe­male name. This was in a con­text of Hun­gar­ian names, as all of us were Hun­gar­i­ans. I had a list of Hun­gar­ian first names in digi­tal for­mat. The dis­cus­sion turned into a bet: I said I can write a pro­gram in half an hour that tells with at least 70% pre­ci­sion the sex of a first name it never saw be­fore. I am quite fast with writ­ing small scripts. It wasn’t even close: It took me 9 min­utes to

• split my sets of 1000 male and 1000 fe­male names into a ran­dom 1000-1000 train-test split,

• split each name into char­ac­ter 1,2- and 3-grams. E.g.: Luca was turned into ^L u c a\$ ^Lu uc ca\$ ^Luc uca\$.

• feed the train­ing data into a com­mand line tool to train a max­ent model,

• test the ac­cu­racy of the model on the un­seen test data.

The model reached an ac­cu­racy of 90%. In ret­ro­spect, this is not sur­pris­ing at all. Look­ing into the lin­ear model, the most im­por­tant fea­ture it iden­ti­fied was whether the name ends with an ‘a’. This triv­ial model alone reaches some 80% pre­ci­sion for Hun­gar­ian names, so if I knew this in ad­vance, I could have won the bet in 30 sec­onds in­stead of 9 min­utes, with the sed com­mand s/​a\$/​a FEMALE/​.

• Th­ese sound like pow­ers I should ac­quire. Could you drop some fur­ther hints on:

• “a com­mand line tool to train a max­ent model”

• how you tested the ac­cu­racy of the model (tools that let you do that in the re­main­ing min­utes, rather than gen­eral prin­ci­ples)

• I used Zhang Le’s tool. Note that it is a rather ob­scure thing, not an in­dus­try stan­dard like say, the huge Weka and Mallet pack­ages. It made very easy the tasks you ask for. When I had a train and test data fea­tur­ized,

max­ent -m gen­der.model train.data

built the model and

max­ent -p -m gen­der.model test.data

told me its ac­cu­racy on the test data.

• This is a great ar­ti­cle, but it only lists stud­ies where SPRs have suc­ceeded. In fair­ness, it would be good to know if there were any stud­ies that showed SPRs failing (and also con­sider pub­li­ca­tion bias, etc.).

• Definitely.

• My prin­ci­ple prob­lem with this ar­ti­cle is that you ap­pear to pro­mote the idea that these SPRs are be­ing ig­nored for ex­tremely bad rea­sons, rather than they were ig­nored for de­cent rea­sons. So when you say ‘definitely’ here I have a prob­lem that you are com­part­men­tal­iz­ing the ar­gu­ments and not ad­mit­ting the prob­lems with your post.

Also, I don’t think this is a great ar­ti­cle and in pro­por­tion to it get­ting +40 votes I have a poor opinion of this com­mu­nity (or at least it’s karma sys­tem where 0 should be neu­tral).

edit: My last para­graph here is ex­ces­sively dra­matic and I re­tract it.

• Miller,

Does this look like “not ad­mit­ting the prob­lems with [my] post”?

• It would be more con­struc­tive of me if I ac­tu­ally helped find counter-ev­i­dence, rather than whing­ing about your not do­ing so. I think you’ve put a lot of effort into up­dat­ing your po­si­tion.

• My gut re­ac­tion is that this doesn’t demon­strate that SPRs are good, just that hu­mans are bad. There are tons of statis­ti­cal mod­el­ing al­gorithms that are more so­phis­ti­cated than SPRs.

Un­less, of course, SPR is an­other word for “any statis­ti­cal mod­el­ing al­gorithm”, in which case this is just the claim that statis­ti­cal ma­chine learn­ing is a good ap­proach, which any­one as Bayesian as the av­er­age LessWronger prob­a­bly agrees with.

• There are tons of statis­ti­cal mod­el­ing al­gorithms that are more so­phis­ti­cated than SPRs.

Not in and of it­self a good thing. As demon­strated re­cently so­phis­ti­cated statis­tics can suffice sim­ply to al­low one to con­fuse one­self in a so­phis­ti­cated knot—that’s harder to un­tie. There is a case to be made for pro­mot­ing the sim­plest al­gorithm that out­performs cur­rent meth­ods, and SPRs seem to fit this bill.

As for what SPR stands for, the post makes it pretty clear that they are a class of rules that pre­dict a (de­sired) prop­erty us­ing weighted cues (ob­serv­able prop­er­ties). I am not fa­mil­iar enough with statis­ti­cal mod­el­ling to say if that is a shared goal among all al­gorithms.

• The post gives an ex­am­ple of an SPR that uses weighted cues. But he speci­fi­cally says

This par­tic­u­lar SPR is called a proper lin­ear model,

in­di­cat­ing that there are other types of SPRs, and I cur­rently have no idea what those other types might be.

I agree with you that com­pli­cated statis­ti­cal tests can lead to spu­ri­ous re­sults; sim­ple statis­ti­cal tests can also lead to spu­ri­ous re­sults if the per­son us­ing them doesn’t un­der­stand them. I naievely as­so­ci­ate both of these with “the test was de­signed to cor­rect against a differ­ent type of flaw in ex­per­i­men­tal de­sign than ac­tu­ally oc­curred”.

When the fo­cus of the statis­ti­cal test is on ac­cu­rately mod­el­ing a given situ­a­tion, I think it is less difficult to re­al­ize when a model choice makes sense and when it doesn’t, so more so­phis­ti­cated ap­proaches will prob­a­bly do bet­ter, since they come closer to carv­ing re­al­ity at its joints. This might be an in­fer­en­tial dis­tance er­ror on my part, though, since I have train­ing in this area, so er­rors that I per­son­ally can avoid might not be gen­er­ally avoid­able.

• I agree with you for smart peo­ple; I do see a lot of value, though, in idiot-proof statis­tics. Weighted-cue SPRs are al­most too sim­ple to screw up.

• Also, while this isn’t su­per-rele­vant, given that I already agree with your claim about peo­ple con­fus­ing them­selves, my im­pres­sion is that the link you gave pre­sents mod­er­ate-to-weak ev­i­dence against this.

I didn’t read the en­tire ar­ti­cle that was linked to dis­cussing the statis­ti­cal anal­y­sis (if there’s a par­tic­u­lar sec­tion you think I should read, please let me know), but my un­der­stand­ing was that in some sense the “ex­per­i­men­tal pro­ce­dure” was the is­sue, not the statis­tics. In other words, Bem con­sid­ered po­ten­tially hun­dreds of hy­pothe­ses about his data, but only re­ported on a few, so that p-val­ues of 0.02 are not su­per-im­pres­sive (since out of 100 hy­pothe­ses we would ex­pect a few to hit that by chance).

Bem’s ex­per­i­ments all ba­si­cally ask “is this coin bi­ased”, which isn’t a very com­pli­cated ques­tion to an­swer. It is the so­phis­ti­cated statis­tics that cor­rects for the flawed pro­ce­dure.

• It wasn’t a very good ex­am­ple at all. I ba­si­cally grepped my mem­ory for “idiot statis­tics” and that one fea­tured strongly. The prob­lem there was not a mi­suse of statis­ti­cal tests, it was a mis­in­ter­pre­ta­tion of the sig­nifi­cance of statis­ti­cal tests.

• Are some SPRs easy to ex­ploit?

• Depends on what you’re mea­sur­ing. I can’t see how it would be ex­ploitable for things like pre­dict­ing wine qual­ity (ac­tu­ally green­hous­ing your grapes to con­trol tem­per­a­ture and rain­fall might just make them bet­ter) but definitely a spe­cific SPR for, say, rat­ing dossiers for hiring would be ex­ploitable if you knew or could guess at which cues it’s us­ing.

• SPR’s sound a lot like the Out­side View.

• SPRs sound like a method to en­sure a very ac­cu­rate out­side view.

‘Out­side view’, I be­lieve, is a term of Kah­ne­man’s, and is used in the liter­a­ture by lots of these peo­ple who work on SPRs, for ex­am­ple Dawes.

Kah­ne­man be­gins his Edge.org mas­ter class on think­ing by dis­cussing the out­side view.

• Well, SPRs can plau­si­bly out­perform av­er­age ex­per­tise. That’s be­cause most of the ex­per­tise is ut­ter and com­plete sham.

The re­ci­di­vism in ex­am­ple...

The judges, or psy­chol­o­gists, or the like, what in the world makes them ex­perts on pre­dict­ing the crim­i­nals? Did they read an un­bi­ased sam­ple of re­ci­di­vism? Did they do any prac­tice, earn­ing marks for pre­dict­ing crim­i­nals? Any­thing?

Re­sound­ing no. They never in their lives did any­thing that should have earned them the ex­pert sta­tus on this task. They did other stuff that puts them first on the list when you’re look­ing for ‘ex­perts’ on a topic for which there is no ex­perts.

They are about as much ex­perts on this task as a court jan­i­tor is an ex­pert on law. He too did not do any­thing re­lated to law, he did clean the court­room.

• Does SPR beat pre­dic­tion mar­kets?

• If it did, then you could make a lot of money on a pre­dic­tion mar­ket with enough cash in it. This would cause the mar­ket to give bet­ter an­swers.

• I have two con­cerns about the prac­ti­cal im­ple­men­ta­tion of this sort of thing:

1. It seems like there are cases where if a rule is be­ing used then peo­ple could abuse it. For ex­am­ple, in job ap­pli­ca­tions or ad­mis­sions to med­i­cal schools. A bet­ter un­der­stand­ing of how the rule re­lates to what it pre­dicts would be needed.

If X+Y pre­dicts Z does that mean en­hanc­ing X and Y will up the prob­a­bil­ity of Z? Not nec­es­sar­ily, con­sider the ex­am­ple of happy mar­riages. Will hav­ing more sex make your re­la­tion­ship hap­pier? Or does the rule work be­cause happy cou­ples tend to have more sex?

1. It is not true in ev­ery case that we equally value all true be­liefs, and equally value all false be­liefs. Cer­tain rules might work bet­ter if we take into con­sid­er­a­tion a per­son’s race, sex, re­li­gion and na­tion­al­ity. But most peo­ple find this sort of thing un­palat­able be­cause it can lead to the sys­tem­atic per­se­cu­tion of sub groups, even if it re­sults in more true, and fewer false, be­liefs over­all. It also might be the case that some of these rules dis­crim­i­nate against groups of peo­ple in more sub­tle ways that won’t be im­me­di­ately ob­vi­ous.

Of course nei­ther of these prob­lems mean that there won’t be perfectly good cases where these rules would im­prove de­ci­sion mak­ing a lot.

• Yes, sev­eral of these mod­els look like they’re likely to run into trou­ble of the Good­hart’s law type (“Any ob­served statis­ti­cal reg­u­lar­ity will tend to col­lapse once pres­sure is placed upon it for con­trol pur­poses”).

• Will hav­ing more sex make your re­la­tion­ship hap­pier?

Ob­vi­ously, yes.

• It prob­a­bly de­pends some­what on with whom you are hav­ing it.
• True. One of my nodes for “re­la­tion­ship” is con­sen­sual; most definitely in that case it would make the re­la­tion­ship much less happy.

• Well, un­less the qual­ity of the sex is causally linked to the quan­tity, such that hav­ing lots and lots of sex (past a cer­tain thresh­old) makes each in­di­vi­d­ual ses­sion dis­pro­por­tionately worse. This is true for a lot of peo­ple’s libidos.

To put it an­other way: it’s not the fre­quency of the mo­tion in the ocean, but the am­pli­tude of the waves.

• This is true for a lot of peo­ple’s libidos.

But prob­a­bly not true for the quan­tity of sex in al­most all re­la­tion­ships, I would bet.

• Although I agree with you, I feel like I should point out that it is some­what non­sen­si­cal for most re­la­tion­ships to be sub-op­ti­mal in this way. If both par­ties want to have more sex, and they can (oth­er­wise the ques­tion wouldn’t re­ally be valid), but they don’t, that’s a lit­tle weird, don’t you think?

We can talk about op­ti­miz­ing for other things (e.g. ca­reers), but I don’t think that’s re­ally the is­sue, since many cou­ples, when ex­plic­itly told that they would be hap­pier if they had more sex, just start hav­ing more sex, with­out sac­ri­fic­ing any­thing that they end up want­ing back.

• Although I agree with you, I feel like I should point out that it is some­what non­sen­si­cal for most re­la­tion­ships to be sub-op­ti­mal in this way. If both par­ties want to have more sex, and they can (oth­er­wise the ques­tion wouldn’t re­ally be valid), but they don’t, that’s a lit­tle weird, don’t you think?

Weird cer­tainly but this is a kind of weird­ness that hu­mans are no­to­ri­ous for. We are ter­rible hap­piness op­ti­misers. In the case of sex speci­fi­cally hav­ing more of it is not as sim­ple as walk­ing over to the bed­room. For males and fe­males al­ike you can want to be hav­ing more sex, be aware that hav­ing more sex would benefit your re­la­tion­ship and still not be ‘in the mood’ for it. A more in­di­rect ap­proach to the prob­lem of libido and de­sire is re­quired—the sort of thing that hu­mans are not nat­u­rally good at op­ti­mis­ing.

• I agree on ev­ery point. I also think part of this is sim­ply that shared knowl­edge that is not com­mon knowl­edge (un­til ac­knowl­edged be­tween par­ties) is much more difficult to act upon.

I think that “okay, we’re go­ing to have sex now, be­cause it will make us hap­pier” is a lit­tle like “okay, I’m go­ing to the gym now, be­cause it will make me feel bet­ter”, which may be the same thing you meant about be­ing “in the mood”, but I think it’s even harder for sex, be­cause we are per­haps less will­ing to see sex ex­cept as im­me­di­ate grat­ifi­ca­tion.

• I’ve heard more than once that hav­ing more sex on a sched­ule in the hopes of hav­ing chil­dren is a mis­er­able ex­pe­rience for cou­ples with fer­til­ity prob­lems.

I don’t know whether hav­ing more sex in the hopes of be­ing hap­pier (rather than be­cause the peo­ple in­volved want sex more for the fun of it) could have similar side effects.

• It’s fairly com­mon for sex ther­a­pists to recom­mend that cou­ples sched­ule sex and have sex at all (but not only) sched­uled times, on the grounds that peo­ple may not be in the mood at first, but en­joy it any­way. While it may be a mis­er­able ex­pe­rience for a few peo­ple, I doubt that it is mis­er­able in gen­eral (and I’m not sure why it would be).

• It’s cer­tainly pos­si­ble for peo­ple to have akra­sia in re­gards to plea­sure, and schedul­ing can help with that.

I think pos­si­ble prob­lems come in if a part­ner (pos­si­bly both part­ners in the case of fer­til­ity) re­ally doesn’t want to at the mo­ment, but is feel­ing pres­sured.

• Will hav­ing more sex make your re­la­tion­ship hap­pier?

I think it’s safe to say that hav­ing less sex will make the re­la­tion­ship less happy, so there is some causal­ity in­volved.

• Not nec­es­sar­ily, con­sider the ex­am­ple of happy mar­riages. Will hav­ing more sex make your re­la­tion­ship hap­pier?

Yes. Al­most cer­tainly. But there are plenty of other ex­am­ples you could pick from where there is not causal­ity in­volved (and some for which causal­ity is nega­tive).

• [quote]Will hav­ing more sex make your re­la­tion­ship hap­pier? [/​quote]

Hav­ing more sex will make ME hap­pier. If my wife finds out though…

• Be­sides the le­gal is­sues with dis­crim­i­na­tion and dis­parate im­pact, an­other im­por­tant is­sue here is that jobs that in­volve mak­ing de­ci­sions about peo­ple tend to be high-sta­tus. As a very gen­eral ten­dency, the higher-sta­tus a pro­fes­sion is, the more its prac­ti­tion­ers are likely to or­ga­nize in a guild-like way and re­sist in­tru­sive in­no­va­tions by out­siders—es­pe­cially in­no­va­tions in­volv­ing perfor­mance met­rics that show the cur­rent stan­dards of the pro­fes­sion in a bad light, or even worse, those that threaten a change in the way their work is done that might lower its sta­tus.

Dis­cus­sions of such cases in medicine are a reg­u­lar fea­ture on Over­com­ing Bias, but it ex­ists in a more or less pro­nounced form in any other high-sta­tus pro­fes­sion too. How much it ac­counts for the spe­cific cases dis­cussed in the above ar­ti­cle is a com­plex ques­tion, but this phe­nomenon should cer­tainly be con­sid­ered as a plau­si­ble part of the ex­pla­na­tion.

• Some­times, be­ing ra­tio­nal is easy. When there ex­ists a re­li­able statis­ti­cal pre­dic­tion rule for the prob­lem you’re con­sid­er­ing, you need not waste your brain power try­ing to make a care­ful judg­ment.

Un­for­tu­nately lin­ear mod­els for a lot of situ­a­tions are sim­ply not available. The dozen or so ones in the liter­a­ture are the ex­cep­tion, not the rule.

• And those that ex­ist are not always easy to find.
And those that are found are not always easy to use in in­dus­try (where so­phis­ti­cated com­puter skills are of­ten the things the mar­ket­ing grad taught er­self to do in Ex­cel).

• You speak of in­cred­ible suc­cess with­out given a suc­cess rate of the mod­els. The fact that there are a dozen cases where spe­cific mod­els out­performed hu­man rea­son­ing doesn’t prove much.

At the mo­ment you recom­mend other peo­ple to use SPRs for their de­ci­sion mak­ing based on “ex­pert judg­ment”. How about pro­vid­ing us a SPR that tells us for which prob­lems we should use SPRs?

• SPRs can be gamed much more di­rectly than hu­man ex­perts. For ex­am­ple, imag­ine an SPR in place of all hiring man­agers. In our cur­rent place, with hiring man­agers, we can guess at what goes in to their de­ci­sion­mak­ing and at­tempt to op­ti­mize for it, but be­cause each man­ager is some­what differ­ent, we can’t know that well. A sin­gle SPR that took over for all the man­agers, or even a cou­ple of very pop­u­lar ones, would strongly en­courage ap­pli­cants to op­ti­mize for the vari­able most weighted in the equa­tion. Over time this would likely de­crease the value of the SPR back to that of a hu­man ex­pert.

This has a name in the liter­a­ture, but I can’t re­mem­ber it at the mo­ment. You see this prob­lem in, for ex­am­ple, the cur­rent ob­ses­sive fo­cus on GDP as the only mea­sure of na­tional well-be­ing. Now that we’ve had that mea­sure for some time, we’re able to have coun­tries whose GDP is im­prov­ing but who suck on lots of other mea­sures, and thus poli­ti­ci­ans who are proud of what they’ve done but who are hated by the peo­ple.

Yes, in some cases, this would cause us to im­prove the SPR to the point where it ac­cu­rately re­flected the qual­ities that go into suc­cess. But that’s not a proven thing.

That said, I’d re­ally like to see a wiki or other at­tempt­ing-to-be-com­plete re­source for find­ing an SPR for any par­tic­u­lar ap­pli­ca­tion. Any­one got one?

• This has a name in the liter­a­ture, but I can’t re­mem­ber it at the moment

Good­hart’s Law

A sin­gle SPR that took over for all the man­agers, or even a cou­ple of very pop­u­lar ones, would strongly en­courage ap­pli­cants to op­ti­mize for the vari­able most weighted in the equa­tion.

W1(Quan­ti­ta­tive skills) + W2(Writ­ten and Oral Com­mu­ni­ca­tion Skills) + W3(Abil­ity to work with loose su­per­vi­sion) + W4(Do­main Ex­per­tise) + W5(So­cial Skills) + W6(Pres­tige Mark­ers)

That said, I’d re­ally like to see a wiki or other at­tempt­ing-to-be-com­plete re­source for find­ing an SPR for any par­tic­u­lar ap­pli­ca­tion. Any­one got one?

No, but I imag­ine that tak­ing a grab bas­ket of plau­si­ble cor­re­lates of the de­sired trait and throw­ing them into a re­gres­sion func­tion would be a good first draft. Then iter­ate.

• Cor­rect me if I’m wrong, but the SPR is just a lin­ear model, right? Statis­tics is an un­der ap­pre­ci­ated field in many walks of life. My own field of spe­cial­ity, ex­per­i­men­tal de­sign, is treated with down right sus­pi­cion by sci­en­tists who have not en­coun­tered it be­fore, who find the re­sults counter-in­tu­itive (when they have 4 con­trol­lable vari­ables in an ex­per­i­ment they want to vary them one at a time, while the best way is to vary all 4 si­mul­ta­neously...)

• I also find that counter-in­tu­itive, is there a short ex­pla­na­tion of why?

• I am cu­ri­ous: could you ex­plain why it is bet­ter to vary all 4?

• Briefly: be­cause to do so as­sumes that they do not in­ter­act, and if they DO in­ter­act, you will have gath­ered no in­for­ma­tion on said in­ter­ac­tions.

• That makes sense… if your in­puts are X and Y, and you want to figure out what your out­put f(X,Y) is, it seems like you’ll even­tu­ally have to vary X and Y si­mul­ta­neously in or­der to tell the differ­ence be­tween f(X,Y) = aXY + c and f(X,Y) = aX + bY + c.

• quite, al­though usu­ally you’ll have a model f(x,y)=aXY+bX+cY+d. I’m ac­tu­ally un­der­sel­ling this ap­proach, be­cause if I had two vari­ables X, and Y which can be ob­served be­tween (-1,1), and only have two ob­ser­va­tions to do it in then we’re much bet­ter go­ing (X,Y)=(-1,1) and (1,-1) rather than (0,1),(1,0), be­cause we’re gath­er­ing more in­for­ma­tion.

We always want to de­sign in the lo­ca­tion with the most var­i­ance, be­cause thats the hard­est place to pre­dict. Given that the model we’re look­ing at is lin­ear in both the pa­ram­e­ters and the vari­ables then we know the places where we get the most vari­a­tion will be at the ex­tremes. Ob­vi­ously we have no in­for­ma­tion if we think there might be some kind of quadratic terms here, but one of the nice things about de­sign for lin­ear mod­els is you can build your ex­per­i­men­ta­tion to iter­a­tively build up in­for­ma­tion.

Typ­i­cally in an in­dus­trial set­ting we’ll have a few dozen differ­ent fac­tors which we think might af­fect our out­come, so we can de­sign to elimi­nate down to a hand­ful by us­ing a very ba­sic lin­ear model in a screen­ing ex­per­i­ment, then use a more so­phis­ti­cated de­sign called a cen­tral com­pos­ite de­sign.

Now if we want a mechanis­tic model, some­thing based on what we know on the physics of the situ­a­tion (say we have some differ­en­tial equa­tions de­scribing the re­ac­tion), then de­sign­ing be­comes harder, which is where my re­search is.

• While this is promis­ing in­deed, it is wise not to for­get about Op­ti­miza­tion By Proxy that can oc­cur when the thing be­ing op­ti­mised is (or is un­der the con­trol of) an in­tel­li­gent agent.

• The thing that makes me twitch about SPRs is a con­cern that they won’t change when the un­der­ly­ing con­di­tions which cre­ated their data sets change. This doesn’t mean that hu­mans are good at notic­ing that sort of thing, ei­ther. How­ever, it’s at least worth think­ing about which ap­proach is likely to over­shoot worse when some­thing sur­pris­ing hap­pens. Or whether there’s some rea­son to think that the greater usual ac­cu­racy of SPRs leads to enough big­ger re­serves that the oc­ca­sional over­shoot prob­lem (if such are worse than in a non-SPR sys­tem) is com­pen­sated for.

• Hi Luke,

Great post. Will be writ­ing some­thing about the le­gal uses of SPRs in the near fu­ture.

Any­way, the link to the Grove and Meehl study doesn’t seem to work for me. It says the file is dam­aged and can­not be re­paired.

• At­lantic, The Brain on Trial:

In the past, re­searchers have asked psy­chi­a­trists and pa­role-board mem­bers how likely spe­cific sex offen­ders were to re­lapse when let out of prison. Both groups had ex­pe­rience with sex offen­ders, so pre­dict­ing who was go­ing straight and who was com­ing back seemed sim­ple. But sur­pris­ingly, the ex­pert guesses showed al­most no cor­re­la­tion with the ac­tual out­comes. The psy­chi­a­trists and pa­role-board mem­bers had only slightly bet­ter pre­dic­tive ac­cu­racy than coin-flip­pers. This as­tounded the le­gal com­mu­nity.

So re­searchers tried a more ac­tu­ar­ial ap­proach. They set about record­ing dozens of char­ac­ter­is­tics of some 23,000 re­leased sex offen­ders: whether the offen­der had un­sta­ble em­ploy­ment, had been sex­u­ally abused as a child, was ad­dicted to drugs, showed re­morse, had de­viant sex­ual in­ter­ests, and so on. Re­searchers then tracked the offen­ders for an av­er­age of five years af­ter re­lease to see who wound up back in prison. At the end of the study, they com­puted which fac­tors best ex­plained the re­offense rates, and from these and later data they were able to build ac­tu­ar­ial ta­bles to be used in sen­tenc­ing.

Which fac­tors mat­tered? Take, for in­stance, low re­morse, de­nial of the crime, and sex­ual abuse as a child. You might guess that these fac­tors would cor­re­late with sex offen­ders’ re­ci­di­vism. But you would be wrong: those fac­tors offer no pre­dic­tive power. How about an­ti­so­cial per­son­al­ity di­s­or­der and failure to com­plete treat­ment? Th­ese offer some­what more pre­dic­tive power. But among the strongest pre­dic­tors of re­ci­di­vism are prior sex­ual offenses and sex­ual in­ter­est in chil­dren. When you com­pare the pre­dic­tive power of the ac­tu­ar­ial ap­proach with that of the pa­role boards and psy­chi­a­trists, there is no con­test: num­bers beat in­tu­ition. In court­rooms across the na­tion, these ac­tu­ar­ial tests are now used in pre­sen­tenc­ing to mod­u­late the length of prison terms.

• On in­ter­views, I had a great deal of suc­cess hiring for cler­i­cal as­sis­tant po­si­tions by sim­ply get­ting the in­ter­vie­wees to do a sim­ple prob­lem in front of us. It turned out to be a great, re­li­able and easy-to-jus­tify sorter of can­di­dates.

But, of course, it was nei­ther un­struc­tured nor much of an “in­ter­view” as such.

• Again, test not in­ter­view. Their GPA is an av­er­age mea­sure of maybe thou­sands of such sim­ple prob­lems—prob­a­bly on av­er­age more rigor­ously pro­duced, pre­sented, and cor­rected than your prob­lem pre­sented in the in­ter­view.

De­cid­ing based on a test in per­son in­stead of de­cid­ing on a num­ber that rep­re­sents thou­sands of such in­di­vi­d­ual tests smacks of anec­do­tal de­ci­sion-mak­ing.

• Since when did greater rigour and av­er­ag­ing of more prob­lems im­ply greater de­gree of cor­re­la­tion with perfor­mance at one spe­cific job?

I call halo effect here. Greater rigour, big­ger num­ber, more ac­cu­rate, more cor­rected, all com­bined re­ally ‘good’ qual­ities about the GPA value spill over into your feel­ing of how well it’ll cor­re­late with perfor­mance at spe­cific job, ver­sus a ‘bad’ ill mea­sured value.

Truth is, say, ill mea­sured hand size based on eye­bal­ling can eas­ily cor­re­late bet­ter with mea­sured finger length, than body weight mea­sured us­ing ul­tra high pre­ci­sion sci­en­tific scales with ac­cu­racy of a mil­li­gram (micro­gram, nanogram, what­ever). Just be­cause ham­mer is a tool you build things with, and but­ter knife is a kitchen uten­sil, doesn’t make ham­mer bet­ter than but­ter knife as a screw driver.

• Just be­cause ham­mer is a tool you build things with, and but­ter knife is a kitchen uten­sil, doesn’t make ham­mer bet­ter than but­ter knife as a screw driver.

Well, ac­tu­ally...

But more on point, you’d need to jus­tify that the test you give is more cor­re­lated than GPA with perfor­mance—this is why I sup­port sim­ple pro­gram­ming tests (be­cause they demon­stra­bly are more cor­re­lated than aca­demic in­di­ca­tors) but for a ‘cler­i­cal as­sis­tant’ po­si­tion as de­scribed above, a spe­cific test doesn’t im­me­di­ately spring to mind, and so it’s sus­pect.

• You aren’t look­ing for ‘cor­re­la­tion’ usu­ally, you’re look­ing for screen­ing out the se­rial job ap­pli­cant who can’t do the job they’re ap­ply­ing for (and keeps re-ap­ply­ing to many places)… just ask ’em to do some work similar to what they will be do­ing as per Loren­zofromOz method, and you’ll at least be as­sured they can do work. While with GPA you won’t be as­sured of any­thing what so ever.

For the pro­gram­ming, the sim­plest dumb­est check works to screen out those en­tirely in­ca­pable, when screen­ing by PhD would not.

http://​​www.cod­inghor­ror.com/​​blog/​​2007/​​02/​​why-cant-pro­gram­mers-pro­gram.html

PhD might cor­re­late bet­ter with perfor­mance than fizzbuzz does (the lat­ter be­ing a bi­nary test of ex­tremely ba­sic knowl­edge), but PhD does not screen out those who will just waste your time, and fizzbuzz (your per­sonal vari­a­tion of it) does.

• Holy crap… I think I had read about the Fiz­zBuzz thing a while ago, but I didn’t re­mem­ber about the 199 in 200 thing… Would it be pos­si­ble to sue the in­sti­tu­tions is­su­ing those PhD or some­thing? :-)

• Well, I don’t know what % of the CS-re­lated PhDs can’t do Fiz­zBuzz, maybe the per­centage is rather small. (Also, sue for what? You are not their client. The in­ca­pable dude that was given a de­gree, that’s their client. Your over-val­u­a­tion of this de­gree as ev­i­dence of ca­pa­bil­ity is your own prob­lem)

The is­sue is that, as Joel ex­plains, the job ap­pli­cants are a sam­ple ex­tremely bi­ased to­wards in­com­pe­tence:

http://​​www.joelon­soft­ware.com/​​items/​​2005/​​01/​​27.html

[Though I would think that the in­com­pe­tents with de­grees would be more able to find in­com­pe­tent em­ployer to work at. And PhDs should be able to find a com­pany that hires PhDs for sig­nal­ling rea­sons]

The is­sue with the hiring meth­ods here, is that we eas­ily con­fuse “more ac­cu­rate mea­sure­ment of X” with “stronger cor­re­la­tion to Y”, and “stronger cor­re­la­tion to Y” with hiring bet­ter staff (the one that doesn’t sink your com­pany), usu­ally out of some dra­mat­i­cally differ­ent pop­u­la­tion than the one on which cor­re­la­tion was found.

Fur­ther­more, a ‘cor­re­la­tion’ is such an in­ex­act mea­sure of how test re­lates to perfor­mance. Com­par­ing cor­re­la­tions is like com­par­ing ap­ples to or­anges by weight. The ‘fizzbuzz’ style prob­lems mea­sure perfor­mance near the ab­solute floor level, but with very high re­li­a­bil­ity. Vir­tu­ally no-one who fails fizzbuzz is a good hire. Vir­tu­ally no-one who passes fizzbuzz (an unique fizzbuzz, not the pop­u­lar one) is com­pletely in­ca­pable of pro­gram­ming. The de­grees cor­re­late to perfor­mance at the higher level, but with very low re­li­a­bil­ity—there are brilli­ant peo­ple with de­grees, there are com­plete in­com­pe­tents with de­grees, there’s brilli­ant peo­ple and in­com­pe­tents with­out de­grees.

edit: other ex­am­ple:

Rev­ers­ing a linked list is a good one un­less the can­di­date knows how to. See, the is­sue is that ed­u­ca­tional in­sti­tu­tions don’t teach how to think up a way to re­verse linked list. Nor do they test for that. They might teach how to re­verse the linked list, then they might test if the per­son can re­verse the linked list. Some peo­ple learn to think of a way to solve such prob­lems. Some don’t. It’s en­tirely in­ci­den­tal.

• Un­for­tu­nately, GPAs can lie. You can­not be cer­tain of the qual­ity of the prob­lems and eval­u­a­tion that was av­er­aged to pro­duce the GPA. So run­ning your own test of known difficulty works well to ver­ify what you see on the re­sume.

For ex­am­ple, I have to hire pro­gram­mers. We give all in­com­ing pro­gram­mers a few rel­a­tively easy pro­gram­ming prob­lems as part of the in­ter­view pro­cess be­cause we’ve found that no mat­ter what the re­sume says, it’s pos­si­ble that they ac­tu­ally do not know how to pro­gram.

Good re­sume + good in­ter­view re­sult is a much stronger in­di­ca­tor than good re­sume alone.

• A sig­nifi­cant prob­lem is the weight­ing of cer­tain courses, par­tic­u­larly Ad­vanced Place­ment ones. A GPA of 3.7, seem­ing quite re­spectable to the un­aware, can be ob­tained by work of qual­ity 83%, and that’s as­sum­ing the class didn’t offer ex­tra credit.

• I don’t think he is likely to hire pro­gram­mers straight out of high school.

Giv­ing IB/​AP/​Honors classes ex­tra weight in high school is nec­es­sary to offset the ad­di­tion­ally difficulty of these classes. Other­wise, high school stu­dents would have a di­rect dis­in­cen­tive to take ad­vanced classes.

• Giv­ing IB/​AP/​Honors classes ex­tra weight in high school is nec­es­sary to offset the ad­di­tion­ally difficulty of these classes. Other­wise, high school stu­dents would have a di­rect dis­in­cen­tive to take ad­vanced classes.

A swift googling brings up this forth­com­ing study of about 900 high schools in Texas:

De­spite con­ven­tional wis­dom to the con­trary, grade weight­ing is not the pri­mary fac­tor driv­ing stu­dents to in­crease their AP course-tak­ing. More­over, a lack of in­sti­tu­tional knowl­edge about the im­por­tance of grade-weight­ing does not have a prac­ti­cally sig­nifi­cant ad­verse im­pact on stu­dents with low his­tor­i­cal par­ti­ci­pa­tion rates in AP, al­though low in­come stu­dents are marginally less re­spon­sive to in­creases in the AP grade weight than oth­ers. The min­i­mal con­nec­tion be­tween AP grade weights and course-tak­ing be­hav­ior may ex­plain why schools tin­ker with their weights, mak­ing changes in the hopes of find­ing the sweet spot that elic­its the de­sired stu­dent AP-tak­ing rates. The re­sults pre­sented here sug­gest that there is no sweet spot and that schools should look el­se­where for ways to in­crease par­ti­ci­pa­tion in rigor­ous courses.

• But there’s still the ad­di­tional in­cen­tive of pres­tige and sig­nal­ling, isn’t there? That should be enough for the se­ri­ous scholar. It’s a sig­nifi­cant prob­lem when non-AP-la­bel­led courses are of­ten passed over for the pur­pose of a cheap grade boost.

• The post men­tions the ex­perts us­ing the re­sults of the SPR. What hap­pens if you re­verse it, and give the SPR the pre­dic­tion of the ex­pert?

• That’s called a ‘boot­strapped’ SPR. It’s one way of build­ing an SPR. And yes, in many cases the SPR ends up be­ing re­li­ably bet­ter than the ex­pert judg­ments that were used to build it.

• I was won­der­ing more how much bet­ter it is than a nor­mal SPR. Also, I won­der what weight it would give to the ex­pert.

• Peo­ple look­ing for ad­di­tional re­sources on this mat­ter should know that such lin­ear mod­els are of­ten called “multi at­tribute util­ity mod­els” (MAUT), and that they’re dis­cussed ex­ten­sively in the liter­a­ture of de­ci­sion anal­y­sis and multi-crite­ria de­ci­sion mak­ing. They’re also used in choice mod­els in the sci­ence of mar­ket­ing.

The word “statis­ti­cal” in the name used here is a bit of a red her­ring.

• AI sys­tems can gen­er­ally whoop hu­mans when a limited fea­ture set can be dis­cov­ered that cov­ers the span of a large class of ex­am­ples to good effect. The challenge is when you seem­ingly need a new fea­ture for each new ex­am­ple in or­der to differ­en­ti­ate it from the rest of the ex­am­ples in that class. Essen­tially you are say­ing that the prob­lem can be mapped to a sim­ple func­tion. Some prob­lems can.

Let’s imag­ine we are clas­sify­ing avian vs. rep­tile. Our first ex­am­ple might be a gecko, and we might say ‘well it’s green’. So ‘Color is Green’ is a clue\fea­ture and that works co­in­ci­den­tally for a few more ex­am­ples. Then you get a par­rot as an ex­am­ple, and you de­cide to add ‘Has a beak’. Then you get the ex­am­ple of a tur­tle, and so you add ‘Has a shell’, etc. It seems to me the suc­cess of these sys­tems boils down to whether the fea­tures can be added at a min­i­mal rate com­pared to the ex­am­ples on hand.

Where AI’s com­pete well gen­er­ally they beat trained hu­mans fairly marginally on easy (or even most) cases, and then fail mis­er­ably at bor­der or novel cases. This can make it dan­ger­ous to use them if the ex­treme failures are dan­ger­ous.

As to why hu­mans can’t en­sem­ble with the ma­chines, I sus­pect that has mostly to do with the hu­mans not be­ing well-trained to do so.

• A fair point and good cau­tion against turn­ing SPRs into your ham­mer for ev­ery nail, but ir­rele­vant in the case luke­prog is dis­cussing; we already have the ex­pert sys­tem, we already know it works bet­ter than the ex­perts, we just aren’t us­ing it.

• Ir­rele­vant is ex­ces­sive. When you say ‘sys­tem A works bet­ter than sys­tem B’ this im­plies that sys­tem A should be used and this is clear cut. But the no­tion ‘works bet­ter’ lacks a rigor­ous defi­ni­tion. Is the ma­chine bet­ter if it wins 90% of the time by 5%, and fails the other 10% by 40%? It’s not as sim­ple as say­ing .9 .05 > .1 .4. The cost of er­ror isn’t nec­es­sar­ily lin­ear.

Now why these sys­tems aren’t used in en­sem­bles with hu­mans is in­deed a great ques­tion. I can imag­ine that in most cases we could also ask ‘why don’t we dou­ble the num­ber of ex­perts who are col­lab­o­rat­ing on a given prob­lem?’ un­der the pre­sump­tion that more minds would likely re­sult in a bet­ter perfor­mance across the board. I wouldn’t be sur­prised if there was a lot of over­lap in the an­swers. Co­or­di­na­tion difficulty is likely high up there. Thus,

con­sider the fact that even when ex­perts are given the re­sults of SPRs, they still can’t out­perform those SPRs

pos­si­bly be­comes the ex­pla­na­tion.

• When you say ‘sys­tem A works bet­ter than sys­tem B’ this im­plies that sys­tem A should be used and this is clear cut. But the no­tion ‘works bet­ter’ lacks a rigor­ous defi­ni­tion.

What? Th­ese are gen­er­ally bi­nary de­ci­sions, with a known cost to false pos­i­tives and false nega­tives, and known rates of false pos­i­tives and false nega­tives. It should be be triv­ial to go from that to a util­ity-val­ued er­ror score.

• You just pre­sumed away my ar­gu­ment. I claim speci­fi­cally that the re­la­tion­ship be­tween var­i­ous classes of er­rors is not well-defined. This can lead to abuse of the term ‘bet­ter’.

Please tell me why I should take that as a pre­sump­tion.

• Be­cause those are the class of prob­lems this post dis­cusses.

From the top of the post:

A pa­role board con­sid­ers the re­lease of a pris­oner: Will he be vi­o­lent again? A hiring officer con­sid­ers a job can­di­date: Will she be a valuable as­set to the com­pany? A young cou­ple con­sid­ers mar­riage: Will they have a happy mar­riage?

The cached wis­dom for mak­ing such high-stakes pre­dic­tions is to have ex­perts gather as much ev­i­dence as pos­si­ble, weigh this ev­i­dence, and make a judg­ment. But 60 years of re­search has shown that in hun­dreds of cases, a sim­ple for­mula called a statis­ti­cal pre­dic­tion rule (SPR) makes bet­ter pre­dic­tions than lead­ing ex­perts do.

• A pa­role board con­sid­ers the re­lease of a pris­oner: Will he be vi­o­lent again?

I think this is the kind of ques­tion that Miller is talk­ing about. Just be­cause a sys­tem is cor­rect more of­ten, doesn’t nec­es­sar­ily mean it’s bet­ter.

For ex­am­ple if the hu­man ex­perts al­lowed more peo­ple out who went on to com­mit rel­a­tively minor vi­o­lent offences and the SPRs do this less of­ten, but are more likely to re­lease pris­on­ers who go on to com­mit mur­der then there would be le­gi­t­i­mate dis­cus­sion over whether the SPR is ac­tu­ally bet­ter.

I think this is ex­actly what he is talk­ing about when he says

Where AI’s com­pete well gen­er­ally they beat trained hu­mans fairly marginally on easy (or even most) cases, and then fail mis­er­ably at bor­der or novel cases. This can make it dan­ger­ous to use them if the ex­treme failures are dan­ger­ous.

Whether or not there is ev­i­dence that says this is a real effect I don’t know, but to ad­dress it what you re­ally need to mea­sure is to­tal util­ity of out­comes rather than ac­cu­racy.

• Yes. You got it, ex­actly.

• No. I’m talk­ing about classes of er­rors.

As in, which is bet­ter?

• A test that re­ports 100 false pos­i­tives for ev­ery 100 false nega­tives for dis­ease X

• A test that re­ports 110 false pos­i­tives for ev­ery 90 false nega­tives for dis­ease X

The cost of fp vs. fn is not defined au­to­mat­i­cally. If hu­mans are closer to #1 than #2, and I de­velop a sys­tem like #2, I might define #2 to be bet­ter. Then later on down the line I stop talk­ing about how I defined bet­ter, and I just use the word bet­ter, and no one ques­tions it be­cause hey… bet­ter is bet­ter, right?

• Which is more costly, false pos­i­tives or false nega­tives? This is an easy ques­tion to an­swer.

If false pos­i­tives, #1 is bet­ter. If false nega­tives, #2. I re­ally do not see what your point is. Th­ese prob­lems you bring up are eas­ily solved.

• Which is bet­ter: Re­leas­ing a vi­o­lent pris­oner, or keep­ing a harm­less one in­car­cer­ated? If you can find an an­swer that 90% of the pop­u­la­tion agrees on, then I think you’ve done bet­ter than ev­ery poli­ti­cian in his­tory.

That peo­ple do NOT agree sug­gest to me that it’s hardly a triv­ial ques­tion...

• Re­leas­ing a vi­o­lent pris­oner, or keep­ing a harm­less one in­car­cer­ated?

How vi­o­lent, how pre­ventably vi­o­lent, how harm­less, how in­car­cer­ated, how long in­car­cer­ated? For any spe­cific case with these agreed-upon, I am con­fi­dent a su­per­ma­jor­ity would agree.

That peo­ple do NOT agree sug­gest to me that it’s hardly a triv­ial ques­tion...

That peo­ple don’t agree sug­gests one side is com­par­ing re­leas­ing a se­rial kil­ler to in­car­cer­at­ing a drifter in jail a short while, and the other side is com­par­ing re­leas­ing a mid­dle-aged man who in a fit of pas­sion struck his adulter­ous wife to in­car­cer­at­ing Ghandi for the term of his nat­u­ral life. More gen­er­ally, they are de­cid­ing based on one spe­cific ex­am­ple they have strongly available to them.

In the state you phrased it, that ques­tion is about as an­swer­able as “how long is a piece of string?”.

• Yes. Thank you. Since at least one per­son un­der­stood me, I’m gonna jump off the merry-go-round at this point.

• (For refer­ence, I re­al­ize an ex­pert runs in to the same is­sue, I just think it’s un­fair to say that the is­sue is “eas­ily solved”)

• Many tests have a con­tin­u­ous, ad­justable pa­ram­e­ter for sen­si­tivity, let­ting you set the trade-off how­ever you want. In that case, we can re­frain from judg­ing the rel­a­tive bad­ness of false pos­i­tives and false nega­tives, and use ROCA, which is ba­si­cally the in­te­gral over all such trade-offs. Tests that are go­ing to be com­bined into a larger pre­dic­tor are usu­ally mea­sured this way.

Ma­chine learn­ing pack­ages gen­er­ally let you spec­ify a “cost ma­trix”, which is the cost of each pos­si­ble con­fu­sion. For a 2-val­ued test, it would be a 2x2 ma­trix with ze­roes on the di­ag­o­nal, and the cost of A->B and B->A er­rors in the other two spots. For a test with N pos­si­ble re­sults, the ma­trix is NxN, with ze­roes on the di­ag­o­nals, and each (row,col) po­si­tion is the cost of a mis­take that con­fuses the re­sult cor­re­spond­ing to that row with the re­sult cor­re­spond­ing to that column.

• Keep in mind this is in the con­clu­sion of luke­prog’s post:

When there ex­ists a re­li­able statis­ti­cal pre­dic­tion rule for the prob­lem you’re considering

Now,

But the no­tion ‘works bet­ter’ lacks a rigor­ous defi­ni­tion. Is the ma­chine bet­ter if it wins 90% of the time by 5%, and fails the other 10% by 40%? It’s not as sim­ple as say­ing .9 .05 > .1 .4. The cost of er­ror isn’t nec­es­sar­ily lin­ear.

If the cost of er­ror isn’t lin­ear, de­ter­mine what func­tion it fol­lows, then use that func­tion in­stead of a lin­ear func­tion to com­pare the rel­a­tive costs, which will tell you which works bet­ter.

Ir­rele­vant is ex­ces­sive.

I stand by it. The post is say­ing, given that SPRs work, work bet­ter than ex­perts, and don’t fail where ex­perts don’t, we should use them in­stead of ex­perts. Your points were that SPRs don’t always work, tend not to work in bor­der cases, and might fail in dan­ger­ous cases. The first point is only true in cases this post is not con­cerned with, the sec­ond is equally true of ex­perts and SPRs, and the third is also equally true of ex­perts and SPRs.

• Also, there is an ar­ti­cle by Dawes, Faust and Meehl. De­spite the fact it was pub­lished 7 years prior to House of Cards, it con­tains some in­for­ma­tion not de­scribed in the chap­ter 3 of House of Cards.

For ex­am­ple, the awe­some re­sult by Gold­berg: lin­ear mod­els of hu­man judges were more ac­cu­rate than hu­man judges them­selves:

in cases of dis­agree­ment, the mod­els were more of­ten cor­rect than the very judges on whom they were based.

• Thank you for this ar­ti­cle. Some peo­ple may re­act to find­ing that their pro­fes­sional opinion be less ac­cu­rate than a sim­ple for­mula, but I get ex­cited in­stead. It’s such a great op­por­tu­nity to be­come more ac­cu­rate, with such com­par­a­tively lit­tle effort! I’m par­tic­u­larly in­ter­ested in the med­i­cal SPRs; I aim to be a doc­tor, and if these will help me be bet­ter than the av­er­age doc­tor in many cases, then so be it. I sus­pect that I’ll have to use them se­cretly.

• Other re­lated read­ing that I don’t think has been men­tioned yet:

Ian Ayres (cofounder of stickK.com) has a pop­u­lar book called Su­per Crunch­ers that ar­gues this ex­act the­sis. http://​​www.ama­zon.com/​​Su­per-Crunch­ers-Think­ing-Num­bers-Smart/​​dp/​​0553805401

A clas­sic is Tet­lock’s Ex­pert Poli­ti­cal Judg­ment. http://​​press.prince­ton.edu/​​ti­tles/​​7959.html

• I think the rea­son I don’t use statis­tics more of­ten is the difficulty of get­ting good data sets; and even when there is good data, there are of­ten eth­i­cal prob­lems with fol­low­ing it. For ex­am­ple: Bob lives in Amer­ica, and is seek­ing to max­i­mize his hap­piness. Amer­i­cans who re­port high lev­els of spiritual con­vic­tion are twice as likely to re­port be­ing “very happy” than the least re­li­gious. Should he be­come a de­vout Chris­tian? There’s ev­i­dence that the hap­piness comes from hold­ing the ma­jor­ity opinion; should he then strive to be­lieve what­ever the polls say is the most com­mon be­lief in his area?

Another ex­am­ple: Bob has three kids; he knows his wife is cheat­ing on him, but he also knows the effect size of di­vorce on child out­comes (de­pres­sion, grades, in­come, sta­bil­ity of fu­ture re­la­tion­ships, etc.) is larger than smok­ing on lung can­cer, as­pirin on heart at­tacks, or cy­closporine on or­gan trans­plants. When do the bad effects of stay­ing in the mar­riage out­weigh the bad effects of split­ting up?

• Bob should not be­come a Chris­tian to be­come hap­pier for the same rea­son that he should not stay away from hos­pi­tals if he’s sick (af­ter all, sick peo­ple are a lot more likely to be in a hos­pi­tal).

• Cosma Shal­izi has a nice bibliog­ra­phy here

60 years of research

I would like to em­pha­size this part. It’s not just scat­tered pa­pers back then. Meehl wrote a book sur­vey­ing the field in 1955.

• Another ex­am­ple of this: the US poli­ti­cal mod­els did fan­tas­tic in pre­dict­ing all sorts of out­comes on elec­tion day 2012, far ex­ceed­ing all sorts of pun­dits or peo­ple ad­just­ing the num­bers based on gut feel­ings and as­sump­tions, de­spite of­ten be­ing pretty sim­ple or tan­ta­mount to poll av­er­ag­ing.

• Just felt like say­ing thank you to luke­prog and all those who com­mented; this has been a great help to me in de­cid­ing what to read about next re­gard­ing de­ter­mi­na­tion of guaran­teed val­ues for the ser­vice the de­part­ment I work in performs.

• Hu­mans use more com­plex util­ity func­tions to eval­u­ate some­thing like mar­tial hap­piness. If you train a statis­ti­cal model on a straight nu­meric value for mar­tial hap­piness than the model only op­ti­mizes to­wards that spe­cific as­pect of hap­piness.

A good eval­u­a­tion should test the model that trained on he­do­nis­tic hap­piness rat­ing on some­thing like the like­li­hood of di­vorce.

• I think you mean “mar­i­tal” here. (De­spite the similar­i­ties, love is not a bat­tlefield.)

• Okay, English isn’t my first lan­guage.

• English isn’t my first language

You could eas­ily have made the same typo even if it were; we’re talk­ing about the mere trans­po­si­tion of two ad­ja­cent let­ters.

(Another ex­am­ple: “ca­sual” vs. “causal”, which of­ten trips me up in read­ing.)

• (Another ex­am­ple: “ca­sual” vs. “causal”, which of­ten trips me up in read­ing.)

In Ital­ian that’s even worse, since causale does mean ‘causal’ but ca­suale means ‘ran­dom’.

• (Another ex­am­ple: “ca­sual” vs. “causal”, which of­ten trips me up in read­ing.)

Cool, that means you would get the joke about how “women are in­ter­ested in causal sex”!

• Is there acausal sex? (Would that be, like, hav­ing (phone/​cy­ber)sex with some­one in a differ­ent Teg­mark uni­verse via some form of com­mu­ni­ca­tion built on UDT acausal trade?)

• Acausal sex­ual re­pro­duc­tion is quite plau­si­ble, in a sense. Sup­pose you were a sin­gle woman liv­ing in a so­ciety with ac­cess to so­phis­ti­cated ge­netic en­g­ineer­ing, and you wanted to give birth to a child that was biolog­i­cally yours and not do any un­nat­u­ral op­ti­miz­ing. You could en­vi­sion your ideal mate in de­tail, re­verse-en­g­ineer the ge­net­ics of this man, and then cre­ate a sperm pop­u­la­tion that the man could have pro­duced had he ex­isted. I can eas­ily imag­ine a ge­netic en­g­ineer offer­ing this ser­vice: you walk into the office, de­scribe the man’s phys­i­cal at­tributes, per­son­al­ity, and even life his­tory, and the en­g­ineer does the rest as much as is pos­si­ble (in this so­ciety, we know that a plu­ral­ity of men who played short­stop in Lit­tle League have a cer­tain allele, etc.) The child could grow up and mean­ingfully learn things about the coun­ter­fac­tual father—if you learned that the father was prone to de­pres­sion, that would mean that you should watch out for that as well.

If the mother re­ally wants to, she can take things fur­ther and spec­ify that the man should be the kind of per­son who would have, had he ex­isted, gone through the analo­gous pro­ce­dure (with a sur­ro­gate or ar­tifi­cial womb), and that the coun­ter­fac­tual woman he would have speci­fied would have been her. In this case, we can say that the man and the woman have acausally re­pro­duced.

• Hmm. So the man has man­aged to “acausally re­pro­duce”, fulfill his util­ity func­tion, in spite of not ex­ist­ing. You could go fur­ther and posit an imag­i­nary cou­ple who would have cho­sen each other for the pro­ce­dure—so they suc­ceed in “acausally re­pro­duc­ing”, even though nei­ther of them ex­ists. Then when some­one tries to write a story about the imag­i­nary cou­ple, the child be­comes ob­serv­able to the writer and starts do­ing some re­pro­duc­ing of her own :-)

• My in­ter­pre­ta­tion of acausal sex­ual re­pro­duc­tion would be some­thing more like All You Zom­bies.

• What makes this acausal? That is, when are fu­ture in­puts mod­ify­ing pre­sent re­sults? Or are you us­ing a differ­ent defi­ni­tion of acausal?

• I meant it in the sense of ata’s par­ent com­ment, al­though there is a back­wards ar­row in there: the phe­no­type is de­ter­min­ing the geno­type rather than vice versa.

• That pa­per is ab­solutely brilli­ant! I kept laugh­ing ev­ery time a new clearly log­i­cally rea­soned yet hu­morous de­tail was ex­plored.

• Is there acausal sex? (Would that be, like, hav­ing (phone/​cy­ber)sex with some­one in a differ­ent Teg­mark uni­verse via some form of com­mu­ni­ca­tion built on UDT acausal trade?)

If you’re bas­ing the sex on acausal trade then you should per­haps re­fer to it as acausal pros­ti­tu­tion. Or pos­si­bly acausal mar­riage.

• Si­mu­late agent.

• Check if it tries to do the same for you.

• If it does, build it a body and have sex.

• In a galaxy far far away, an agent simu­lates you, sees you try to do the same for them.

• It clones you and has sex.

Does this fit the bill?

• It’s in­ter­est­ing to me that the proper lin­ear model ex­am­ple is es­sen­tially a stripped down ver­sion of a very sim­ple neu­ral net­work with a lin­ear ac­ti­va­tion func­tion.

• Is that re­ally true? Couldn’t one say that of just about any Tur­ing-com­plete (or less) model of com­pu­ta­tion?

‘Oh, it’s in­ter­est­ing that they are re­ally just a sim­ple unary fixed-length lambda-calcu­lus func­tion with con­stant-value pa­ram­e­ters.’

‘Oh, it’s in­ter­est­ing that they are re­ally just re­stricted petri-nets with bounded branch­ing fac­tors.’

‘Oh, it’s in­ter­est­ing that these are mod­e­lable by finite au­tomata.’

etc. (Plau­si­ble-sound­ing gob­bledy­gook in­cluded to make the point.)

• Yes, sort of, but a) a lin­ear clas­sifier is not a Tur­ing-com­plete model of com­pu­ta­tion, and b) there is a clear re­sem­blance that can be seen by merely glanc­ing at the equa­tions.

• I would ar­gue that neu­rons, neu­ral nets, SPRs, and ev­ery­one else do­ing lin­ear re­gres­sion use those tech­niques be­cause it’s the sim­plest way to ag­gre­gate data.

• I’m skep­ti­cal, and will now pro­ceed to ques­tion some of the as­ser­tions made/​refer­ences cited. Note that I’m not trained in statis­tics.

Un­for­tu­nately, most of the ar­ti­cles cited are not eas­ily available. I would have liked to check the method­ol­ogy of a few more of them.

|For ex­am­ple, one SPR de­vel­oped in 1995 pre­dicts the price of ma­ture Bordeaux red wines at auc­tion bet­ter than ex­pert wine tasters do.

The pa­per doesn’t ac­tu­ally es­tab­lish what you say it does. There is no statis­ti­cal anal­y­sis of ex­pert wine tasters, only one or two anec­do­tal state­ments of their fury at the whole idea. In­stead, the SPR is com­pared to ac­tual mar­ket prices—not to ex­perts’ pre­dic­tions. I think it’s fair to say that the claim I quoted is over­reached.

Now, about the model and its fit to data. Note that the SPR is older than 1995, when the pa­per was pub­lished. The NYTimes ar­ti­cle about it which you refer­ence is from 1990 (the pa­per bizarrely dates it to 1995; I’m not sure what’s go­ing on there).

The fact that there’s a lin­ear model—not speci­fied pre­cisely any­where in the ar­ti­cle—which is a good fit to wine prices for vin­tages of 1961-1972 (Table 3 in the pa­per) is not, I think, very sig­nifi­cant on its own. To judge the model, we should look at what it pre­dicts for up­com­ing years. Both the pa­per and the NYTimes ar­ti­cle make two spe­cific pre­dic­tions. First, the 1986 vin­tage, claimed to be ex­tol­led by ex­perts early on, will prove mediocre be­cause of the weather con­di­tions that year (see Figure 3 in the pa­per − 1986 is clearly the worst of the 80ies). NYTimes says “When the dust set­tles, he pre­dicts, it will be judged the worst vin­tage of the 1980′s, and no bet­ter than the un­mem­o­rable 1974′s or 1969′s”. The 1995 pa­per says, more mod­estly, “We should ex­pect that, in due course, the prices of these wines will de­cline rel­a­tive to the prices of most of the other vin­tages of the 1980s”. Se­cond, the 1989-1990 is pre­dicted to be “out­stand­ing” (pa­per), “stun­ningly good” (NYTimes), “ad­justed for age, will out­sell at a sig­nifi­cant pre­mium the great 1961 vin­tage (NYTimes).”

It’s now 16 years later. How do we test these pre­dic­tions?

First, I’ve stum­bled on a differ­ent pa­per from the pri­mary au­thor, Prof. Ashen­felter, from 2007. Pub­lished 12 years later than the one you refer­ence, this pa­per has sub­stan­tially the same con­tents, with whole pages copied ver­ba­tim from the ear­lier one. That, by it­self, wor­ries me. Even more wor­ry­ing is the fact that the 1986 pre­dic­tion, promi­nent in the 1990 ar­ti­cle and the 1995 pa­per, is com­pletely miss­ing from the 2007 pa­per (the data be­low might in­di­cate why). And most wor­ry­ing of all is the change of lan­guage re­gard­ing the 1989/​1990 pre­dic­tion. The 1995 pa­per says about its pre­dic­tion that the 1989/​1990 will turn out to be out­stand­ing, “Many wine writ­ers have made the same pre­dic­tions in the trade mag­a­z­ines”. The 2007 pa­per says “Iron­i­cally, many pro­fes­sional wine writ­ers did not con­cur with this pre­dic­tion at the time. In the years that have fol­lowed minds have been changed; and there is now vir­tu­ally unan­i­mous agree­ment that 1989 and 1990 are two of the out­stand­ing vin­tages of the last 50 years.”

Uhm. Right. Well, be­cause the claims aren’t strong enough, they do not ex­actly con­tra­dict each other, but this change leaves a bad taste. I don’t think I should give much trust to these pa­pers’ claims.

The data I could find quickly to test the pre­dic­tions is here. The prices are bro­ken down by the chateaux, by the vin­tage year, the pack­ag­ing (I’ve always cho­sen BT—bot­tle), and the auc­tion year (I’ve always cho­sen the last year available, typ­i­cally 2004). Un­for­tu­nately, Ashen­felter un­der­speci­fies how he came up with the ag­gre­gate prices for a given year—he says he chose a pack­age of the best 15 winer­ies, but doesn’t say which ones or how the prices are com­bined. I used 5 winer­ies that are speci­fied as the best in the 2007 pa­per, and looked up the prices for years 1981-1990. The data is in this spread­sheet. I haven’t tried to statis­ti­cally an­a­lyze it, but even from a quick glance, I think the fol­low­ing is clear. 1986 did not sta­bi­lize as the worst year of the 1980s. It is fre­quently sec­ond- or third-best of the decade. It is always much bet­ter than ei­ther 1984 or 1987, which are sup­posed to be vastly bet­ter ac­cord­ing to the 1995 pa­per’s weather data (see Figure 3). 1989/​1990 did turn out well, es­pe­cially 1990. Still, they’re both nearly always less ex­pen­sive than 1982, which is again vastly in­fe­rior in the weather data (it isn’t even in the best quar­ter). Over­all, I fail to see much cor­re­la­tion be­tween the weather data in the pa­per for the 1980s, the spe­cific claims about 1986 and 1989/​1990, and the mar­ket prices as of 2004. I wouldn’t recom­mend us­ing this SPR to pre­dict mar­ket prices.

Now, this was the first ex­am­ple in your post, and I found what I be­lieve to be sub­stan­tial prob­lems with its method­ol­ogy and the qual­ity of its SPR. If I were to pro­ceed and ex­am­ine ev­ery ex­am­ple you cite in the same de­tail, would I en­counter many such prob­lems? It’s difficult to tell, but my pre­dic­tion is “yes”. I an­ti­ci­pate overfit­ting and shoddy method­ol­ogy. I an­ti­ci­pate huge in­fluence of the se­lec­tion bias—the au­thors that pub­lish these kinds of pa­pers will not pub­lish a pa­per that says “The ex­perts were bet­ter than our SPR”. And fi­nally, I an­ti­ci­pate over­reach­ing claims of wide-reach­ing ap­pli­ca­bil­ity of the mod­els, based on pa­pers that ac­tu­ally in­di­cate mod­est effect in a very spe­cific situ­a­tion with a small sam­ple size.

I’ve looked at your sec­ond ex­am­ple:

|Howard and Dawes (1976) found they can re­li­ably pre­dict mar­i­tal hap­piness with one of the sim­plest SPRs ever con­ceived, us­ing only two cues: P = [rate of love­mak­ing] - [rate of fight­ing].

I couldn’t find the origi­nal pa­per, but the re­sults are sum­marised in Dawes (1979). Look­ing at it, it turns out that when you say “pre­dict mar­i­tal hap­piness”, it re­ally means “pre­dicts one of the part­ner’s sub­jec­tive opinion of their mar­i­tal hap­piness”—as op­posed to e.g. sta­bil­ity of the mar­riage over time. There’s no in­di­ca­tion as to how the part­ner to ques­tion was cho­sen from each pair (e.g. whether the ex­per­i­menter knew the rate when they chose). There was very good cor­re­la­tion with bi­nary out­come (happy/​un­happy), but when a finer scale of 7 de­grees of hap­piness was used, the cor­re­la­tion was weak—rate of 0.4. In a fol­low-up ex­per­i­ment, cor­re­la­tion rate went up to 0.8, but there the sub­ject looked at the love­mak­ing/​fight­ing statis­tics be­fore opin­ing on the de­gree of hap­piness, thus con­tam­i­nat­ing their de­ci­sion. And even in the ear­lier ex­per­i­ment, the sub­ject had been record­ing those love­mak­ing/​fight­ing statis­tics in the first place, so it would make sense for them to re­call those events when they’re asked to as­sess whether their mar­riage is a happy one. Over­all, the model is witty and naively ap­pears to be use­ful, but the sus­pect method­ol­ogy and the rel­a­tively weak cor­re­la­tion en­courages me to dis­count the anal­y­sis.

Fi­nally, the fol­low­ing claim is the sin­gle most ob­jec­tion­able one in your post, to my taste:

|If you’re hiring, you’re prob­a­bly bet­ter off not do­ing in­ter­views.

My own ex­pe­rience strongly sug­gests to me that this claim is inane—and is highly dan­ger­ous ad­vice. I’m not able to view the pa­pers you base it on, but if they’re any­thing like the first and sec­ond ex­am­ple, they’re far, far away from con­vinc­ing me of the truth of this claim, which I in any case strongly sus­pect to over­reach gi­gan­ti­cally over what the pa­pers are prov­ing. It may be true, for ex­am­ple, that a very large body of hiring de­ci­sion-mak­ers in a huge or­gani­sa­tion or a state on av­er­age make poorer de­ci­sions based on their pro­fes­sional judge­ment dur­ing in­ter­views than they would have made based purely on the re­sume. I can see how this claim might be true, be­cause any such very large body must be largely in­com­pe­tent. But it doesn’t fol­low that it’s good ad­vice for you to ab­strain from in­ter­view­ing—it would only fol­low if you be­lieve your­self to be no more com­pe­tent than the av­er­age hiring man­ager in such a body, or in the pa­pers you refer­ence. My per­sonal ex­pe­rience from in­ter­view­ing many, many can­di­dates for a large com­pany sug­gests that in­ter­view­ing is cru­cial (though I will freely grant that differ­ent kinds of in­ter­views vary wildly in their effec­tive­ness).

• I was think­ing of writ­ing a post about Bishop & Trout when I didn’t see it men­tioned on this site be­fore, but I’m glad you beat me to it. (Among other things, I lent out my copy and so would have difficulty writ­ing up a re­view). It’s a great book.

• Your up­load of Dawes’s “The Ro­bust Beauty of Im­proper Lin­ear Models in De­ci­sion Mak­ing” seems to be bro­ken- at least, I’m not able to ac­cess it.

• at least, I’m not able to ac­cess it.

Nei­ther.

• Dang. Fixed.

• Wow. I highly recom­mend read­ing the Dawes pdf, it’s illu­mi­nat­ing:

Ex­pert doc­tors coded [vari­ables from] biop­sies of pa­tients with Hodgkin’s dis­ease and then made an over­all rat­ing of the sever­ity of the pro­cess. The over­all rat­ing did not pre­dict the sur­vival time of the 193 pa­tients, all of whom died. (The cor­re­la­tions of sur­vival time with rat­ings was vir­tu­ally 0, some in the wrong di­rec­tion). The vari­ables that the doc­tors coded, how­ever, did pre­dict sur­vival time when they were used in a mul­ti­ple re­gres­sion model.

In sum­mary, proper lin­ear mod­els work for a very sim­ple rea­son. Peo­ple are good at pick­ing out the right pre­dic­tor vari­ables … Peo­ple are bad at in­te­grat­ing in­for­ma­tion from di­verse and in­com­pa­rable sources. Proper lin­ear mod­els are good at such in­te­gra­tion …

He then goes on to show that im­proper lin­ear mod­els still beat hu­man judg­ment. If your re­ac­tion to the top-level post wasn’t en­dorse­ment of statis­ti­cal meth­ods for these prob­lems, this pdf is a bunch more ev­i­dence that you can use to up­date your be­liefs about statis­ti­cal meth­ods of pre­dic­tion.

• Peo­ple are good at pick­ing out the right pre­dic­tor vari­ables … Peo­ple are bad at in­te­grat­ing in­for­ma­tion from di­verse and in­com­pa­rable sources.

That is a beau­tiful sum­mary sen­tence, in­ci­den­tally, and I am tak­ing it with me as a short­hand “han­dle” for this whole idea.

I find it works well as a sur­face-level counter for the (alas, still in­ap­pro­pri­ately com­pel­ling) idea that a dumb al­gorithm can’t get more ac­cu­rate re­sults than a smart ob­server.

• Another pos­si­ble metaphor is the pocket calcu­la­tor.

It can find a num­ber for any ex­pres­sion you can put into it, and in most cases it can do it way faster and more ac­cu­rately than a hu­man could. How­ever, that doesn’t make it a re­place­ment for a hu­man. An in­tel­li­gent agent like a hu­man is still needed for the cru­cial part of figur­ing out what ex­pres­sion would be mean­ingful to put into it.

• That is a very helpful metaphor for wrap­ping my head around both the ad­van­tages and limi­ta­tions of SPR, thank you! :)

• I can­not help un­leash­ing an evil laugh when­ever I dis­cover an­other tool to aid in world dom­i­na­tion. Thank you.

• To think about it, the main cri­tique i have for this ar­ti­cle is:

• Only lists cases where SPR ‘out­performed’ ex­per­tise. Of which in most we just loosely de­scribe as ‘ex­perts’ some peo­ple who had never did any proper train­ing (with ex­er­cises and test­ing) to perform task in ques­tion.

• Equates bet­ter cor­re­la­tion with “out­performs”. Not the same thing. The max­i­mum cor­re­la­tion hap­pens when you clas­sify into those with less than av­er­age risk of re­ci­di­vism and those with larger than av­er­age risk. Pa­role board is not even sup­posed to work like this AFAIK.

• If some SPR can ‘out­perform’ av­er­age HR ex­per­tise, it doesn’t mean SPR out­performs best ex­per­tise. Ex­am­ple where it mat­ters: if you are a soft­ware start-up com­pany founder, and if your ex­per­tise is av­er­age, your start-up will al­most in­evitably fail. Only small per­centage suc­cesses, top 1% or less. You strive to max­i­mize your chances at mak­ing into top 1%, not at mak­ing into top 50%.

• What’s about eth­i­cal is­sues? Race cor­re­lates with crim­i­nal­ity, for ex­am­ple.

edit: not fully sure at the mo­ment when max­i­mum cor­re­la­tion hap­pens.