# Statistical Prediction Rules Out-Perform Expert Human Judgments

A pa­role board con­sid­ers the re­lease of a pris­oner: Will he be vi­o­lent again? A hiring officer con­sid­ers a job can­di­date: Will she be a valuable as­set to the com­pany? A young cou­ple con­sid­ers mar­riage: Will they have a happy mar­riage?

The cached wis­dom for mak­ing such high-stakes pre­dic­tions is to have ex­perts gather as much ev­i­dence as pos­si­ble, weigh this ev­i­dence, and make a judg­ment. But 60 years of re­search has shown that in hun­dreds of cases, a sim­ple for­mula called a statis­ti­cal pre­dic­tion rule (SPR) makes bet­ter pre­dic­tions than lead­ing ex­perts do. Or, more ex­actly:

When based on the same ev­i­dence, the pre­dic­tions of SPRs are at least as re­li­able as, and are typ­i­cally more re­li­able than, the pre­dic­tions of hu­man ex­perts for prob­lems of so­cial pre­dic­tion.1

For ex­am­ple, one SPR de­vel­oped in 1995 pre­dicts the price of ma­ture Bordeaux red wines at auc­tion bet­ter than ex­pert wine tasters do. Re­ac­tion from the wine-tast­ing in­dus­try to such wine-pre­dict­ing SPRs has been “some­where be­tween vi­o­lent and hys­ter­i­cal.”

How does the SPR work? This par­tic­u­lar SPR is called a proper lin­ear model, which has the form:

P = w1(c1) + w2(c2) + w3(c3) + …wn(cn)

The model calcu­lates the summed re­sult P, which aims to pre­dict a tar­get prop­erty such as wine price, on the ba­sis of a se­ries of cues. Above, cn is the value of the nth cue, and wn is the weight as­signed to the nth cue.2

In the wine-pre­dict­ing SPR, c1 re­flects the age of the vin­tage, and other cues re­flect rele­vant cli­matic fea­tures where the grapes were grown. The weights for the cues were as­signed on the ba­sis of a com­par­i­son of these cues to a large set of data on past mar­ket prices for ma­ture Bordeaux wines.3

There are other ways to con­struct SPRs, but rather than sur­vey these de­tails, I will in­stead sur­vey the in­cred­ible suc­cess of SPRs.

• Howard and Dawes (1976) found they can re­li­ably pre­dict mar­i­tal hap­piness with one of the sim­plest SPRs ever con­ceived, us­ing only two cues: P = [rate of love­mak­ing] - [rate of fight­ing]. The re­li­a­bil­ity of this SPR was con­firmed by Ed­wards & Ed­wards (1977) and by Thorn­ton (1979).

• Un­struc­tured in­ter­views re­li­ably de­grade the de­ci­sions of gate­keep­ers (e.g. hiring and ad­mis­sions officers, pa­role boards, etc.). Gate­keep­ers (and SPRs) make bet­ter de­ci­sions on the ba­sis of dossiers alone than on the ba­sis of dossiers and un­struc­tured in­ter­views. (Bloom and Brundage 1947, DeVaul et. al. 1957, Oskamp 1965, Milstein et. al. 1981; Hunter & Hunter 1984; Wies­ner & Cron­shaw 1988). If you’re hiring, you’re prob­a­bly bet­ter off not do­ing in­ter­views.

• Wittman (1941) con­structed an SPR that pre­dicted the suc­cess of elec­troshock ther­apy for pa­tients more re­li­ably than the med­i­cal or psy­cholog­i­cal staff.

• Car­roll et. al. (1988) found an SPR that pre­dicts crim­i­nal re­ci­di­vism bet­ter than ex­pert crim­i­nol­o­gists.

• An SPR con­structed by Gold­berg (1968) did a bet­ter job of di­ag­nos­ing pa­tients as neu­rotic or psy­chotic than did trained clini­cal psy­chol­o­gists.

• SPRs reg­u­larly pre­dict aca­demic perfor­mance bet­ter than ad­mis­sions officers, whether for med­i­cal schools (DeVaul et. al. 1957), law schools (Swets, Dawes and Mon­a­han 2000), or grad­u­ate school in psy­chol­ogy (Dawes 1971).

• SPRs pre­dict loan and credit risk bet­ter than bank officers (Stil­lwell et. al. 1983).

• SPRs pre­dict new­borns at risk for Sud­den In­fant Death Syn­drome bet­ter than hu­man ex­perts do (Lowry 1975; Car­pen­ter et. al. 1977; Gold­ing et. al. 1985).

• SPRs are bet­ter at pre­dict­ing who is prone to vi­o­lence than are foren­sic psy­chol­o­gists (Faust & Ziskin 1988).

• Libby (1976) found a sim­ple SPR that pre­dicted firm bankruptcy bet­ter than ex­pe­rienced loan officers.

And that is barely scratch­ing the sur­face.

If this is not amaz­ing enough, con­sider the fact that even when ex­perts are given the re­sults of SPRs, they still can’t out­perform those SPRs (Leli & Filskov 1985; Gold­berg 1968).

So why aren’t SPRs in use ev­ery­where? Prob­a­bly, sug­gest Bishop & Trout, we deny or ig­nore the suc­cess of SPRs be­cause of deep-seated cog­ni­tive bi­ases, such as over­con­fi­dence in our own judg­ments. But if these SPRs work as well as or bet­ter than hu­man judg­ments, shouldn’t we use them?

Robyn Dawes (2002) drew out the nor­ma­tive im­pli­ca­tions of such stud­ies:

If a well-val­i­dated SPR that is su­pe­rior to pro­fes­sional judg­ment ex­ists in a rele­vant de­ci­sion mak­ing con­text, pro­fes­sion­als should use it, to­tally ab­sent­ing them­selves from the pre­dic­tion.

Some­times, be­ing ra­tio­nal is easy. When there ex­ists a re­li­able statis­ti­cal pre­dic­tion rule for the prob­lem you’re con­sid­er­ing, you need not waste your brain power try­ing to make a care­ful judg­ment. Just take an out­side view and use the damn SPR.4

1 Bishop & Trout, Epistemology and the Psychology of Human Judgment, p. 27. The definitive case for this claim is made in a 1996 study by Grove & Meehl that surveyed 136 studies yielding 617 comparisons between the judgments of human experts and SPRs (in which humans and SPRs made predictions about the same cases and the SPRs never had more information than the humans). Grove & Meehl found that of the 136 studies, 64 favored the SPR, 64 showed roughly equal accuracy, and 8 favored human judgment. Since these last 8 studies "do not form a pocket of predictive excellent in which [experts] could profitably specialize," Grove and Meehl speculated that these 8 outliers may be due to random sampling error.

2 Read­ers of Less Wrong may rec­og­nize SPRs as a rel­a­tively sim­ple type of ex­pert sys­tem.

3 But, see Ana­toly_Vorobey’s fine ob­jec­tions.

4 There are oc­ca­sional ex­cep­tions, usu­ally referred to as “bro­ken leg” cases. Sup­pose an SPR re­li­ably pre­dicts an in­di­vi­d­ual’s movie at­ten­dance, but then you learn he has a bro­ken leg. In this case it may be wise to aban­don the SPR. The prob­lem is that there is no gen­eral rule for when ex­perts should aban­don the SPR. When they are al­lowed to do so, they aban­don the SPR far too fre­quently, and thus would have been bet­ter off stick­ing strictly to the SPR, even for le­gi­t­i­mate “bro­ken leg” in­stances (Gold­berg 1968; Sawyer 1966; Leli and Filskov 1984).

• I’m skep­ti­cal, and will now pro­ceed to ques­tion some of the as­ser­tions made/​refer­ences cited. Note that I’m not trained in statis­tics.

Un­for­tu­nately, most of the ar­ti­cles cited are not eas­ily available. I would have liked to check the method­ol­ogy of a few more of them.

For ex­am­ple, one SPR de­vel­oped in 1995 pre­dicts the price of ma­ture Bordeaux red wines at auc­tion bet­ter than ex­pert wine tasters do.

The pa­per doesn’t ac­tu­ally es­tab­lish what you say it does. There is no statis­ti­cal anal­y­sis of ex­pert wine tasters, only one or two anec­do­tal state­ments of their fury at the whole idea. In­stead, the SPR is com­pared to ac­tual mar­ket prices—not to ex­perts’ pre­dic­tions. I think it’s fair to say that the claim I quoted is over­reached.

Now, about the model and its fit to data. Note that the SPR is older than 1995, when the pa­per was pub­lished. The NYTimes ar­ti­cle about it which you refer­ence is from 1990 (the pa­per bizarrely dates it to 1995; I’m not sure what’s go­ing on there).

The fact that there’s a lin­ear model—not speci­fied pre­cisely any­where in the ar­ti­cle—which is a good fit to wine prices for vin­tages of 1961-1972 (Table 3 in the pa­per) is not, I think, very sig­nifi­cant on its own. To judge the model, we should look at what it pre­dicts for up­com­ing years. Both the pa­per and the NYTimes ar­ti­cle make two spe­cific pre­dic­tions. First, the 1986 vin­tage, claimed to be ex­tol­led by ex­perts early on, will prove mediocre be­cause of the weather con­di­tions that year (see Figure 3 in the pa­per − 1986 is clearly the worst of the 80ies). NYTimes says “When the dust set­tles, he pre­dicts, it will be judged the worst vin­tage of the 1980′s, and no bet­ter than the un­mem­o­rable 1974′s or 1969′s”. The 1995 pa­per says, more mod­estly, “We should ex­pect that, in due course, the prices of these wines will de­cline rel­a­tive to the prices of most of the other vin­tages of the 1980s”. Se­cond, the 1989-1990 is pre­dicted to be “out­stand­ing” (pa­per), “stun­ningly good” (NYTimes), “ad­justed for age, will out­sell at a sig­nifi­cant pre­mium the great 1961 vin­tage (NYTimes).”

It’s now 16 years later. How do we test these pre­dic­tions?

First, I’ve stum­bled on a differ­ent pa­per from the pri­mary au­thor, Prof. Ashen­felter, from 2007. Pub­lished 12 years later than the one you refer­ence, this pa­per has sub­stan­tially the same con­tents, with whole pages copied ver­ba­tim from the ear­lier one. That, by it­self, wor­ries me. Even more wor­ry­ing is the fact that the 1986 pre­dic­tion, promi­nent in the 1990 ar­ti­cle and the 1995 pa­per, is com­pletely miss­ing from the 2007 pa­per (the data be­low might in­di­cate why). And most wor­ry­ing of all is the change of lan­guage re­gard­ing the 1989/​1990 pre­dic­tion. The 1995 pa­per says about its pre­dic­tion that the 1989/​1990 will turn out to be out­stand­ing, “Many wine writ­ers have made the same pre­dic­tions in the trade mag­a­z­ines”. The 2007 pa­per says “Iron­i­cally, many pro­fes­sional wine writ­ers did not con­cur with this pre­dic­tion at the time. In the years that have fol­lowed minds have been changed; and there is now vir­tu­ally unan­i­mous agree­ment that 1989 and 1990 are two of the out­stand­ing vin­tages of the last 50 years.”

Uhm. Right. Well, be­cause the claims aren’t strong enough, they do not ex­actly con­tra­dict each other, but this change leaves a bad taste. I don’t think I should give much trust to these pa­pers’ claims.

The data I could find quickly to test the pre­dic­tions is here. The prices are bro­ken down by the chateaux, by the vin­tage year, the pack­ag­ing (I’ve always cho­sen BT—bot­tle), and the auc­tion year (I’ve always cho­sen the last year available, typ­i­cally 2004). Un­for­tu­nately, Ashen­felter un­der­speci­fies how he came up with the ag­gre­gate prices for a given year—he says he chose a pack­age of the best 15 winer­ies, but doesn’t say which ones or how the prices are com­bined. I used 5 winer­ies that are speci­fied as the best in the 2007 pa­per, and looked up the prices for years 1981-1990. The data is in this spread­sheet. I haven’t tried to statis­ti­cally an­a­lyze it, but even from a quick glance, I think the fol­low­ing is clear. 1986 did not sta­bi­lize as the worst year of the 1980s. It is fre­quently sec­ond- or third-best of the decade. It is always much bet­ter than ei­ther 1984 or 1987, which are sup­posed to be vastly bet­ter ac­cord­ing to the 1995 pa­per’s weather data (see Figure 3). 1989/​1990 did turn out well, es­pe­cially 1990. Still, they’re both nearly always less ex­pen­sive than 1982, which is again vastly in­fe­rior in the weather data (it isn’t even in the best quar­ter). Over­all, I fail to see much cor­re­la­tion be­tween the weather data in the pa­per for the 1980s, the spe­cific claims about 1986 and 1989/​1990, and the mar­ket prices as of 2004. I wouldn’t recom­mend us­ing this SPR to pre­dict mar­ket prices.

Now, this was the first ex­am­ple in your post, and I found what I be­lieve to be sub­stan­tial prob­lems with its method­ol­ogy and the qual­ity of its SPR. If I were to pro­ceed and ex­am­ine ev­ery ex­am­ple you cite in the same de­tail, would I en­counter many such prob­lems? It’s difficult to tell, but my pre­dic­tion is “yes”. I an­ti­ci­pate overfit­ting and shoddy method­ol­ogy. I an­ti­ci­pate huge in­fluence of the se­lec­tion bias—the au­thors that pub­lish these kinds of pa­pers will not pub­lish a pa­per that says “The ex­perts were bet­ter than our SPR”. And fi­nally, I an­ti­ci­pate over­reach­ing claims of wide-reach­ing ap­pli­ca­bil­ity of the mod­els, based on pa­pers that ac­tu­ally in­di­cate mod­est effect in a very spe­cific situ­a­tion with a small sam­ple size.

I’ve looked at your sec­ond ex­am­ple:

Howard and Dawes (1976) found they can re­li­ably pre­dict mar­i­tal hap­piness with one of the sim­plest SPRs ever con­ceived, us­ing only two cues: P = [rate of love­mak­ing] - [rate of fight­ing].

I couldn’t find the origi­nal pa­per, but the re­sults are sum­marised in Dawes (1979). Look­ing at it, it turns out that when you say “pre­dict mar­i­tal hap­piness”, it re­ally means “pre­dicts one of the part­ners’ sub­jec­tive opinion of their mar­i­tal hap­piness”—as op­posed to e.g. sta­bil­ity of the mar­riage over time. There’s no in­di­ca­tion as to how the part­ner to ques­tion was cho­sen from each pair (e.g. whether the ex­per­i­menter knew the rate when they chose). There was very good cor­re­la­tion with bi­nary out­come (happy/​un­happy), but when a finer scale of 7 de­grees of hap­piness was used, the cor­re­la­tion was weak—rate of 0.4. In a fol­low-up ex­per­i­ment, cor­re­la­tion rate went up to 0.8, but there the sub­ject looked at the love­mak­ing/​fight­ing statis­tics be­fore opin­ing on the de­gree of hap­piness, thus con­tam­i­nat­ing their de­ci­sion. And even in the ear­lier ex­per­i­ment, the sub­ject had been record­ing those love­mak­ing/​fight­ing statis­tics in the first place, so it would make sense for them to re­call those events when they’re asked to as­sess whether their mar­riage is a happy one. Over­all, the model is witty and naively ap­pears to be use­ful, but the sus­pect method­ol­ogy and the rel­a­tively weak cor­re­la­tion en­courages me to dis­count the anal­y­sis.

Fi­nally, the fol­low­ing claim is the sin­gle most ob­jec­tion­able one in your post, to my taste:

If you’re hiring, you’re prob­a­bly bet­ter off not do­ing in­ter­views.

My own ex­pe­rience strongly sug­gests to me that this claim is inane—and is highly dan­ger­ous ad­vice. I’m not able to view the pa­pers you base it on, but if they’re any­thing like the first and sec­ond ex­am­ple, they’re far, far away from con­vinc­ing me of the truth of this claim, which I in any case strongly sus­pect to over­reach gi­gan­ti­cally over what the pa­pers are prov­ing. It may be true, for ex­am­ple, that a very large body of hiring de­ci­sion-mak­ers in a huge or­gani­sa­tion or a state on av­er­age make poorer de­ci­sions based on their pro­fes­sional judge­ment dur­ing in­ter­views than they would have made based purely on the re­sume. I can see how this claim might be true, be­cause any such very large body must be largely in­com­pe­tent. But it doesn’t fol­low that it’s good ad­vice for you to ab­strain from in­ter­view­ing—it would only fol­low if you be­lieve your­self to be no more com­pe­tent than the av­er­age hiring man­ager in such a body, or in the pa­pers you refer­ence. My per­sonal ex­pe­rience from in­ter­view­ing many, many can­di­dates for a large com­pany sug­gests that in­ter­view­ing is cru­cial (though I will freely grant that differ­ent kinds of in­ter­views vary wildly in their effec­tive­ness).

• If you’re hiring, you’re prob­a­bly bet­ter off not do­ing in­ter­views.

My own ex­pe­rience strongly sug­gests to me that this claim is inane—and is highly dan­ger­ous ad­vice… My per­sonal ex­pe­rience from in­ter­view­ing many, many can­di­dates for a large com­pany sug­gests that in­ter­view­ing is cru­cial (though I will freely grant that differ­ent kinds of in­ter­views vary wildly in their effec­tive­ness).

The whole point of this ar­ti­cle is that ex­perts of­ten think them­selves bet­ter than SPR’s when ac­tu­ally they perform no bet­ter than SPRs on av­er­age. Here we have an ex­pert tel­ling us that he thinks he would perform bet­ter than an SPR. Why should we be in­ter­ested?

• Be­cause I didn’t just state a blan­ket opinion. I dug into the stud­ies, looked for data to test one of them in depth, and found it to be highly flawed. I called into ques­tion the method­ol­ogy em­ployed by the stud­ies, as well as over­gen­er­al­iz­ing and over­reach­ing con­clu­sions they’re drummed up to sup­port. The ev­i­dence that at least some stud­ies are flawed and the method­ol­ogy is shoddy should make you ques­tion the uni­ver­sal claim ”… ac­tu­ally they perform no bet­ter than SPRs on av­er­age”. That’s why you should be in­ter­ested.

My per­sonal ex­pe­rience with in­ter­view­ing is cer­tainly not as im­por­tant piece of ev­i­dence against the ar­ti­cle as the spe­cific crit­i­cisms of the stud­ies. It’s just an­other anec­do­tal data point. That’s why I didn’t ex­pand on it as much as I did on the wine study, al­though I do be­lieve it can be made more con­vinc­ing through fur­ther elu­ci­da­tion.

• Cool, I’ll look into these points.

I made one small change so far. The above ar­ti­cle now read: “Re­ac­tion from the wine-tast­ing in­dus­try to such wine-pre­dict­ing SPRs has been ‘some­where be­tween vi­o­lent and hys­ter­i­cal.’”

Also, I’ll post links to the spe­cific pa­pers when I have time to visit UCLA and grab them.

Psy­chol­ogy is not my field, but my un­der­stand­ing is that the ‘in­ter­view effect’ for un­struc­tured in­ter­views is a very ro­bust find­ing across many decades. For more, you can listen to my in­ter­view with Michael Bishop. But hey, maybe he’s wrong!

Up­date 1: If I read the 1995 study cor­rectly, they judged the ac­cu­racy of wine tasters by com­par­ing the price of im­ma­ture wines to those of ma­ture wines, but I’m not sure. The way I phrased that is from Bishop & Trout, and that is how Bishop re­calls it, though it’s been sev­eral years now since he co-wrote Episte­mol­ogy and the Psy­chol­ogy of Hu­man Judg­ment.

• My own ex­pe­rience strongly sug­gests to me that this claim is inane … it would only fol­low if you be­lieve your­self to be no more com­pe­tent than the av­er­age hiring man­ager in such a body, or in the pa­pers you refer­ence.

What ev­i­dence do you have that you are bet­ter than av­er­age?

My per­sonal ex­pe­rience from in­ter­view­ing many, many can­di­dates for a large com­pany sug­gests that in­ter­view­ing is crucial

“It is difficult to get a man to un­der­stand some­thing, when his salary de­pends upon his not un­der­stand­ing it!”

• I have heard of one job in­ter­view that I felt con­sti­tuted a use­ful tool that could not effec­tively be re­placed by re­sume ex­am­i­na­tion and statis­ti­cal anal­y­sis. A friend of mine got a job work­ing for a com­pany that pro­vides math­e­mat­i­cal mod­el­ing ser­vices for other com­pa­nies, and his “in­ter­view” was a sev­eral hour test to cre­ate a num­ber of math­e­mat­i­cal mod­els, and then ex­plain­ing to the ex­am­iner in lay­man’s terms how and why the mod­els worked.

Most job in­ter­views are re­ally not a demon­stra­tion of job skills and ap­ti­tude, and it’s pos­si­ble to sim­ply bul­lshit your way through them. On the other hand, if you have a sim­ple and di­rect way to test the com­pe­tence of your ap­pli­cants, then by all means use it.

• That isn’t an in­ter­view, it’s a test. Tests are ex­tremely use­ful. IQ tests are an ex­cel­lent pre­dic­tor of job perfor­mance, maybe the best one available. Re­gard­less, IQ tests are usu­ally de facto ille­gal in the US due to dis­parate im­pact.

• I put in­ter­view in quotes be­cause they called it an in­ter­view. Speak­ing broadly enough, all in­ter­views are tests, but most are un­struc­tured and not very good at ex­am­in­ing the rele­vant pre­dic­tor vari­ables. All tests are of course not nec­es­sar­ily in­ter­views, but the part where they had ap­pli­cants ex­plain their pro­cesses in lay­man’s terms might qual­ify it, at least if you’re gen­er­ous with your defi­ni­tions.

Of course, it’s cer­tainly un­clear if not out­right in­cor­rect to call it an in­ter­view, but that was their choice; pos­si­bly they felt that sub­ject­ing ap­pli­cants to a “test” rather than an “in­ter­view” pro­jected a less pos­i­tive image.

• I’m most fa­mil­iar with in­ter­views for pro­gram­ming jobs, where an in­ter­view that doesn’t ask the can­di­date to demon­strate job-spe­cific skills, knowl­edge and ap­ti­tude is nearly worth­less. Th­ese jobs are also startlingly prone to re­sume dis­tor­tion that can make vastly differ­ent can­di­dates look similar, es­pe­cially re­cent grad­u­ates.

Ask­ing for cod­ing sam­ples and call­ing pre­vi­ous em­ploy­ers, es­pe­cially if cou­pled with a re­quest for code solv­ing a new (re­quested) prob­lem, could po­ten­tially re­place in­ter­views. How­ever, judg­ing the qual­ity of code still re­quires a per­son, so that doesn’t seem to re­ally change things to me.

• That’s what I think of, too, when I hear the phrase “job in­ter­view”. Is this not typ­i­cal out­side fields like pro­gram­ming?

• I can con­firm that such a “job in­ter­view” is not com­mon in medicine. The po­ten­tial em­ployer gen­er­ally re­lies on the cre­den­tial­ing pro­cess of the med­i­cal es­tab­lish­ment. Most physi­ci­ans, upon com­plet­ing their train­ing, pass a test demon­strat­ing their abil­ity to re­gur­gi­tate the teach­ers’ pass­words, and are recom­mended to the ap­pro­pri­ate cer­tifi­ca­tion board as “qual­ified” by their pro­gram di­rec­tor; to do oth­er­wise would re­flect badly on the pro­gram. Also, pro­gram di­rec­tors are loath to re­move a res­i­dent/​fel­low dur­ing ad­vanced train­ing be­cause some warm body must show up to do the work, or the pro­fes­sor him­self/​her­self might have to fill in. It is difficult to find re­place­ments for up­per level res­i­dents; the only com­mon rea­son such would be available is dis­mis­sal/​trans­fer from an­other pro­gram. Con­se­quently, the USA turns out physi­ci­ans of widely varied skill lev­els, even though their cre­den­tials are similar. In sur­gi­cal spe­cial­ities, it is not un­usual for a par­tic­u­larly bright in­di­vi­d­ual with all the pass­words but very poor tech­ni­cal skills to be­come a sur­gi­cal pro­fes­sor.

• My mother has told me an anec­dote about a fam­ily friend who was a sur­geon who had a former stu­dent call him while con­duct­ing an op­er­a­tion be­cause he couldn’t re­mem­ber what to do.

• My mother has told me an anec­dote about a fam­ily friend who was a sur­geon who had a former stu­dent call him while con­duct­ing an op­er­a­tion be­cause he couldn’t re­mem­ber what to do.

The (ru­mored) stu­dent has my re­spect. I would ex­pect most sur­geons to have too much of an ego to ad­mit to that doubt rather than stum­ble ahead full of hubris. It would be com­fort­ing to know that your sur­geon acted as if (as op­posed to merely be­liev­ing that) he cared more about the pa­tient than the im­me­di­ate per­cep­tion of sta­tus loss. (I wouldn’t care whether that just meant his thought out an­ti­ci­pa­tion of fu­ture sta­tus loss for a failed op­er­a­tion over­rode his im­me­di­ate so­cial in­stincts.)

• “It is difficult to get a man to un­der­stand some­thing, when his salary de­pends upon his not un­der­stand­ing it!”

I don’t think it’s fair, as his job is not be­ing an in­ter­viewer, but per­haps hiring smart peo­ple we can benefit from.

• Re­gard­ing hiring, I think the key­word might be “un­struc­tured”—what makes an in­ter­view an “un­struc­tured” in­ter­view?

• That’s what I thought too. The defi­ni­tions I found search­ing all say that any in­ter­view where you de­cide what to ask and how to in­ter­pret the re­sults is “un­struc­tured”. The only “struc­tured” in­ter­views seem to be tests with pre-de­ter­mined sets of ques­tions, and the can­di­date’s an­swers judged by for­mal crite­ria.

I’m not sure this di­vi­sion of the “in­ter­view-space” is all that use­ful. I would dis­t­in­guish three cat­e­gories:

1. You have an in­for­mat chat with me about the na­ture of the job, my ex­pe­rience, my pre­vi­ous em­ploy­ment, my claims about my ap­ti­tude, etc. Your im­pres­sions from this chat de­ter­mine your judge­ment of my suit­abil­ity for the job.

2. You ask me to an­swer ques­tions or perform tasks that demon­strate my ap­ti­tude. It’s up to you to choose the tasks, in­ter­pret my perfor­mance, and guide the whole pro­cess.

3. You give me a pre-de­ter­mined set of ques­tions/​tasks that is the same for all can­di­dates. My an­swers are me­chan­i­cally in­ter­preted by whether they co­in­cide with the pre-de­ter­mined set of cor­rect an­swers.

If I in­ter­pret the defi­ni­tions I could find cor­rectly, 3 is a “struc­tured” in­ter­view, and both 1 and 2 are “un­struc­tured”. To my mind, there’s a world of differ­ence be­tween 1 and 2, how­ever. 1 is of very limited util­ity (I want to say “next to worth­less”, but that’d be too pre­sump­tu­ous), and, quite pos­si­bly, does no bet­ter than de­cid­ing on the ba­sis of the re­sume alone, thought I’d still want to see the data to be con­vinced. 2, when performed by a trained and cal­ibrated in­ter­viewer, is—again, in my own ex­pe­rience—ob­vi­ously su­pe­rior both to 1 and to de­cid­ing on the ba­sis of the re­sume alone. Maybe this is some­how unique to the pro­fes­sion I in­ter­view for, but I doubt it.

Sup­pose there’s re­search which demon­strates that in some set­ting type 1 in­ter­views are worse than us­ing the re­sume alone. I don’t know whether this is the case in the pa­pers cited in this post (I couldn’t read them), but I find it plau­si­ble. Sup­pose then that the con­clu­sions drawn are the uni­ver­sal state­ments “un­struc­tured in­ter­views re­li­ably de­grade the de­ci­sions of gate­keep­ers” and “if you’re hiring, you’re prob­a­bly bet­ter off not do­ing in­ter­views”. I con­sider such con­clu­sions then to be ob­vi­ously un­sub­stan­ti­ated, in­cred­ibly over­reached, and highly dan­ger­ous ad­vice.

• The in­ter­view ex­am­ple makes sense to me if the usual hiring man­ager is strongly bi­ased re­gard­ing in­for­ma­tion that are not cru­cial. A dossier only gives lit­tle but im­por­tant in­for­ma­tion. In a face-to-face in­ter­view var­i­ous other fac­tors can play a role (of­ten un­con­sciously), e.g. smell or the abil­ity to re­turn a look.

• More here. Surely that isn’t strong ev­i­dence but an­other in­di­ca­tion that if you are not an LW type per­son then in­for­ma­tion that are not cru­cial might al­ter your per­cep­tion and sub­se­quent de­ci­sion when do­ing face-to-face in­ter­views ver­sus dossier based rul­ing.

• Read the Dawes pdf linked in the top post. I can’t speak for the other ex­am­ples, but that one is solid.

edit: my apolo­gies, re-read­ing I see you dis­cussed the mar­riage ex­am­ple. What is your opinion on the grad­u­ate rat­ing and Hodgkin’s dis­ease ex­am­ples?

• that one is solid

Why do you say that? My re­ac­tion to that pa­per was very nega­tive. In large part, it was the anec­do­tal fla­vor of the ar­gu­ments made there, but also be­cause I didn’t see the two things I was speci­fi­cally look­ing for:

• Ci­ta­tions of stud­ies in which a lin­ear model was con­structed us­ing one set of data, and then com­pared as to perfor­mance against the ex­perts us­ing a differ­ent set of data.

• Failing that, some num­bers that would con­vince me that the failure to test mod­els us­ing differ­ent data than was used to con­struct them just doesn’t mat­ter.

In­stead, here and in the 1996 study by Grove & Meehl, I find ar­gu­ments from in­cre­dulity—in effect: “Do our crit­ics re­ally think that this re­ally mat­ters? Don’t be ab­surd!”. I also no­tice that this ide­ol­ogy is be­ing pro­moted by a small num­ber of re­searchers who re­peat­edly cite each other’s work, and do not cite crit­ics (ex­cept as straw­men).

• Like Per­plexed, I hated this pa­per. Of course, it has the very good ex­cuse that it is from 1979. But in 2011, it is sort of ex­pected that you eval­u­ate your model on a sec­ond, in­de­pen­dent dataset. (My mod­els of­ten crash and burn at this stage.) Did any of these stud­ies do this?

• Also, if I may be per­mit­ted to make a more gen­eral crit­i­cism in re­sponse to this post, I would say that while the ar­ti­cle ap­pears to be well-re­searched, it has demon­strated some of the worst prob­lems I com­monly no­tice on this fo­rum. The same goes for the ma­jor­ity of the com­ments, even though many are knowl­edge­able and in­for­ma­tive. What I have in mind is the fix­a­tion on con­coct­ing the­o­ries about hu­man be­hav­ior and so­ciety based on var­i­ous idées fixes and leit­mo­tifs that are parts of the in­tel­lec­tual folk­lore here, while failing to no­tice is­sues sug­gested by ba­sic com­mon sense that are likely to be far more im­por­tant.

Thus the poster no­tices that these mod­els are not used in prac­tice de­spite con­sid­er­able ev­i­dence in their fa­vor, and rushes to pro­pose cog­ni­tive bi­ases à la Kah­ne­man & Tver­sky as the likely ex­pla­na­tion. This with­out even stop­ping to think of two ques­tions that just scream for at­ten­tion. First, what is the im­por­tance of the fact that just about any is­sue of sort­ing out peo­ple is nowa­days likely to be ide­olog­i­cally charged and legally dan­ger­ous? Se­cond, what about the fact that these mod­els are sup­posed to throw some high-sta­tus peo­ple out of work, and in a way that makes them look like they’ve been in­com­pe­tent all along?

Re­gard­less of whether var­i­ous hy­pothe­ses based on these ques­tions have any merit, the fact that some­one could write a post with­out even giv­ing them the slight­est pass­ing at­ten­tion, offer­ing in­stead a blinkered ex­pla­na­tion in­volv­ing the stan­dard old LW/​OB folk­lore, and still get up­voted to +40 is, in my opinion, in­dica­tive of some se­vere and wide­spread bi­ases.

• While this post has +40 up­votes, the ma­jor­ity of the top-voted com­ments are skep­ti­cal of it. I think this rep­re­sents con­fu­sion as to how to up­vote, al­though this is merely a hy­poth­e­sis. The ar­ti­cle sur­veys a very in­ter­est­ing topic that is right in the sweet spot of in­ter­est for this com­mu­nity, it also ap­pears schol­arly, how­ever the con­clu­sions syn­the­sized by the au­thor strike me as naive and I sus­pect that’s also the con­clu­sion of the ma­jor­ity. Whether it de­serves an up­vote is de­bate­able. I down­voted.

• I felt the con­fu­sion you are talk­ing about. If read­ers could be ex­pected to read the top-voted replies (RTFC), then the cur­rent dis­tri­bu­tion of votes would be ideal: The in­ter­est­ing ar­ti­cle gets some well-de­served at­ten­tion, and the skep­ti­cal replies give a coun­ter­bal­ance. But if read­ers don’t read the com­ments, then frankly I think this ar­ti­cle got too many up­votes when com­pared to many oth­ers.

Off­topic: Is there a meta thread some­where dis­cussing the se­man­tics of votes? I am happy that we don’t use slash­dot’s baroque in­sight­ful/​in­ter­est­ing/​funny dis­tinc­tions, but some con­sen­sus about the mean­ing of +1 would be nice.

• I don’t know about a meta-thread, but the rule of thumb I’ve seen quoted of­ten is “up­vote what you want more of; down­vote what you want less of.” Karma scores are in­tended, on this view, as an in­di­ca­tor of how many peo­ple (net) want more en­tries like that.

One im­pli­ca­tion of this view is that a score of 40 isn’t “ten times bet­ter” than a score of 4, it just means that many more peo­ple want to see posts like this than don’t want to.

Of course, this view com­petes with peo­ple’s en­tirely pre­dictable ten­dency to treat karma as an in­di­ca­tor of the en­try’s (and the user’s) over­all worth, or as a game to max­i­mize one’s score on, or as a form of re­ward/​pun­ish­ment.

Equally pre­dictably, this pre­dictable but un­in­tended use of karma far far far out­weighs the in­tended use.

• Karma-max­i­miz­ing is of­ten but not always a good ap­prox­i­ma­tion to worth-as-judged-by-com­mu­nity max­i­miz­ing, which is a good thing to max­i­mize.

• Yes. The ques­tion is how sig­nifi­cant the gap be­tween “of­ten” and “always” is.

• Though if you have a tar­get au­di­ence in mind, it is some­times worth post­ing things that will be down­voted by the com­mu­nity-at-large.

(I’ve been do­ing this a lot re­cently, though I plan on cut­ting back and re­gain­ing some gen­eral ra­tio­nal­ist cred­i­bil­ity.)

• My in­tent was to sum­ma­rize the liter­a­ture on SPRs, not provide an ac­count for why they are not used more widely. I al­most didn’t in­clude that sen­tence at all. Surely, more anal­y­sis would be im­por­tant to have in a post in­tend­ing to dis­cuss the psy­cholog­i­cal is­sues in­volved in our re­ac­tion to SPRs, but that was not my sub­ject.

In point­ing to cog­ni­tive bi­ases as an ex­pla­na­tion, I was merely re­peat­ing what Bishop & Trout & Dawes have sug­gested on the mat­ter, not mak­ing up my own ex­pla­na­tions in light of LW lore.

In fact, the ar­rows point the other way. Many of the au­thors cited in my ar­ti­cle worked closely with peo­ple like Kah­ne­man who are the origi­nal aca­demic sources of much of LW lore.

Edit: I’ve added a clause about the source of the “cog­ni­tive bi­ases” sug­ges­tion, in case oth­ers are tempted to make the same mis­taken as­sump­tion as you made.

• First, what is the im­por­tance of the fact that just about any is­sue of sort­ing out peo­ple is nowa­days likely to be ide­olog­i­cally charged and legally dan­ger­ous? Se­cond, what about the fact that these mod­els are sup­posed to throw some high-sta­tus peo­ple out of work, and in a way that makes them look like they’ve been in­com­pe­tent all along?

I am not sure what you think the an­swers to these ques­tions are, but I would say my per­sonal opinion on the mat­ter is that the more ide­olog­i­cally charged and legally dan­ger­ous a mat­ter is, the more im­por­tant ac­cu­racy and cor­rect­ness—at the ex­pense, if nec­es­sary, of strongly-held be­liefs. I would also say that pro­tect­ing the rep­u­ta­tion of com­pe­tency en­joyed by high-sta­tus peo­ple is not an ac­tivity that strongly cor­re­lates with be­ing right; I pre­dict a small nega­tive cor­re­la­tion, in fact.

Fur­ther­more, there is a se­lec­tion effect: learn­ing the LW/​OB folk­lore will re­sult in you notic­ing spe­cific cases of their ap­pli­ca­tion, and you are far, far more likely to write a post about that any about any given sub­ject. That is, you see a prevalence of “stan­dard bias ex­pla­na­tion” be­cause top-level posters are ac­tively look­ing for ac­tual cases of bias to dis­cuss.

• The sec­ond rea­son is in­valid un­less the ac­tor is self-de­lud­ing—a smart ac­tor that faces be­ing put out of work would silently adopt a SPR as his de­ci­sion-mak­ing sys­tem with­out ad­mit­ting to it. Since the su­pe­ri­or­ity of SPR con­tinues in many fields, ei­ther rele­vant ac­tors are con­sis­tently not smart, perfor­mance is not a sig­nifi­cant con­tribut­ing crite­rion to their suc­cess, or they’re self-de­lud­ing ie. over­rat­ing their own judg­ment as the poster stated.  I’d guess a com­bi­na­tion of the last two.

• Yes, I’d say it’s a com­bi­na­tion of the last two points, with em­pha­sis on the sec­ond last.

The crit­i­cal ques­tion is whether max­i­miz­ing the ac­cu­racy of your judg­ments is a prac­ti­cal way to get ahead in a given pro­fes­sion. Some­times that is in­deed the case, and in such fields we in­deed see tremen­dous efforts to au­to­mate as much ex­pert work as pos­si­ble, of­ten with great suc­cess, as in the elec­tron­ics in­dus­try. But in pro­fes­sions that op­er­ate as more tightly-knit guilds, ad­her­ence to ac­cepted stan­dards is much more im­por­tant than any ob­jec­tive met­rics of effec­tive­ness. Step­ping out­side of stan­dard work pro­ce­dures is of­ten treated as a se­ri­ous in­frac­tion with po­ten­tially se­vere con­se­quences. (Espe­cially if your non-stan­dard method­ol­ogy fails in some par­tic­u­lar case, as it will sooner or later, and you can’t cover your ass by claiming that you fol­lowed all the stan­dard ac­cepted pro­ce­dures and hav­ing your pro­fes­sion back you up or­ga­ni­za­tion­ally.)

Now, you could try en­hanc­ing your work with de­ci­sion mod­els in se­cret. But even then, it’s hard to do it in a com­pletely se­cre­tive way, and more­over, hu­man minds be­ing what they are, most peo­ple can achieve pro­fes­sional suc­cess only if they are re­ally sincerely con­vinced in their ex­per­tise and effec­tive­ness. Keep­ing a pub­lic fa­cade is hard for ev­ery­one ex­cept a very small minor­ity of peo­ple.

• So why aren’t SPRs in use ev­ery­where? Prob­a­bly, we deny or ig­nore the suc­cess of SPRs be­cause of deep-seated cog­ni­tive bi­ases, such as over­con­fi­dence in our own judg­ments. But if these SPRs work as well as or bet­ter than hu­man judg­ments, shouldn’t we use them?

Without even get­ting into the con­crete de­tails of these mod­els, I’m sur­prised that no­body so far has pointed out the elephant in the room: in con­tem­po­rary so­ciety, statis­ti­cal in­fer­ence about hu­man be­hav­ior and char­ac­ter­is­tics is a topic bear­ing tremen­dous poli­ti­cal, ide­olog­i­cal, and le­gal weight. [*] Nowa­days there ex­ists a firm main­stream con­sen­sus that the use of cer­tain sorts of con­di­tional prob­a­bil­ities to make statis­ti­cal pre­dic­tions about peo­ple is dis­crim­i­na­tory and there­fore evil, and do­ing so may re­sult not only in loss of rep­u­ta­tion, but also in se­ri­ous le­gal con­se­quences. (Note that even if none of the for­bid­den crite­ria are built into your de­ci­sion-mak­ing ex­plic­itly, that still doesn’t leave you off the hook—just search for “dis­parate im­pact” if you don’t know what I’m talk­ing about.)

Now of course, mak­ing any pre­dic­tion about peo­ple at all nec­es­sar­ily in­volves one sort of statis­ti­cal dis­crim­i­na­tion or an­other. The bound­aries be­tween the types of statis­ti­cal dis­crim­i­na­tion that are con­sid­ered OK and those that are con­sid­ered evil and risk le­gal li­a­bil­ity are an ar­bi­trary re­sult of cul­tural, poli­ti­cal, and ide­olog­i­cal fac­tors. (They would cer­tainly look strange and ar­bi­trary to some­one who isn’t im­mersed in the cul­ture that gen­er­ated them to the point where they ap­pear com­mon-sen­si­cal or at least ex­pli­ca­ble.) There­fore, while your model may well be ac­cu­rate in es­ti­mat­ing the prob­a­bil­ity of re­ci­di­vism, job perfor­mance, etc., it’s un­likely that it will be able to nav­i­gate the so­cial con­ven­tions that de­ter­mine these for­bid­den lines. A lot of the seem­ingly ab­surd and in­effec­tive rit­u­als and reg­u­la­tions in mod­ern busi­ness, gov­ern­ment, academia, etc. ex­ist ex­actly for the pur­pose of satis­fy­ing these com­plex con­straints, even if they’re not com­monly thought of as such.

--

[*] Edit: I missed the com­ment be­low in which the com­menter Stu­dent_UK already raised a similar point.

• If the best way to choose who to hire is with a statis­ti­cal anal­y­sis of legally for­bid­den crite­ria, then keep your rea­sons se­cret and shred your work. Is that so hard?

• That doesn’t close the loop­hole, it adds a con­straint. And it’s only sig­nifi­cant for those who both hire enough peo­ple to be vuln­er­a­ble to statis­ti­cal anal­y­sis of their hiring prac­tices, and re­ceive too many bad ap­pli­cants from pro­tected classes. If it is a sig­nifi­cant con­straint, you want to find that out from the data, not from guess­work, and ap­ply the min­i­mum legally ac­cept­able cor­rec­tion fac­tor.

Be­sides, it’s not like mug­gles are a pro­tected class. And if they were? Just keep them from ap­ply­ing in the first place, by build­ing your office some­where they can’t get to. There aren’t any le­gal re­stric­tions on that.

• Be­sides, it’s not like mug­gles are a pro­tected class. And if they were? Just keep them from ap­ply­ing in the first place, by build­ing your office some­where they can’t get to. There aren’t any le­gal re­stric­tions on that.

You joke, but the world [1] re­ally is chok­ing with in­effi­cient, kludgey workarounds for the le­gal pro­hi­bi­tion of effec­tive em­ploy­ment screen­ing. For ex­am­ple, the en­tire higher ed­u­ca­tion mar­ket has be­come, ba­si­cally, a case of em­ploy­ers pass­ing off tests to uni­ver­si­ties that they can’t legally ad­minister them­selves. You’re a ter­ror­ist if you give an IQ test to ap­pli­cants, but not if you re­quire a com­pletely ir­rele­vant col­lege de­gree that re­quires tak­ing the SAT (or the mil­i­tary’s ASVAB or what­ever the call it now).

It feels so good to ban dis­crim­i­na­tion, as long as you don’t have to di­rectly face the trade­off you’re mak­ing.

[1] Per MattherW’s cor­rec­tion, this should read “Western de­vel­oped economies” in­stead of “the world”—though I’m sure the phe­nomenon I’ve de­scribed is more gen­eral the form it takes in the West.

• You say ‘the world’, but it seems to me you’re talk­ing about a re­gion which is a lit­tle smaller.

• I’m not sure the cor­rec­tion is that rele­vant. The US and the EU to­gether make up about 40% of global GDP (PPP).

Sev­eral minor economies with nearly iden­ti­cal con­di­tions and re­stric­tions such as Canada, New Zealand, Aus­tralia, South Africa, Nor­way, Switzer­land … add up to an­other 3% or so.Most states in Latin Amer­ica have similar le­gal pro­hi­bi­tions as well, they are not as well en­forced, but avoid­ing them still im­poses costs. This is men­tion­ing noth­ing of Ja­pan or other de­vel­oped East Asian economies (though to be fair losses are prob­a­bly much smaller than the de­vel­oped West and per­haps even Latin Amer­ica).

The other half of the world’s has a mas­sive op­por­tu­nity cost due to the men­tioned half’s de­scribed in­effi­ciency. Con­vert­ing this loss into num­ber of lives or qual­ity of life is a de­press­ing ex­er­cise.

For­tu­nately that is only a prob­lem if you care about hu­mans.

• Well, I’m in the UK, and there’s no law against us­ing IQ-style tests for job ap­pli­cants here. Is that re­ally the case in the US? (I as­sume the “You’re a ter­ror­ist” bit was hy­per­bole.)

Em­ploy­ers here still of­ten ask for ap­par­ently-ir­rele­vant de­grees. But ad­mis­sion to uni­ver­sity here isn’t no­tice­ably based on ‘generic’ tests like the SAT; it’s mostly done on the grades from sub­ject-spe­cific ex­ams. So I doubt em­ploy­ers are treat­ing the de­grees as a proxy for SAT-style test­ing.

• Cor­rec­tion ac­cepted.

• That doesn’t close the loop­hole, it adds a con­straint.

Yes, it does close the loop­hole. You say con­ceal the cause (in­tent to dis­crim­i­nate) and you can get away with as much effect (dis­pro­por­tionate ex­clu­sion) as you want. Ex­cept the law already speci­fies that the effect is pun­ish­able as well as the cause.

So now the best you can do, as­sum­ing the pop­u­la­tions are equally com­pe­tent and suited for the job, is 20% dis­crim­i­na­tion.

And of course, in the real world, pop­u­la­tions usu­ally differ in their suit­abil­ity for the job. Blacks tend not to have as many CS de­grees as whites, for ex­am­ple. So if you are an em­ployer of CS de­grees, you may not be able to get away with any dis­crim­i­na­tion be­fore you have breached the 20% limit, and may need to dis­crim­i­nate against the non-blacks in or­der to be com­pli­ant.

Be­sides, it’s not like mug­gles are a pro­tected class.

I would sus­pect that if the US Mug­gle le­gal sys­tem had any­thing to say about it, they would be. If mag­i­cal-ness is con­ferred by genes, then it’s vi­o­lat­ing ei­ther the gen­eral racial guideline or it’s vi­o­lat­ing re­cent laws (signed by GWB, IIRC) for­bid­ding em­ployer dis­crim­i­na­tion based on ge­net­ics (in the con­text of genome se­quenc­ing, true, but prob­a­bly gen­eral). If it’s not con­ferred by genes, then there may be a gen­eral cul­tural ba­sis on which to sue (Mug­gles as dis­abled be­cause they lack an abil­ity nec­es­sary for ba­sic func­tion­ing in Wizard­ing so­ciety, per­haps).

• You can put de­gree re­quire­ments on the job ad­ver­tise­ment, which should act as a filter on ap­pli­ca­tions, some­thing that can’t be caught by the 80% rule.

(Of course, uni­ver­si­ties tend to use racial crite­ria for ad­mis­sion in the US, some­thing which, iron­i­cally, can be an in­cen­tive for com­pa­nies to dis­crim­i­nate based on race even amongst ap­pli­cants with CS de­grees.)

• The 80% rule is only part of it. Again, racist re­quire­ments is an ob­vi­ous loop­hole you should ex­pect to have been ad­dressed; you can only get away with a lit­tle covert dis­crim­i­na­tion if any.

For ex­am­ple, a fire de­part­ment re­quiring ap­pli­cants to carry a 100 lb (50 kg) pack up three flights of stairs. The up­per-body strength re­quired typ­i­cally has an ad­verse im­pact on women. The fire de­part­ment would have to show that this re­quire­ment is job-re­lated for the po­si­tion. This typ­i­cally re­quires em­ploy­ers to con­duct val­i­da­tion stud­ies that ad­dress both the Uniform Guidelines and pro­fes­sional stan­dards.

If you add un­nec­es­sary re­quire­ments as a stealth filter, how do you show the re­quire­ments are job-re­lated?

• I thought we were talk­ing about how to use nec­es­sary re­quire­ments with­out risk­ing a suit, not how to con­ceal racial prefer­ences by us­ing clev­erly cho­sen proxy re­quire­ments. But it looks like you can’t use job ap­pli­ca­tion de­gree re­quire­ments with­out show­ing a busi­ness need ei­ther.

• topy­nate:

But it looks like you can’t use job ap­pli­ca­tion de­gree re­quire­ments with­out show­ing a busi­ness need ei­ther.

The rele­vant land­mark case in U.S. law is the 1971 Supreme Court de­ci­sion in Griggs v. Duke Power Co. The court ruled that not just test­ing of prospec­tive em­ploy­ees, but also aca­demic de­gree re­quire­ments that have dis­parate im­pact across pro­tected groups are ille­gal un­less they are “demon­stra­bly a rea­son­able mea­sure of job perfor­mance.

Now of course, “a rea­son­able mea­sure of job perfor­mance” is a vague crite­rion, which de­pends on con­tro­ver­sial facts as well as sub­jec­tive opinion. To take only the most no­table ex­am­ple, these peo­ple would prob­a­bly say that IQ tests are a rea­son­able mea­sure of perfor­mance for a great va­ri­ety of jobs, but the pre­sent le­gal prece­dent dis­agrees. This situ­a­tion has given rise to end­less reams of of case law and a le­gal minefield that takes ex­perts to nav­i­gate.

At the end, as might be ex­pected, what sorts of tests and aca­demic re­quire­ments are per­mit­ted to differ­ent in­sti­tu­tions in prac­tice de­pends on ar­bi­trary cus­tom and the pub­lic per­cep­tion of their sta­tus. The de facto rules are only partly cod­ified for­mally. Thus, to take again the most no­table ex­am­ple, the army and the uni­ver­si­ties are al­lowed to use what are IQ tests in all but name, which is an ab­solute taboo for al­most any other in­sti­tu­tion.

• I thought we were talk­ing about how to use nec­es­sary re­quire­ments with­out risk­ing a suit, not how to con­ceal racial prefer­ences by us­ing clev­erly cho­sen proxy re­quire­ments.

I wasn’t. I was talk­ing about how the ob­vi­ous loop­holes are already closed or have been heav­ily re­stricted (even at the cost of false pos­i­tives), and hence how Quir­rel’s com­ments are naive and un­in­formed.

But it looks like you can’t use job ap­pli­ca­tion de­gree re­quire­ments with­out show­ing a busi­ness need ei­ther.

Yes, that doesn’t sur­prise me in the least.

• Just keep them from ap­ply­ing in the first place, by build­ing your office some­where they can’t get to. There aren’t any le­gal re­stric­tions on that.

You re­ally are new here, aren’t you?

http://​​en.wikipe­dia.org/​​wiki/​​Amer­i­cans_with_Dis­abil­ities_Act_of_1990#Ti­tle_III_-_Public_Ac­com­mo­da­tions_.28and_Com­mer­cial_Fa­cil­ities.29

http://​​en.wikipe­dia.org/​​wiki/​​Zoning

In short, there most cer­tainly ARE le­gal re­stric­tions on build­ing your office some­where de­liber­ately se­lected for it’s in­ac­cessibil­ity to those with a con­gen­i­tal in­abil­ity to e.g. tele­port, and a lack of tele­por­ta­tion-spe­cific case law would not work in your fa­vor, given the judge’s ac­cess to state­ments you’ve already made.

• In short, there most cer­tainly ARE le­gal re­stric­tions on build­ing your office some­where de­liber­ately se­lected for it’s in­ac­cessibil­ity to those with a con­gen­i­tal in­abil­ity to e.g. tele­port,

The Amer­i­cans with Dis­abil­ities Act limits what you can build (ev­ery build­ing needs ramps and ele­va­tors), not where you can build it. Zon­ing laws are black­list-based, not whitelist-based, so ex­tradi­men­sional spaces are fine. More com­monly, you can eas­ily find office space in lo­ca­tions that poor peo­ple can’t af­ford to live near. And in the un­likely event that race or na­tional ori­gin is the key fac­tor, you get to choose which coun­try or city’s de­mo­graph­ics you want.

A lack of tele­por­ta­tion-spe­cific case law would not work in your fa­vor, given the judge’s ac­cess to state­ments you’ve already made.

This is the iden­tity un­der which I speak freely and teach defense against the dark arts. This is not the iden­tity un­der which I buy office build­ings and hire minions. If it was, I wouldn’t be talk­ing about hiring strate­gies.

• This is the iden­tity un­der which I speak freely and teach defense against the dark arts. This is not the iden­tity un­der which I buy office build­ings and hire minions. If it was, I wouldn’t be talk­ing about hiring strate­gies.

Up voted for hav­ing the sense to em­ploy a blind­ingly ob­vi­ous strat­egy that some­how con­sis­tently fails to be­come com­mon sense.

• More com­monly, you can eas­ily find office space in lo­ca­tions that poor peo­ple can’t af­ford to live near.

But that they could, in prin­ci­ple, walk to and from.

• Be­sides, it’s not like mug­gles are a pro­tected class. And if they were? Just keep them from ap­ply­ing in the first place, by build­ing your office some­where they can’t get to. There aren’t any le­gal re­stric­tions on that.

My google-fu is not strong enough to find the le­gal doc­trine, but in the US at least, you can be sued for ~im­plicit dis­crim­i­na­tion, i.e. if the news­pa­per you ad­ver­tise in has a reader pop­u­la­tion that does not ref­elect the gen­eral pop­u­la­tion, you’re dis­crim­i­nat­ing against the un­der rep­re­sented pop­u­la­tion.

• i.e. if the news­pa­per you ad­ver­tise in has a reader pop­u­la­tion that does not re­flect the gen­eral pop­u­la­tion, you’re dis­crim­i­nat­ing against the un­der rep­re­sented pop­u­la­tion.

...I thought this was a joke. Now… not so sure.

• See the last sen­tence of my first para­graph above (the one in paren­the­ses).

• An in­ter­est­ing story that I think I re­mem­ber read­ing:

One study found that rel­a­tively in­ex­pe­rienced psy­chi­a­trists were more ac­cu­rate at di­ag­nos­ing men­tal ill­ness than ex­pe­rienced ones. This is be­cause in­ex­pe­rienced psy­chi­a­trists stuck closely to check­lists rather than rely on their own judg­ment, and whether or not a di­ag­no­sis was con­sid­ered “ac­cu­rate” was based on how closely the re­ported symp­toms matched the check­list. ;)

• If we are mea­sur­ing the ac­cu­racy of A vs. B, we are im­plic­itly mea­sur­ing A against gold stan­dard C, and B against gold stan­dard C. If a bet­ter C is not read­ily available, we may choose to use A or B as an ap­prox­i­ma­tion, the choice of which de­ter­mines our out­come.

Now I won­der:

Are the peo­ple that are sym­pa­thetic to the hy­poth­e­sis that com­put­ers are bet­ter in the cases above (and ig­nored be­cause of bi­ases) as­sum­ing we made the fal­lacy of us­ing hu­mans as a gold stan­dard?

Are the peo­ple that are sym­pa­thetic to the hy­poth­e­sis that hu­mans are bet­ter (and ig­nored be­cause of bi­ases) as­sum­ing we made the fal­lacy of us­ing com­put­ers as a gold stan­dard?

The union of which is a lot of up­votes. I can’t de­cide which was meant.

• This is one of the top 3 rated com­ments on this post. I think you should spec­ify more di­rectly how this anec­dote re­lates to how you in­ter­pret the ar­ti­cle’s in­ten­tion.

• He should spec­ify where he has read that.

• I don’t re­mem­ber. I may have ac­tu­ally heard one of my par­ents talk­ing about it in­stead of read­ing it. So con­sider it an ur­ban leg­end.

• If this is not amaz­ing enough, con­sider the fact that even when ex­perts are given the re­sults of SPRs, they still can’t out­perform those SPRs (Leli & Filskov 1985; Gold­berg 1968).

Now THAT part is just plain em­bar­rass­ing. I mean, it’s truly a mark of shame upon us if we have a tool that we know works, we are given ac­cess to the tool, and we still can’t do bet­ter than the tool it­self, un­aided. (EDIT: By “we”, I mean “the ex­perts in the rele­vant fields”… which I guess isn’t re­ally a “we” as such, but you know what I mean)

Any­ways, are there any nice on­line in­dexes or what­ever of SPRs that make it easy to put in class of prob­lem and have it find a SPR that’s been ver­ified to work for that sort of prob­lem?

• Now THAT part is just plain em­bar­rass­ing. I mean, it’s truly a mark of shame upon us if we have a tool that we know works, we are given ac­cess to the tool, and we still can’t do bet­ter than the tool it­self, un­aided.

Coin­ci­den­tally, I was plan­ning to write an ar­ti­cle “defend­ing” the use of fal­la­cies on Bayesian grounds. A typ­i­cal pas­sage would go like this:

Peo­ple say it’s fal­la­cious to ap­peal to au­thor­ity. How­ever, if you learn that ex­perts be­lieve X, you should cer­tainly up­date some finite amount in fa­vor of be­liev­ing X, as ex­perts are, in gen­eral, more likely to be­lieve X if it is true than it is false—even as you may find many ex­cep­tions.

In­deed, it would be quite a strange world if ex­perts were con­sis­tently wrong about a given sub­ject mat­ter X, thus mak­ing their opinions for X into ev­i­dence against X, be­cause they would have to per­sist in this er­ror, even know­ing that their en­tan­gle­ment with X means they only have to in­vert their pro­nounce­ments or re­main ag­nos­tic to im­prove ac­cu­racy.

Well, it seems we ac­tu­ally do live in such a world, where (some classes of) ex­perts make pre­dictable er­rors, and don’t take triv­ial steps to make their opinions more ac­cu­rate (and en­tan­gled with the sub­ject mat­ter).

• Well, ex­perts still do bet­ter than non-ex­perts on av­er­age (afaik), just that they seem to to­tally ig­nore tools that could let them do a whole lot bet­ter, and also ap­par­ently can’t do much bet­ter than the tools them­selves, even when they’re able to use the tools.

• Mak­ing pre­dictable er­rors isn’t the same thing as their opinions be­ing anti-cor­re­lated with re­al­ity.

• If any­body would like to try some statis­ti­cal ma­chine learn­ing at home, it’s ac­tu­ally not that hard. The tough part is get­ting a data set. Once that’s done, most of the ex­am­ples in this ar­ti­cle are things you could just feed to some soft­ware like Weka, press a few but­tons, and get a statis­ti­cal model. BAM!

Let’s try an ex­am­ple. Here is some breast can­cer di­ag­nos­tic data, show­ing a bunch of ob­ser­va­tions of peo­ple with breast can­cer (age, size of tu­mors, etc.) and whether or not the can­cer re­oc­curred af­ter treat­ment. Can we pre­dict can­cer re­cur­rence?

If you look at it with a de­ci­sion tree, it turns out that you can get about 70% ac­cu­racy by ob­serv­ing two of the sev­eral fac­tors that were ob­served, in a very sim­ple de­ci­sion pro­ce­dure. You can do a lit­tle bet­ter by us­ing some­thing more so­phis­ti­cated, like a naive Bayes clas­sifier. Th­ese show us what fac­tors are the most im­por­tant, and how.

If you’re in­ter­ested, go ahead and play around. It’s pretty easy to get started. Ob­vi­ously, take ev­ery­thing with a grain of salt, but still, ba­sic ma­chine learn­ing is sur­pris­ingly easy.

Let me brag a bit. Once in a friendly dis­cus­sion the fol­low­ing ques­tion came up: How to pre­dict for an un­known first name whether it is a male or fe­male name. This was in a con­text of Hun­gar­ian names, as all of us were Hun­gar­i­ans. I had a list of Hun­gar­ian first names in digi­tal for­mat. The dis­cus­sion turned into a bet: I said I can write a pro­gram in half an hour that tells with at least 70% pre­ci­sion the sex of a first name it never saw be­fore. I am quite fast with writ­ing small scripts. It wasn’t even close: It took me 9 min­utes to

• split my sets of 1000 male and 1000 fe­male names into a ran­dom 1000-1000 train-test split,

• split each name into char­ac­ter 1,2- and 3-grams. E.g.: Luca was turned into ^L u c a\$ ^Lu uc ca\$ ^Luc uca\$.

• feed the train­ing data into a com­mand line tool to train a max­ent model,

• test the ac­cu­racy of the model on the un­seen test data.

The model reached an ac­cu­racy of 90%. In ret­ro­spect, this is not sur­pris­ing at all. Look­ing into the lin­ear model, the most im­por­tant fea­ture it iden­ti­fied was whether the name ends with an ‘a’. This triv­ial model alone reaches some 80% pre­ci­sion for Hun­gar­ian names, so if I knew this in ad­vance, I could have won the bet in 30 sec­onds in­stead of 9 min­utes, with the sed com­mand s/​a\$/​a FEMALE/​.

• Th­ese sound like pow­ers I should ac­quire. Could you drop some fur­ther hints on:

• “a com­mand line tool to train a max­ent model”

• how you tested the ac­cu­racy of the model (tools that let you do that in the re­main­ing min­utes, rather than gen­eral prin­ci­ples)

• I used Zhang Le’s tool. Note that it is a rather ob­scure thing, not an in­dus­try stan­dard like say, the huge Weka and Mallet pack­ages. It made very easy the tasks you ask for. When I had a train and test data fea­tur­ized,

max­ent -m gen­der.model train.data

built the model and

max­ent -p -m gen­der.model test.data

told me its ac­cu­racy on the test data.

• This is a great ar­ti­cle, but it only lists stud­ies where SPRs have suc­ceeded. In fair­ness, it would be good to know if there were any stud­ies that showed SPRs failing (and also con­sider pub­li­ca­tion bias, etc.).

• Definitely.

• My prin­ci­ple prob­lem with this ar­ti­cle is that you ap­pear to pro­mote the idea that these SPRs are be­ing ig­nored for ex­tremely bad rea­sons, rather than they were ig­nored for de­cent rea­sons. So when you say ‘definitely’ here I have a prob­lem that you are com­part­men­tal­iz­ing the ar­gu­ments and not ad­mit­ting the prob­lems with your post.

Also, I don’t think this is a great ar­ti­cle and in pro­por­tion to it get­ting +40 votes I have a poor opinion of this com­mu­nity (or at least it’s karma sys­tem where 0 should be neu­tral).

edit: My last para­graph here is ex­ces­sively dra­matic and I re­tract it.

• Miller,

Does this look like “not ad­mit­ting the prob­lems with [my] post”?

• It would be more con­struc­tive of me if I ac­tu­ally helped find counter-ev­i­dence, rather than whing­ing about your not do­ing so. I think you’ve put a lot of effort into up­dat­ing your po­si­tion.

• My gut re­ac­tion is that this doesn’t demon­strate that SPRs are good, just that hu­mans are bad. There are tons of statis­ti­cal mod­el­ing al­gorithms that are more so­phis­ti­cated than SPRs.

Un­less, of course, SPR is an­other word for “any statis­ti­cal mod­el­ing al­gorithm”, in which case this is just the claim that statis­ti­cal ma­chine learn­ing is a good ap­proach, which any­one as Bayesian as the av­er­age LessWronger prob­a­bly agrees with.

• There are tons of statis­ti­cal mod­el­ing al­gorithms that are more so­phis­ti­cated than SPRs.

Not in and of it­self a good thing. As demon­strated re­cently so­phis­ti­cated statis­tics can suffice sim­ply to al­low one to con­fuse one­self in a so­phis­ti­cated knot—that’s harder to un­tie. There is a case to be made for pro­mot­ing the sim­plest al­gorithm that out­performs cur­rent meth­ods, and SPRs seem to fit this bill.

As for what SPR stands for, the post makes it pretty clear that they are a class of rules that pre­dict a (de­sired) prop­erty us­ing weighted cues (ob­serv­able prop­er­ties). I am not fa­mil­iar enough with statis­ti­cal mod­el­ling to say if that is a shared goal among all al­gorithms.

• The post gives an ex­am­ple of an SPR that uses weighted cues. But he speci­fi­cally says

This par­tic­u­lar SPR is called a proper lin­ear model,

in­di­cat­ing that there are other types of SPRs, and I cur­rently have no idea what those other types might be.

I agree with you that com­pli­cated statis­ti­cal tests can lead to spu­ri­ous re­sults; sim­ple statis­ti­cal tests can also lead to spu­ri­ous re­sults if the per­son us­ing them doesn’t un­der­stand them. I naievely as­so­ci­ate both of these with “the test was de­signed to cor­rect against a differ­ent type of flaw in ex­per­i­men­tal de­sign than ac­tu­ally oc­curred”.

When the fo­cus of the statis­ti­cal test is on ac­cu­rately mod­el­ing a given situ­a­tion, I think it is less difficult to re­al­ize when a model choice makes sense and when it doesn’t, so more so­phis­ti­cated ap­proaches will prob­a­bly do bet­ter, since they come closer to carv­ing re­al­ity at its joints. This might be an in­fer­en­tial dis­tance er­ror on my part, though, since I have train­ing in this area, so er­rors that I per­son­ally can avoid might not be gen­er­ally avoid­able.

• I agree with you for smart peo­ple; I do see a lot of value, though, in idiot-proof statis­tics. Weighted-cue SPRs are al­most too sim­ple to screw up.

• Also, while this isn’t su­per-rele­vant, given that I already agree with your claim about peo­ple con­fus­ing them­selves, my im­pres­sion is that the link you gave pre­sents mod­er­ate-to-weak ev­i­dence against this.

I didn’t read the en­tire ar­ti­cle that was linked to dis­cussing the statis­ti­cal anal­y­sis (if there’s a par­tic­u­lar sec­tion you think I should read, please let me know), but my un­der­stand­ing was that in some sense the “ex­per­i­men­tal pro­ce­dure” was the is­sue, not the statis­tics. In other words, Bem con­sid­ered po­ten­tially hun­dreds of hy­pothe­ses about his data, but only re­ported on a few, so that p-val­ues of 0.02 are not su­per-im­pres­sive (since out of 100 hy­pothe­ses we would ex­pect a few to hit that by chance).

Bem’s ex­per­i­ments all ba­si­cally ask “is this coin bi­ased”, which isn’t a very com­pli­cated ques­tion to an­swer. It is the so­phis­ti­cated statis­tics that cor­rects for the flawed pro­ce­dure.

• It wasn’t a very good ex­am­ple at all. I ba­si­cally grepped my mem­ory for “idiot statis­tics” and that one fea­tured strongly. The prob­lem there was not a mi­suse of statis­ti­cal tests, it was a mis­in­ter­pre­ta­tion of the sig­nifi­cance of statis­ti­cal tests.

• Are some SPRs easy to ex­ploit?

• Depends on what you’re mea­sur­ing. I can’t see how it would be ex­ploitable for things like pre­dict­ing wine qual­ity (ac­tu­ally green­hous­ing your grapes to con­trol tem­per­a­ture and rain­fall might just make them bet­ter) but definitely a spe­cific SPR for, say, rat­ing dossiers for hiring would be ex­ploitable if you knew or could guess at which cues it’s us­ing.

• SPR’s sound a lot like the Out­side View.

• SPRs sound like a method to en­sure a very ac­cu­rate out­side view.

‘Out­side view’, I be­lieve, is a term of Kah­ne­man’s, and is used in the liter­a­ture by lots of these peo­ple who work on SPRs, for ex­am­ple Dawes.

Kah­ne­man be­gins his Edge.org mas­ter class on think­ing by dis­cussing the out­side view.

• Well, SPRs can plau­si­bly out­perform av­er­age ex­per­tise. That’s be­cause most of the ex­per­tise is ut­ter and com­plete sham.

The re­ci­di­vism in ex­am­ple...

The judges, or psy­chol­o­gists, or the like, what in the world makes them ex­perts on pre­dict­ing the crim­i­nals? Did they read an un­bi­ased sam­ple of re­ci­di­vism? Did they do any prac­tice, earn­ing marks for pre­dict­ing crim­i­nals? Any­thing?

Re­sound­ing no. They never in their lives did any­thing that should have earned them the ex­pert sta­tus on this task. They did other stuff that puts them first on the list when you’re look­ing for ‘ex­perts’ on a topic for which there is no ex­perts.

They are about as much ex­perts on this task as a court jan­i­tor is an ex­pert on law. He too did not do any­thing re­lated to law, he did clean the court­room.

• Does SPR beat pre­dic­tion mar­kets?

• If it did, then you could make a lot of money on a pre­dic­tion mar­ket with enough cash in it. This would cause the mar­ket to give bet­ter an­swers.

• I have two con­cerns about the prac­ti­cal im­ple­men­ta­tion of this sort of thing:

1. It seems like there are cases where if a rule is be­ing used then peo­ple could abuse it. For ex­am­ple, in job ap­pli­ca­tions or ad­mis­sions to med­i­cal schools. A bet­ter un­der­stand­ing of how the rule re­lates to what it pre­dicts would be needed.

If X+Y pre­dicts Z does that mean en­hanc­ing X and Y will up the prob­a­bil­ity of Z? Not nec­es­sar­ily, con­sider the ex­am­ple of happy mar­riages. Will hav­ing more sex make your re­la­tion­ship hap­pier? Or does the rule work be­cause happy cou­ples tend to have more sex?

1. It is not true in ev­ery case that we equally value all true be­liefs, and equally value all false be­liefs. Cer­tain rules might work bet­ter if we take into con­sid­er­a­tion a per­son’s race, sex, re­li­gion and na­tion­al­ity. But most peo­ple find this sort of thing un­palat­able be­cause it can lead to the sys­tem­atic per­se­cu­tion of sub groups, even if it re­sults in more true, and fewer false, be­liefs over­all. It also might be the case that some of these rules dis­crim­i­nate against groups of peo­ple in more sub­tle ways that won’t be im­me­di­ately ob­vi­ous.

Of course nei­ther of these prob­lems mean that there won’t be perfectly good cases where these rules would im­prove de­ci­sion mak­ing a lot.

• Yes, sev­eral of these mod­els look like they’re likely to run into trou­ble of the Good­hart’s law type (“Any ob­served statis­ti­cal reg­u­lar­ity will tend to col­lapse once pres­sure is placed upon it for con­trol pur­poses”).

• Will hav­ing more sex make your re­la­tion­ship hap­pier?

Ob­vi­ously, yes.

• It prob­a­bly de­pends some­what on with whom you are hav­ing it.
• True. One of my nodes for “re­la­tion­ship” is con­sen­sual; most definitely in that case it would make the re­la­tion­ship much less happy.

• Well, un­less the qual­ity of the sex is causally linked to the quan­tity, such that hav­ing lots and lots of sex (past a cer­tain thresh­old) makes each in­di­vi­d­ual ses­sion dis­pro­por­tionately worse. This is true for a lot of peo­ple’s libidos.

To put it an­other way: it’s not the fre­quency of the mo­tion in the ocean, but the am­pli­tude of the waves.

• This is true for a lot of peo­ple’s libidos.

But prob­a­bly not true for the quan­tity of sex in al­most all re­la­tion­ships, I would bet.

• Although I agree with you, I feel like I should point out that it is some­what non­sen­si­cal for most re­la­tion­ships to be sub-op­ti­mal in this way. If both par­ties want to have more sex, and they can (oth­er­wise the ques­tion wouldn’t re­ally be valid), but they don’t, that’s a lit­tle weird, don’t you think?

We can talk about op­ti­miz­ing for other things (e.g. ca­reers), but I don’t think that’s re­ally the is­sue, since many cou­ples, when ex­plic­itly told that they would be hap­pier if they had more sex, just start hav­ing more sex, with­out sac­ri­fic­ing any­thing that they end up want­ing back.

• Although I agree with you, I feel like I should point out that it is some­what non­sen­si­cal for most re­la­tion­ships to be sub-op­ti­mal in this way. If both par­ties want to have more sex, and they can (oth­er­wise the ques­tion wouldn’t re­ally be valid), but they don’t, that’s a lit­tle weird, don’t you think?

Weird cer­tainly but this is a kind of weird­ness that hu­mans are no­to­ri­ous for. We are ter­rible hap­piness op­ti­misers. In the case of sex speci­fi­cally hav­ing more of it is not as sim­ple as walk­ing over to the bed­room. For males and fe­males al­ike you can want to be hav­ing more sex, be aware that hav­ing more sex would benefit your re­la­tion­ship and still not be ‘in the mood’ for it. A more in­di­rect ap­proach to the prob­lem of libido and de­sire is re­quired—the sort of thing that hu­mans are not nat­u­rally good at op­ti­mis­ing.

• I agree on ev­ery point. I also think part of this is sim­ply that shared knowl­edge that is not com­mon knowl­edge (un­til ac­knowl­edged be­tween par­ties) is much more difficult to act upon.

I think that “okay, we’re go­ing to have sex now, be­cause it will make us hap­pier” is a lit­tle like “okay, I’m go­ing to the gym now, be­cause it will make me feel bet­ter”, which may be the same thing you meant about be­ing “in the mood”, but I think it’s even harder for sex, be­cause we are per­haps less will­ing to see sex ex­cept as im­me­di­ate grat­ifi­ca­tion.

• I’ve heard more than once that hav­ing more sex on a sched­ule in the hopes of hav­ing chil­dren is a mis­er­able ex­pe­rience for cou­ples with fer­til­ity prob­lems.

I don’t know whether hav­ing more sex in the hopes of be­ing hap­pier (rather than be­cause the peo­ple in­volved want sex more for the fun of it) could have similar side effects.

• It’s fairly com­mon for sex ther­a­pists to recom­mend that cou­ples sched­ule sex and have sex at all (but not only) sched­uled times, on the grounds that peo­ple may not be in the mood at first, but en­joy it any­way. While it may be a mis­er­able ex­pe­rience for a few peo­ple, I doubt that it is mis­er­able in gen­eral (and I’m not sure why it would be).

• It’s cer­tainly pos­si­ble for peo­ple to have akra­sia in re­gards to plea­sure, and schedul­ing can help with that.

I think pos­si­ble prob­lems come in if a part­ner (pos­si­bly both part­ners in the case of fer­til­ity) re­ally doesn’t want to at the mo­ment, but is feel­ing pres­sured.

• Will hav­ing more sex make your re­la­tion­ship hap­pier?

I think it’s safe to say that hav­ing less sex will make the re­la­tion­ship less happy, so there is some causal­ity in­volved.

• Not nec­es­sar­ily, con­sider the ex­am­ple of happy mar­riages. Will hav­ing more sex make your re­la­tion­ship hap­pier?

Yes. Al­most cer­tainly. But there are plenty of other ex­am­ples you could pick from where there is not causal­ity in­volved (and some for which causal­ity is nega­tive).

• [quote]Will hav­ing more sex make your re­la­tion­ship hap­pier? [/​quote]

Hav­ing more sex will make ME hap­pier. If my wife finds out though…

• Be­sides the le­gal is­sues with dis­crim­i­na­tion and dis­parate im­pact, an­other im­por­tant is­sue here is that jobs that in­volve mak­ing de­ci­sions about peo­ple tend to be high-sta­tus. As a very gen­eral ten­dency, the higher-sta­tus a pro­fes­sion is, the more its prac­ti­tion­ers are likely to or­ga­nize in a guild-like way and re­sist in­tru­sive in­no­va­tions by out­siders—es­pe­cially in­no­va­tions in­volv­ing perfor­mance met­rics that show the cur­rent stan­dards of the pro­fes­sion in a bad light, or even worse, those that threaten a change in the way their work is done that might lower its sta­tus.

Dis­cus­sions of such cases in medicine are a reg­u­lar fea­ture on Over­com­ing Bias, but it ex­ists in a more or less pro­nounced form in any other high-sta­tus pro­fes­sion too. How much it ac­counts for the spe­cific cases dis­cussed in the above ar­ti­cle is a com­plex ques­tion, but this phe­nomenon should cer­tainly be con­sid­ered as a plau­si­ble part of the ex­pla­na­tion.

• Some­times, be­ing ra­tio­nal is easy. When there ex­ists a re­li­able statis­ti­cal pre­dic­tion rule for the prob­lem you’re con­sid­er­ing, you need not waste your brain power try­ing to make a care­ful judg­ment.

Un­for­tu­nately lin­ear mod­els for a lot of situ­a­tions are sim­ply not available. The dozen or so ones in the liter­a­ture are the ex­cep­tion, not the rule.

• And those that ex­ist are not always easy to find.
And those that are found are not always easy to use in in­dus­try (where so­phis­ti­cated com­puter skills are of­ten the things the mar­ket­ing grad taught er­self to do in Ex­cel).

• You speak of in­cred­ible suc­cess with­out given a suc­cess rate of the mod­els. The fact that there are a dozen cases where spe­cific mod­els out­performed hu­man rea­son­ing doesn’t prove much.

At the mo­ment you recom­mend other peo­ple to use SPRs for their de­ci­sion mak­ing based on “ex­pert judg­ment”. How about pro­vid­ing us a SPR that tells us for which prob­lems we should use SPRs?

• SPRs can be gamed much more di­rectly than hu­man ex­perts. For ex­am­ple, imag­ine an SPR in place of all hiring man­agers. In our cur­rent place, with hiring man­agers, we can guess at what goes in to their de­ci­sion­mak­ing and at­tempt to op­ti­mize for it, but be­cause each man­ager is some­what differ­ent, we can’t know that well. A sin­gle SPR that took over for all the man­agers, or even a cou­ple of very pop­u­lar ones, would strongly en­courage ap­pli­cants to op­ti­mize for the vari­able most weighted in the equa­tion. Over time this would likely de­crease the value of the SPR back to that of a hu­man ex­pert.

This has a name in the liter­a­ture, but I can’t re­mem­ber it at the mo­ment. You see this prob­lem in, for ex­am­ple, the cur­rent ob­ses­sive fo­cus on GDP as the only mea­sure of na­tional well-be­ing. Now that we’ve had that mea­sure for some time, we’re able to have coun­tries whose GDP is im­prov­ing but who suck on lots of other mea­sures, and thus poli­ti­ci­ans who are proud of what they’ve done but who are hated by the peo­ple.

Yes, in some cases, this would cause us to im­prove the SPR to the point where it ac­cu­rately re­flected the qual­ities that go into suc­cess. But that’s not a proven thing.

That said, I’d re­ally like to see a wiki or other at­tempt­ing-to-be-com­plete re­source for find­ing an SPR for any par­tic­u­lar ap­pli­ca­tion. Any­one got one?

• This has a name in the liter­a­ture, but I can’t re­mem­ber it at the moment

Good­hart’s Law

A sin­gle SPR that took over for all the man­agers, or even a cou­ple of very pop­u­lar ones, would strongly en­courage ap­pli­cants to op­ti­mize for the vari­able most weighted in the equa­tion.

W1(Quan­ti­ta­tive skills) + W2(Writ­ten and Oral Com­mu­ni­ca­tion Skills) + W3(Abil­ity to work with loose su­per­vi­sion) + W4(Do­main Ex­per­tise) + W5(So­cial Skills) + W6(Pres­tige Mark­ers)

That said, I’d re­ally like to see a wiki or other at­tempt­ing-to-be-com­plete re­source for find­ing an SPR for any par­tic­u­lar ap­pli­ca­tion. Any­one got one?

No, but I imag­ine that tak­ing a grab bas­ket of plau­si­ble cor­re­lates of the de­sired trait and throw­ing them into a re­gres­sion func­tion would be a good first draft. Then iter­ate.

• Cor­rect me if I’m wrong, but the SPR is just a lin­ear model, right? Statis­tics is an un­der ap­pre­ci­ated field in many walks of life. My own field of spe­cial­ity, ex­per­i­men­tal de­sign, is treated with down right sus­pi­cion by sci­en­tists who have not en­coun­tered it be­fore, who find the re­sults counter-in­tu­itive (when they have 4 con­trol­lable vari­ables in an ex­per­i­ment they want to vary them one at a time, while the best way is to vary all 4 si­mul­ta­neously...)

• I also find that counter-in­tu­itive, is there a short ex­pla­na­tion of why?

• I am cu­ri­ous: could you ex­plain why it is bet­ter to vary all 4?

• Briefly: be­cause to do so as­sumes that they do not in­ter­act, and if they DO in­ter­act, you will have gath­ered no in­for­ma­tion on said in­ter­ac­tions.

• That makes sense… if your in­puts are X and Y, and you want to figure out what your out­put f(X,Y) is, it seems like you’ll even­tu­ally have to vary X and Y si­mul­ta­neously in or­der to tell the differ­ence be­tween f(X,Y) = aXY + c and f(X,Y) = aX + bY + c.

• quite, al­though usu­ally you’ll have a model f(x,y)=aXY+bX+cY+d. I’m ac­tu­ally un­der­sel­ling this ap­proach, be­cause if I had two vari­ables X, and Y which can be ob­served be­tween (-1,1), and only have two ob­ser­va­tions to do it in then we’re much bet­ter go­ing (X,Y)=(-1,1) and (1,-1) rather than (0,1),(1,0), be­cause we’re gath­er­ing more in­for­ma­tion.

We always want to de­sign in the lo­ca­tion with the most var­i­ance, be­cause thats the hard­est place to pre­dict. Given that the model we’re look­ing at is lin­ear in both the pa­ram­e­ters and the vari­ables then we know the places where we get the most vari­a­tion will be at the ex­tremes. Ob­vi­ously we have no in­for­ma­tion if we think there might be some kind of quadratic terms here, but one of the nice things about de­sign for lin­ear mod­els is you can build your ex­per­i­men­ta­tion to iter­a­tively build up in­for­ma­tion.

Typ­i­cally in an in­dus­trial set­ting we’ll have a few dozen differ­ent fac­tors which we think might af­fect our out­come, so we can de­sign to elimi­nate down to a hand­ful by us­ing a very ba­sic lin­ear model in a screen­ing ex­per­i­ment, then use a more so­phis­ti­cated de­sign called a cen­tral com­pos­ite de­sign.

Now if we want a mechanis­tic model, some­thing based on what we know on the physics of the situ­a­tion (say we have some differ­en­tial equa­tions de­scribing the re­ac­tion), then de­sign­ing be­comes harder, which is where my re­search is.

• While this is promis­ing in­deed, it is wise not to for­get about Op­ti­miza­tion By Proxy that can oc­cur when the thing be­ing op­ti­mised is (or is un­der the con­trol of) an in­tel­li­gent agent.

• The thing that makes me twitch about SPRs is a con­cern that they won’t change when the un­der­ly­ing con­di­tions which cre­ated their data sets change. This doesn’t mean that hu­mans are good at notic­ing that sort of thing, ei­ther. How­ever, it’s at least worth think­ing about which ap­proach is likely to over­shoot worse when some­thing sur­pris­ing hap­pens. Or whether there’s some rea­son to think that the greater usual ac­cu­racy of SPRs leads to enough big­ger re­serves that the oc­ca­sional over­shoot prob­lem (if such are worse than in a non-SPR sys­tem) is com­pen­sated for.

• Hi Luke,

Great post. Will be writ­ing some­thing about the le­gal uses of SPRs in the near fu­ture.

Any­way, the link to the Grove and Meehl study doesn’t seem to work for me. It says the file is dam­aged and can­not be re­paired.

• At­lantic, The Brain on Trial:

In the past, re­searchers have asked psy­chi­a­trists and pa­role-board mem­bers how likely spe­cific sex offen­ders were to re­lapse when let out of prison. Both groups had ex­pe­rience with sex offen­ders, so pre­dict­ing who was go­ing straight and who was com­ing back seemed sim­ple. But sur­pris­ingly, the ex­pert guesses showed al­most no cor­re­la­tion with the ac­tual out­comes. The psy­chi­a­trists and pa­role-board mem­bers had only slightly bet­ter pre­dic­tive ac­cu­racy than coin-flip­pers. This as­tounded the le­gal com­mu­nity.

So re­searchers tried a more ac­tu­ar­ial ap­proach. They set about record­ing dozens of char­ac­ter­is­tics of some 23,000 re­leased sex offen­ders: whether the offen­der had un­sta­ble em­ploy­ment, had been sex­u­ally abused as a child, was ad­dicted to drugs, showed re­morse, had de­viant sex­ual in­ter­ests, and so on. Re­searchers then tracked the offen­ders for an av­er­age of five years af­ter re­lease to see who wound up back in prison. At the end of the study, they com­puted which fac­tors best ex­plained the re­offense rates, and from these and later data they were able to build ac­tu­ar­ial ta­bles to be used in sen­tenc­ing.

Which fac­tors mat­tered? Take, for in­stance, low re­morse, de­nial of the crime, and sex­ual abuse as a child. You might guess that these fac­tors would cor­re­late with sex offen­ders’ re­ci­di­vism. But you would be wrong: those fac­tors offer no pre­dic­tive power. How about an­ti­so­cial per­son­al­ity di­s­or­der and failure to com­plete treat­ment? Th­ese offer some­what more pre­dic­tive power. But among the strongest pre­dic­tors of re­ci­di­vism are prior sex­ual offenses and sex­ual in­ter­est in chil­dren. When you com­pare the pre­dic­tive power of the ac­tu­ar­ial ap­proach with that of the pa­role boards and psy­chi­a­trists, there is no con­test: num­bers beat in­tu­ition. In court­rooms across the na­tion, these ac­tu­ar­ial tests are now used in pre­sen­tenc­ing to mod­u­late the length of prison terms.

• On in­ter­views, I had a great deal of suc­cess hiring for cler­i­cal as­sis­tant po­si­tions by sim­ply get­ting the in­ter­vie­wees to do a sim­ple prob­lem in front of us. It turned out to be a great, re­li­able and easy-to-jus­tify sorter of can­di­dates.

But, of course, it was nei­ther un­struc­tured nor much of an “in­ter­view” as such.

• Again, test not in­ter­view. Their GPA is an av­er­age mea­sure of maybe thou­sands of such sim­ple prob­lems—prob­a­bly on av­er­age more rigor­ously pro­duced, pre­sented, and cor­rected than your prob­lem pre­sented in the in­ter­view.

De­cid­ing based on a test in per­son in­stead of de­cid­ing on a num­ber that rep­re­sents thou­sands of such in­di­vi­d­ual tests smacks of anec­do­tal de­ci­sion-mak­ing.

• Since when did greater rigour and av­er­ag­ing of more prob­lems im­ply greater de­gree of cor­re­la­tion with perfor­mance at one spe­cific job?

I call halo effect here. Greater rigour, big­ger num­ber, more ac­cu­rate, more cor­rected, all com­bined re­ally ‘good’ qual­ities about the GPA value spill over into your feel­ing of how well it’ll cor­re­late with perfor­mance at spe­cific job, ver­sus a ‘bad’ ill mea­sured value.

Truth is, say, ill mea­sured hand size based on eye­bal­ling can eas­ily cor­re­late bet­ter with mea­sured finger length, than body weight mea­sured us­ing ul­tra high pre­ci­sion sci­en­tific scales with ac­cu­racy of a mil­li­gram (micro­gram, nanogram, what­ever). Just be­cause ham­mer is a tool you build things with, and but­ter knife is a kitchen uten­sil, doesn’t make ham­mer bet­ter than but­ter knife as a screw driver.

• Just be­cause ham­mer is a tool you build things with, and but­ter knife is a kitchen uten­sil, doesn’t make ham­mer bet­ter than but­ter knife as a screw driver.

Well, ac­tu­ally...

But more on point, you’d need to jus­tify that the test you give is more cor­re­lated than GPA with perfor­mance—this is why I sup­port sim­ple pro­gram­ming tests (be­cause they demon­stra­bly are more cor­re­lated than aca­demic in­di­ca­tors) but for a ‘cler­i­cal as­sis­tant’ po­si­tion as de­scribed above, a spe­cific test doesn’t im­me­di­ately spring to mind, and so it’s sus­pect.

• You aren’t look­ing for ‘cor­re­la­tion’ usu­ally, you’re look­ing for screen­ing out the se­rial job ap­pli­cant who can’t do the job they’re ap­ply­ing for (and keeps re-ap­ply­ing to many places)… just ask ’em to do some work similar to what they will be do­ing as per Loren­zofromOz method, and you’ll at least be as­sured they can do work. While with GPA you won’t be as­sured of any­thing what so ever.

For the pro­gram­ming, the sim­plest dumb­est check works to screen out those en­tirely in­ca­pable, when screen­ing by PhD would not.

http://​​www.cod­inghor­ror.com/​​blog/​​2007/​​02/​​why-cant-pro­gram­mers-pro­gram.html

PhD might cor­re­late bet­ter with perfor­mance than fizzbuzz does (the lat­ter be­ing a bi­nary test of ex­tremely ba­sic knowl­edge), but PhD does not screen out those who will just waste your time, and fizzbuzz (your per­sonal vari­a­tion of it) does.

• Holy crap… I think I had read about the Fiz­zBuzz thing a while ago, but I didn’t re­mem­ber about the 199 in 200 thing… Would it be pos­si­ble to sue the in­sti­tu­tions is­su­ing those PhD or some­thing? :-)

• Well, I don’t know what % of the CS-re­lated PhDs can’t do Fiz­zBuzz, maybe the per­centage is rather small. (Also, sue for what? You are not their client. The in­ca­pable dude that was given a de­gree, that’s their client. Your over-val­u­a­tion of this de­gree as ev­i­dence of ca­pa­bil­ity is your own prob­lem)

The is­sue is that, as Joel ex­plains, the job ap­pli­cants are a sam­ple ex­tremely bi­ased to­wards in­com­pe­tence:

http://​​www.joelon­soft­ware.com/​​items/​​2005/​​01/​​27.html

[Though I would think that the in­com­pe­tents with de­grees would be more able to find in­com­pe­tent em­ployer to work at. And PhDs should be able to find a com­pany that hires PhDs for sig­nal­ling rea­sons]

The is­sue with the hiring meth­ods here, is that we eas­ily con­fuse “more ac­cu­rate mea­sure­ment of X” with “stronger cor­re­la­tion to Y”, and “stronger cor­re­la­tion to Y” with hiring bet­ter staff (the one that doesn’t sink your com­pany), usu­ally out of some dra­mat­i­cally differ­ent pop­u­la­tion than the one on which cor­re­la­tion was found.

Fur­ther­more, a ‘cor­re­la­tion’ is such an in­ex­act mea­sure of how test re­lates to perfor­mance. Com­par­ing cor­re­la­tions is like com­par­ing ap­ples to or­anges by weight. The ‘fizzbuzz’ style prob­lems mea­sure perfor­mance near the ab­solute floor level, but with very high re­li­a­bil­ity. Vir­tu­ally no-one who fails fizzbuzz is a good hire. Vir­tu­ally no-one who passes fizzbuzz (an unique fizzbuzz, not the pop­u­lar one) is com­pletely in­ca­pable of pro­gram­ming. The de­grees cor­re­late to perfor­mance at the higher level, but with very low re­li­a­bil­ity—there are brilli­ant peo­ple with de­grees, there are com­plete in­com­pe­tents with de­grees, there’s brilli­ant peo­ple and in­com­pe­tents with­out de­grees.

edit: other ex­am­ple:

Rev­ers­ing a linked list is a good one un­less the can­di­date knows how to. See, the is­sue is that ed­u­ca­tional in­sti­tu­tions don’t teach how to think up a way to re­verse linked list. Nor do they test for that. They might teach how to re­verse the linked list, then they might test if the per­son can re­verse the linked list. Some peo­ple learn to think of a way to solve such prob­lems. Some don’t. It’s en­tirely in­ci­den­tal.

• Un­for­tu­nately, GPAs can lie. You can­not be cer­tain of the qual­ity of the prob­lems and eval­u­a­tion that was av­er­aged to pro­duce the GPA. So run­ning your own test of known difficulty works well to ver­ify what you see on the re­sume.

For ex­am­ple, I have to hire pro­gram­mers. We give all in­com­ing pro­gram­mers a few rel­a­tively easy pro­gram­ming prob­lems as part of the in­ter­view pro­cess be­cause we’ve found that no mat­ter what the re­sume says, it’s pos­si­ble that they ac­tu­ally do not know how to pro­gram.

Good re­sume + good in­ter­view re­sult is a much stronger in­di­ca­tor than good re­sume alone.

• A sig­nifi­cant prob­lem is the weight­ing of cer­tain courses, par­tic­u­larly Ad­vanced Place­ment ones. A GPA of 3.7, seem­ing quite re­spectable to the un­aware, can be ob­tained by work of qual­ity 83%, and that’s as­sum­ing the class didn’t offer ex­tra credit.

• I don’t think he is likely to hire pro­gram­mers straight out of high school.

Giv­ing IB/​AP/​Honors classes ex­tra weight in high school is nec­es­sary to offset the ad­di­tion­ally difficulty of these classes. Other­wise, high school stu­dents would have a di­rect dis­in­cen­tive to take ad­vanced classes.

• Giv­ing IB/​AP/​Honors classes ex­tra weight in high school is nec­es­sary to offset the ad­di­tion­ally difficulty of these classes. Other­wise, high school stu­dents would have a di­rect dis­in­cen­tive to take ad­vanced classes.

A swift googling brings up this forth­com­ing study of about 900 high schools in Texas:

De­spite con­ven­tional wis­dom to the con­trary, grade weight­ing is not the pri­mary fac­tor driv­ing stu­dents to in­crease their AP course-tak­ing. More­over, a lack of in­sti­tu­tional knowl­edge about the im­por­tance of grade-weight­ing does not have a prac­ti­cally sig­nifi­cant ad­verse im­pact on stu­dents with low his­tor­i­cal par­ti­ci­pa­tion rates in AP, al­though low in­come stu­dents are marginally less re­spon­sive to in­creases in the AP grade weight than oth­ers. The min­i­mal con­nec­tion be­tween AP grade weights and course-tak­ing be­hav­ior may ex­plain why schools tin­ker with their weights, mak­ing changes in the hopes of find­ing the sweet spot that elic­its the de­sired stu­dent AP-tak­ing rates. The re­sults pre­sented here sug­gest that there is no sweet spot and that schools should look el­se­where for ways to in­crease par­ti­ci­pa­tion in rigor­ous courses.

• But there’s still the ad­di­tional in­cen­tive of pres­tige and sig­nal­ling, isn’t there? That should be enough for the se­ri­ous scholar. It’s a sig­nifi­cant prob­lem when non-AP-la­bel­led courses are of­ten passed over for the pur­pose of a cheap grade boost.

• The post men­tions the ex­perts us­ing the re­sults of the SPR. What hap­pens if you re­verse it, and give the SPR the pre­dic­tion of the ex­pert?

• That’s called a ‘boot­strapped’ SPR. It’s one way of build­ing an SPR. And yes, in many cases the SPR ends up be­ing re­li­ably bet­ter than the ex­pert judg­ments that were used to build it.

• I was won­der­ing more how much bet­ter it is than a nor­mal SPR. Also, I won­der what weight it would give to the ex­pert.

• Peo­ple look­ing for ad­di­tional re­sources on this mat­ter should know that such lin­ear mod­els are of­ten called “multi at­tribute util­ity mod­els” (MAUT), and that they’re dis­cussed ex­ten­sively in the liter­a­ture of de­ci­sion anal­y­sis and multi-crite­ria de­ci­sion mak­ing. They’re also used in choice mod­els in the sci­ence of mar­ket­ing.

The word “statis­ti­cal” in the name used here is a bit of a red her­ring.

• AI sys­tems can gen­er­ally whoop hu­mans when a limited fea­ture set can be dis­cov­ered that cov­ers the span of a large class of ex­am­ples to good effect. The challenge is when you seem­ingly need a new fea­ture for each new ex­am­ple in or­der to differ­en­ti­ate it from the rest of the ex­am­ples in that class. Essen­tially you are say­ing that the prob­lem can be mapped to a sim­ple func­tion. Some prob­lems can.

Let’s imag­ine we are clas­sify­ing avian vs. rep­tile. Our first ex­am­ple might be a gecko, and we might say ‘well it’s green’. So ‘Color is Green’ is a clue\fea­ture and that works co­in­ci­den­tally for a few more ex­am­ples. Then you get a par­rot as an ex­am­ple, and you de­cide to add ‘Has a beak’. Then you get the ex­am­ple of a tur­tle, and so you add ‘Has a shell’, etc. It seems to me the suc­cess of these sys­tems boils down to whether the fea­tures can be added at a min­i­mal rate com­pared to the ex­am­ples on hand.

Where AI’s com­pete well gen­er­ally they beat trained hu­mans fairly marginally on easy (or even most) cases, and then fail mis­er­ably at bor­der or novel cases. This can make it dan­ger­ous to use them if the ex­treme failures are dan­ger­ous.

As to why hu­mans can’t en­sem­ble with the ma­chines, I sus­pect that has mostly to do with the hu­mans not be­ing well-trained to do so.

• A fair point and good cau­tion against turn­ing SPRs into your ham­mer for ev­ery nail, but ir­rele­vant in the case luke­prog is dis­cussing; we already have the ex­pert sys­tem, we already know it works bet­ter than the ex­perts, we just aren’t us­ing it.

• Ir­rele­vant is ex­ces­sive. When you say ‘sys­tem A works bet­ter than sys­tem B’ this im­plies that sys­tem A should be used and this is clear cut. But the no­tion ‘works bet­ter’ lacks a rigor­ous defi­ni­tion. Is the ma­chine bet­ter if it wins 90% of the time by 5%, and fails the other 10% by 40%? It’s not as sim­ple as say­ing .9 .05 > .1 .4. The cost of er­ror isn’t nec­es­sar­ily lin­ear.

Now why these sys­tems aren’t used in en­sem­bles with hu­mans is in­deed a great ques­tion. I can imag­ine that in most cases we could also ask ‘why don’t we dou­ble the num­ber of ex­perts who are col­lab­o­rat­ing on a given prob­lem?’ un­der the pre­sump­tion that more minds would likely re­sult in a bet­ter perfor­mance across the board. I wouldn’t be sur­prised if there was a lot of over­lap in the an­swers. Co­or­di­na­tion difficulty is likely high up there. Thus,

con­sider the fact that even when ex­perts are given the re­sults of SPRs, they still can’t out­perform those SPRs

pos­si­bly be­comes the ex­pla­na­tion.

• When you say ‘sys­tem A works bet­ter than sys­tem B’ this im­plies that sys­tem A should be used and this is clear cut. But the no­tion ‘works bet­ter’ lacks a rigor­ous defi­ni­tion.

What? Th­ese are gen­er­ally bi­nary de­ci­sions, with a known cost to false pos­i­tives and false nega­tives, and known rates of false pos­i­tives and false nega­tives. It should be be triv­ial to go from that to a util­ity-val­ued er­ror score.

• You just pre­sumed away my ar­gu­ment. I claim speci­fi­cally that the re­la­tion­ship be­tween var­i­ous classes of er­rors is not well-defined. This can lead to abuse of the term ‘bet­ter’.

Please tell me why I should take that as a pre­sump­tion.

• Be­cause those are the class of prob­lems this post dis­cusses.

From the top of the post:

A pa­role board con­sid­ers the re­lease of a pris­oner: Will he be vi­o­lent again? A hiring officer con­sid­ers a job can­di­date: Will she be a valuable as­set to the com­pany? A young cou­ple con­sid­ers mar­riage: Will they have a happy mar­riage?

The cached wis­dom for mak­ing such high-stakes pre­dic­tions is to have ex­perts gather as much ev­i­dence as pos­si­ble, weigh this ev­i­dence, and make a judg­ment. But 60 years of re­search has shown that in hun­dreds of cases, a sim­ple for­mula called a statis­ti­cal pre­dic­tion rule (SPR) makes bet­ter pre­dic­tions than lead­ing ex­perts do.

• A pa­role board con­sid­ers the re­lease of a pris­oner: Will he be vi­o­lent again?

I think this is the kind of ques­tion that Miller is talk­ing about. Just be­cause a sys­tem is cor­rect more of­ten, doesn’t nec­es­sar­ily mean it’s bet­ter.

For ex­am­ple if the hu­man ex­perts al­lowed more peo­ple out who went on to com­mit rel­a­tively minor vi­o­lent offences and the SPRs do this less of­ten, but are more likely to re­lease pris­on­ers who go on to com­mit mur­der then there would be le­gi­t­i­mate dis­cus­sion over whether the SPR is ac­tu­ally bet­ter.

I think this is ex­actly what he is talk­ing about when he says

Where AI’s com­pete well gen­er­ally they beat trained hu­mans fairly marginally on easy (or even most) cases, and then fail mis­er­ably at bor­der or novel cases. This can make it dan­ger­ous to use them if the ex­treme failures are dan­ger­ous.

Whether or not there is ev­i­dence that says this is a real effect I don’t know, but to ad­dress it what you re­ally need to mea­sure is to­tal util­ity of out­comes rather than ac­cu­racy.

• Yes. You got it, ex­actly.

• No. I’m talk­ing about classes of er­rors.

As in, which is bet­ter?

• A test that re­ports 100 false pos­i­tives for ev­ery 100 false nega­tives for dis­ease X

• A test that re­ports 110 false pos­i­tives for ev­ery 90 false nega­tives for dis­ease X

The cost of fp vs. fn is not defined au­to­mat­i­cally. If hu­mans are closer to #1 than #2, and I de­velop a sys­tem like #2, I might define #2 to be bet­ter. Then later on down the line I stop talk­ing about how I defined bet­ter, and I just use the word bet­ter, and no one ques­tions it be­cause hey… bet­ter is bet­ter, right?

• Which is more costly, false pos­i­tives or false nega­tives? This is an easy ques­tion to an­swer.

If false pos­i­tives, #1 is bet­ter. If false nega­tives, #2. I re­ally do not see what your point is. Th­ese prob­lems you bring up are eas­ily solved.

• Which is bet­ter: Re­leas­ing a vi­o­lent pris­oner, or keep­ing a harm­less one in­car­cer­ated? If you can find an an­swer that 90% of the pop­u­la­tion agrees on, then I think you’ve done bet­ter than ev­ery poli­ti­cian in his­tory.

That peo­ple do NOT agree sug­gest to me that it’s hardly a triv­ial ques­tion...

• Re­leas­ing a vi­o­lent pris­oner, or keep­ing a harm­less one in­car­cer­ated?

How vi­o­lent, how pre­ventably vi­o­lent, how harm­less, how in­car­cer­ated, how long in­car­cer­ated? For any spe­cific case with these agreed-upon, I am con­fi­dent a su­per­ma­jor­ity would agree.

That peo­ple do NOT agree sug­gest to me that it’s hardly a triv­ial ques­tion...

That peo­ple don’t agree sug­gests one side is com­par­ing re­leas­ing a se­rial kil­ler to in­car­cer­at­ing a drifter in jail a short while, and the other side is com­par­ing re­leas­ing a mid­dle-aged man who in a fit of pas­sion struck his adulter­ous wife to in­car­cer­at­ing Ghandi for the term of his nat­u­ral life. More gen­er­ally, they are de­cid­ing based on one spe­cific ex­am­ple they have strongly available to them.

In the state you phrased it, that ques­tion is about as an­swer­able as “how long is a piece of string?”.

• Yes. Thank you. Since at least one per­son un­der­stood me, I’m gonna jump off the merry-go-round at this point.

• (For refer­ence, I re­al­ize an ex­pert runs in to the same is­sue, I just think it’s un­fair to say that the is­sue is “eas­ily solved”)

• Many tests have a con­tin­u­ous, ad­justable pa­ram­e­ter for sen­si­tivity, let­ting you set the trade-off how­ever you want. In that case, we can re­frain from judg­ing the rel­a­tive bad­ness of false pos­i­tives and false nega­tives, and use ROCA, which is ba­si­cally the in­te­gral over all such trade-offs. Tests that are go­ing to be com­bined into a larger pre­dic­tor are usu­ally mea­sured this way.

Ma­chine learn­ing pack­ages gen­er­ally let you spec­ify a “cost ma­trix”, which is the cost of each pos­si­ble con­fu­sion. For a 2-val­ued test, it would be a 2x2 ma­trix with ze­roes on the di­ag­o­nal, and the cost of A->B and B->A er­rors in the other two spots. For a test with N pos­si­ble re­sults, the ma­trix is NxN, with ze­roes on the di­ag­o­nals, and each (row,col) po­si­tion is the cost of a mis­take that con­fuses the re­sult cor­re­spond­ing to that row with the re­sult cor­re­spond­ing to that column.

• Keep in mind this is in the con­clu­sion of luke­prog’s post:

When there ex­ists a re­li­able statis­ti­cal pre­dic­tion rule for the prob­lem you’re considering

Now,

But the no­tion ‘works bet­ter’ lacks a rigor­ous defi­ni­tion. Is the ma­chine bet­ter if it wins 90% of the time by 5%, and fails the other 10% by 40%? It’s not as sim­ple as say­ing .9 .05 > .1 .4. The cost of er­ror isn’t nec­es­sar­ily lin­ear.

If the cost of er­ror isn’t lin­ear, de­ter­mine what func­tion it fol­lows, then use that func­tion in­stead of a lin­ear func­tion to com­pare the rel­a­tive costs, which will tell you which works bet­ter.

Ir­rele­vant is ex­ces­sive.

I stand by it. The post is say­ing, given that SPRs work, work bet­ter than ex­perts, and don’t fail where ex­perts don’t, we should use them in­stead of ex­perts. Your points were that SPRs don’t always work, tend not to work in bor­der cases, and might fail in dan­ger­ous cases. The first point is only true in cases this post is not con­cerned with, the sec­ond is equally true of ex­perts and SPRs, and the third is also equally true of ex­perts and SPRs.

• Also, there is an ar­ti­cle by Dawes, Faust and Meehl. De­spite the fact it was pub­lished 7 years prior to House of Cards, it con­tains some in­for­ma­tion not de­scribed in the chap­ter 3 of House of Cards.

For ex­am­ple, the awe­some re­sult by Gold­berg: lin­ear mod­els of hu­man judges were more ac­cu­rate than hu­man judges them­selves:

in cases of dis­agree­ment, the mod­els were more of­ten cor­rect than the very judges on whom they were based.

• Thank you for this ar­ti­cle. Some peo­ple may re­act to find­ing that their pro­fes­sional opinion be less ac­cu­rate than a sim­ple for­mula, but I get ex­cited in­stead. It’s such a great op­por­tu­nity to be­come more ac­cu­rate, with such com­par­a­tively lit­tle effort! I’m par­tic­u­larly in­ter­ested in the med­i­cal SPRs; I aim to be a doc­tor, and if these will help me be bet­ter than the av­er­age doc­tor in many cases, then so be it. I sus­pect that I’ll have to use them se­cretly.

• Other re­lated read­ing that I don’t think has been men­tioned yet:

Ian Ayres (cofounder of stickK.com) has a pop­u­lar book called Su­per Crunch­ers that ar­gues this ex­act the­sis. http://​​www.ama­zon.com/​​Su­per-Crunch­ers-Think­ing-Num­bers-Smart/​​dp/​​0553805401

A clas­sic is Tet­lock’s Ex­pert Poli­ti­cal Judg­ment. http://​​press.prince­ton.edu/​​ti­tles/​​7959.html

• I think the rea­son I don’t use statis­tics more of­ten is the difficulty of get­ting good data sets; and even when there is good data, there are of­ten eth­i­cal prob­lems with fol­low­ing it. For ex­am­ple: Bob lives in Amer­ica, and is seek­ing to max­i­mize his hap­piness. Amer­i­cans who re­port high lev­els of spiritual con­vic­tion are twice as likely to re­port be­ing “very happy” than the least re­li­gious. Should he be­come a de­vout Chris­tian? There’s ev­i­dence that the hap­piness comes from hold­ing the ma­jor­ity opinion; should he then strive to be­lieve what­ever the polls say is the most com­mon be­lief in his area?

Another ex­am­ple: Bob has three kids; he knows his wife is cheat­ing on him, but he also knows the effect size of di­vorce on child out­comes (de­pres­sion, grades, in­come, sta­bil­ity of fu­ture re­la­tion­ships, etc.) is larger than smok­ing on lung can­cer, as­pirin on heart at­tacks, or cy­closporine on or­gan trans­plants. When do the bad effects of stay­ing in the mar­riage out­weigh the bad effects of split­ting up?

• Bob should not be­come a Chris­tian to be­come hap­pier for the same rea­son that he should not stay away from hos­pi­tals if he’s sick (af­ter all, sick peo­ple are a lot more likely to be in a hos­pi­tal).

• Cosma Shal­izi has a nice bibliog­ra­phy here

60 years of research

I would like to em­pha­size this part. It’s not just scat­tered pa­pers back then. Meehl wrote a book sur­vey­ing the field in 1955.

• Another ex­am­ple of this: the US poli­ti­cal mod­els did fan­tas­tic in pre­dict­ing all sorts of out­comes on elec­tion day 2012, far ex­ceed­ing all sorts of pun­dits or peo­ple ad­just­ing the num­bers based on gut feel­ings and as­sump­tions, de­spite of­ten be­ing pretty sim­ple or tan­ta­mount to poll av­er­ag­ing.

• Just felt like say­ing thank you to luke­prog and all those who com­mented; this has been a great help to me in de­cid­ing what to read about next re­gard­ing de­ter­mi­na­tion of guaran­teed val­ues for the ser­vice the de­part­ment I work in performs.

• Hu­mans use more com­plex util­ity func­tions to eval­u­ate some­thing like mar­tial hap­piness. If you train a statis­ti­cal model on a straight nu­meric value for mar­tial hap­piness than the model only op­ti­mizes to­wards that spe­cific as­pect of hap­piness.

A good eval­u­a­tion should test the model that trained on he­do­nis­tic hap­piness rat­ing on some­thing like the like­li­hood of di­vorce.

• I think you mean “mar­i­tal” here. (De­spite the similar­i­ties, love is not a bat­tlefield.)

• Okay, English isn’t my first lan­guage.

• English isn’t my first language

You could eas­ily have made the same typo even if it were; we’re talk­ing about the mere trans­po­si­tion of two ad­ja­cent let­ters.

(Another ex­am­ple: “ca­sual” vs. “causal”, which of­ten trips me up in read­ing.)

• (Another ex­am­ple: “ca­sual” vs. “causal”, which of­ten trips me up in read­ing.)

In Ital­ian that’s even worse, since causale does mean ‘causal’ but ca­suale means ‘ran­dom’.

• (Another ex­am­ple: “ca­sual” vs. “causal”, which of­ten trips me up in read­ing.)

Cool, that means you would get the joke about how “women are in­ter­ested in causal sex”!

• Is there acausal sex? (Would that be, like, hav­ing (phone/​cy­ber)sex with some­one in a differ­ent Teg­mark uni­verse via some form of com­mu­ni­ca­tion built on UDT acausal trade?)

• Acausal sex­ual re­pro­duc­tion is quite plau­si­ble, in a sense. Sup­pose you were a sin­gle woman liv­ing in a so­ciety with ac­cess to so­phis­ti­cated ge­netic en­g­ineer­ing, and you wanted to give birth to a child that was biolog­i­cally yours and not do any un­nat­u­ral op­ti­miz­ing. You could en­vi­sion your ideal mate in de­tail, re­verse-en­g­ineer the ge­net­ics of this man, and then cre­ate a sperm pop­u­la­tion that the man could have pro­duced had he ex­isted. I can eas­ily imag­ine a ge­netic en­g­ineer offer­ing this ser­vice: you walk into the office, de­scribe the man’s phys­i­cal at­tributes, per­son­al­ity, and even life his­tory, and the en­g­ineer does the rest as much as is pos­si­ble (in this so­ciety, we know that a plu­ral­ity of men who played short­stop in Lit­tle League have a cer­tain allele, etc.) The child could grow up and mean­ingfully learn things about the coun­ter­fac­tual father—if you learned that the father was prone to de­pres­sion, that would mean that you should watch out for that as well.

If the mother re­ally wants to, she can take things fur­ther and spec­ify that the man should be the kind of per­son who would have, had he ex­isted, gone through the analo­gous pro­ce­dure (with a sur­ro­gate or ar­tifi­cial womb), and that the coun­ter­fac­tual woman he would have speci­fied would have been her. In this case, we can say that the man and the woman have acausally re­pro­duced.

• Hmm. So the man has man­aged to “acausally re­pro­duce”, fulfill his util­ity func­tion, in spite of not ex­ist­ing. You could go fur­ther and posit an imag­i­nary cou­ple who would have cho­sen each other for the pro­ce­dure—so they suc­ceed in “acausally re­pro­duc­ing”, even though nei­ther of them ex­ists. Then when some­one tries to write a story about the imag­i­nary cou­ple, the child be­comes ob­serv­able to the writer and starts do­ing some re­pro­duc­ing of her own :-)

• My in­ter­pre­ta­tion of acausal sex­ual re­pro­duc­tion would be some­thing more like All You Zom­bies.

• What makes this acausal? That is, when are fu­ture in­puts mod­ify­ing pre­sent re­sults? Or are you us­ing a differ­ent defi­ni­tion of acausal?

• I meant it in the sense of ata’s par­ent com­ment, al­though there is a back­wards ar­row in there: the phe­no­type is de­ter­min­ing the geno­type rather than vice versa.

• That pa­per is ab­solutely brilli­ant! I kept laugh­ing ev­ery time a new clearly log­i­cally rea­soned yet hu­morous de­tail was ex­plored.

• Is there acausal sex? (Would that be, like, hav­ing (phone/​cy­ber)sex with some­one in a differ­ent Teg­mark uni­verse via some form of com­mu­ni­ca­tion built on UDT acausal trade?)

If you’re bas­ing the sex on acausal trade then you should per­haps re­fer to it as acausal pros­ti­tu­tion. Or pos­si­bly acausal mar­riage.

• Si­mu­late agent.

• Check if it tries to do the same for you.

• If it does, build it a body and have sex.

• In a galaxy far far away, an agent simu­lates you, sees you try to do the same for them.

• It clones you and has sex.

Does this fit the bill?

• It’s in­ter­est­ing to me that the proper lin­ear model ex­am­ple is es­sen­tially a stripped down ver­sion of a very sim­ple neu­ral net­work with a lin­ear ac­ti­va­tion func­tion.

• Is that re­ally true? Couldn’t one say that of just about any Tur­ing-com­plete (or less) model of com­pu­ta­tion?

‘Oh, it’s in­ter­est­ing that they are re­ally just a sim­ple unary fixed-length lambda-calcu­lus func­tion with con­stant-value pa­ram­e­ters.’

‘Oh, it’s in­ter­est­ing that they are re­ally just re­stricted petri-nets with bounded branch­ing fac­tors.’

‘Oh, it’s in­ter­est­ing that these are mod­e­lable by finite au­tomata.’

etc. (Plau­si­ble-sound­ing gob­bledy­gook in­cluded to make the point.)

• Yes, sort of, but a) a lin­ear clas­sifier is not a Tur­ing-com­plete model of com­pu­ta­tion, and b) there is a clear re­sem­blance that can be seen by merely glanc­ing at the equa­tions.

• I would ar­gue that neu­rons, neu­ral nets, SPRs, and ev­ery­one else do­ing lin­ear re­gres­sion use those tech­niques be­cause it’s the sim­plest way to ag­gre­gate data.

• I’m skep­ti­cal, and will now pro­ceed to ques­tion some of the as­ser­tions made/​refer­ences cited. Note that I’m not trained in statis­tics.

Un­for­tu­nately, most of the ar­ti­cles cited are not eas­ily available. I would have liked to check the method­ol­ogy of a few more of them.

|For ex­am­ple, one SPR de­vel­oped in 1995 pre­dicts the price of ma­ture Bordeaux red wines at auc­tion bet­ter than ex­pert wine tasters do.

The pa­per doesn’t ac­tu­ally es­tab­lish what you say it does. There is no statis­ti­cal anal­y­sis of ex­pert wine tasters, only one or two anec­do­tal state­ments of their fury at the whole idea. In­stead, the SPR is com­pared to ac­tual mar­ket prices—not to ex­perts’ pre­dic­tions. I think it’s fair to say that the claim I quoted is over­reached.

Now, about the model and its fit to data. Note that the SPR is older than 1995, when the pa­per was pub­lished. The NYTimes ar­ti­cle about it which you refer­ence is from 1990 (the pa­per bizarrely dates it to 1995; I’m not sure what’s go­ing on there).

The fact that there’s a lin­ear model—not speci­fied pre­cisely any­where in the ar­ti­cle—which is a good fit to wine prices for vin­tages of 1961-1972 (Table 3 in the pa­per) is not, I think, very sig­nifi­cant on its own. To judge the model, we should look at what it pre­dicts for up­com­ing years. Both the pa­per and the NYTimes ar­ti­cle make two spe­cific pre­dic­tions. First, the 1986 vin­tage, claimed to be ex­tol­led by ex­perts early on, will prove mediocre be­cause of the weather con­di­tions that year (see Figure 3 in the pa­per − 1986 is clearly the worst of the 80ies). NYTimes says “When the dust set­tles, he pre­dicts, it will be judged the worst vin­tage of the 1980′s, and no bet­ter than the un­mem­o­rable 1974′s or 1969′s”. The 1995 pa­per says, more mod­estly, “We should ex­pect that, in due course, the prices of these wines will de­cline rel­a­tive to the prices of most of the other vin­tages of the 1980s”. Se­cond, the 1989-1990 is pre­dicted to be “out­stand­ing” (pa­per), “stun­ningly good” (NYTimes), “ad­justed for age, will out­sell at a sig­nifi­cant pre­mium the great 1961 vin­tage (NYTimes).”

It’s now 16 years later. How do we test these pre­dic­tions?

First, I’ve stum­bled on a differ­ent pa­per from the pri­mary au­thor, Prof. Ashen­felter, from 2007. Pub­lished 12 years later than the one you refer­ence, this pa­per has sub­stan­tially the same con­tents, with whole pages copied ver­ba­tim from the ear­lier one. That, by it­self, wor­ries me. Even more wor­ry­ing is the fact that the 1986 pre­dic­tion, promi­nent in the 1990 ar­ti­cle and the 1995 pa­per, is com­pletely miss­ing from the 2007 pa­per (the data be­low might in­di­cate why). And most wor­ry­ing of all is the change of lan­guage re­gard­ing the 1989/​1990 pre­dic­tion. The 1995 pa­per says about its pre­dic­tion that the 1989/​1990 will turn out to be out­stand­ing, “Many wine writ­ers have made the same pre­dic­tions in the trade mag­a­z­ines”. The 2007 pa­per says “Iron­i­cally, many pro­fes­sional wine writ­ers did not con­cur with this pre­dic­tion at the time. In the years that have fol­lowed minds have been changed; and there is now vir­tu­ally unan­i­mous agree­ment that 1989 and 1990 are two of the out­stand­ing vin­tages of the last 50 years.”

Uhm. Right. Well, be­cause the claims aren’t strong enough, they do not ex­actly con­tra­dict each other, but this change leaves a bad taste. I don’t think I should give much trust to these pa­pers’ claims.

The data I could find quickly to test the pre­dic­tions is here. The prices are bro­ken down by the chateaux, by the vin­tage year, the pack­ag­ing (I’ve always cho­sen BT—bot­tle), and the auc­tion year (I’ve always cho­sen the last year available, typ­i­cally 2004). Un­for­tu­nately, Ashen­felter un­der­speci­fies how he came up with the ag­gre­gate prices for a given year—he says he chose a pack­age of the best 15 winer­ies, but doesn’t say which ones or how the prices are com­bined. I used 5 winer­ies that are speci­fied as the best in the 2007 pa­per, and looked up the prices for years 1981-1990. The data is in this spread­sheet. I haven’t tried to statis­ti­cally an­a­lyze it, but even from a quick glance, I think the fol­low­ing is clear. 1986 did not sta­bi­lize as the worst year of the 1980s. It is fre­quently sec­ond- or third-best of the decade. It is always much bet­ter than ei­ther 1984 or 1987, which are sup­posed to be vastly bet­ter ac­cord­ing to the 1995 pa­per’s weather data (see Figure 3). 1989/​1990 did turn out well, es­pe­cially 1990. Still, they’re both nearly always less ex­pen­sive than 1982, which is again vastly in­fe­rior in the weather data (it isn’t even in the best quar­ter). Over­all, I fail to see much cor­re­la­tion be­tween the weather data in the pa­per for the 1980s, the spe­cific claims about 1986 and 1989/​1990, and the mar­ket prices as of 2004. I wouldn’t recom­mend us­ing this SPR to pre­dict mar­ket prices.

Now, this was the first ex­am­ple in your post, and I found what I be­lieve to be sub­stan­tial prob­lems with its method­ol­ogy and the qual­ity of its SPR. If I were to pro­ceed and ex­am­ine ev­ery ex­am­ple you cite in the same de­tail, would I en­counter many such prob­lems? It’s difficult to tell, but my pre­dic­tion is “yes”. I an­ti­ci­pate overfit­ting and shoddy method­ol­ogy. I an­ti­ci­pate huge in­fluence of the se­lec­tion bias—the au­thors that pub­lish these kinds of pa­pers will not pub­lish a pa­per that says “The ex­perts were bet­ter than our SPR”. And fi­nally, I an­ti­ci­pate over­reach­ing claims of wide-reach­ing ap­pli­ca­bil­ity of the mod­els, based on pa­pers that ac­tu­ally in­di­cate mod­est effect in a very spe­cific situ­a­tion with a small sam­ple size.

I’ve looked at your sec­ond ex­am­ple:

|Howard and Dawes (1976) found they can re­li­ably pre­dict mar­i­tal hap­piness with one of the sim­plest SPRs ever con­ceived, us­ing only two cues: P = [rate of love­mak­ing] - [rate of fight­ing].

I couldn’t find the origi­nal pa­per, but the re­sults are sum­marised in Dawes (1979). Look­ing at it, it turns out that when you say “pre­dict mar­i­tal hap­piness”, it re­ally means “pre­dicts one of the part­ner’s sub­jec­tive opinion of their mar­i­tal hap­piness”—as op­posed to e.g. sta­bil­ity of the mar­riage over time. There’s no in­di­ca­tion as to how the part­ner to ques­tion was cho­sen from each pair (e.g. whether the ex­per­i­menter knew the rate when they chose). There was very good cor­re­la­tion with bi­nary out­come (happy/​un­happy), but when a finer scale of 7 de­grees of hap­piness was used, the cor­re­la­tion was weak—rate of 0.4. In a fol­low-up ex­per­i­ment, cor­re­la­tion rate went up to 0.8, but there the sub­ject looked at the love­mak­ing/​fight­ing statis­tics be­fore opin­ing on the de­gree of hap­piness, thus con­tam­i­nat­ing their de­ci­sion. And even in the ear­lier ex­per­i­ment, the sub­ject had been record­ing those love­mak­ing/​fight­ing statis­tics in the first place, so it would make sense for them to re­call those events when they’re asked to as­sess whether their mar­riage is a happy one. Over­all, the model is witty and naively ap­pears to be use­ful, but the sus­pect method­ol­ogy and the rel­a­tively weak cor­re­la­tion en­courages me to dis­count the anal­y­sis.

Fi­nally, the fol­low­ing claim is the sin­gle most ob­jec­tion­able one in your post, to my taste:

|If you’re hiring, you’re prob­a­bly bet­ter off not do­ing in­ter­views.

My own ex­pe­rience strongly sug­gests to me that this claim is inane—and is highly dan­ger­ous ad­vice. I’m not able to view the pa­pers you base it on, but if they’re any­thing like the first and sec­ond ex­am­ple, they’re far, far away from con­vinc­ing me of the truth of this claim, which I in any case strongly sus­pect to over­reach gi­gan­ti­cally over what the pa­pers are prov­ing. It may be true, for ex­am­ple, that a very large body of hiring de­ci­sion-mak­ers in a huge or­gani­sa­tion or a state on av­er­age make poorer de­ci­sions based on their pro­fes­sional judge­ment dur­ing in­ter­views than they would have made based purely on the re­sume. I can see how this claim might be true, be­cause any such very large body must be largely in­com­pe­tent. But it doesn’t fol­low that it’s good ad­vice for you to ab­strain from in­ter­view­ing—it would only fol­low if you be­lieve your­self to be no more com­pe­tent than the av­er­age hiring man­ager in such a body, or in the pa­pers you refer­ence. My per­sonal ex­pe­rience from in­ter­view­ing many, many can­di­dates for a large com­pany sug­gests that in­ter­view­ing is cru­cial (though I will freely grant that differ­ent kinds of in­ter­views vary wildly in their effec­tive­ness).

• I was think­ing of writ­ing a post about Bishop & Trout when I didn’t see it men­tioned on this site be­fore, but I’m glad you beat me to it. (Among other things, I lent out my copy and so would have difficulty writ­ing up a re­view). It’s a great book.

• Your up­load of Dawes’s “The Ro­bust Beauty of Im­proper Lin­ear Models in De­ci­sion Mak­ing” seems to be bro­ken- at least, I’m not able to ac­cess it.

• at least, I’m not able to ac­cess it.

Nei­ther.

• Dang. Fixed.

• Wow. I highly recom­mend read­ing the Dawes pdf, it’s illu­mi­nat­ing:

Ex­pert doc­tors coded [vari­ables from] biop­sies of pa­tients with Hodgkin’s dis­ease and then made an over­all rat­ing of the sever­ity of the pro­cess. The over­all rat­ing did not pre­dict the sur­vival time of the 193 pa­tients, all of whom died. (The cor­re­la­tions of sur­vival time with rat­ings was vir­tu­ally 0, some in the wrong di­rec­tion). The vari­ables that the doc­tors coded, how­ever, did pre­dict sur­vival time when they were used in a mul­ti­ple re­gres­sion model.

In sum­mary, proper lin­ear mod­els work for a very sim­ple rea­son. Peo­ple are good at pick­ing out the right pre­dic­tor vari­ables … Peo­ple are bad at in­te­grat­ing in­for­ma­tion from di­verse and in­com­pa­rable sources. Proper lin­ear mod­els are good at such in­te­gra­tion …

He then goes on to show that im­proper lin­ear mod­els still beat hu­man judg­ment. If your re­ac­tion to the top-level post wasn’t en­dorse­ment of statis­ti­cal meth­ods for these prob­lems, this pdf is a bunch more ev­i­dence that you can use to up­date your be­liefs about statis­ti­cal meth­ods of pre­dic­tion.

• Peo­ple are good at pick­ing out the right pre­dic­tor vari­ables … Peo­ple are bad at in­te­grat­ing in­for­ma­tion from di­verse and in­com­pa­rable sources.

That is a beau­tiful sum­mary sen­tence, in­ci­den­tally, and I am tak­ing it with me as a short­hand “han­dle” for this whole idea.

I find it works well as a sur­face-level counter for the (alas, still in­ap­pro­pri­ately com­pel­ling) idea that a dumb al­gorithm can’t get more ac­cu­rate re­sults than a smart ob­server.

• Another pos­si­ble metaphor is the pocket calcu­la­tor.

It can find a num­ber for any ex­pres­sion you can put into it, and in most cases it can do it way faster and more ac­cu­rately than a hu­man could. How­ever, that doesn’t make it a re­place­ment for a hu­man. An in­tel­li­gent agent like a hu­man is still needed for the cru­cial part of figur­ing out what ex­pres­sion would be mean­ingful to put into it.

• That is a very helpful metaphor for wrap­ping my head around both the ad­van­tages and limi­ta­tions of SPR, thank you! :)

• I can­not help un­leash­ing an evil laugh when­ever I dis­cover an­other tool to aid in world dom­i­na­tion. Thank you.

• To think about it, the main cri­tique i have for this ar­ti­cle is:

• Only lists cases where SPR ‘out­performed’ ex­per­tise. Of which in most we just loosely de­scribe as ‘ex­perts’ some peo­ple who had never did any proper train­ing (with ex­er­cises and test­ing) to perform task in ques­tion.

• Equates bet­ter cor­re­la­tion with “out­performs”. Not the same thing. The max­i­mum cor­re­la­tion hap­pens when you clas­sify into those with less than av­er­age risk of re­ci­di­vism and those with larger than av­er­age risk. Pa­role board is not even sup­posed to work like this AFAIK.

• If some SPR can ‘out­perform’ av­er­age HR ex­per­tise, it doesn’t mean SPR out­performs best ex­per­tise. Ex­am­ple where it mat­ters: if you are a soft­ware start-up com­pany founder, and if your ex­per­tise is av­er­age, your start-up will al­most in­evitably fail. Only small per­centage suc­cesses, top 1% or less. You strive to max­i­mize your chances at mak­ing into top 1%, not at mak­ing into top 50%.

• What’s about eth­i­cal is­sues? Race cor­re­lates with crim­i­nal­ity, for ex­am­ple.

edit: not fully sure at the mo­ment when max­i­mum cor­re­la­tion hap­pens.