Anatoly_Vorobey comments on Statistical Prediction Rules Out-Perform Expert Human Judgments

Anatoly_Vorobey 18 Jan 2011 16:47 UTC
87 points
0
I’m skeptical, and will now proceed to question some of the assertions made/references cited. Note that I’m not trained in statistics.

Unfortunately, most of the articles cited are not easily available. I would have liked to check the methodology of a few more of them.

For example, one SPR developed in 1995 predicts the price of mature Bordeaux red wines at auction better than expert wine tasters do.

The paper doesn’t actually establish what you say it does. There is no statistical analysis of expert wine tasters, only one or two anecdotal statements of their fury at the whole idea. Instead, the SPR is compared to actual market prices—not to experts’ predictions. I think it’s fair to say that the claim I quoted is overreached.

Now, about the model and its fit to data. Note that the SPR is older than 1995, when the paper was published. The NYTimes article about it which you reference is from 1990 (the paper bizarrely dates it to 1995; I’m not sure what’s going on there).

The fact that there’s a linear model—not specified precisely anywhere in the article—which is a good fit to wine prices for vintages of 1961-1972 (Table 3 in the paper) is not, I think, very significant on its own. To judge the model, we should look at what it predicts for upcoming years. Both the paper and the NYTimes article make two specific predictions. First, the 1986 vintage, claimed to be extolled by experts early on, will prove mediocre because of the weather conditions that year (see Figure 3 in the paper − 1986 is clearly the worst of the 80ies). NYTimes says “When the dust settles, he predicts, it will be judged the worst vintage of the 1980′s, and no better than the unmemorable 1974′s or 1969′s”. The 1995 paper says, more modestly, “We should expect that, in due course, the prices of these wines will decline relative to the prices of most of the other vintages of the 1980s”. Second, the 1989-1990 is predicted to be “outstanding” (paper), “stunningly good” (NYTimes), “adjusted for age, will outsell at a significant premium the great 1961 vintage (NYTimes).”

It’s now 16 years later. How do we test these predictions?

First, I’ve stumbled on a different paper from the primary author, Prof. Ashenfelter, from 2007. Published 12 years later than the one you reference, this paper has substantially the same contents, with whole pages copied verbatim from the earlier one. That, by itself, worries me. Even more worrying is the fact that the 1986 prediction, prominent in the 1990 article and the 1995 paper, is completely missing from the 2007 paper (the data below might indicate why). And most worrying of all is the change of language regarding the 1989/1990 prediction. The 1995 paper says about its prediction that the 1989/1990 will turn out to be outstanding, “Many wine writers have made the same predictions in the trade magazines”. The 2007 paper says “Ironically, many professional wine writers did not concur with this prediction at the time. In the years that have followed minds have been changed; and there is now virtually unanimous agreement that 1989 and 1990 are two of the outstanding vintages of the last 50 years.”

Uhm. Right. Well, because the claims aren’t strong enough, they do not exactly contradict each other, but this change leaves a bad taste. I don’t think I should give much trust to these papers’ claims.

The data I could find quickly to test the predictions is here. The prices are broken down by the chateaux, by the vintage year, the packaging (I’ve always chosen BT—bottle), and the auction year (I’ve always chosen the last year available, typically 2004). Unfortunately, Ashenfelter underspecifies how he came up with the aggregate prices for a given year—he says he chose a package of the best 15 wineries, but doesn’t say which ones or how the prices are combined. I used 5 wineries that are specified as the best in the 2007 paper, and looked up the prices for years 1981-1990. The data is in this spreadsheet. I haven’t tried to statistically analyze it, but even from a quick glance, I think the following is clear. 1986 did not stabilize as the worst year of the 1980s. It is frequently second- or third-best of the decade. It is always much better than either 1984 or 1987, which are supposed to be vastly better according to the 1995 paper’s weather data (see Figure 3). 1989/1990 did turn out well, especially 1990. Still, they’re both nearly always less expensive than 1982, which is again vastly inferior in the weather data (it isn’t even in the best quarter). Overall, I fail to see much correlation between the weather data in the paper for the 1980s, the specific claims about 1986 and 1989/1990, and the market prices as of 2004. I wouldn’t recommend using this SPR to predict market prices.

Now, this was the first example in your post, and I found what I believe to be substantial problems with its methodology and the quality of its SPR. If I were to proceed and examine every example you cite in the same detail, would I encounter many such problems? It’s difficult to tell, but my prediction is “yes”. I anticipate overfitting and shoddy methodology. I anticipate huge influence of the selection bias—the authors that publish these kinds of papers will not publish a paper that says “The experts were better than our SPR”. And finally, I anticipate overreaching claims of wide-reaching applicability of the models, based on papers that actually indicate modest effect in a very specific situation with a small sample size.

I’ve looked at your second example:

Howard and Dawes (1976) found they can reliably predict marital happiness with one of the simplest SPRs ever conceived, using only two cues: P = [rate of lovemaking] - [rate of fighting].

I couldn’t find the original paper, but the results are summarised in Dawes (1979). Looking at it, it turns out that when you say “predict marital happiness”, it really means “predicts one of the partners’ subjective opinion of their marital happiness”—as opposed to e.g. stability of the marriage over time. There’s no indication as to how the partner to question was chosen from each pair (e.g. whether the experimenter knew the rate when they chose). There was very good correlation with binary outcome (happy/unhappy), but when a finer scale of 7 degrees of happiness was used, the correlation was weak—rate of 0.4. In a follow-up experiment, correlation rate went up to 0.8, but there the subject looked at the lovemaking/fighting statistics before opining on the degree of happiness, thus contaminating their decision. And even in the earlier experiment, the subject had been recording those lovemaking/fighting statistics in the first place, so it would make sense for them to recall those events when they’re asked to assess whether their marriage is a happy one. Overall, the model is witty and naively appears to be useful, but the suspect methodology and the relatively weak correlation encourages me to discount the analysis.

Finally, the following claim is the single most objectionable one in your post, to my taste:

If you’re hiring, you’re probably better off not doing interviews.

My own experience strongly suggests to me that this claim is inane—and is highly dangerous advice. I’m not able to view the papers you base it on, but if they’re anything like the first and second example, they’re far, far away from convincing me of the truth of this claim, which I in any case strongly suspect to overreach gigantically over what the papers are proving. It may be true, for example, that a very large body of hiring decision-makers in a huge organisation or a state on average make poorer decisions based on their professional judgement during interviews than they would have made based purely on the resume. I can see how this claim might be true, because any such very large body must be largely incompetent. But it doesn’t follow that it’s good advice for you to abstrain from interviewing—it would only follow if you believe yourself to be no more competent than the average hiring manager in such a body, or in the papers you reference. My personal experience from interviewing many, many candidates for a large company suggests that interviewing is crucial (though I will freely grant that different kinds of interviews vary wildly in their effectiveness).
What links here?
- bentarm 19 Jan 2011 0:41 UTC
  20 points
  0
  Parent
  
  If you’re hiring, you’re probably better off not doing interviews.
  
  My own experience strongly suggests to me that this claim is inane—and is highly dangerous advice… My personal experience from interviewing many, many candidates for a large company suggests that interviewing is crucial (though I will freely grant that different kinds of interviews vary wildly in their effectiveness).
  
  The whole point of this article is that experts often think themselves better than SPR’s when actually they perform no better than SPRs on average. Here we have an expert telling us that he thinks he would perform better than an SPR. Why should we be interested?
  - Anatoly_Vorobey 19 Jan 2011 11:18 UTC
    27 points
    0
    Parent
    Because I didn’t just state a blanket opinion. I dug into the studies, looked for data to test one of them in depth, and found it to be highly flawed. I called into question the methodology employed by the studies, as well as overgeneralizing and overreaching conclusions they’re drummed up to support. The evidence that at least some studies are flawed and the methodology is shoddy should make you question the universal claim ”… actually they perform no better than SPRs on average”. That’s why you should be interested.
    
    My personal experience with interviewing is certainly not as important piece of evidence against the article as the specific criticisms of the studies. It’s just another anecdotal data point. That’s why I didn’t expand on it as much as I did on the wine study, although I do believe it can be made more convincing through further elucidation.
- shokwave 18 Jan 2011 17:02 UTC
  12 points
  0
  Parent
  
  My own experience strongly suggests to me that this claim is inane … it would only follow if you believe yourself to be no more competent than the average hiring manager in such a body, or in the papers you reference.
  
  What evidence do you have that you are better than average?
  
  My personal experience from interviewing many, many candidates for a large company suggests that interviewing is crucial
  
  “It is difficult to get a man to understand something, when his salary depends upon his not understanding it!”
  - Desrtopa 18 Jan 2011 17:26 UTC
    4 points
    0
    Parent
    I have heard of one job interview that I felt constituted a useful tool that could not effectively be replaced by resume examination and statistical analysis. A friend of mine got a job working for a company that provides mathematical modeling services for other companies, and his “interview” was a several hour test to create a number of mathematical models, and then explaining to the examiner in layman’s terms how and why the models worked.
    
    Most job interviews are really not a demonstration of job skills and aptitude, and it’s possible to simply bullshit your way through them. On the other hand, if you have a simple and direct way to test the competence of your applicants, then by all means use it.
    - datadataeverywhere 19 Jan 2011 1:54 UTC
      12 points
      0
      Parent
      I’m most familiar with interviews for programming jobs, where an interview that doesn’t ask the candidate to demonstrate job-specific skills, knowledge and aptitude is nearly worthless. These jobs are also startlingly prone to resume distortion that can make vastly different candidates look similar, especially recent graduates.
      
      Asking for coding samples and calling previous employers, especially if coupled with a request for code solving a new (requested) problem, could potentially replace interviews. However, judging the quality of code still requires a person, so that doesn’t seem to really change things to me.
      - sketerpot 19 Jan 2011 22:34 UTC
        1 point
        0
        Parent
        That’s what I think of, too, when I hear the phrase “job interview”. Is this not typical outside fields like programming?
        retiredurologist 19 Jan 2011 23:17 UTC
        13 points
        0
        Parent
        I can confirm that such a “job interview” is not common in medicine. The potential employer generally relies on the credentialing process of the medical establishment. Most physicians, upon completing their training, pass a test demonstrating their ability to regurgitate the teachers’ passwords, and are recommended to the appropriate certification board as “qualified” by their program director; to do otherwise would reflect badly on the program. Also, program directors are loath to remove a resident/fellow during advanced training because some warm body must show up to do the work, or the professor himself/herself might have to fill in. It is difficult to find replacements for upper level residents; the only common reason such would be available is dismissal/transfer from another program. Consequently, the USA turns out physicians of widely varied skill levels, even though their credentials are similar. In surgical specialities, it is not unusual for a particularly bright individual with all the passwords but very poor technical skills to become a surgical professor.
        Desrtopa 19 Jan 2011 23:22 UTC
        4 points
        0
        Parent
        My mother has told me an anecdote about a family friend who was a surgeon who had a former student call him while conducting an operation because he couldn’t remember what to do.
        wedrifid 19 Jan 2011 23:55 UTC
        39 points
        0
        Parent
        
        My mother has told me an anecdote about a family friend who was a surgeon who had a former student call him while conducting an operation because he couldn’t remember what to do.
        
        The (rumored) student has my respect. I would expect most surgeons to have too much of an ego to admit to that doubt rather than stumble ahead full of hubris. It would be comforting to know that your surgeon acted as if (as opposed to merely believing that) he cared more about the patient than the immediate perception of status loss. (I wouldn’t care whether that just meant his thought out anticipation of future status loss for a failed operation overrode his immediate social instincts.)
    - knb 19 Jan 2011 2:32 UTC
      10 points
      0
      Parent
      That isn’t an interview, it’s a test. Tests are extremely useful. IQ tests are an excellent predictor of job performance, maybe the best one available. Regardless, IQ tests are usually de facto illegal in the US due to disparate impact.
      - Desrtopa 19 Jan 2011 6:24 UTC
        4 points
        0
        Parent
        I put interview in quotes because they called it an interview. Speaking broadly enough, all interviews are tests, but most are unstructured and not very good at examining the relevant predictor variables. All tests are of course not necessarily interviews, but the part where they had applicants explain their processes in layman’s terms might qualify it, at least if you’re generous with your definitions.
        
        Of course, it’s certainly unclear if not outright incorrect to call it an interview, but that was their choice; possibly they felt that subjecting applicants to a “test” rather than an “interview” projected a less positive image.
  - Dr_Manhattan 18 Jan 2011 21:21 UTC
    2 points
    0
    Parent
    
    “It is difficult to get a man to understand something, when his salary depends upon his not understanding it!”
    
    I don’t think it’s fair, as his job is not being an interviewer, but perhaps hiring smart people we can benefit from.
- lukeprog 18 Jan 2011 19:14 UTC
  9 points
  0
  Parent
  Cool, I’ll look into these points.
  
  I made one small change so far. The above article now read: “Reaction from the wine-tasting industry to such wine-predicting SPRs has been ‘somewhere between violent and hysterical.’”
  
  Also, I’ll post links to the specific papers when I have time to visit UCLA and grab them.
  
  Psychology is not my field, but my understanding is that the ‘interview effect’ for unstructured interviews is a very robust finding across many decades. For more, you can listen to my interview with Michael Bishop. But hey, maybe he’s wrong!
  
  Update 1: If I read the 1995 study correctly, they judged the accuracy of wine tasters by comparing the price of immature wines to those of mature wines, but I’m not sure. The way I phrased that is from Bishop & Trout, and that is how Bishop recalls it, though it’s been several years now since he co-wrote Epistemology and the Psychology of Human Judgment.
  What links here?
  - lukeprog's comment on Statistical Prediction Rules Out-Perform Expert Human Judgments by lukeprog (20 Jan 2011 16:16 UTC; 1 point)
- CronoDAS 19 Jan 2011 7:41 UTC
  2 points
  0
  Parent
  Regarding hiring, I think the keyword might be “unstructured”—what makes an interview an “unstructured” interview?
  - Anatoly_Vorobey 19 Jan 2011 8:26 UTC
    14 points
    0
    Parent
    That’s what I thought too. The definitions I found searching all say that any interview where you decide what to ask and how to interpret the results is “unstructured”. The only “structured” interviews seem to be tests with pre-determined sets of questions, and the candidate’s answers judged by formal criteria.
    
    I’m not sure this division of the “interview-space” is all that useful. I would distinguish three categories:
    
    You have an informat chat with me about the nature of the job, my experience, my previous employment, my claims about my aptitude, etc. Your impressions from this chat determine your judgement of my suitability for the job.
    You ask me to answer questions or perform tasks that demonstrate my aptitude. It’s up to you to choose the tasks, interpret my performance, and guide the whole process.
    You give me a pre-determined set of questions/tasks that is the same for all candidates. My answers are mechanically interpreted by whether they coincide with the pre-determined set of correct answers.
    
    If I interpret the definitions I could find correctly, 3 is a “structured” interview, and both 1 and 2 are “unstructured”. To my mind, there’s a world of difference between 1 and 2, however. 1 is of very limited utility (I want to say “next to worthless”, but that’d be too presumptuous), and, quite possibly, does no better than deciding on the basis of the resume alone, thought I’d still want to see the data to be convinced. 2, when performed by a trained and calibrated interviewer, is—again, in my own experience—obviously superior both to 1 and to deciding on the basis of the resume alone. Maybe this is somehow unique to the profession I interview for, but I doubt it.
    
    Suppose there’s research which demonstrates that in some setting type 1 interviews are worse than using the resume alone. I don’t know whether this is the case in the papers cited in this post (I couldn’t read them), but I find it plausible. Suppose then that the conclusions drawn are the universal statements “unstructured interviews reliably degrade the decisions of gatekeepers” and “if you’re hiring, you’re probably better off not doing interviews”. I consider such conclusions then to be obviously unsubstantiated, incredibly overreached, and highly dangerous advice.
- XiXiDu 18 Jan 2011 18:38 UTC
  1 point
  0
  Parent
  The interview example makes sense to me if the usual hiring manager is strongly biased regarding information that are not crucial. A dossier only gives little but important information. In a face-to-face interview various other factors can play a role (often unconsciously), e.g. smell or the ability to return a look.
  - XiXiDu 19 Jan 2011 13:11 UTC
    −2 points
    0
    Parent
    More here. Surely that isn’t strong evidence but another indication that if you are not an LW type person then information that are not crucial might alter your perception and subsequent decision when doing face-to-face interviews versus dossier based ruling.
- shokwave 18 Jan 2011 17:03 UTC
  0 points
  0
  Parent
  Read the Dawes pdf linked in the top post. I can’t speak for the other examples, but that one is solid.
  
  edit: my apologies, re-reading I see you discussed the marriage example. What is your opinion on the graduate rating and Hodgkin’s disease examples?
  - Perplexed 19 Jan 2011 3:53 UTC
    9 points
    0
    Parent
    that one is solid
    
    Why do you say that? My reaction to that paper was very negative. In large part, it was the anecdotal flavor of the arguments made there, but also because I didn’t see the two things I was specifically looking for:
    
    Citations of studies in which a linear model was constructed using one set of data, and then compared as to performance against the experts using a different set of data.
    Failing that, some numbers that would convince me that the failure to test models using different data than was used to construct them just doesn’t matter.
    
    Instead, here and in the 1996 study by Grove & Meehl, I find arguments from incredulity—in effect: “Do our critics really think that this really matters? Don’t be absurd!”. I also notice that this ideology is being promoted by a small number of researchers who repeatedly cite each other’s work, and do not cite critics (except as strawmen).
  - DanielVarga 19 Jan 2011 21:42 UTC
    7 points
    0
    Parent
    Like Perplexed, I hated this paper. Of course, it has the very good excuse that it is from 1979. But in 2011, it is sort of expected that you evaluate your model on a second, independent dataset. (My models often crash and burn at this stage.) Did any of these studies do this?