Shallow Review of Consistency in Statement Evaluation


Most ex­ist­ing fore­cast­ing or eval­u­a­tion plat­form ques­tions are for very clearly ver­ifi­able ques­tions:

  • “Who will win the next elec­tion?”

  • “How many cars will Tesla sell in 2030?”

  • “How many jelly beans are in this jar?”

But many of the ques­tions we care about are do not look like this. They might…

  • Be severely un­der­speci­fied, e.g. “How much should we charge this cus­tomer for this vague fea­ture re­quest?”

  • In­volve value judge­ments, e.g. “What is the op­ti­mum prison sen­tence for this con­vict?”, “How much does this plain­tiff de­serve for pain and suffer­ing?”

  • Not have a clear stop­ping point, e.g. “What is the rel­a­tive effec­tive­ness of AI safety re­search vs. bio risk re­search?”

  • Re­quire mul­ti­ple steps in­stead of a yes/​no or nu­mer­i­cal an­swer, e.g. “What treat­ment is ap­pro­pri­ate for this pa­tient with pre­can­cer­ous cells?”

  • Not have good refer­rents, e.g. “What is the mar­ket size for this com­pletely new tech?”

An en­tity who could an­swer these ques­tions well would be a very valuable as­set. But what does well even mean here? We want peo­ple to be ac­cu­rate, of course, but in many cases we also need their pre­dic­tions/​eval­u­a­tions to be con­sis­tent to be ac­tion­able. This is es­pe­cially true when fair­ness norms are in play, such as pric­ing[1] and prison sen­tenc­ing.

There is a lot of re­search show­ing that peo­ple make in­con­sis­tent eval­u­a­tions (with each other and them­selves across time) across a wide va­ri­ety of fields, even those that more closely re­sem­ble the “easy” ques­tions above (valu­ing stocks, ap­prais­ing real es­tate, sen­tenc­ing crim­i­nals, eval­u­at­ing job perfor­mance, au­dit­ing fi­nan­cial state­ments)[2]. It is even more difficult to con­sis­tently eval­u­ate or pre­dict novel ques­tions or low-fre­quency events, like “Will In­dia use a nu­clear weapon on Pak­istan by 1/​1/​20” or “How much coun­ter­fac­tual value has this or­ga­ni­za­tion cre­ated?”.

This pa­per is a shal­low re­view of the liter­a­ture around how to get en­tities to make con­sis­tent judge­ments. I want to note up front that a ma­jor limi­ta­tion of this write-up and of shal­low re­views in gen­eral is that I mostly re­lied on au­thors’ de­scrip­tions of their work and con­clu­sions, rather than ver­ify­ing their ex­per­i­men­tal de­sign and con­clu­sions for my­self, or look­ing up oth­ers’ opinions of pa­pers. As such, this post should be taken as a de­scrip­tion of the state of the liter­a­ture, not the state of the world.

Speak­ing of un­der­speci­fied ques­tions, “how to get con­sis­tent an­swers to com­pli­cated ques­tions?” sure is one. I started this re­search pro­ject with a vague sense of an area from Ozzie Gooen; as I iter­ated, we came up with more spe­cific ques­tions. The fol­low­ing is a list of ques­tions or hooks that came up as we dis­cussed the re­search:

  1. Over­all, what liter­a­ture is available to an­swer the ques­tion “how to get peo­ple to an­swer messy ques­tions con­sis­tently?”

  2. What are the costs of con­sis­tency?

  3. How of­ten are eval­u­a­tions /​ sen­tences sim­ply mi­s­un­der­stood by peo­ple? Is there a de­cent sci­ence be­hind un­der­stand­ing and ex­pect­ing lev­els of mi­s­un­der­stand­ings in differ­ent cases?

  4. How of­ten are the eval­u­a­tions do­ing the wrong things? What are the failure modes? For in­stance, one failure mode is that they have mis­judged the value of some in­ter­me­di­ate vari­ables. Maybe that’s all there is?

  5. In what do­mains are sub­jec­tive mea­sures likely to be in­for­ma­tive, es­pe­cially about things other than sub­jec­tive states? (For in­stance, the sub­jec­tive mea­sure of “I think this work was done at a 810, is very differ­ent than, “I’m feel­ing an 810 now”, in that both of them re­quire an in­tu­itive judge­ment, but in one case the in­tu­itive judge­ment **is** the mea­sure.

  6. What are the main prob­lems that come up for non­profit eval­u­a­tions? Have they found any meth­ods that would be use­ful to us?

  7. How difficult is/​can it be to come up with these com­pos­ite in­dexes/​lin­ear mod­els? What should we know when at­tempt­ing them?

  8. Can we have any clever mod­els where eval­u­a­tors are re­ally just pre­dict­ing what other eval­u­a­tors would say?

  9. What are good ar­eas for fol­low-ups?

Some of these ques­tions were an­swered in more de­tail than oth­ers, some were not an­swer­able at all in the time available. Here is what I found.

Meth­ods to Im­prove Con­sis­tency in Evaluations

  • Hold Key­ne­sian beauty con­tests, in which the goal is to guess what other peo­ple will guess, not what you think is true.

    • A sin­gle study sug­gested this im­proves re­call and pre­ci­sion.

    • “What do other peo­ple think?” is also a well known trick for get­ting peo­ple to be hon­est about opinions over which they ex­pect to re­ceive cen­sure.

  • Use groups in­stead of in­di­vi­d­u­als (Zhi­tomirsky-Geffet, Bar-Ilan, and Mark Levene)

    • Con­figur­ing groups such that each group has the same va­ri­ety of ex­per­tise al­lows you to use some non-com­mon knowl­edge in your es­ti­mates (per­sonal guess).

    • For pro­ce­dures with many iter­a­tions (e.g., image la­bel­ing), com­bine mul­ti­ple pre­dic­tors with a math­e­mat­i­cal model that in­cor­po­rates vary­ing skill, ex­per­tise, and task difficulty level (Welin­der et al, Bachrach et al)

  • Re­move ex­tra­ne­ous in­for­ma­tion. In­di­vi­d­u­als’ es­ti­mates are widely af­fected by ex­tra­ne­ous in­for­ma­tion even when they them­selves view it as ex­tra­ne­ous (Grim­stad and Jør­gensen, Ste­wart). In the real world this may be a lengthy pro­cess of de­ter­min­ing what in­for­ma­tion is ex­tra­ne­ous.

  • Force par­ti­ci­pants to write up mod­els of their think­ing (us­ing vari­ables for un­knowns), and then eval­u­ate the vari­ables sep­a­rately (Kah­ne­man, Lo­vallo, and Sibony).

    • Kah­ne­man sug­gests 5-6 vari­ables, and ab­solutely no more than 8 (Knowl­edge@Whar­ton).

    • To pre­serve in­de­pen­dence, have in­di­vi­d­u­als write up their mod­els be­fore shar­ing with the group and com­ing to con­sen­sus.

    • See “Creat­ing Com­pos­ite Models” be­low.

  • Let par­ti­ci­pants know you’ll be ask­ing about their rea­son­ing af­ter­words (Kah­ne­man, Lo­vallo, and Sibony).

  • Create refer­ence guides that fore­cast­ers can re­fer to while mak­ing an es­ti­mate (e.g. “this is what Level 4 teach­ing looks like, this is what Level 5 teach­ing looks like). Bet­ter, af­ter they’ve made their es­ti­mate show them the near­est refer­ence and ask how they com­pare (Penny, John­son, and Gor­don).

    • In the case of novel ques­tions, I spec­u­late that it would be use­ful to make an imag­i­nary refer­ence chart (“this is what a coun­try that’s 20% likely to launch a nu­clear mis­sile in the next year would look like…”) .

  • Some eval­u­a­tions can be bro­ken down into sub-eval­u­a­tions, in which peo­ple tend to agree on the first step but dis­agree on the sec­ond. E.g., they’ll agree on the or­der­ing of the sever­ity of per­sonal in­jury cases, but trans­late the sever­ity into wildly differ­ent dol­lar amounts (Sun­stein, Kah­ne­man, and Schkade). Or doc­tors will agree on the sever­ity of a case but not the pa­tient’s fu­ture out­come (Dwyer et al).

  • Train­ing and re­train­ing. With e.g. ed­u­ca­tional as­sess­ment, this means giv­ing peo­ple refer­ence eval­u­a­tions and then prac­tic­ing on a sec­ond set of eval­u­a­tions un­til they get the right re­sult (Wikipe­dia, Polin et al). Even af­ter this was done, eval­u­a­tors benefited from pe­ri­odic re­train­ing (Polin et al).

Creat­ing Com­pos­ite Models

One idea that came up re­peat­edly in busi­ness liter­a­ture was forc­ing pre­dic­tors to build (po­ten­tially very crude) math­e­mat­i­cal mod­els.

Kah­ne­man recom­mends the fol­low­ing pro­ce­dure, which he calls cre­at­ing a “rea­soned rule” (sum­mary from Ja­son Col­lins):

  1. Select six to eight vari­ables that are dis­tinct and ob­vi­ously re­lated to the pre­dicted out­come. As­sets and rev­enues (weighted pos­i­tively) and li­a­bil­ities (weighted nega­tively) would surely be in­cluded, along with a few other fea­tures of loan ap­pli­ca­tions.

  2. Take the data from your set of cases (all the loan ap­pli­ca­tions from the past year) and com­pute the mean and stan­dard de­vi­a­tion of each vari­able in that set.

  3. For ev­ery case in the set, com­pute a “stan­dard score” for each vari­able: the differ­ence be­tween the value in the case and the mean of the whole set, di­vided by the stan­dard de­vi­a­tion. With stan­dard scores, all vari­ables are ex­pressed on the same scale and can be com­pared and av­er­aged.

  4. Com­pute a “sum­mary score” for each case―the av­er­age of its vari­ables’ stan­dard scores. This is the out­put of the rea­soned rule. The same for­mula will be used for new cases, us­ing the mean and stan­dard de­vi­a­tion of the origi­nal set and up­dat­ing pe­ri­od­i­cally.

  5. Order the cases in the set from high to low sum­mary scores, and de­ter­mine the ap­pro­pri­ate ac­tions for differ­ent ranges of scores. With loan ap­pli­ca­tions, for in­stance, the ac­tions might be “the top 10% of ap­pli­cants will re­ceive a dis­count” and “the bot­tom 30% will be turned down.”

Richard H. Moss recom­mends a similar pro­ce­dure in his pa­per on es­ti­mat­ing cli­mate change:

  1. For each of the ma­jor find­ings you ex­pect to be de­vel­oped in your chap­ter, iden­tify the most im­por­tant fac­tors and un­cer­tain­ties that are likely to af­fect the con­clu­sions. Also spec­ify which im­por­tant fac­tors/​vari­ables are be­ing treated ex­oge­nously or fixed, as it will al­most always be the case that some im­por­tant com­po­nents will be treated in this way when ad­dress­ing the com­plex phe­nom­ena ex­am­ined in the TAR.

  2. Doc­u­ment ranges and dis­tri­bu­tions in the liter­a­ture, in­clud­ing sources of in­for­ma­tion on the key causes of un­cer­tainty. Note that it is im­por­tant to con­sider the types of ev­i­dence available to sup­port a find­ing (e.g., dis­t­in­guish find­ings that are well es­tab­lished through ob­ser­va­tions and tested the­ory from those that are not so es­tab­lished).

  3. Given the na­ture of the un­cer­tain­ties and state of sci­ence, make an ini­tial de­ter­mi­na­tion of the ap­pro­pri­ate level of pre­ci­sion—is the state of sci­ence such that only qual­i­ta­tive es­ti­mates are pos­si­ble, or is quan­tifi­ca­tion pos­si­ble, and if so, to how many sig­nifi­cant digits? As the as­sess­ment pro­ceeds, re­cal­ibrate level of pre­ci­sion in re­sponse to your as­sess­ment of new in­for­ma­tion.

  4. Quan­ti­ta­tively or qual­i­ta­tively char­ac­ter­ize the dis­tri­bu­tion of val­ues that a pa­ram­e­ter, vari­able, or out­come may take. First iden­tify the end points of the range that the writ­ing team es­tab­lishes, and/​or any high con­se­quence, low prob­a­bil­ity out­comes or “out­liers.” Par­tic­u­lar care needs to be taken to spec­ify what por­tion of the range is in­cluded in the es­ti­mate (e.g., this is a 90% con­fi­dence in­ter­val) and what the range is based on. Then provide an as­sess­ment of the gen­eral shape (e.g., uniform, bell, bi­modal, skewed, sym­met­ric) of the dis­tri­bu­tion. Fi­nally, provide your as­sess­ment of the cen­tral ten­dency of the dis­tri­bu­tion (if ap­pro­pri­ate).

  5. Us­ing the terms de­scribed be­low, rate and de­scribe the state of sci­en­tific in­for­ma­tion on which the con­clu­sions and/​or es­ti­mates (i.e. from step 4) are based.

  6. Pre­pare a “trace­able ac­count” of how the es­ti­mates were con­structed that de­scribes the writ­ing team’s rea­sons for adopt­ing a par­tic­u­lar prob­a­bil­ity dis­tri­bu­tion, in­clud­ing im­por­tant lines of ev­i­dence used, stan­dards of ev­i­dence ap­plied, ap­proaches to com­bin­ing/​rec­on­cil­ing mul­ti­ple lines of ev­i­dence, ex­plicit ex­pla­na­tions of meth­ods for ag­gre­ga­tion, and crit­i­cal un­cer­tain­ties.

  7. OPTIONAL: Use for­mal prob­a­bil­is­tic frame­works for as­sess­ing ex­pert judg­ment (i.e. de­ci­sion an­a­lytic tech­niques), as ap­pro­pri­ate for each writ­ing team.

Costs of Consistency

It is triv­ial to get 100% con­sis­tency: just have ev­ery­one guess 0 ev­ery time. If you’re feel­ing fancy they could guess base rate. Ob­vi­ously this would be pointless be­cause you would learn nothing

If two in­di­vi­d­u­als are to come up with the same an­swer to a prob­lem, they can only use in­for­ma­tion both of them have. This should on av­er­age dam­age the ac­cu­racy of the work (if it doesn’t, you have more prob­lems). This can be okay in cer­tain cir­cum­stances; the pe­nal sys­tem some­times val­ues pre­dictabil­ity over get­ting ex­actly the right an­swer, cus­tomers get irate if quoted widely vary­ing prices. But of­ten it is not okay, and false pre­ci­sion is harm­ful.

Mea­sures of Noise in Answers

There’s a ro­bust field of in­ter-rater re­li­a­bil­ity statis­tics, of which The Hand­book of In­ter-Rater Reli­a­bil­ity ap­pears to be the best sin­gle source. Due to time con­straints and the den­sity of the sub­ject I did not fol­low up on this fur­ther.

Mea­sures of Am­bi­guity in Questions

I found no data on am­bi­guity in pre­dic­tions or state­ment eval­u­a­tion. The rich­est source of re­lated data was on am­bi­guity in product re­quire­ment speci­fi­ca­tions. There are sev­eral sys­tems for mea­sur­ing am­bi­guity in nat­u­ral lan­guage, the most promi­nent of which is LOLITA. Other sys­tems in­clude:

I found no data on the cost am­bigu­ous re­quire­ments ex­act, or how much of this cost could be avoided with NLP sys­tems. Th­ese sys­tems had ma­jor types of am­bi­guity they could not de­tect and were not a sub­sti­tute for hu­man eval­u­a­tion.

Sub­jec­tive Judgements

I found very mixed re­sults on whether sub­jec­tive judge­ments could re­place ob­jec­tive com­pos­ite mea­sure­ments, and no ob­vi­ous trends in which ar­eas were ro­bust to sub­jec­tive pre­dic­tions: nega­tive, pos­i­tive, nega­tive, pos­i­tive, nega­tive.

Papers tended to as­sume the ob­jec­tive mea­sure­ments were more ac­cu­rate, with­out con­sid­er­ing how they could be tam­pered with. E.g.,in this study of the Den­ver po­lice, crime rates were not found to be heav­ily cor­re­lated with res­i­dent satis­fac­tion. The pa­per seemed to think this was a deficit in the res­i­dents’ un­der­stand­ing, as op­posed to the po­lice de­part­ment in­terfer­ing with crime statis­tics. So per­haps one area where sub­jec­tive mea­sure­ments are prefer­able is where nom­i­nally ob­jec­tive mea­sure­ments are con­trol­led by the in­sti­tu­tion be­ing mea­sured.

Limi­ta­tions of This Paper and Fu­ture Work

  • Due to time con­straints, I had to take pa­pers’ word for their find­ings. I did not have time to look for repli­ca­bil­ity or statis­ti­cal er­rors, and could only do quick checks of method­ol­ogy. A fu­ture deep dive in any sub­ject cov­ered should in­clude a more skep­ti­cal read­ing of my sources.

  • Most work done on in­ter-rater re­li­a­bil­ity is in fields like medicine, teacher eval­u­a­tions, and image la­bel­ing. Th­ese in­volve es­ti­mat­ing fairly known things with lots of refer­ence in­stances. This is a fun­da­men­tally differ­ent kind of prob­lem than pre­dict­ing novel, low-prob­a­bil­ity events- among other differ­ences, it’s harder to gen­er­ate refer­ence charts and train­ing data.

  • There are many, many on in­ter-rater re­li­a­bil­ity in nar­row fields. Some­times they con­tain sug­ges­tions for miti­ga­tions; usu­ally they do not. Ad­di­tion­ally, an over­whelming ma­jor­ity of these stud­ies are on can­cer-di­ag­no­sis-type prob­lems, not low-fre­quency-global-event-type prob­lems. I read a few of these and got some value out of them (mostly miti­ga­tion tech­niques, such as ask­ing why some­one be­lieved some­thing), but hit diminish­ing re­turns af­ter a few pa­pers. A more thor­ough read­ing of the genre of “Hu­mans are un­re­li­able” would prob­a­bly find more miti­ga­tions.

  • There are also many, many stud­ies on us­ing mul­ti­ple hu­man la­bel­ers to do image la­bel­ing or NLP tasks, of­ten us­ing math­e­mat­i­cal mod­els. I did not have time to dig into the ac­tual mod­els and took the pa­pers’ word for their power. This pa­per on boot­strap­ping from 0 to known ques­tion an­swer, ques­tion difficulty, and IQ as­sess­ment of par­ti­ci­pants looks es­pe­cially in­ter­est­ing.

Edit 9/​16: This re­view pa­per, found by DanielFilan, looks even bet­ter.

  • A more thor­ough un­der­stand­ing of the statis­tics would be use­ful, per­haps start­ing with The Hand­book of In­ter-Rater Reli­a­bil­ity or http://​​in­ter-rater-re­li­a­bil­​​.

  • How to get the best work out of groups work­ing to­gether? This is a so­cial psy­chol­ogy re­search pro­ject in its own right.

  • There is a lot of in­for­ma­tion about how to make crowds more ac­cu­rate, but not more con­sis­tent.

  • In­ves­ti­gate the bias-var­i­ance trade off more, es­pe­cially for hu­man de­ci­sion mak­ing.

  • Books that would be rele­vant to the ques­tions:

    • Pro­to­col Anal­y­sis in­cludes sec­tions on cod­ing ver­bal re­ports re­li­ably.

    • Daniel Kah­ne­men is writ­ing a tan­ta­l­iz­ingly rele­vant book (Noise) that will not be available for at least a year, pos­si­bly more.

    • Emerg­ing Trends in the Devel­op­ment and Ap­pli­ca­tion of Com­pos­ite Indicators

    • Superforecasting

    • The Power of Math­e­mat­i­cal Thinking

    • The Power of In­tu­ition (least sure about that one)

    • WISER: Get­ting Beyond Group­think to Make Groups Smarter

    • Psy­chol­ogy of In­tel­li­gence Analysis

Edit 9/​16: Rae­mon de­scribes this as “Think­ing Fast and Slow” for CIA agents.

    • Col­lec­tive Wis­dom: Prin­ci­ples and Mechanisms

    • Dialogue Map­ping: Build­ing Shared Un­der­stand­ing of Wicked Problem

    • How to Mea­sure Anything

    • Uncer­tain Judge­ments: Elic­it­ing Ex­perts’ Probabilities

Edit 9/​16: on skim­ming, Ruby did not find any­thing speci­fi­cally re­lated to con­sis­tency.

    • Cam­bridge Hand­book on Expertise

This re­port was funded by a fore­cast­ing in­fras­truc­ture pro­ject man­aged by Ozzie Gooen, which is it­self funded by a grant from the Effec­tive Altru­ism Long Term Fu­ture Fund.

My raw notes are available here.

[1] While com­pa­nies are typ­i­cally try­ing to max­i­mize prof­its, cus­tomers are of­ten ex­tremely sen­si­tive to per­ceived in­jus­tices in pric­ing, and in­con­sis­ten­cies are per­ceived as in­jus­tices.

[2] List cour­tesy https://​​​​2016/​​10/​​noise.

9/​16/​2019: Made var­i­ous up­dates based on other peo­ple’s re­search, seen in the com­ments of this post, re­lated ques­tions, and pri­vately shared write ups. Thanks to ev­ery­one for com­ing out.