Consider Representative Data Sets

In this ar­ti­cle, I con­sider the stan­dard bi­ases in draw­ing fac­tual con­clu­sions that are not re­lated to emo­tional re­ac­tions, and de­scribe a sim­ple model sum­ma­riz­ing what goes wrong with the rea­son­ing in these cases, that in turn sug­gests a way of sys­tem­at­i­cally avoid­ing this kind of prob­lems.

The fol­low­ing model is used to de­scribe the pro­cess of get­ting from a ques­tion to a (po­ten­tially bi­ased) an­swer for the pur­poses of this ar­ti­cle. First, you ask your­self a ques­tion. Se­cond, in the con­text of the ques­tion, a data set is pre­sented be­fore your mind, ei­ther di­rectly, by you look­ing at the ex­plicit state­ments of fact, or in­di­rectly, by as­so­ci­ated facts be­com­ing salient to your at­ten­tion, trig­gered by the ex­plicit data items or by the ques­tion. Third, you con­struct an in­tu­itive model of some phe­nomenon, that al­lows to see its prop­er­ties, as a re­sult of con­sid­er­ing the data set. And fi­nally, you pro­nounce the an­swer, that is read out as one of the prop­er­ties of the model you’ve just con­structed.

This de­scrip­tion is meant to pre­sent men­tal paint­brush han­dles, to re­fer to the things you can see in­tro­spec­tively, and things you could op­er­ate con­sciously if you choose to.

Most of the bi­ases in the con­sid­ered class may be seen as par­tic­u­lar ways in which you pay at­ten­tion to a wrong data set, not rep­re­sen­ta­tive of the phe­nomenon you model to get to the an­swer you seek. As a re­sult, the in­tu­itive model gets sys­tem­at­i­cally wrong, and the an­swer read out from it gets bi­ased. Below I re­view the spe­cific bi­ases, to iden­tify the ways in which things go wrong in each par­tic­u­lar case, and then I sum­ma­rize the classes of mis­takes of rea­son­ing play­ing ma­jor roles in these bi­ases and cor­re­spond­ingly the ways of avoid­ing those mis­takes.

Cor­re­spon­dence Bias is a ten­dency to at­tribute to a per­son a dis­po­si­tion to be­have in a par­tic­u­lar way, based on ob­serv­ing an epi­sode in which that per­son be­haves in that way. The data set that gets con­sid­ered con­sists only of the ob­served epi­sode, while the tar­get model is of the per­son’s be­hav­ior in gen­eral, in many pos­si­ble epi­sodes, in many differ­ent pos­si­ble con­texts that may in­fluence the per­son’s be­hav­ior.

Hind­sight bias is a ten­dency to over­es­ti­mate the a pri­ori prob­a­bil­ity of an event that has ac­tu­ally hap­pened. The data set that gets con­sid­ered overem­pha­sizes the sce­nario that did hap­pen, while the model that needs to be con­structed, of the a pri­ori be­lief, should be in­differ­ent to which of the op­tions will ac­tu­ally get re­al­ized. From this model, you need to read out the prob­a­bil­ity of the spe­cific event, but which event you’ll read out shouldn’t figure into the model it­self.

Availa­bil­ity bias is a ten­dency to es­ti­mate the prob­a­bil­ity of an event based on what­ever ev­i­dence about that event pops into your mind, with­out tak­ing into ac­count the ways in which some pieces of ev­i­dence are more mem­o­rable than oth­ers, or some pieces of ev­i­dence are eas­ier to come by than oth­ers. This bias di­rectly con­sists in con­sid­er­ing a mis­matched data set that leads to a dis­torted model, and bi­ased es­ti­mate.

Plan­ning Fal­lacy is a ten­dency to over­es­ti­mate your effi­ciency in achiev­ing a task. The data set you con­sider con­sists of sim­ple cached ways in which you move about ac­com­plish­ing the task, and lacks the unan­ti­ci­pated prob­lems and more com­plex ways in which the pro­cess may un­fold. As a re­sult, the model fails to ad­e­quately de­scribe the phe­nomenon, and the an­swer gets sys­tem­at­i­cally wrong.

The Log­i­cal Fal­lacy of Gen­er­al­iza­tion from Fic­tional Ev­i­dence con­sists in draw­ing the real-world con­clu­sions based on state­ments in­vented and se­lected for the pur­pose of writ­ing fic­tion. The data set is not at all rep­re­sen­ta­tive of the real world, and in par­tic­u­lar of what­ever real-world phe­nomenon you need to un­der­stand to an­swer your real-world ques­tion. Con­sid­er­ing this data set leads to an in­ad­e­quate model, and in­ad­e­quate an­swers.

Propos­ing Solu­tions Pre­ma­turely is dan­ger­ous, be­cause it in­tro­duces weak con­clu­sions in the pool of the facts you are con­sid­er­ing, and as a re­sult the data set you think about be­comes weaker, overly tilted to­wards pre­ma­ture con­clu­sions that are likely to be wrong, that are less rep­re­sen­ta­tive of the phe­nomenon you are try­ing to model than the ini­tial facts you started from, be­fore com­ing up with the pre­ma­ture con­clu­sions.

Gen­er­al­iza­tion From One Ex­am­ple is a ten­dency to pay too much at­ten­tion to the few anec­do­tal pieces of ev­i­dence you ex­pe­rienced, and model some gen­eral phe­nomenon based on them. This is a spe­cial case of availa­bil­ity bias, and the way in which the mis­take un­folds is closely re­lated to the cor­re­spon­dence bias and the hind­sight bias.

Con­tam­i­na­tion by Prim­ing is a prob­lem that re­lates to the pro­cess of im­plic­itly in­tro­duc­ing the facts in the at­tended data set. When you are primed with a con­cept, the facts re­lated to that con­cept come to mind eas­ier. As a re­sult, the data set se­lected by your mind be­comes tilted to­wards the el­e­ments re­lated to that con­cept, even if it has no re­la­tion to the ques­tion you are try­ing to an­swer. Your think­ing be­comes con­tam­i­nated, shifted in a par­tic­u­lar di­rec­tion. The data set in your fo­cus of at­ten­tion be­comes less rep­re­sen­ta­tive of the phe­nomenon you are try­ing to model, and more rep­re­sen­ta­tive of the con­cepts you were primed with.

Know­ing About Bi­ases Can Hurt Peo­ple. When you learn about the bi­ases, you ob­tain a toolset for con­struct­ing new state­ments of fact. Similarly to what goes wrong when you pro­pose solu­tions to a hard prob­lem pre­ma­turely, you con­tam­i­nate the data set with weak con­clu­sions, alle­ga­tions against spe­cific data items that don’t add to the un­der­stand­ing of phe­nomenon you are try­ing to model, dis­tract from con­sid­er­ing the ques­tion, take away what­ever rele­vant knowl­edge you had, and in some cases even in­vert it.

A more gen­eral tech­nique for not mak­ing these mis­takes con­sists in mak­ing sure that the data set you con­sider is rep­re­sen­ta­tive of the phe­nomenon you are try­ing to un­der­stand. Hu­man brain can’t au­to­mat­i­cally cor­rect for the mis­lead­ing se­lec­tion of data, so you need to con­sciously en­sure that you get pre­sented with a bal­anced se­lec­tion.

The first mis­take is in­tro­duc­tion of ir­rele­vant data items. Fo­cus on the prob­lem, don’t let the dis­trac­tions get their way. The ir­rele­vant data may find its way in your thoughts covertly, through prim­ing effects you don’t even no­tice. Don’t let any­thing dis­tract you, even if you un­der­stand that the dis­trac­tion isn’t re­lated to the prob­lem you are work­ing on. Don’t con­struct the ir­rele­vant items your­self, as byprod­ucts of your ac­tivity. Make sure that the data items you con­sider are ac­tu­ally re­lated to the phe­nomenon you are try­ing to un­der­stand. To form ac­cu­rate be­liefs about some­thing, you re­ally do have to ob­serve it. Don’t think about fic­tional ev­i­dence, don’t think about the facts that look su­perfi­cially rele­vant to the ques­tion, but ac­tu­ally aren’t, as in the case of the hind­sight bias and rea­son­ing by sur­face analo­gies.

The sec­ond mis­take is to con­sider an un­bal­anced data set, overem­pha­siz­ing some as­pects of the phe­nomenon, and un­der­em­pha­siz­ing the oth­ers. The data needs to cover the whole phe­nomenon in a rep­re­sen­ta­tive way, for the hu­man mind to pro­cess it ad­e­quately. There are two sides to cor­rect­ing this im­bal­ance. First, you may take away the ex­ces­sive data points, de­liber­a­tively re­fus­ing to con­sider them, so that your mind gets pre­sented with less ev­i­dence, but this re­main­ing ev­i­dence is more bal­anced, more rep­re­sen­ta­tive of what you are try­ing to un­der­stand. This is similar to what hap­pens when you take an out­side view, for ex­am­ple, to avoid plan­ning fal­lacy. Se­cond, you may gen­er­ate the cor­rect data items to fill the rest of the model, from the cluster of ev­i­dence you’ve got. This gen­er­a­tion may hap­pen ei­ther for­mally, through us­ing tech­ni­cal mod­els of the phe­nomenon that al­low to ex­plic­itly calcu­late more facts, or in­for­mally, through train­ing your in­tu­ition to fol­low re­li­able rules for in­ter­pret­ing the spe­cific pieces of ev­i­dence as the as­pects of the whole phe­nomenon you are study­ing. To­gether, these feats con­sti­tute ex­per­tise in the do­main, an art of know­ing how to make use of the data that would only con­fuse a naive mind. When dis­card­ing ev­i­dence to cor­rect the im­bal­ance of data, only parts you don’t posses ex­per­tise in need to be thrown away, while the parts that you are ready to pro­cess may be kept, mak­ing your un­der­stand­ing of the phe­nomenon stronger.

The third mis­take is to mix re­li­able ev­i­dence with un­re­li­able ev­i­dence. The mind can’t tell be­tween rele­vant info and fic­tional ir­rele­vant info, let alone be­tween solid rele­vant ev­i­dence and shaky rele­vant ev­i­dence. If you know some facts for sure, and some facts only through in­di­rect un­re­li­able meth­ods, don’t con­sider the lat­ter at all when form­ing the ini­tial un­der­stand­ing of the phe­nomenon. Your own un­trained in­tu­ition gen­er­ates weak facts, on the things in which you don’t have do­main ex­per­tise, for ex­am­ple when you spon­ta­neously think up solu­tions to a hard prob­lem. You only get wild guesses when the data is too thin for your in­tu­ition to re­tain at least min­i­mal re­li­a­bil­ity, when get­ting a few steps away from the data. You get weak ev­i­dence from ap­ply­ing gen­eral heuris­tics that don’t promise ex­cep­tional pre­ci­sion, such as knowl­edge of bi­ases. You get weak ev­i­dence from listen­ing to the opinion of the ma­jor­ity, from listen­ing to the viru­lent memes. How­ever, when you don’t have re­li­able data, you need to start in­clud­ing less re­li­able ev­i­dence into con­sid­er­a­tion, but only the best of what you can come up with.

Your think­ing shouldn’t be con­tam­i­nated by un­re­lated facts, shouldn’t tum­ble over from the im­bal­ance in knowl­edge, and shouldn’t get diluted by the abun­dance of weak con­clu­sions. In­stead, the un­der­stand­ing should grow more fo­cused on the rele­vant de­tails, more com­pre­hen­sive and bal­anced, at­tend­ing to more as­pects of the prob­lem, and more tech­ni­cally ac­cu­rate.

Think rep­re­sen­ta­tive sets of your best data.