Conjunction Controversy (Or, How They Nail It Down)

Fol­lowup to: Con­junc­tion Fallacy

When a sin­gle ex­per­i­ment seems to show that sub­jects are guilty of some hor­rify­ing sin­ful bias—such as think­ing that the propo­si­tion “Bill is an ac­coun­tant who plays jazz” has a higher prob­a­bil­ity than “Bill is an ac­coun­tant”—peo­ple may try to dis­miss (not defy) the ex­per­i­men­tal data. Most com­monly, by ques­tion­ing whether the sub­jects in­ter­preted the ex­per­i­men­tal in­struc­tions in some un­ex­pected fash­ion—per­haps they mi­s­un­der­stood what you meant by “more prob­a­ble”.

Ex­per­i­ments are not be­yond ques­tion­ing; on the other hand, there should always ex­ist some moun­tain of ev­i­dence which suffices to con­vince you. It’s not im­pos­si­ble for re­searchers to make mis­takes. It’s also not im­pos­si­ble for ex­per­i­men­tal sub­jects to be re­ally gen­uinely and truly bi­ased. It hap­pens. On both sides, it hap­pens. We’re all only hu­man here.

If you think to ex­tend a hand of char­ity to­ward ex­per­i­men­tal sub­jects, cast­ing them in a bet­ter light, you should also con­sider think­ing char­i­ta­bly of sci­en­tists. They’re not stupid, you know. If you can see an al­ter­na­tive in­ter­pre­ta­tion, they can see it too. This is es­pe­cially im­por­tant to keep in mind when you read about a bias and one or two illus­tra­tive ex­per­i­ments in a blog post. Yes, if the few ex­per­i­ments you saw were all the ev­i­dence, then in­deed you might won­der. But you might also won­der if you’re see­ing all the ev­i­dence that sup­ports the stan­dard in­ter­pre­ta­tion. Espe­cially if the ex­per­i­ments have dates on them like “1982” and are pref­aced with ad­jec­tives like “fa­mous” or “clas­sic”.

So! This is a long post. It is a long post be­cause nailing down a the­ory re­quires more ex­per­i­ments than the one or two vivid illus­tra­tions needed to merely ex­plain. I am go­ing to cite maybe one in twenty of the ex­per­i­ments that I’ve read about, which is maybe a hun­dredth of what’s out there. For more in­for­ma­tion, see Tver­sky and Kah­ne­man (1983) or Kah­ne­man and Fred­er­ick (2002), both available on­line, from which this post is pri­mar­ily drawn.

Here is (prob­a­bly) the sin­gle most ques­tioned ex­per­i­ment in the liter­a­ture of heuris­tics and bi­ases, which I re­pro­duce here ex­actly as it ap­pears in Tver­sky and Kah­ne­man (1982):

Linda is 31 years old, sin­gle, out­spo­ken, and very bright. She ma­jored in philos­o­phy. As a stu­dent, she was deeply con­cerned with is­sues of dis­crim­i­na­tion and so­cial jus­tice, and also par­ti­ci­pated in anti-nu­clear demon­stra­tions.

Please rank the fol­low­ing state­ments by their prob­a­bil­ity, us­ing 1 for the most prob­a­ble and 8 for the least prob­a­ble:

(5.2) Linda is a teacher in el­e­men­tary school.
(3.3) Linda works in a book­store and takes Yoga classes.
(2.1) Linda is ac­tive in the fem­i­nist move­ment. (F)
(3.1) Linda is a psy­chi­a­tric so­cial worker.
(5.4) Linda is a mem­ber of the League of Women Vot­ers.
(6.2) Linda is a bank tel­ler. (T)
(6.4) Linda is an in­surance sales­per­son.
(4.1) Linda is a bank tel­ler and is ac­tive in the fem­i­nist move­ment. (T & F)

(The num­bers at the start of each line are the mean ranks of each propo­si­tion, lower be­ing more prob­a­ble.)

How do you know that sub­jects did not in­ter­pret “Linda is a bank tel­ler” to mean “Linda is a bank tel­ler and is not ac­tive in the fem­i­nist move­ment”? For one thing, dear read­ers, I offer the ob­ser­va­tion that most bank tel­lers, even the ones who par­ti­ci­pated in anti-nu­clear demon­stra­tions in col­lege, are prob­a­bly not ac­tive in the fem­i­nist move­ment. So, even so, Tel­ler should rank above Tel­ler & Fem­i­nist. You should be skep­ti­cal of your own ob­jec­tions, too; else it is dis­con­fir­ma­tion bias. But the re­searchers did not stop with this ob­ser­va­tion; in­stead, in Tver­sky and Kah­ne­man (1983), they cre­ated a be­tween-sub­jects ex­per­i­ment in which ei­ther the con­junc­tion or the two con­juncts were deleted. Thus, in the be­tween-sub­jects ver­sion of the ex­per­i­ment, each sub­ject saw ei­ther (T&F), or (T), but not both. With a to­tal of five propo­si­tions ranked, the mean rank of (T&F) was 3.3 and the mean rank of (T) was 4.4, N=86. Thus, the fal­lacy is not due solely to in­ter­pret­ing “Linda is a bank tel­ler” to mean “Linda is a bank tel­ler and not ac­tive in the fem­i­nist move­ment.”

Similarly, the ex­per­i­ment dis­cussed yes­ter­day used a be­tween-sub­jects de­sign (where each sub­ject only saw one state­ment) to elicit lower prob­a­bil­ities for “A com­plete sus­pen­sion of diplo­matic re­la­tions be­tween the USA and the Soviet Union, some­time in 1983” ver­sus “A Rus­sian in­va­sion of Poland, and a com­plete sus­pen­sion of diplo­matic re­la­tions be­tween the USA and the Soviet Union, some­time in 1983″.

Another way of know­ing whether sub­jects have mis­in­ter­preted an ex­per­i­ment is to ask the sub­jects di­rectly. Also in Tver­sky and Kah­ne­man (1983), a to­tal of 103 med­i­cal in­ternists (in­clud­ing 37 in­ternists tak­ing a post­grad­u­ate course at Har­vard, and 66 in­ternists with ad­mit­ting priv­ileges at New England Med­i­cal Cen­ter) were given prob­lems like the fol­low­ing:

A 55-year-old woman had pul­monary em­bolism doc­u­mented an­gio­graph­i­cally 10 days af­ter a cholec­s­tec­tomy. Please rank or­der the fol­low­ing in terms of the prob­a­bil­ity that they will be among the con­di­tions ex­pe­rienced by the pa­tient (use 1 for the most likely and 6 for the least likely). Nat­u­rally, the pa­tient could ex­pe­rience more than one of these con­di­tions.

  • Dysp­nea and hemiparesis

  • Calf pain

  • Pleu­ritic chest pain

  • Syn­cope and tachycardia

  • Hemiparesis

  • Hemoptysis

As Tver­sky and Kah­ne­man note, “The symp­toms listed for each prob­lem in­cluded one, de­noted B, that was judged by our con­sult­ing physi­ci­ans to be non­rep­re­sen­ta­tive of the pa­tient’s con­di­tion, and the con­junc­tion of B with an­other highly rep­re­sen­ta­tive symp­tom de­noted A. In the above ex­am­ple of pul­monary em­bolism (blood clots in the lung), dys­p­nea (short­ness of breath) is a typ­i­cal symp­tom, whereas hemi­pare­sis (par­tial paral­y­sis) is very atyp­i­cal.”

In in­di­rect tests, the mean ranks of A&B and B re­spec­tively were 2.8 and 4.3; in di­rect tests, they were 2.7 and 4.6. In di­rect tests, sub­jects ranked A&B above B be­tween 73% to 100% of the time, with an av­er­age of 91%.

The ex­per­i­ment was de­signed to elimi­nate, in four ways, the pos­si­bil­ity that sub­jects were in­ter­pret­ing B to mean “only B (and not A)”. First, care­fully word­ing the in­struc­tions: ”...the prob­a­bil­ity that they will be among the con­di­tions ex­pe­rienced by the pa­tient”, plus an ex­plicit re­minder, “the pa­tient could ex­pe­rience more than one of these con­di­tions”. Se­cond, by in­clud­ing in­di­rect tests as a com­par­i­son. Third, the re­searchers af­ter­ward ad­ministered a ques­tion­naire:

In as­sess­ing the prob­a­bil­ity that the pa­tient de­scribed has a par­tic­u­lar symp­tom X, did you as­sume that (check one):
X is the only symp­tom ex­pe­rienced by the pa­tient?
X is among the symp­toms ex­pe­rienced by the pa­tient?

60 of 62 physi­ci­ans, asked this ques­tion, checked the sec­ond an­swer.

Fourth and fi­nally, as Tver­sky and Kah­ne­man write, “An ad­di­tional group of 24 physi­ci­ans, mostly res­i­dents at Stan­ford Hospi­tal, par­ti­ci­pated in a group dis­cus­sion in which they were con­fronted with their con­junc­tion fal­la­cies in the same ques­tion­naire. The re­spon­dents did not defend their an­swers, al­though some refer­ences were made to ‘the na­ture of clini­cal ex­pe­rience.’ Most par­ti­ci­pants ap­peared sur­prised and dis­mayed to have made an el­e­men­tary er­ror of rea­son­ing.”

A fur­ther ex­per­i­ment is also dis­cussed in Tver­sky and Kah­ne­man (1983) in which 93 sub­jects rated the prob­a­bil­ity that Bjorn Borg, a strong ten­nis player, would in the Wim­ble­don fi­nals “win the match”, “lose the first set”, “lose the first set but win the match”, and “win the first set but lose the match”. The con­junc­tion fal­lacy was ex­pressed: “lose the first set but win the match” was ranked more prob­a­ble than”lose the first set”. Sub­jects were also asked to ver­ify whether var­i­ous strings of wins and losses would count as an ex­ten­sional ex­am­ple of each case, and in­deed, sub­jects were in­ter­pret­ing the cases as con­juncts which were satis­fied iff both con­stituents were satis­fied, and not in­ter­pret­ing them as ma­te­rial im­pli­ca­tions, con­di­tional state­ments, or dis­junc­tions; also, con­stituent B was not in­ter­preted to ex­clude con­stituent A. The ge­nius of this ex­per­i­ment was that re­searchers could di­rectly test what sub­jects thought was the mean­ing of each propo­si­tion, rul­ing out a very large class of mi­s­un­der­stand­ings.

Does the con­junc­tion fal­lacy arise be­cause sub­jects mis­in­ter­pret what is meant by “prob­a­bil­ity”? This can be ex­cluded by offer­ing stu­dents bets with pay­offs. In ad­di­tion to the col­ored dice dis­cussed yes­ter­day, sub­jects have been asked which pos­si­bil­ity they would pre­fer to bet $10 on in the clas­sic Linda ex­per­i­ment. This did re­duce the in­ci­dence of the con­junc­tion fal­lacy, but only to 56% (N=60), which is still more than half the stu­dents.

But the ul­ti­mate proof of the con­junc­tion fal­lacy is also the most el­e­gant. In the con­ven­tional in­ter­pre­ta­tion of the Linda ex­per­i­ment, sub­jects sub­sti­tute judg­ment of rep­re­sen­ta­tive­ness for judg­ment of prob­a­bil­ity: Their feel­ings of similar­ity be­tween each of the propo­si­tions and Linda’s de­scrip­tion, de­ter­mines how plau­si­ble it feels that each of the propo­si­tions is true of Linda. If this cen­tral the­ory is true, then the way in which the con­junc­tion fal­lacy fol­lows is ob­vi­ous—Linda more closely re­sem­bles a fem­i­nist than a fem­i­nist bank tel­ler, and more closely re­sem­bles a fem­i­nist bank tel­ler than a bank tel­ler. Well, that is our the­ory about what goes on in the ex­per­i­men­tal sub­jects minds, but how could we pos­si­bly know? We can’t look in­side their neu­ral cir­cuits—not yet! So how would you con­struct an ex­per­i­ment to di­rectly test the stan­dard model of the Linda ex­per­i­ment?

Very eas­ily. You just take an­other group of ex­per­i­men­tal sub­jects, and ask them how much each of the propo­si­tions “re­sem­bles” Linda. This was done—see Kah­ne­man and Fred­er­ick (2002) - and the cor­re­la­tion be­tween rep­re­sen­ta­tive­ness and prob­a­bil­ity was nearly perfect. 0.99, in fact. Here’s the (rather re­dun­dant) graph:

Lindacorrelation

This has been repli­cated for nu­mer­ous other ex­per­i­ments. For ex­am­ple, in the med­i­cal ex­per­i­ment de­scribed above, an in­de­pen­dent group of 32 physi­ci­ans from Stan­ford Univer­sity was asked to rank each list of symp­toms “by the de­gree to which they are rep­re­sen­ta­tive of the clini­cal con­di­tion of the pa­tient”. The cor­re­la­tion be­tween prob­a­bil­ity rank and rep­re­sen­ta­tive­ness rank ex­ceeded 95% on each of the five tested med­i­cal prob­lems.

Now, a cor­re­la­tion near 1 does not prove that sub­jects are sub­sti­tut­ing judg­ment of rep­re­sen­ta­tive­ness for judg­ment of prob­a­bil­ity. But if you want to claim that sub­jects are do­ing some­thing else, I would like to hear the ex­pla­na­tion for why the cor­re­la­tion comes out so close to 1. It will re­ally take quite a com­pli­cated story to ex­plain, not just why the sub­jects have an elab­o­rate mi­s­un­der­stand­ing that pro­duces an in­no­cent and blame­less con­junc­tion fal­lacy, but also how it comes out to a com­pletely co­in­ci­den­tal cor­re­la­tion of nearly 1 with sub­jects’ feel­ing of similar­ity. Across mul­ti­ple ex­per­i­men­tal de­signs.

And we all know what hap­pens to the prob­a­bil­ity of com­pli­cated sto­ries: They go down when you add de­tails to them.

Really, you know, some­times peo­ple just make mis­takes. And I’m not talk­ing about the re­searchers here.

The con­junc­tion fal­lacy is prob­a­bly the sin­gle most ques­tioned bias ever in­tro­duced, which means that it now ranks among the best repli­cated. The con­ven­tional in­ter­pre­ta­tion has been nearly ab­solutely nailed down. Ques­tion­ing, in sci­ence, calls forth an­swers.

I em­pha­size this, be­cause it seems that when I talk about bi­ases (es­pe­cially to au­di­ences not pre­vi­ously fa­mil­iar with the field), a lot of peo­ple want to be char­i­ta­ble to ex­per­i­men­tal sub­jects. But it is not only ex­per­i­men­tal sub­jects who de­serve char­ity. Scien­tists can also be un­stupid. Some­one else has already thought of your al­ter­na­tive in­ter­pre­ta­tion. Some­one else has already de­vised an ex­per­i­ment to test it. Maybe more than one. Maybe more than twenty.

A blank map is not a blank ter­ri­tory; if you don’t know whether some­one has tested it, that doesn’t mean no one has tested it. This is not a hunter-gath­erer tribe of two hun­dred peo­ple, where if you do not know a thing, then prob­a­bly no one in your tribe knows. There are six billion peo­ple in the world, and no one can say with cer­ti­tude that sci­ence does not know a thing; there is too much sci­ence. Ab­sence of such ev­i­dence is only ex­tremely weak ev­i­dence of ab­sence. So do not mis­take your ig­no­rance of whether an al­ter­na­tive in­ter­pre­ta­tion has been tested, for the pos­i­tive knowl­edge that no one has tested it. Be char­i­ta­ble to sci­en­tists too. Do not say, “I bet what re­ally hap­pened was X”, but ask, “Which ex­per­i­ments dis­crim­i­nated be­tween the stan­dard in­ter­pre­ta­tion ver­sus X?”

If it seems that I am driv­ing this point home with a sledge­ham­mer, well, yes, I guess I am. It does be­come a lit­tle frus­trat­ing, some­times—to know about this over­whelming moun­tain of ev­i­dence from thou­sands of ex­per­i­ments, but other peo­ple have no clue that it ex­ists. After all, if there are other ex­per­i­ments sup­port­ing the re­sult, why haven’t they heard of them? It’s a small tribe, af­ter all; surely they would have heard. By the same to­ken, I have to make a con­scious effort to re­mem­ber that other peo­ple don’t know about the ev­i­dence, and they aren’t de­liber­ately ig­nor­ing it in or­der to an­noy me. Which is why it gets a lit­tle frus­trat­ing some­times! We just aren’t built for wor­lds of 6 billion peo­ple.

I’m not say­ing, of course, that peo­ple should stop ask­ing ques­tions. If you stop ask­ing ques­tions, you’ll never find out about the moun­tains of ex­per­i­men­tal ev­i­dence. Faith is not un­der­stand­ing, only be­lief in a pass­word. It is fu­tile to be­lieve in some­thing, how­ever fer­vently, when you don’t re­ally know what you’re sup­posed to be­lieve in. So I’m not say­ing that you should take it all on faith. I’m not say­ing to shut up. I’m not try­ing to make you feel guilty for ask­ing ques­tions.

I’m just say­ing, you should sus­pect the ex­is­tence of other ev­i­dence, when a brief ac­count of ac­cepted sci­ence raises fur­ther ques­tions in your mind. Not be­lieve in that un­seen ev­i­dence, just sus­pect its ex­is­tence. The more so if it is a clas­sic ex­per­i­ment with a stan­dard in­ter­pre­ta­tion. Ask a lit­tle more gen­tly. Put less con­fi­dence in your brilli­ant new al­ter­na­tive hy­poth­e­sis. Ex­tend some char­ity to the re­searchers, too.

And above all, talk like a pirate. Arr!


Kah­ne­man, D. and Fred­er­ick, S. 2002. Rep­re­sen­ta­tive­ness re­vis­ited: At­tribute sub­sti­tu­tion in in­tu­itive judg­ment. Pp 49-81 in Gilovich, T., Griffin, D. and Kah­ne­man, D., eds. Heuris­tics and Bi­ases: The Psy­chol­ogy of In­tu­itive Judg­ment. Cam­bridge Univer­sity Press, Cam­bridge.

Tver­sky, A. and Kah­ne­man, D. 1982. Judg­ments of and by rep­re­sen­ta­tive­ness. Pp 84-98 in Kah­ne­man, D., Slovic, P., and Tver­sky, A., eds. Judg­ment un­der un­cer­tainty: Heuris­tics and bi­ases. New York: Cam­bridge Univer­sity Press.

Tver­sky, A. and Kah­ne­man, D. 1983. Ex­ten­sional ver­sus in­tu­itive rea­son­ing: The con­junc­tion fal­lacy in prob­a­bil­ity judg­ment. Psy­cholog­i­cal Re­view, 90: 293-315.