Are these cognitive biases, biases?

Continuing my special report on people who don’t think human reasoning is all that bad, I’ll now briefly present some studies which claim that phenomena other researchers have considered signs of faulty reasoning aren’t actually that. I found these from Gigerenzer (2004), which I in turn found when I went looking for further work done on the Take the Best algorithm.

Before we get to the list—what is Gigerenzer’s exact claim when he lists these previous studies? Well, he’s saying that minds aren’t actually biased, but may make judgments that seem biased in certain environments.

Table 4.1 Twelve examples of phenomena that were first interpreted as “cognitive illusions” but later revalued as reasonable judgments given the environmental structure. [...]

The general argument is that an unbiased mind plus environmental structure (such as unsystematic error, unequal sample sizes, skewed distributions) is sufficient to produce the phenomenon. Note that other factors can also contribute to some of the phenomena. The moral is not that people would never err, but that in order to understand good and bad judgments, one needs to analyze the structure of the problem or of the natural environment.

On to the actual examples. Of the twelve examples referenced, I’ve included three for now.

The False Consensus Effect

Bias description: People tend to imagine that everyone responds the way they do. They tend to see their own behavior as typical. The tendency to exaggerate how common one’s opinions and behavior are is called the false consensus effect. For example, in one study, subjects were asked to walk around on campus for 30 minutes, wearing a sign board that said “Repent!”. Those who agreed to wear the sign estimated that on average 63.5% of their fellow students would also agree, while those who disagreed estimated 23.3% on average.

Counterclaim (Dawes & Mulford, 1996): The correctness of reasoning is not estimated on the basis of whether or not one arrives at the correct result. Instead, we look at whether reach reasonable conclusions given the data they have. Suppose we ask people to estimate whether an urn contains more blue balls or red balls, after allowing them to draw one ball. If one person first draws a red ball, and another person draws a blue ball, then we should expect them to give different estimates. In the absence of other data, you should treat your own preferences as evidence for the preferences of others. Although the actual mean for people willing to carry a sign saying “Repent!” probably lies somewhere in between of the estimates given, these estimates are quite close to the one-third and two-thirds estimates that would arise from a Bayesian analysis with a uniform prior distribution of belief. A study by the authors suggested that people do actually give their own opinion roughly the right amount of weight.

Overconfidence /​ Underconfidence

Bias description: Present people with binary yes/​no questions. Ask them to specify how confident they are, on a scale from .5 to 1, in that they got the answer correct. The mean subjective probability x assigned to the correctness of general knowledge items tends to exceed the proportion of correct answers c, x—c > 0; people are overconfident. The hard-easy effect says that people tend to be underconfident in easy questions, and overconfident in hard questions.

Counterclaim (Juslin, Winman & Olsson 2000): The apparent overconfidence and underconfidence effects are caused by a number of statistical phenomena, such as scale-end effects, linear dependency, and regression effects. In particular, the questions in the relevant studies have been selectively drawn in a manner that is unrepresentative of the actual environment, and thus throws off the participants’ estimates of their own accuracy. Define a “representative” item sample as one coming from a study containing explicit statements that (a) a natural environment had been defined and (b) the items had been generated by random sampling of this environment. Define any studies that didn’t describe how the items had been chosen, or that explicitly describe a different procedure, as having a “selected” item sample. A survey of several studies contained 95 independent data points with selected item samples and 35 independent data points with representative item samples, where “independence” means different participant samples (i.e. all data points were between subjects).

For studies with selected item samples, the mean subjective probability was .73 and the actual proportion correct was .64, indicating a clear overconfidence effect. However, for studies with representative item samples, the mean subjective probability was .73 and the proportion correct was .72, indicating close to no overconfidence. The over/​underconfidence effect of nearly zero for the representative samples was also not a mere consequence of averaging: for the selected item samples, the mean absolute bias was .10, while for the representative item samples it was .03. Once scale-end effects and linear dependency are controlled for, the remaining hard-easy effect is rather modest.

What does the “representative” sample mean? If I understood correctly: Imagine that you know that 30% of the people living in a certain city are black, and 70% are white. Next you’re presented with questions where you have to guess whether a certain inhabitant of the city is black or white. If you don’t have any other information, you know that consistently guessing “white” in every question will get you 70% correct. So when the questionnaire also asks you for your calibration, you say that you’re 70% certain for each question.

Now, assuming that the survey questions had been composed by randomly sampling from all the inhabitants of the city (a “representative” sampling), then you would indeed be correct about 70% of the time and be well-calibrated. But assume that instead, all the people the survey asked about live in a certain neighborhood, which happens to be predominantly black (a “selected” sampling). Now you might have only 40% right answers, while you indicated a confidence of 70%, so the researchers behind the survey mark you as overconfident.

Availability Bias

Bias description: We estimate probabilities based on how easily they’re recalled, not based on their actual frequency. Tversky & Kahneman conducted a classic study where participants were given five consonants (K, L, N, R, V), and were asked to estimate whether the letter appeared more frequently as the first or the third letter of a word. Each was judged by the participants to occur more frequently as the first letter, even though all five actually occur more frequently as the third letter. This was assumed to be because words starting with a particular letter are more easily recalled than words that have a particular letter in the third position.

Counterclaim (Sedlmeier, Hertwig & Gigerenzer 1998): Not only does the only replication of Tversky & Kahneman’s result seem to be a single one-page article, it seems to be contradicted by a number of studies suggesting that memory is often (though not always) excellent in storing the frequency information from various environments. In particular, several authors have documented that participants’ judgments of the frequency of letters and words generally show a remarkable sensitivity to the actual frequencies. The one previous study that did try to replicate the classical experiment, failed to do so. It used Tversky & Kahneman’s five consonants, all more frequent in the third position, and also five other consonants that were more frequent in the first position. All five consonants that appear more often in the first position were judged to do so; three of the five consonants that appear more frequently in the third position were also judged to do so.

The classic article did not specify a mechanism for how the availability heuristic might work. The current authors considered four different mechanisms. Availability by number states that if asked for the proportion in which a certain letter occurs in the first versus in a later position in words, one produces words with this letter in the respective positions and uses the produced proportion as an estimate for the actual proportion. Availability by speed states that one produces single words with the letter in this position, and uses the time ratio of the retrieval times as an estimate of the actual proportion. The letter class hypothesis notes that the original sample was atypical; most consonants (12 of 20) are in fact more frequent in the first position. This hypothesis assumes that people know whether consonants or vowels are more frequent in which position, and default to that knowledge. The regressed frequencies hypothesis assumes that people do actually have a rather good knowledge of the actual frequencies, but that the estimates are regressed towards the mean: low frequencies are overestimated and large frequencies underestimated.

After two studies made to calibrate the predictions of the availability hypotheses, three main studies were conducted. In each, the participants were asked whether a certain letter was more frequent in the first or second position of all German words. They were also asked about the proportions of each letter appearing in the first or second position. Study one was a basic replication of the Tversky & Kahneman study, albeit with more letters. Study two was designed to be favorable to the letter class hypothesis: each participant was only given one letter whose frequency to judge instead of several. It was thought that participants may have switched away from a letter class strategy when presented with multiple consonants and vowels. Study three was designed to be favorable to the availability hypotheses, in that the participants were made to first produce words with the letters O, U, N and R in the first and second position (90 seconds per letter) before proceeding as in study one. Despite two of the studies having been explicitly constructed to be favorable to the other hypotheses, the predictions of the regressed frequency hypothesis had the best match to the actual estimates in all three studies. Thus it seems that people are capable of estimating letter frequencies, although in a regressed form.

The authors propose two different explanations for the discrepancy of results with the classic study. One is that the corpus used by Tversky & Kahneman only covers words at least three letters long, but English has plenty of one- and two-letter words. The participants in the classic study were told to disregard words with less than three letters, but it may be that they were unable to properly do so. Alternatively, it may have been caused by the use of an unrepresentative sample of letters: had the authors used only consonants that are more frequent in the second position, then they too would have reported that the frequency of the those letters in the first position is overestimated. However, a consideration of all the consonants tested shows that the frequency of those in the first position is actually underestimated. This disagrees with the interpretation by Tversky & Kahneman, and implies a regression effect as the main cause.

EDIT: This result doesn’t mean that the availability heuristic would be a myth, of course. It is, AFAIK, true that e.g. biased reporting in the media will throw off people’s conceptions of what events are the most likely. But one probably wouldn’t be too far from the truth if they said that in that case, the brain is still computing relative frequencies correctly, given the information at hand—it’s just that the media reporting is biased. The claim that there are some types of important information for which the mind has particular difficulty assessing relative frequencies correctly, though, doesn’t seem to be as supported as is sometimes claimed.