Hearsay, Double Hearsay, and Bayesian Updates

Application of: How Much Evidence Does It Take?

(trigger warning: some description of domestic violence)

Summary: I discuss the strengths and weaknesses of one way that the American legal system tries to assess and cope with the unreliability of certain kinds of evidence. After explaining the relevant rules with references to a few recent famous cases and a non-notable case that I’m working on now, I briefly consider whether this part of the evidence code is above or below the sanity waterline, and suggest an incremental improvement.

Recently, I got to the point in my legal career where people are trusting me to write evidentiary briefs, i.e., to argue in front of a judge about what kinds of evidence are reliable enough to be safely presented to a jury. There is an odd division of epistemological labor in the American court system: judges are thought [page 90] to be better than juries at resisting passionate or manipulative oratory, and juries are thought to be better than judges at resisting bribery and (pre-existing) personal hatred. As a result, potentially inflammatory or unreliable evidence is presented first to a judge, who (much like one of Eliezer’s Confessors) is supposed to sift the exhibit to see if normal people can handle it without losing their tenuous grip on sanity. If and only if the evidence seems safe for ordinary human consumption, the judge will allow the lawyers to argue about that evidence in front of the jury. Otherwise, the evidence sits in a cardboard box in an unheated warehouse, safely away from the eyes of the jury, until it’s time for an appeal.

The Hearsay Rule

By way of a concrete example, one famous recent case featured a recorded 911 call made by a domestic violence victim to the emergency phone operator. The operator asked questions about the location and identity of the person who was accused of beating the caller. The caller answered the questions on tape, explicitly identifying her abuser as Mr. Adrian Martell Davis, and the answers were used first to find and arrest the suspect, and ultimately to convict him. The victim was apparently too intimidated to testify in open court, and so her recorded statement as to the name of her abuser was absolutely necessary to support a conviction—no recording, no conviction. Under the 400-year-old hearsay rule, recorded testimony typically is not allowed to be presented to a jury—courts are concerned that the person giving the recorded statement might be pressured by the police in ways that wouldn’t show up on tape, and that allowing a witness to testify without showing up in court unfairly deprives the defendant of a chance to (a) cross-examine the witness, and (b) have the jury see any facial tics, body language, etc. that undercut the witness’s credibility. In the 911 case, though, the Court faced a straight choice between finding an exception to the hearsay rule and letting an apparent abuser go free.

In making this choice, the US Supreme Court managed to ignore a variety of emotionally salient but epistemologically irrelevant distractions, such as the seriousness of the crime, the relative helplessness of the victim, and the respectability of the 911 operator. Instead, the Court focused on the purpose for which the 911 statements were obtained. If the statements were obtained to help gather information needed to safely resolve an ongoing emergency, they could be used at trial. If the statements, however, were obtained to gather information about a past event, they could *not* be used at trial.

The theory supporting this distinction seems to have been that the right to cross-examine and the right to have the jury see body language are fungible elements of a more general reliability test. A stranger’s assertion, without more, could be true or could be false. It doesn’t count as very much evidence. To turn an assertion into enough evidence to convict someone beyond a reasonable doubt, you need to show that the assertion comes with “indicia of reliability.” Two of these indicia are cross-examination and body language—if a story checks out despite a vigorous unfriendly interview and the peer pressure of having to tell the story while physically in the room with other people from your community, then that’s pretty good evidence. But you might have reasons to believe a story even if you don’t get cross-examination or body language. In the case of the 911 call, one might think that the caller had a strong motive to tell the truth, because if she didn’t, then the police would go looking for the wrong guy, and her abuser would come find her and continue hurting her. Similarly, one might think that the operators had a strong motive to ask fair, non-leading questions, because of they didn’t get the right answer, then the police might show up in the wrong neighborhood or with the wrong expectations, and there could be an unnecessary firefight. Finally, one could argue that a recorded statement made as events were unfolding is inherently more reliable (in some ways) than a narrative given months or years after the event; human memory gets corrupted faster than 8-track tapes.

Some combination of these factors convinced the Court to admit the evidence. Other, very similar cases have been decided differently. Whether they got that particular decision right or wrong, though, the framework of “indicia of reliability” is hard-coded into American evidence law, especially for civil cases. If you want to present evidence to a jury based on a statement that was made outside of court, you have to give at least one reason why the statement is nevertheless reliable.

Double and Triple Hearsay

Here’s where things really get interesting: if your out-of-court statement quotes another out-of-court statement, the evidence is called “double hearsay,” and you need to independently verify each statement. If any link in the chain breaks, the whole document gets excluded. For example, in the case I’m working on now, the defendants want to show the jury a report filled out by California’s Occupational Health and Safety Administration (“OSHA”). The OSHA report is based almost entirely on an accident report form filled out by a private corporation. That report form, in turn, is based almost entirely on an informal interview of the only eyewitness to an accident. So the defendants can use the OSHA report if and only if the OSHA report, the accident report, and the informal interview are all reliable. Use A ↔ (A ∧ B ∧ C) are reliable.

To try to qualify the OSHA report, the defendants are arguing that the OSHA report is reliable under the public record exception to the hearsay rule, meaning that the public officials who prepared it had a stronger interest in accurately reporting public information than they did in the outcome of the accident victim’s private case. To get the accident report form in, the defendants are arguing that it is reliable under the business record exception to the hearsay rule, meaning that the corporate officials who prepared it had a stronger interest in making sure their company had access to accurate information about safety risks than they did in the outcome of any one customer’s lawsuit. As for the informal interview...well, I honestly have no idea how they plan to justify its reliability. But, then again, I’m biased. My professional interest lies in making sure that the whole string of unhelpful quotations stays in a cardboard box in a dank garage, far away from any juries.

Do the Rules Work?

So far, I’ve been pleasantly surprised at how well the American legal system handles some of these challenges. The fact that we have a two-tiered system of evaluating evidence at all is a cut above average—imagine, e.g., the doctor who examines you taking notes on your condition, filtering out any subjective comments you make about how you’re sure it’s just a cold, and reporting only your objective symptoms to a second doctor, who then renders a diagnosis. Or imagine a team of business consultants who interview a Fortune 500 company’s leadership team, and then pass their written notes back to a team at HQ (who has never met the executives) so that HQ can catch any obvious mistakes in reasoning before sending out recommendations. We know, intellectually, that meeting people tends to make us friendlier toward them and more likely to adopt their point of view even if we encounter no Bayesian evidence that increases the plausibility of their opinions, but our institutions rarely take steps to guard against that bias.

I think my biggest criticism of the American evidence code is that it doesn’t account for uncertainty in the model. For instance, if I read the headline on a piece of science journalism saying that (e.g.) coffee consumption reduces the risk of prostate cancer, or that receiving spankings in childhood is negatively correlated with conscientiousness as an adult, there are least six layers of ‘hearsay’—I might have misunderstood the headline, the headline might have mis-summarized the article, the article might have misquoted the scientist, the scientist might have misinterpreted the recorded data, the recorded data might not faithfully reflect what actually happened during the experiment, and the experiment might not faithfully replicate the real-world conditions that interest us.

Even if I can articulate plausible reasons why each step in the transmission of information was “reliable,” I should be very skeptical that my *model* of the transmission is accurate. I only have to be wrong about one of the six steps for my estimate of the information’s plausibility to be untrustworthy. If the information would only provide a few decibels of evidence even if it were perfectly reliable, then trying to calculate how many points a semi-reliable piece of evidence is worth can fail because of a low signal-to-noise ratio. E.g., suppose I learn that neither the suspect nor the actual criminal were redheads—I might be absolutely certain of this new piece of information, but that’s still nowhere near enough evidence to support a conviction. If instead I learn that there is probably something like a 60% chance that neither the suspect nor the criminal had red hair, that datum really doesn’t tell me anything at all—the info shouldn’t shift my prior enough for my prior to be noticeably different.

Although courts are allowed to consider the extent to which an unduly long chain of inferences makes evidence less “trustworthy,” I think that on balance decisions would be more accurate if there were a firm limit—say, three layers—beyond which evidence was simply inadmissible as a matter of law. If A says that B says that C says that D shot someone, then no matter how reliable we think A, B, and C are, we should probably keep that evidence away from the jury unless we can haul at least one of B, C, or D into court to answer cross-examination.