Bayes’ Law is About Multiple Hypothesis Testing

I’ve called outside view the main debiasing technique, and I somewhat stand by that, not only because base-rate neglect can account for a variety of other biases, but also because outside view is about working on the policy level, which you have to do to implement other debiasing strategies.

Nonetheless, I am here today to tell you why the Method of Multiple Working Hypotheses is a central technique. T. C. Chamberlin wrote about it in 1897. More recently, Heuer discusses a very similar technique in Psychology of Intelligence Analysis, which served for a time as the debiasing handbook for the CIA. Heuer called his version Analysis of Competing Hypotheses.

(So, we could call it Method of Multiple Hypotheses (MMH), Analysis of Competing Hypotheses (ACH), or perhaps Analysis of Alternative Hypotheses (AAH) -- it seems doomed to be abbreviated as some variety of grunt.)

Heuer found that asking people to articulate the assumptions behind their assertions did not work very well—analysts tend to insist that their conclusions follow directly from looking at the data, with no assumptions in between. (It is difficult to see the lens which you use to see!) However, if you instead ask people to compare their conclusions to other possibilities, they start noticing the assumptions which pointed them in one direction rather than another.

In order to make it stick in people’s heads, I want to explain why it is just about inevitable from Bayes’ Law.

Bayes’ Law compares hypotheses to each other in terms of their likelihood ratios, balanced by the priors. Testing a single hypothesis feels meaningful, perhaps because in logical/​deterministic cases we sometimes can prove or disprove something on its own. In the general case, though, we have to compare a hypothesis to alternatives to say anything meaningful. It’s much like trying to evaluate a plan in isolation—you can figure out a probability of success, or an expected value, but this is meaningless in isolation. You need to compare it to alternatives to know anything about whether you want to enact the plan. And, not just any alternatives; the best alternatives you can come up with.

Similarly, it only makes sense to evaluate hypotheses by looking at their relative likelihoods in comparison to a number of other hypotheses, and relative prior probabilities.

This is the thing which null hypothesis testing is sweeping under the rug. Null hypothesis testing attempts to fake testing a single hypothesis in isolation by comparing it to a “null” hypothesis which is taken to be the default thing we would believe. This often makes enough sense to not be glaringly terrible, but misrepresents the epistemics. There should not be special hypotheses which we consider “default”.

A common way of writing Bayes’ Law makes it look as if you can judge probability in isolation:

Variables ‘h’ and ‘e’ here are supposed to remind us of ‘hypothesis’ and ‘evidence’. It looks like we’re able to evaluate hypothesis on its own merits. However, another common statement of the law shows some of the complexity by expanding out the denominator:

In words: we can judge hypotheses in isolation by multiplying their prior probability by their likelihood . We could call the “goodness” of . This doesn’t give a number which sums to one, though; we have to normalize, by dividing the “goodness” of each hypothesis by the total “goodness” of all hypotheses. The resulting number is between 0 and 1, so it can be a probability; indeed, it is the posterior probability.

However, note that the revised formulation represents alternatives to simply by the negation, . This is still hiding a lot of complexity. How do we compute the “goodness” of not-? In simple situations, this might be clear. But, to my mind, this invites the same sort of mistakes which can be made in null hypothesis testing: testing against a straw “default” hypothesis, rather than against the strongest alternative hypotheses you can think of.

Yet another common form of Bayes’ Law unpacks this simplification. We consider a family of hypotheses, :

Now we’ve got it: we see the need to enumerate every hypothesis we can in order to test even one hypothesis properly. The previous use of was just hiding “all the other hypotheses”, and the original denominator, , hid it further still.

It’s like… optimizing is always about evaluating more and more alternatives so that you can find better and better things. Optimizing for accurate beliefs is different only in that you want to weigh your several options together, rather than taking only the best one after. But, still, how can you expect to find good hypotheses if you’re not generating as many as you can and allowing them to compete on the data?

Heuer tries to get people to do this by telling them to make a grid, with all the hypotheses written on the top and all the significant pieces of evidence written on the side. Rather than figuring out exact likelihood ratios, you can write “+” or “-” to indicate very roughly how well hypotheses match up to evidence:

In fleshing out this fake example, it occurred to me that I had to also include “no data breach” to be able to examine the evidence in favor of the breach. Really, it should be split into more hypotheses (which might give alternative explanations of why Victor knew too much). As we see, the evidence in favor of the breach is actually not as strong as one might think, given the priors against it and the lack of evidence in favor of any particular type of breach. (However, we can also see how rough and potentially misleading simply writing plusses and minuses can be!)

This seems better than nothing, but I can see several problems with it:

  • It is easy to forget the “prior”—I had to lump it in with evidence. In fact, I think Heuer doesn’t put the prior in at all.

  • The chart format makes you think of “compatibility” between hypothesis and evidence in a fairly symmetric way; it doesn’t jump out at you that you’re supposed to be writing rather than .

In any case, I think the cognitive gear which Heuer and Chamberlin are pointing at is very important. It is more precise than the common pattern “try very hard to falsify your hypothesis” (though that mental movement may still prove useful), because it isn’t obvious how to try to falsify a hypothesis; coming up with good alternative hypotheses is a necessary step.

When I first read about Heuer’s ACH method, I remember having thoughts along the lines of “this debiases in a lot of different ways!”—but I can’t recall the biases I thought it covered, now. Fortunately, cousin_it has recently been thinking about it, and made his own attempt to list implications, which I’ll quote in whole:

T.C. Chamberlin’s “Method of Multiple Working Hypotheses”, as discussed by Abram here, is pretty much a summary of LW epistemic rationality. The idea is that you should look at your data, your hypothesis, and the next best hypothesis that fits the data. Some applications:
Wason 2-4-6 task: if you receive information that 1-2-3 is okay and 2-4-6 is okay while 3-2-1 isn’t, and your hypothesis is that increasing arithmetic progressions are okay, the next best hypothesis for the same data is that all increasing sequences are okay. That suggests the next experiment to try.
Hermione and Harry with the soda: if the soda vanishes when spilled on the robes, and your hypothesis is that the robes are magical, the next best hypothesis is that the soda is magical. That suggests the next experiment to try.
Einstein’s arrogance: if you have a hypothesis and you’ve tried many next best hypotheses on the same data, you can be arrogant before seeing new data.
Witch trials: if the witch is scared of your questioning, and your hypothesis is that she’s scared because she’s guilty, the next best hypothesis is that she’s scared of being killed. If your data doesn’t favor one over the other, you have no business thinking about such things.
Mysterious answers: if you don’t know anything about science, and your hypothesis is that sugar is sweet because its molecule is triangular, the next best hypothesis is that the molecule is square shaped. If your data doesn’t favor one over the other, you have no business thinking about such things.
Religion: if you don’t see any miracles, and your hypothesis is that God is hiding, the next best hypothesis is that God doesn’t exist.
And so on. It’s interesting how many ideas this covers.

The way this has entered into my personal thought patterns is: when I’ve come to some solid-seeming conclusion (in my own thoughts or in discussion), make it a principle to list alternatives (until the point where it has more cost than expected benefit). I think this has saved me a month or two of wasted effort on one occasion (though it is possible I would have noticed the problem sooner than that by some other means).

Happy debiasing!