Trust in Bayes

Followup to: Beautiful Probability, Trust in Math

In Trust in Math, I presented an algebraic proof that 1 = 2, which turned out to be—surprise surprise—flawed. Trusting that algebra, correctly used, will not carry you to an absurd result, is not a matter of blind faith. When we see apparent evidence against algebra’s trustworthiness, we should also take into account the massive evidence favoring algebra which we have previously encountered. We should take into account our past experience of seeming contradictions which turned out to be themselves flawed. Based on our inductive faith that we may likely have a similar experience in the future, we look for a flaw in the contrary evidence.

This seems like a dangerous way to think, and it is dangerous, as I noted in “Trust in Math”. But, faced with a proof that 2 = 1, I can’t convince myself that it’s genuinely reasonable to think any other way.

The novice goes astray and says, “The Art failed me.”
The master goes astray and says, “I failed my Art.”

To get yourself to stop saying “The Art failed me”, it’s helpful to know the history of people crying wolf on Bayesian math—to be familiar with seeming paradoxes that have been discovered and refuted. Here an invaluable resource is “Paradoxes of Probability Theory”, Chapter 15 of E. T. Jaynes’s Probability Theory: The Logic of Science (available online).

I’ll illustrate with one of Jaynes’s examples:

If you have a conditional probability distribution P(X|C), the unconditional probability P(X) should be a weighted average of the various P(X|C), and therefore intermediate between the various P(X|C) in value - somewhere between the minimum and the maximum P(X|C). That is: If you flip a coin before rolling a die, and the die is a four-sided die if the coin comes up heads, or ten-sided if the coin comes up tails, then (even without doing an exact calculation) you know that the compound probability of rolling a “1” occupies the range [0.1, 0.25].

Now suppose a two-dimensional array, M cells wide and N cells tall, with positions written (i, j) with i as the horizontal coordinate and j as the vertical coordinate. And suppose a uniform probability distribution over the array: p(i, j) = 1/​MN for all i, j. Finally, let X be the event that i < j. We’ll be asking about P(X).

If we think about just the top row—that is, condition on the information j=N—then the probability of X is p(i < N) = (N − 1)/​M, or 1 if N > M.

If we think about just the bottom row—condition on the information j=1 - then the probability of X is p(i < 1) = 0.

Similarly, if we think about just the rightmost column, condition on i=M, then the probability of X is p(j > M) = (N—M)/​N, or 0 if M > N.

And thinking about the leftmost column, conditioning on i=1, the probability of X is p(j > 1) = (N − 1)/​N.

So for the whole array, the probability of X must be between (N − 1)/​M and 0 (by reasoning about rows) or between (N − 1)/​N and (N—M)/​N (by reasoning about columns).

This is actually correct, so no paradox so far. If the array is 5 x 7, then the probability of X on the top row is 1, the probability of X on the bottom row is 0. The probability of X in the rightmost row is 27, and the probability of X in the leftmost row is 67. The probability of X over the whole array is 47, which obeys both constraints.

But now suppose that the array is infinite. Reasoning about the rows, we see that, for every row, there is a finite number of points where i < j, and an infinite number of points where i >= j. So for every row, the probability of the event X must be 0. Reasoning about the columns, we see in every column a finite number of points where j ⇐ i, and an infinite number of points where i < j. So for every column, the probability of the event X must be 1. This is a paradox, since the compound probability of X must be both a weighted mix of the probability for each row, and a weighted mix of the probability for each column.

If to you this seems like a perfectly reasonable paradox, then you really need to read Jaynes’s “Paradoxes of Probability Theory” all the way through.

In “paradoxes” of algebra, there is always an illegal operation that produces the contradiction. For algebraic paradoxes, the illegal operation is usually a disguised division by zero. For “paradoxes” of probability theory and decision theory, the illegal operation is usually assuming an infinity that has not been obtained as the limit of a finite calculation.

In the case above, the limiting probability of i < j approaches the ratio M /​ N, so just assuming that M and N are “infinite” will naturally produce all sorts of paradoxes in the ratio—it depends on how M and N approach infinity.

It’s all too tempting to just talk about infinities, instead of constructing them. As Jaynes observes, this is a particularly pernicious habit because it may work 95% of the time and then lead you into trouble on the last 5% of occasions—like how the really deadly bugs in a computer program are not those that appear all the time, but those that only appear 1% of the time.

Apparently there was a whole cottage industry in this kind of paradox, where, assuming infinite sets, the marginal probability seemed not to be inside the conditional probabilities of some partition, and this was called “nonconglomerability”. Jaynes again:

“Obviously, nonconglomerability cannot arise from a correct application of the rules of probability on finite sets. It cannot, therefore, occur in an infinite set which is approached as a well-defined limit of a sequence of finite sets. Yet nonconglomerability has become a minor industry, with a large and growing literature. There are writers who believe that it is a real phenomenon, and that they are proving theorems about the circumstances in which it occurs, which are important for the foundations of probability theory.”

We recently ran into a similar problem here on Overcoming Bias: A commenter cited a paper, “An Air-Tight Dutch Book” by Vann McGee, which purports to show that if your utility function is not bounded, then a dutch book can be constructed against you. The paper is gated, but Neel Krishnaswami passed me a copy. A summary of McGee’s argument can also be found in the ungated paper “Bayesianism, infinite decisions, and binding”.

Rephrasing somewhat, McGee’s argument goes as follows. Suppose that you are an expected utility maximizer and your utility function is unbounded in some quantity, such as human lives or proving math theorems. We’ll write $27 to indicate a quantity worth 27 units of utility; by the hypothesis of an unbounded utility function, you can always find some amount of fun that is worth at least 27 units of utility (where the reference unit can be any positive change in the status quo).

Two important notes are that, (1) this does not require your utility function to be linear in anything, just that it grow monotonically and without bound; and (2) your utility function does not have to assign infinite utility to any outcome, just ever-larger finite utilities to ever-larger finite outcomes.

Now for the seeming Dutch Book—a sequence of bets that (McGee argues) a Bayesian will take, but which produce a guaranteed loss.

McGee produces a fair coin, and proceeds to offer us the following bet: We lose $1 (one unit of utility) if the coin comes up “tails” on the first round, and gain $3 (three units of utility) if the coin comes up “heads” on the first round and “tails” on the second round. Otherwise nothing happens—the bet has no payoff. The probability of the first outcome in the bet is 12, so it has an expected payoff of -$.50; and the probability of the second outcome is 14, so it has an expected payoff of +$.75. All other outcomes have no payoff, so the net value is +$0.25. We take the bet.

Now McGee offers us a second bet, which loses $4 if the coin first comes up “tails” on the second round, but pays $9 if the coin first comes up “tails” on the third round, with no consequence otherwise. The probabilities of a fair coin producing a sequence that begins HT or HHT are respectively 14 and 18, so the expected values are -$1.00 and +$1.125. The net expectation is positive, so we take this bet as well.

Then McGee offers us a third bet which loses $10 if the coin first comes up “tails” on the third round, but gains $21 if the coin first comes up “tails” on the fourth round; then a bet which loses $22 if the coin shows “tails” first on round 4 and gains $45 if the coin shows “tails” first on round 5. Etc.

If we accept all these bets together, then we lose $1 no matter when the coin first shows “tails”. So, McGee says, we have accepted a Dutch Book. From which McGee argues that every rational mind must have a finite upper bound on its utility function.

Now, y’know, there’s a number of replies I could give to this. I won’t challenge the possibility of the game, which would be my typical response as an infinite set atheist, because I never actually encounter an infinity. I can imagine living in a universe where McGee actually does have the ability to increase his resources exponentially, and so could actually offer me that series of bets.

But if McGee is allowed to deploy a scenario where the expected value of the infinite sequence does not equal the limit of the expected values of the finite sequences, then why should a good Bayesian’s decision in the infinite case equal the limit of the Bayesian’s decisions in the finite cases?

It’s easy to demonstrate that for every finite N, a Bayesian will accept the first N bets. It’s also easy to demonstrate that for every finite N, accepting the first N bets has a positive expected payoff. The decision in every finite scenario is to accept all N bets—from which you might say that the “limit” decision is to accept all offered bets—and the limit of the expected payoffs of these finite decisions goes to +$.50.

But now McGee wants to talk about the infinite scenario directly, rather than as a limiting strategy that applies to any one of a series of finite scenarios. Jaynes would not let you get away with this at all, but I accept that I might live in an unbounded universe and I might just have to shut up and deal with infinite games. Well, if so, the expected payoff of the infinite scenario does not equal the limit of the expected payoffs of the finite scenarios. One equals -$1, the other equals +$.50.

So there is no particular reason why the rational decision in the infinite scenario should equal the limit of the rational decisions in the finite scenarios, given that the payoff in the infinite scenario does not equal the limit of the payoffs in the finite scenarios.

And from this, McGee wants to deduce that all rational entities must have bounded utility functions? If it turns out that I live in an infinite universe, you can bet that there isn’t any positive real number such that I would decline to have more fun than that.

Arntzenius, Elga, and Hawthorne give a more detailed argument in “Bayesianism, Infinite Decisions, and Binding” that the concept of provable dominance only applies to finite option sets and not infinite option sets. If you show me a compound planning problem with a sub-option X, such that for every possible compound plan I am better off taking X1 than X2, then this shows that a maximal plan must include X1 when the set of possible plans is finite. But when there is no maximal plan, no “optimal” decision—because there are an infinite number of possible plans whose upper bound (if any) isn’t in the set—then proving local dominance obviously can’t show anything about the “optimal” decision. See Arntzenius et. al’s section on “Satan’s Apple” for their full argument.

An even better version of McGee’s scenario, in my opinion, would use a different sequence of bets: -$1 on the first round versus +$6 on the second round; -$6 on the second round and +$20 on the third round; -$20 on the third round and +$56 on the fourth round. Now we’ve picked the sequence so that if you accept all bets up to the Nth bet, your expected value is $N.

So really McGee’s argument can be simplified as follows: Pick any positive integer, and I’ll give you that amount of money. Clearly you shouldn’t pick 1, because 1 is always inferior to 2 and above. Clearly you shouldn’t pick 2, because it’s always inferior to 3 and above. By induction, you shouldn’t pick any number, so you don’t get any money. So (McGee concludes) if you’re really rational, there must be some upper bound on how much you care about anything.

(Actually, McGee’s proposed upper bound doesn’t really solve anything. Once you allow infinite times, you can be put into the same dilemma if I offer you $.50, but then offer to trade $.50 today for $.75 tomorrow, and then, tomorrow, offer to trade $.75 now for $.875 the next day, and so on. Even if my utility function is bounded at $1, this doesn’t save me from problems where the limit of the payoffs of the finite plans doesn’t seem to equal the payoff of the limit of the finite plans. See the comments for further arguments.)

The meta-moral is that Bayesian probability theory and decision theory are math: the formalism provably follows from axioms, and the formalism provably obeys those axioms. When someone shows you a purported paradox of probability theory or decision theory, don’t shrug and say, “Well, I guess 2 = 1 in that case” or “Haha, look how dumb Bayesians are” or “The Art failed me… guess I’ll resort to irrationality.” Look for the division by zero; or the infinity that is assumed rather than being constructed as the limit of a finite operation; or the use of different implicit background knowledge in different parts of the calculation; or the improper prior that is not treated as the limit of a series of proper priors… something illegal.

Trust Bayes. Bayes has earned it.