What makes a probability question "well-defined"? (Part II: Bertrand's Paradox)

(Follow-up to my last post and cross-posted from my new(ish) blog. Sorry for the very long delay, life was crazy. )

I. Introduction

In my last essay, I argued that when we ask for the probability that some event E holds, the very meaning of the question is to ask for our best guess given the information available to us. As such, we can never claim that we have insufficient information to answer a probability question. That is simply the nature of probability. We must give an answer regardless, at least assuming the event E is meaningful at all.

I further argued that when we call a probability question “undefined”, what we really mean is that it’s not clear how to move from the information we do have to a precise numerical answer to the question. But it would be more proper to call such questions computationally difficult, rather than undefined. Let’s apply this reasoning to some more exotic problems, shall we?

II. Bertrand’s Paradox

Bertrand’s Paradox is a famously “undefined” probability problem. The linked video (which I recommend watching) visualizes the problem quite well, so I’ll just briefly describe it in words. Consider choosing a random chord on a circle. What’s the probability that the length of this chord is greater than the side length of an equilateral triangle inscribed in the circle?

To answer this, we need to figure out what we mean by a “random” chord. We haven’t been given any information about the chord’s distribution, but based on what we learned in the last essay, we can use a uniform distribution. The problem lies in what we consider “uniform”. There are many ways to think about this. Three in particular are called out in this paradox.

In the first, we choose two points on the circle uniformly at random, and form their chord. In the second, we choose the midpoint of the chord uniformly at random from within the circle. The third is a bit trickier to state mathematically, but visually is still intuitive. Imagine a circle with a vertical line to the left of it. Gradually shift the line to the right, and consider all the chords it forms as it intersects the circle. We assume each one of these is equally likely. But this only covers vertical chords, so also assume the line is equally likely to approach the circle from any angle. (Mathematically, we say the angle of the chord’s radial line is chosen uniformly at random, and then the chord’s center is also chosen uniformly at random along this line.)

The paradox is that all three of these uniform assignments yield different probability estimates for our question, giving ¹⁄₃, ¹⁄₄, and ¹⁄₂, respectively. Each method sounds reasonable, so what gives? The question is typically considered ill-posed, simply not possessing enough information to yield a definite answer. But that puts us in a bit of a pickle, since I’ve claimed that’s not an option for us.

The follow-up video lays out a potential solution from the book Probability Theory: The Logic of Science by late great Bayesian master E. T. Jaynes (from whom I ripped most of the last essay, and this one as well), although the video does not quite do him justice. His full solution is rather technical, but I’ll try to convey the principles behind it. The basic idea is that if two observers are given the same problem and have all the same information, they ought to give the same answer for their probability estimates. If you believe this, then the question as stated has the unambiguous answer of ¹⁄₂.

III. Metalevel principle of indifference

The principle of indifference typically states that, in the absence of any other relevant information, we should assign equal probabilities to all possible events under consideration, just as we have been arguing. But Jaynes suggests that we should view this principle on a higher level; it’s dangerous to apply at the level of events, since it leaves too much room for arbitrary intuition. If we choose a totally random chord, is it really true that each pair of endpoints is as likely as any other? Is it really true that each midpoint is as likely as any other? Is it really true that each radial line is as likely as any other? Modeling the question this way, it’s up to us to decide which individual events we find equally plausible, but who’s to say that our intuition is a good guide?

Instead, Jaynes proposes that we apply the principle of indifference at the level of problems. If two observers are given the same problem, they ought to give the same answer. And then, as Jaynes says, “Every circumstance left unspecified in the statement of a problem defines an invariance property which the solution must have if there is to be any definite solution at all.” Let’s show what this means in a simpler case.

Consider again a die with six faces. We have no information about the relative frequency of each face. As Richard Hamming says in The Art of Probability:

[We] see that the six possible faces are all the same, that they are equivalent, that they are symmetric, that they are interchangeable… If you do not make equal probability assignments to the interchangeable outcomes then you will be embarrassed by interchanging a pair of the non-equal values. Since the six faces of an ideal die are all symmetric… then each face must have the probability of 1/6… When we say symmetric, equivalent, or interchangeable, is it based on perfect knowledge of all the relevant facts, or is it merely saying that we are not aware of any lack thereof? The first case being clearly impossible, we must settle for the second… We will always have to admit that if we examined things more closely we might find reason to doubt the assumed symmetry.

Pay particular attention to: “If you do not make equal probability assignments to the interchangeable outcomes then you will be embarrassed by interchanging a pair of the non-equal values”. Let’s say we tried to separate out our events into {1} and {2, 3, 4, 5, 6} (i.e. we either roll a 1, or something higher). Can we then assign equal probabilities to each event, so that there’s a 50% chance of rolling a 1? But imagine we instead separated the problem into the events {2} and {1, 3, 4, 5, 6} (i.e. we either roll a 2 or we don’t). But each value from 1 to 6 is symmetric, since our information does not distinguish between them. As such, these two problems are identical, and by Jaynes’s extended principle of indifference, we must assign equal probabilities in each case. So we would also have to say 2 has a 50% chance, and 3, and so on. But this is impossible; the only consistent assignment is ¹⁄₆ for each.

So while there might in actuality be some difference between the faces, they are not specified to us, which defines a certain invariance property (i.e. if we swap faces, our answers shouldn’t change), which lets us solve the problem. Jaynes does something very much like this for Bertrand’s Paradox. In particular, notice that we did not specify the size of the circle, nor where in the plane it lies. So if the question is to have an answer at all, it must be invariant to these properties. Or to put it another way, if you posed the same question to two different observers, who were looking at two different-sized circles at different locations, they ought to give the same answer, since they have been given all the same information. This invariance is enough to answer the question.

Jaynes shows that if we try to use either of the first two distributions, and we then apply a change of scale or a translation in space, then our probability for each chord changes; we are “embarrassed” in a similar way to Hamming’s example. Only the third distribution gives a consistent answer, which is ¹⁄₂. (In fact, we only need translational invariance to show this.)

IV. What does this tell us about frequency?

Now it’s very important to make one thing clear: we are not assuming the “true” distribution of chords is actually translationally invariant. It might very well be that the distribution changes as we move through the plane. Rather, we are saying that this is the distribution that correctly encodes our ignorance in the problem. This is a common sticking point in the frequentist mindset: “Oh, but you’re just assuming that’s the true distribution, and that’s unjustified!”. But that’s really not what’s happening.

The whole idea of a true, fixed frequency distribution is a strictly frequentist concept in the first place! The confusion comes from a lack of perspective-taking, in which one is unable to see that a fundamentally different claim is being made with this distribution. The only assumption that is occurring is that two observers given the same information ought to give the same answer, a charge which I am happy to accept.

Yet, as Jaynes says:

Nevertheless, we are entitled to claim a definite frequency correspondence… For there is one ‘objective fact’ which has been proved by the above derivation: any rain of straws which does not produce a frequency distribution agreeing with [the result] will necessarily produce different distributions on different circles. This is all we need in order to predict with confidence that the distribution will be observed in any experiment where the ‘region of uncertainty’ is large compared with the circle. For, if we lack the skill to toss straws so that, with certainty, they intersect a given circle, then surely we lack a fortiori the skill consistently to produce different distributions on different circles within this region of uncertainty!

This is to say, of course in principle any distribution could occur if we finely control the chord-generating process. But what if have little control? If we imagine ourselves trying to toss lines (e.g. straws), there is some “region of uncertainty”; based on our limited skill, we can reasonably guarantee that the straws fall within this region, but we have little control beyond that. If this region is large compared with the circle, then he have little control over where our chord falls on the circle, and we cannot purposely control the distribution. This seems like a quite reasonable model for random chord generation.

But if this is the case, then we also cannot produce different distributions for different circles in the region. Hence, we have translational invariance, and we get a definite physical prediction of ¹⁄₂. Of course, if we do have some control over the process, then we can use our knowledge of that control to form a better estimate. This estimate only tells us what to believe in the case of absolute ignorance.

V. Buckets and brain scans

In the two videos linked before, the presenter (Grant, of 3blue1brown fame) largely comes to the standard conclusion, that the problem is ill-posed. Toward the end of the second video, the interviewer (Brady) insists that the question must have an answer. Grant proposes the following thought experiment: Imagine we have a bucket of dice, all of different shapes and sizes, and we pull one out at random and roll it. What’s the probability we get a five? (I’ll actually consider the question “What’s the mean?”, for convenience.) If we have no idea what dice are in the bucket, should this have a definite answer? It doesn’t seem so. He claims we are in a similar situation with respect to Bertrand’s Paradox, since we don’t know how we’re generating chords.

What can we say in our framework? Here are a few intuition pumps. Our core philosophical argument still seems to apply: we don’t have perfect information, but all that probability means is our best judgement under imperfect information. And as always, even if we did know the distribution of dice, we wouldn’t know, say, which dice are near the top, which is clearly a relevant factor, so we could still say the problem is underdetermined, etc. But all we’re asking for is the best guess given your ignorance, so it doesn’t seem like we can refuse to answer.

Or consider Eliezer Yudkowsky’s admonitions in “I don’t know”. We can’t actually say we have zero knowledge on this question. We know that four is a better guess for the mean than one or a trillion. And there’s always the classic, which Brady uses in the second video: “What if I burn your house down if you refuse to answer?”. Or in this case, we could ask you to guess the mean, then choose a die and roll it, and torture you for a number of minutes equal to the square of your error (which incentives you to give your honest best guess of the mean), and it’ll be even worse if you refuse. Then you are forced to give some answer, and you know some answers are better than others, and it’s in your interest to give your honest best guess…

In this case, you clearly have some implicit knowledge, but it’s very hard to turn it into a definite numerical answer. This is much like the problem from my last essay, where we tried to estimate the distribution for a die with a mean of 4.5. It’s very hard to convert the information we have into a definite answer, and so we feel intuitively that the question is ill-posed.

But here’s a final intuition pump. Suppose we agree that if I told you the distribution of dice, then the problem would be well-defined. But what if I told you the distribution via a coded message? Then technically you would have all the relevant information, and the problem would be perfectly well-defined, but it might be computationally difficult to answer. Going even more extreme, what if I know the distribution, and I give you a full scan of my brain? Then all the information is at your disposal: if you were a god, you could read off the distribution from this scan. But as a human, you have no idea how to go from this brain scan to a numerical answer.

This, I propose, is exactly the situation you are in with respect to your own brain. There is some best guess given everything you know, but actually extracting it is just totally intractable.

VI. As with all interesting ideas, Eliezer said it first

In Eliezer’s piece When (Not) To Use Probabilities, he writes:

If P != NP and the universe has no source of exponential computing power, then there are evidential updates too difficult for even a superintelligence to compute—even though the probabilities would be quite well-defined, if we could afford to calculate them.

From this he concludes that is not always wise to use explicit probabilities in your reasoning. There are times when your information is so far off from being numerical, that trying to turn it into numbers is tantamount to trying to read a probability distribution off a brain scan. In situations like this, making up a number can actually be damaging. There is no way your number can actually be the correct computation, so it’ll really just be a pure fabrication. You may be better off going with your gut.

It’s still true that, the degree to which your behavior is not consistent with some probability distribution, is the degree to which you will step on your own toes and behave inconsistently. But you may not be able to improve this situation by making up probabilities. If you are forced to take some action, you will simply have to do your best, and there is no guarantee that your best will actually be the correct answer. Such is life.

I’ll leave you with Eliezer’s recommended strategy in Harry Potter and the Methods of Rationality:

One version of the process was to tally hypotheses and list out evidence, make up all the numbers, do the calculation, and then throw out the final answer and go with your brain’s gut feeling after you’d forced it to really weigh everything.

What makes a probability question “well-defined”? (Part II: Bertrand’s Paradox)