*(Already know what Laplace’s rule of succession is and just want the proof? Scroll down.)*

Suppose I give you a bag of marbles and tell you that all the marbles are either green or black. You repeatedly reach in, pick out a random marble, and then put it back. You find that out of 100 draws, 70 of the marbles you took out of the bag were green. What’s the probability that the next marble you’ll draw will be green? Or, to put it another way, what’s your best guess (expected value) of the fraction of marbles in the bag that are green?

A decently sensible answer is: 70%. After all, 70 of the 100 marbles you’ve seen were green, and you don’t really have more information than that.

Okay, fair enough. But let’s back up a bit. Suppose that you draw *just one* marble from the bag, and it’s green. What’s the probability that the next marble will be green too? The same reasoning as before would make you say 100%, but that doesn’t make sense. It’s not like you know, after just that one draw, that all the marbles in the bag are for sure green.

Really, I haven’t given you enough information answer the question. What you need is some sort of *prior information* about the marbles in the bag. For example, here are two different assumptions you could make that lead to wildly different answers:

The marbles in the bag are either all green or all black. (In this case, 100% is the right answer.)

The marbles in the bag are chosen to be all green, all black, or

^{50}⁄_{50}, each with probability^{1}⁄_{3}. (Under this assumption, the right answer turns out to be^{5}⁄_{6}.)

Okay, but my question still stands. Suppose that, *in real life*, I give you a bag of marbles, tell you that they’re all green or black, let you drawn one marble (it turns out green), and you have to guess the probability that the next marble you draw is green. What should you guess?

One reasonable assumption you could make is that, when filling the bag with marbles, I chose the fraction of green marbles uniformly at random. That is, I picked a random number between 0% and 100% and decided to make that fraction of all the marbles green. So your prior over the fraction of marbles that are green is uniformly distributed between 0 and 1. Another way to think about this assumption is that it’s the simplest one you could make: you don’t know what your prior distribution should be, but this seems like the simplest one.^{1}

This assumption, called the *uniform prior*, gives pretty reasonable answers! Before drawing any marbles, what’s your guess at the probability that the next marble you draw will be green? Under the uniform prior, the answer is ^{1}⁄_{2}.

What if you draw one marble and it’s green? After getting this information, you Bayes update on your uniform prior, and your posterior distribution is one whose density at probability p is proportional to p. For example, because you drew a green marble, it’s now 9 times as likely that around 90% of the marbles are green than that around 10% of the marbles are green. From here, computing the expected fraction of green marbles in the bag amounts to doing an integral, and if you do this you’ll get ^{2}⁄_{3}. A very sensible answer—much more sensible than the naïve answer of 1.

What about in general? Say you start with a uniform prior over the fraction of green marbles in the bag, and then draw g green marbles out of n total. What’s your posterior probability that the next marble you draw will be green?

The answer turns out to be: (g+1)/(n+2). So at the beginning it’s ^{1}⁄_{2}. Then if you draw a green marble it goes up to ^{2}⁄_{3} (as we just calculated); and symmetrically, if you had drawn a black marble, it would have gone down to ^{1}⁄_{3}. If you draw two green marbles, it goes up to ^{3}⁄_{4}; and as you’d expect, if you drawn one of each color, you’re back to ^{1}⁄_{2}. What about the example we started with, where you drew 70 green marbles out of 100 total? You get ^{71}⁄_{102}, which is around 69.6%. It makes sense that this is close to 70% -- after all, you now have a lot of information about the marbles in the bag! And it also makes sense that it’s a little less than 70%: this reflects the fact that you started out with a uniform prior centered around 50%.

This formula—the (g+1)/(n+2) -- is called **Laplace’s rule of succession**. This rule is pretty useful: for example, forecasters often use this rule when predicting events with small historical sample sizes. It also has a nice intuitive interpretation: **Before you start, pretend you’ve already drawn one black marble and one green marble. Then, when asked about the probability that the next marble will be green, report the historical rate of drawing green marbles (including these two pretend marbles).**

This is an awfully simple result—one that ought to have a really slick proof. And yet, if you google for a proof of Laplace’s rule of succession, you’ll basically just find proofs that do out the integral (sort of like the one we did earlier, but more generalized and therefore more complicated). These proofs don’t tell a nice story; they don’t tell you what’s actually going on.

I’ve wanted a nice proof of the rule of succession for a while now, and the other day I came up with one. The proof even gives an intuition for why the “two extra marbles” interpretation of the rule makes sense!

**Proof of Laplace’s rule of succession**

Let’s state the problem formally, with a coin instead of marbles. Say that you have a coin, and the probability that the coin comes up heads is some p drawn uniformly between 0 and 1 (we will call p the coin’s *headiness*^{2}). You see n flips of the coin, of which h are heads. We want to prove that, given this information, the probability that the next coin will come up heads is (h+1)/(n+2). (Flipping heads corresponds to drawing a green marble in our earlier discussion.)

I encourage you to think about how you might go about proving this fact before reading on—it might make you appreciate the proof more. Or if you want, keep going. I’ll first give the proof—it’s really quick—and then explain how I came up with it.

Consider the following way to simulate the process of flipping a coin n times with headiness drawn uniformly between 0 and 1. First, pick n + 2 points uniformly at random on the circumference of a circle. Pick one of those points at random and call it Z. Pick a different one at random and call it P. Color green the part of the circle which is between Z and P, going clockwise from Z. Now, each of the remaining points corresponds to a flip of the coin: if it’s in the green region of the circle, we say that the coin came up *heads*; otherwise the coin came up *tails*.

*9 random points on a circle (so n = 7), with one randomly labeled Z and another randomly labeled P. In this example, P is 25% of the way around the circle clockwise from Z. The two points in the green region correspond to “heads” flips, and the other five correspond to “tails” flips.*

It’s pretty easy to see that this sampling procedure is the same as choosing the coin’s headiness at random between 0 and 1 and flipping the coin n times. The headiness of the coin corresponds to the clockwise distance from Z to P, i.e. the fraction of the circle that is green. In the example above, the green region is 25% of the whole circle, so each time the coin is flipped there’s a 25% chance that it will land heads-up.

But, regardless of what’s going on inside the random number generator, all *you* get to know are the coin flips. So, you don’t know where Z and P are; all you know is: of the n points besides Z and P, exactly h of them were clockwise on the way from Z to P (in the green region). These h points split up the green region into h + 1 circular arcs. So, another way to summarize the information you have is: the green region consists of h + 1 arcs, out of n + 2 arcs total around the circle.

So from your perspective, what’s the probability that the *next* coin will come up heads? The answer is: exactly (h+1)/(n+2)! That’s because, given what you know, the next randomly chosen point from the circle is equally likely to lie on each of the n + 2 arcs. And the flip outcome will be heads if it lands on one of h + 1 of those arcs (the green ones). That’s it—we’re done!

Notice, by the way, that this proof nicely explains the formulation of the rule of succession in terms of adding two extra flips (one heads and one tails) and then taking the historical rate of heads. That’s because we can identify each arc with its counterclockwise-most endpoint (e.g. the first green arc is identified with Z, and the first black arc is identified with P). If the newly selected point lands in an arc identified with one of the previous heads flips, *or with Z*, it’s in a green arc and so the resulting flip will be heads. If it lands in an arc identified with one of the previous tails flips, *or with P*, it’s in a black arc and so the resulting flip will be tails. In this way, Z is really an honorary heads flip (or green marble, with our earlier formulation) and P is an honorary tails flip (black marble)! *[Thanks to Adam Hesterberg for pointing this out!]*

I’ll talk a bit about how I came up with this proof. The main insight was to stop thinking of the coin flips as binary random variables, and instead think of them as being random numbers between 0 and 1, where all you get to find out about them is whether they’re bigger or smaller than p. I’m not sure what inspired this thought, but possibly computer programming: a common way to generate a random coin flip is by picking a random number between 0 and 1 and then rounding.

I then realized that, just like the other n numbers, the headiness p was *also* selected uniformly at random between 0 and 1. This is nice because it means that we can think of p as just one of n + 1 random numbers. In particular, the problem has now become: “given n + 1 random numbers between 0 and 1, what is the expected value of the (h + 1)-th smallest one?” That’s because p is larger than h of the other n random numbers, and that’s all we know about it.

This felt like a much more tractable problem. I now had an intuition for the answer (h+1)/(n+2). The intuition came from the idea that the n + 1 numbers ought to be—on average—equally spaced on the interval [0, 1]. So for example, if there were two numbers (n = 1), the smaller one would be ^{1}⁄_{3} on average and the larger one would be ^{2}⁄_{3} on average.

But I wasn’t really sure how to make this intuition rigorous. Because while it’s intuitive that the gaps between consecutive numbers should probably be equal, it isn’t *obvious*. Then I came up with the idea of putting the numbers on a circle, where “how large the random number was” became “how far around the circle the point was”. Unlike intervals, circles are nice and symmetric; for a circle, it *is* clear that if you randomly choose k points then for a randomly selected point, the point nearest it going clockwise will on average be 1/k fraction of the way around the circle!

This was basically the end of the proof. The final step was adding the point Z (which stands for “zero”) as a *random* point on the circle, rather than a fixed point (at the top of the circle, say). I’m not sure if this step was necessary, but it resolved a confusion I was having.

Here was the confusion: let’s say we just call the top of the circle “zero”. Now we’re going to put n + 1 points (the n flips and the value p) on the circle. Since there are n + 1 random points, it seems like the expected distance between two consecutive ones ought to be 1/(n + 1) fraction of the way around the circle—including the distance from the last point clockwise to the first point. If that’s right, then the expected value of the distance from zero to the first point is 1/(2(n+1)) of the way around the circle (half the expected distance from the last point to the first point); to the second point it’s and so on. But this would imply that if you flip n coins and they’re all tails, the expected value of p is 1/(2(n + 1)) -- which is not what the rule of succession says!

So, what’s going on—what’s wrong with the previous paragraph—that caused me to add “zero” as a random point of the circle instead of a fixed one? I’ll leave that question for you to think about! *[Edit: Stumped? See **this comment**!]*

^{1. If it bothers you that this prior takes on irrational values, think of the bag as being really big; or think instead of being given a coin that lands heads with some unknown probability between 0 and 1.}

^{2. The technical term for this is “bias”, but I don’t like that because it doesn’t make sense to me that a coin with bias 0 always comes up tails (I feel like “bias” should refer to how far the coin is from being fair).}

This its very similar to another result: at the beginning, before seeing any draws, you believe that at the end of $n$ draws, every possible number of green draws is equally likely, i.e, $\int_0^1 \binom{n}{k}x^{n-k}(1-x)^{k} dx = \frac{1}{n+1}$. The proof: if you draw $n+1$ IID uniform $[0,1]$ random variables, on the one hand, the first one its equally likely to have any particular rank, so the probability it has rank $k+1$ is $\frac{1}{n+1}$. On the other hand, the probability it has rank $k+1$ is exactly the probability that $k$ of the remaining $n$ uniform random variables take a value greater than it, and $n-k$ of the remaining $n$ take a value lesser than it, which equals the integral.

Nice idea, but it’s not a proof.

The gap is here:

the next randomly chosen point from the circle is equally likely to lie on each of the n + 2 arcs.For any given configuration of points, the quoted statement is almost certainly false, since the arcs almost certainly have different lengths. To close the gap, you need to show that the statement becomes true when averaging over all configurations consistent with the observed h out of n Heads.

I’m not conditioning on any configuration of points. I agree it’s false for a given configuration of points, but that’s not relevant here. Instead, I’m saying: number the intervals clockwise from 1 to n + 2, starting with the interval clockwise of Z. Since the n + 2 points were chosen uniformly at random, the interval numbered k1 is just as likely to have the new point as the interval numbered k2, for any k1 and k2. This is a probability over the entire space of outcomes, not for any fixed configuration.

(Or, as you put it, the average probability over all configurations of the probability of landing in a given interval is the same for all intervals. But that’s needlessly complicated.)

I see it now. Good call.

Nice post! I just finished reading a lot about the rule of succession (In Jaynes’ book) and this is a helpful similar-yet-different perspective. Cool circle idea.

Question to the LW community: I made this linkpost by copy-pasting from my my blog and then correcting any bad formatting. Is there an easier way to do this? Also, is there a way to do in-line LaTeX equations here?

I don’t know about linkposts, but there’s an editor here with LaTeX, yup. There are two editor modes: The LessWrong Editor, and the Markdown Editor. Looks like you’re in the markdown editor here. The LessWrong editor doesn’t support footnotes, and there are plenty of other differences, so I recommend against converting this post to LessWrong-editor-format—things might get totally messed up. (Though I’ve only written 6 posts here, none in the Markdown editor, so I’m not actually sure).