Laplace’s rule of succession
Laplace’s rule of succession is a commonly used heuristic in forecasting the occurence of events which have little to no historical precedents. The basic problem setup is this: suppose you flip a coin with an unknown bias times and get heads and tails. We want to infer something about the bias of the coin from this information, where we define to be the probability of the coin coming up heads.
One obvious idea is to estimate by . This works well if we have some decent number of heads and tails, but it fails badly in the edge case where the tosses are very imbalanced. In the extreme case, if we get all heads and no tails, we would be estimating which is obviously a bad estimate: it implies that we’re sure the coin has a probability of coming up heads, which seems like an extreme inference to make from a sequence of heads. After all, if we merely had , we would not expect to see any heads to begin with. An inference that goes beyond this bound seems intuitively unjustified.
The simple version of Laplace’s rule is a heuristic which says we should pretend we got one heads and one tails in excess of the results we have actually obtained and compute the naive estimator under that assumption. In other words, we would estimate . This avoids the pitfall mentioned in the previous paragraph even though it is a biased estimator: for tosses we have
which is not equal to the true value unless . This is unavoidable, because if we want to avoid overconfidence in the case of heads, we have to make the estimator biased in the case ; and by continuity it will also at least be biased in some neighborhood of . It is, however, asymptotically unbiased: this means that as , we have .
It’s easy to see why this is an improvement over the naive estimator for avoiding overconfidence, but it looks more like an ad-hoc fix than an approach with theoretical justification. To give the justification behind Laplace’s rule, I first have to talk about conjugate priors.
In Bayesian inference, we often have the following problem: there is a family of models we’re considering that are parametrized by a vector of parameters and we have a prior over these parameters. Given the prior, we observe some event , and we want to do Bayesian updating to get a posterior by
Here is the proportionality symbol, since I’m omitting the overall normalizing factor in Bayes’ rule for simplicity.
In general this becomes a mess: the posterior can become intractable even after a simple update like this, since in principle there’s no reason for the prior and the likelihood to have anything to do with each other. To simplify Bayesian inference in cases like this, a common trick is to pick a family of priors given the form of the likelihood function such that the posterior will also lie in the same family as the prior. Given the likelihood function, such families of priors are called conjugate priors.
Suppose we try to do this for our coin setup. We take , and the likelihood function of observing a specific sequence of heads and tails will then be . The prior and posterior will be related by
It should be fairly obvious that we can consider a family of distributions in this case. Such density functions can be normalized to give genuine probability densities as long as . If the prior is of this form, all the Bayesian update will do is change its exponents: we’ll go from a prior of the form to a posterior of the form . Thus, as long as our initial prior is of this specific form, we can reduce a Bayesian update to a simple problem of updating two parameters in a straightforward and transparent manner.
The factor we need to normalize the function to get an actual probability density function is called the beta function and the distribution itself is called the beta distribution. We say that the beta distribution is the conjugate prior to the Bernoulli distribution, which is the technical name for the distribution of coin flips.
Properties of the beta distribution
There are two properties of the beta distribution that are important for the inference problem we’ll face: its mode and its mean. I’ll compute both of them here.
The mode is easy. I’ll only do the case here since this case is the hardest. We can ignore the normalization factor and simply maximize by choosing . Differentiating and using the product rule gives
which is equal to zero when the expression in the parantheses is equal to zero. Multiplying through by gives or
This local extremum is a maximum when , so it’s indeed the mode of .
The mean is easy to compute if we know some functional equations satisfied by the beta function, in particular its relationship to the Gamma function, but I won’t assume knowledge of those here and I’ll just say that the mean of turns out to be .
Importantly, the mean and mode of the beta distribution are not equal! This is actually the correct way to understand the necessity for something like Laplace’s rule from a Bayesian perspective, as I’ll explain shortly.
Now that we have the conjugate prior machinery prepared, we can imagine that we start flipping our coin with some prior over its bias . Once we get heads and tails, we update to the posterior . This works fine if we actually have a prior, so part of the moral of Laplace’s rule is that you can account for your prior when doing Bayesian inference for this problem by simply assuming you got more heads and more tails than you actually did get. If we don’t actually have a prior, though, how do we choose one?
This is the problem of uninformative priors, in other words, priors we would choose in a setting in which we know nothing about the coin. In particular, we can’t know that heads is more likely than tails or vice versa, so that suggests our prior should be symmetric: we should have . That reduces us down to one degree of freedom.
The naive version of Laplace’s rule corresponds to taking , which as you can check simply corresponds to a uniform distribution over . Once we do this and observe heads and tails our posterior becomes which has mean and mode . If we’re good Bayesians our subjective probability that the next coin toss will come up heads has to be the mean which fits Laplace’s rule and not the mode . Taking the mode in this context would be a form of maximum a posteriori estimation or maximum likelihood estimation: we have that
so taking the mode of the posterior is equivalent to maximizing the right hand side of this expression, and since the prior is constant for this just means we’re maximizing the likelihood. In this particular case, the mode of the prior is also the maximum likelihood estimate, though in general this would not be true. However, taking the maximum likelihood estimate is not the correct thing to do in this situation, because we really want the mean of the posterior and not its mode. This is the reason why the naive estimator fails in this problem.
We can, however, ask whether is the most sensible choice of prior. Two other popular choices are , called the Jeffreys prior; and , called the Haldane prior. The Haldane prior is technically not a probability density because its integral over the unit interval blows up and therefore can’t be normalized, but we can still use it as a so-called improper prior in Bayesian inference. However, it has the same overconfidence problem discussed at the start of this post, so I’ll instead try to give the motivation for why we might prefer the Jeffreys prior instead.
One of the main reasons to prefer the Jeffreys prior is that it is invariant under reparametrization. There is a kind of arbitrariness to assuming that the most natural distribution is uniform: if we parametrize our coin not by the chance of coming up heads but instead by some other variable such as , taking a uniform distribution over would not be the same thing as taking a uniform distribution over . Therefore we can only really say a uniform distribution is a natural choice if we know that parametrizing the distribution of coin flips by is in some sense the best choice we can make.
On the other hand, the Jeffreys prior is defined explicitly to be invariant under reparametrization: it is proportional to the square root of a quantity (the determinant of the Fisher information matrix) which changes by the square of the Jacobian when we switch parametrizations, so the prior itself transforms by a Jacobian factor, which makes it give a consistent probability distribution regardless of the choice of variable used to parametrize it. You can think of the Jeffreys prior as “respecting the chain rule”.
If we compute the Jeffreys prior for our setup, we find that it is equivalent to taking . In terms of inference, this would mean we would get the estimator
instead of the Laplace rule estimator of
Using it in practice
The rule of succession is best used in practice not by adhering to a fixed procedure for choosing a prior, but by using the principle that the beta distribution is conjugate prior to the Bernoulli distribution in order to represent your prior beliefs about something in a tractable manner and to do Bayesian updating easily and in a way that is interpretable.
If we have strong reasons to be confident in our prior, such as if we’re sampling from a reliable pseudorandom number generator, we would pick large and equal to the probability we’re feeding the PRNG, where how large these parameters are would represent our confidence level. On the other hand, if we’re forecasting something such as the chance that Italy will have a different prime minister in six months, we can get from the past frequency of prime minister changes in Italy, scale them down by some amount to add some uncertainty to the prior to account for the fact that political conditions might have changed, and then update our forecast as new evidence comes in. We can put evidence on the same footing by asking “what is the number of months that the incumbent prime minister would have to remain in office so that the strength of that evidence is roughly comparable to the strength of the evidence I have?”
Like anything else in forecasting this method can be more of an art than a science, but I think the rule of succession gives a useful framework for thinking about these kinds of issues in a Bayesian way.