I’ve taken the data from Jaynes, who uses it as an example for maximum-entropy methods. We’ll use it for several examples, including some of Jaynes’ models.
The most interesting fact about Wolf’s dice data is that some faces come up significantly more often than others. As a quick back-of-the-envelope check, we can see this by calculating the expected number of times a face should come up on one die (), and the standard deviation of this number (). A quick glance at the data shows that the column- and row-totals differ from their expected value by roughly 2 to 6 standard deviations, an error much larger than we’d expect based on random noise.
That, however, is an ad-hoc test. The point of this sequence is to test our hypotheses from first principles, specifically the principles of probability theory. If we want to know which of two or more models is correct, then we calculate the probability of each model given the data.
White Die: Biased or Not?
Let’s start with just Wolf’s white die. Our two models are:
Model 1: all faces equally probable
Model 2: each face i has its own probability (we’ll use a uniform prior on for now, to keep things simple)
Our data is the bottom row in the table above, and our goal is to calculate for each model.
The first step is Bayes’ rule:
Here’s what those pieces each mean:
- is our prior probability for each model—for simplicity, let’s say the two are equally likely a priori.
- is the normalizer, chosen so that the posterior probabilities sum to one: (implicitly, we assume one of the two models is “correct”).
- is computed using
… so the actual work is in calculating .
Model 1: Unbiased
For the first model, is a standard probability problem: given unbiased die rolls, what’s the probability of 1’s, 2’s, etc? This is a multinomial distribution; the answer is
Note the symmetry factor with the factorials: we’re computing the probability of the observed counts, not the probability of a particular string of outcomes, so we have to add up probabilities of all the outcomes with the same counts. The same factor will appear in model 2 as well, for the same reason.
Model 2: Biased
The second model is a bit more complicated. If we knew the probabilities for each face then we could use the multinomial distribution formula, same as model 1. But since we don’t know the ’s, we need to integrate over possible values:
Here is our prior distribution on - in this case uniform, i.e. a dirichlet distribution with parameter . This is a somewhat nontrivial integral: the constraint means that we’re integrating over a five-dimensional surface in six dimensions. Fortunately, other people have already solved this integral in general: it’s the very handy dirichlet-multinomial distribution. With the result is particularly simple; the integral comes out to:
We’ll use this formula quite a bit in the next post. For now, we’re using outcomes, so we get
So the data is around times more likely under this model than under the unbiased-die model.
Probability of Bias
Last step: let’s go back to where we started and compute the posterior probabilities of each model. We plug back into Bayes’ rule. Each model had a prior probability of 0.5, so we compute the normalizer Z as:
A few comments on this result...
First, we have pretty conclusively confirmed that the faces are not equally probable, given this data.
Second, the numbers involved are REALLY BIG—ratios on the order of . This is par for the course: since independent probabilities multiply, probabilities tend to be roughly exponential in the number of data points. One side effect is that, as long as we have a reasonable amount of data, the priors don’t matter much. Even if we’d thought that an unbiased die was a thousand times more likely than a biased die a priori, those huge exponents would have completely swamped our prior.
Playing around with will reveal that the same applies to our prior distribution for : the result is not very sensitive to changes in the prior. A word of caution, however: priors over unknown parameters become more important in high-dimensional problems.
Third, notice that there was no maximization and no extra parameters to twiddle. Once the models were specified, we had zero choices to make, zero degrees of freedom to play with. Any “free parameters”—i.e. - have a prior over which we integrate. That integral is the hard part: as the models get larger and more complicated, we need to evaluate hairy high-dimensional integrals. The problem is not just NP-complete, it’s #P-complete. In practice, approximations are used instead, including the entire range of hypothesis tests and model comparison tests—that’s where maximization enters the picture. We’ll talk about some of those later, especially about when they’re likely to work or not work.
Finally, note that we compared a model which is conceptually “compatible” with the data (i.e. capable of generating roughly the observed frequencies), to one which is conceptually “incompatible” (i.e. the observed frequencies are way outside the expected range). A more interesting case is to consider two models which are both “compatible”. In that case, we’d want to use some kind of complexity penalty to say that the more complex model is less likely—otherwise we’d expect overfit. In the next post, we’ll revisit Wolf’s dice with a couple models from Jaynes, and see how “penalizes” overly-complicated models.