Wolf’s Dice
Around the mid-19th century, Swiss astronomer Rudolf Wolf rolled a pair of dice 20000 times, recording each outcome. Here is his data:
I’ve taken the data from Jaynes, who uses it as an example for maximum-entropy methods. We’ll use it for several examples, including some of Jaynes’ models.
The most interesting fact about Wolf’s dice data is that some faces come up significantly more often than others. As a quick back-of-the-envelope check, we can see this by calculating the expected number of times a face should come up on one die (), and the standard deviation of this number (). A quick glance at the data shows that the column- and row-totals differ from their expected value by roughly 2 to 6 standard deviations, an error much larger than we’d expect based on random noise.
That, however, is an ad-hoc test. The point of this sequence is to test our hypotheses from first principles, specifically the principles of probability theory. If we want to know which of two or more models is correct, then we calculate the probability of each model given the data.
White Die: Biased or Not?
Let’s start with just Wolf’s white die. Our two models are:
- Model 1: all faces equally probable 
- Model 2: each face i has its own probability (we’ll use a uniform prior on for now, to keep things simple) 
Our data is the bottom row in the table above, and our goal is to calculate for each model.
The first step is Bayes’ rule:
Here’s what those pieces each mean:
- is our prior probability for each model—for simplicity, let’s say the two are equally likely a priori.
- is the normalizer, chosen so that the posterior probabilities sum to one: (implicitly, we assume one of the two models is “correct”).
- is computed using
… so the actual work is in calculating .
Model 1: Unbiased
For the first model, is a standard probability problem: given unbiased die rolls, what’s the probability of 1’s, 2’s, etc? This is a multinomial distribution; the answer is
Note the symmetry factor with the factorials: we’re computing the probability of the observed counts, not the probability of a particular string of outcomes, so we have to add up probabilities of all the outcomes with the same counts. The same factor will appear in model 2 as well, for the same reason.
Model 2: Biased
The second model is a bit more complicated. If we knew the probabilities for each face then we could use the multinomial distribution formula, same as model 1. But since we don’t know the ’s, we need to integrate over possible values:
Here is our prior distribution on - in this case uniform, i.e. a dirichlet distribution with parameter . This is a somewhat nontrivial integral: the constraint means that we’re integrating over a five-dimensional surface in six dimensions. Fortunately, other people have already solved this integral in general: it’s the very handy dirichlet-multinomial distribution. With the result is particularly simple; the integral comes out to:
We’ll use this formula quite a bit in the next post. For now, we’re using outcomes, so we get
So the data is around times more likely under this model than under the unbiased-die model.
Probability of Bias
Last step: let’s go back to where we started and compute the posterior probabilities of each model. We plug back into Bayes’ rule. Each model had a prior probability of 0.5, so we compute the normalizer Z as:
so
A few comments on this result...
First, we have pretty conclusively confirmed that the faces are not equally probable, given this data.
Second, the numbers involved are REALLY BIG—ratios on the order of . This is par for the course: since independent probabilities multiply, probabilities tend to be roughly exponential in the number of data points. One side effect is that, as long as we have a reasonable amount of data, the priors don’t matter much. Even if we’d thought that an unbiased die was a thousand times more likely than a biased die a priori, those huge exponents would have completely swamped our prior.
Playing around with will reveal that the same applies to our prior distribution for : the result is not very sensitive to changes in the prior. A word of caution, however: priors over unknown parameters become more important in high-dimensional problems.
Third, notice that there was no maximization and no extra parameters to twiddle. Once the models were specified, we had zero choices to make, zero degrees of freedom to play with. Any “free parameters”—i.e. - have a prior over which we integrate. That integral is the hard part: as the models get larger and more complicated, we need to evaluate hairy high-dimensional integrals. The problem is not just NP-complete, it’s #P-complete. In practice, approximations are used instead, including the entire range of hypothesis tests and model comparison tests—that’s where maximization enters the picture. We’ll talk about some of those later, especially about when they’re likely to work or not work.
Finally, note that we compared a model which is conceptually “compatible” with the data (i.e. capable of generating roughly the observed frequencies), to one which is conceptually “incompatible” (i.e. the observed frequencies are way outside the expected range). A more interesting case is to consider two models which are both “compatible”. In that case, we’d want to use some kind of complexity penalty to say that the more complex model is less likely—otherwise we’d expect overfit. In the next post, we’ll revisit Wolf’s dice with a couple models from Jaynes, and see how “penalizes” overly-complicated models.
I’m confused about something. In reality there are no perfect dice, all dice are biased in some way, intentionally or not. Thus wouldn’t a more realistic approach be something like “Given the dataset, construct the (multidimensional) probability distribution of biases.” Why privilege the “unbiased” hypothesis?
Right. But also we would want to use a prior that favoured biases which were near fair, since we know that Wolf at least thought they were a normal pair of dice.
First answer: this question is a perfect lead-in to the next post, in which we try to figure which physical asymmetries the die had. Definitely read that.
Second answer: In physics, to talk about the force applied by a baseball bat on a baseball, we use a delta function. We don’t actually think that the force is infinite and applied over an infinitesimal time span, but that’s a great approximation for simple calculations. Same in probability: we do actually think that most dice & coins are very close to unbiased. Even if we think there’s some small spread, the delta function distribution (i.e. a delta function right at the unbiased probability) is a great approximation for an “unbiased” real-world die or coin. That’s what the unbiased model is.
Third answer: “Given the dataset, construct the (multidimensional) probability distribution of biases” translates to “calculate P[p|data]”. That is absolutely a valid question to ask. Our models then enter into the prior for p—each model implies a different prior p distribution, so to get the overall prior for p, we’d combine them: P[p]=P[p|model1]P[model1]+P[p|model2]P[model2]. In English: we think the world has some “unbiased” dice, which have outcome frequencies very close to uniform, and some “biased” dice, which could have any frequencies at all. Thus our prior for p looks like a delta function plus some flatter distribution—a mixture of “unbiased” and “biased” dice.
Sure, you use a delta function when you want to make a simplifying assumption. But this post is about questioning the assumption. That’s exactly when you wouldn’t use a delta function. Your third answer flatly contradicts Shminux. No, he does not believe that there are any perfect dice. Sometimes it’s right to contradict people, but if you don’t notice you’re doing it, it’s a sign that you’re the one who is confused.
The third answer was meant to be used in conjunction with the second; that’s what the scare quotes around “unbiased” were meant to convey, along with the phrase “frequencies very close to uniform”. Sorry if that was insufficiently clear.
Also, if we’re questioning (i.e. testing) the assumption, then we still need the assumption around as a hypothesis against which to test. That’s exactly how it’s used in the post.
No, really, it was perfectly clear. The problem is that it was wrong.
Can you clarify why we look at the probability of counts rather than the particular string?
The reason I’m asking is that if a problem has continuous outcomes instead of discrete then we automatically look at the string of outcomes instead of the count (unless we bin the results). Is this just a fundamental difference between continuous and discrete outcomes?
It’s not really important, all that matters is that we’re consistent in which one we use. We have to always include the symmetry factor, or never include it.
In this case, I went with counts because our data does, in fact, consist of counts. Because we’re assuming each die roll is independent, we’d get the same answer if we just made up a string of outcomes with the same counts, and used that as the data instead.