# Wolf’s Dice

Around the mid-19th cen­tury, Swiss as­tronomer Ru­dolf Wolf rol­led a pair of dice 20000 times, record­ing each out­come. Here is his data:

I’ve taken the data from Jaynes, who uses it as an ex­am­ple for max­i­mum-en­tropy meth­ods. We’ll use it for sev­eral ex­am­ples, in­clud­ing some of Jaynes’ mod­els.

The most in­ter­est­ing fact about Wolf’s dice data is that some faces come up sig­nifi­cantly more of­ten than oth­ers. As a quick back-of-the-en­velope check, we can see this by calcu­lat­ing the ex­pected num­ber of times a face should come up on one die (), and the stan­dard de­vi­a­tion of this num­ber (). A quick glance at the data shows that the column- and row-to­tals differ from their ex­pected value by roughly 2 to 6 stan­dard de­vi­a­tions, an er­ror much larger than we’d ex­pect based on ran­dom noise.

That, how­ever, is an ad-hoc test. The point of this se­quence is to test our hy­pothe­ses from first prin­ci­ples, speci­fi­cally the prin­ci­ples of prob­a­bil­ity the­ory. If we want to know which of two or more mod­els is cor­rect, then we calcu­late the prob­a­bil­ity of each model given the data.

# White Die: Bi­ased or Not?

Let’s start with just Wolf’s white die. Our two mod­els are:

• Model 1: all faces equally probable

• Model 2: each face i has its own prob­a­bil­ity (we’ll use a uniform prior on for now, to keep things sim­ple)

Our data is the bot­tom row in the table above, and our goal is to calcu­late for each model.

The first step is Bayes’ rule:

Here’s what those pieces each mean:

• is our prior prob­a­bil­ity for each model—for sim­plic­ity, let’s say the two are equally likely a pri­ori.
• is the nor­mal­izer, cho­sen so that the pos­te­rior prob­a­bil­ities sum to one: (im­plic­itly, we as­sume one of the two mod­els is “cor­rect”).
• is com­puted us­ing

… so the ac­tual work is in calcu­lat­ing .

## Model 1: Unbiased

For the first model, is a stan­dard prob­a­bil­ity prob­lem: given un­bi­ased die rolls, what’s the prob­a­bil­ity of 1’s, 2’s, etc? This is a mult­i­no­mial dis­tri­bu­tion; the an­swer is

Note the sym­me­try fac­tor with the fac­to­ri­als: we’re com­put­ing the prob­a­bil­ity of the ob­served counts, not the prob­a­bil­ity of a par­tic­u­lar string of out­comes, so we have to add up prob­a­bil­ities of all the out­comes with the same counts. The same fac­tor will ap­pear in model 2 as well, for the same rea­son.

## Model 2: Biased

The sec­ond model is a bit more com­pli­cated. If we knew the prob­a­bil­ities for each face then we could use the mult­i­no­mial dis­tri­bu­tion for­mula, same as model 1. But since we don’t know the ’s, we need to in­te­grate over pos­si­ble val­ues:

Here is our prior dis­tri­bu­tion on - in this case uniform, i.e. a dirich­let dis­tri­bu­tion with pa­ram­e­ter . This is a some­what non­triv­ial in­te­gral: the con­straint means that we’re in­te­grat­ing over a five-di­men­sional sur­face in six di­men­sions. For­tu­nately, other peo­ple have already solved this in­te­gral in gen­eral: it’s the very handy dirich­let-mult­i­no­mial dis­tri­bu­tion. With the re­sult is par­tic­u­larly sim­ple; the in­te­gral comes out to:

We’ll use this for­mula quite a bit in the next post. For now, we’re us­ing out­comes, so we get

So the data is around times more likely un­der this model than un­der the un­bi­ased-die model.

## Prob­a­bil­ity of Bias

Last step: let’s go back to where we started and com­pute the pos­te­rior prob­a­bil­ities of each model. We plug back into Bayes’ rule. Each model had a prior prob­a­bil­ity of 0.5, so we com­pute the nor­mal­izer Z as:

so

A few com­ments on this re­sult...

First, we have pretty con­clu­sively con­firmed that the faces are not equally prob­a­ble, given this data.

Se­cond, the num­bers in­volved are REALLY BIG—ra­tios on the or­der of . This is par for the course: since in­de­pen­dent prob­a­bil­ities mul­ti­ply, prob­a­bil­ities tend to be roughly ex­po­nen­tial in the num­ber of data points. One side effect is that, as long as we have a rea­son­able amount of data, the pri­ors don’t mat­ter much. Even if we’d thought that an un­bi­ased die was a thou­sand times more likely than a bi­ased die a pri­ori, those huge ex­po­nents would have com­pletely swamped our prior.

Play­ing around with will re­veal that the same ap­plies to our prior dis­tri­bu­tion for : the re­sult is not very sen­si­tive to changes in the prior. A word of cau­tion, how­ever: pri­ors over un­known pa­ram­e­ters be­come more im­por­tant in high-di­men­sional prob­lems.

Third, no­tice that there was no max­i­miza­tion and no ex­tra pa­ram­e­ters to twid­dle. Once the mod­els were speci­fied, we had zero choices to make, zero de­grees of free­dom to play with. Any “free pa­ram­e­ters”—i.e. - have a prior over which we in­te­grate. That in­te­gral is the hard part: as the mod­els get larger and more com­pli­cated, we need to eval­u­ate hairy high-di­men­sional in­te­grals. The prob­lem is not just NP-com­plete, it’s #P-com­plete. In prac­tice, ap­prox­i­ma­tions are used in­stead, in­clud­ing the en­tire range of hy­poth­e­sis tests and model com­par­i­son tests—that’s where max­i­miza­tion en­ters the pic­ture. We’ll talk about some of those later, es­pe­cially about when they’re likely to work or not work.

Fi­nally, note that we com­pared a model which is con­cep­tu­ally “com­pat­i­ble” with the data (i.e. ca­pa­ble of gen­er­at­ing roughly the ob­served fre­quen­cies), to one which is con­cep­tu­ally “in­com­pat­i­ble” (i.e. the ob­served fre­quen­cies are way out­side the ex­pected range). A more in­ter­est­ing case is to con­sider two mod­els which are both “com­pat­i­ble”. In that case, we’d want to use some kind of com­plex­ity penalty to say that the more com­plex model is less likely—oth­er­wise we’d ex­pect overfit. In the next post, we’ll re­visit Wolf’s dice with a cou­ple mod­els from Jaynes, and see how “pe­nal­izes” overly-com­pli­cated mod­els.

• I’m con­fused about some­thing. In re­al­ity there are no perfect dice, all dice are bi­ased in some way, in­ten­tion­ally or not. Thus wouldn’t a more re­al­is­tic ap­proach be some­thing like “Given the dataset, con­struct the (mul­ti­di­men­sional) prob­a­bil­ity dis­tri­bu­tion of bi­ases.” Why priv­ilege the “un­bi­ased” hy­poth­e­sis?

• Right. But also we would want to use a prior that favoured bi­ases which were near fair, since we know that Wolf at least thought they were a nor­mal pair of dice.

• First an­swer: this ques­tion is a perfect lead-in to the next post, in which we try to figure which phys­i­cal asym­me­tries the die had. Definitely read that.

Se­cond an­swer: In physics, to talk about the force ap­plied by a base­ball bat on a base­ball, we use a delta func­tion. We don’t ac­tu­ally think that the force is in­finite and ap­plied over an in­finites­i­mal time span, but that’s a great ap­prox­i­ma­tion for sim­ple calcu­la­tions. Same in prob­a­bil­ity: we do ac­tu­ally think that most dice & coins are very close to un­bi­ased. Even if we think there’s some small spread, the delta func­tion dis­tri­bu­tion (i.e. a delta func­tion right at the un­bi­ased prob­a­bil­ity) is a great ap­prox­i­ma­tion for an “un­bi­ased” real-world die or coin. That’s what the un­bi­ased model is.

Third an­swer: “Given the dataset, con­struct the (mul­ti­di­men­sional) prob­a­bil­ity dis­tri­bu­tion of bi­ases” trans­lates to “calcu­late ”. That is ab­solutely a valid ques­tion to ask. Our mod­els then en­ter into the prior for p—each model im­plies a differ­ent prior dis­tri­bu­tion, so to get the over­all prior for , we’d com­bine them: . In English: we think the world has some “un­bi­ased” dice, which have out­come fre­quen­cies very close to uniform, and some “bi­ased” dice, which could have any fre­quen­cies at all. Thus our prior for looks like a delta func­tion plus some flat­ter dis­tri­bu­tion—a mix­ture of “un­bi­ased” and “bi­ased” dice.

• Sure, you use a delta func­tion when you want to make a sim­plify­ing as­sump­tion. But this post is about ques­tion­ing the as­sump­tion. That’s ex­actly when you wouldn’t use a delta func­tion. Your third an­swer flatly con­tra­dicts Sh­minux. No, he does not be­lieve that there are any perfect dice. Some­times it’s right to con­tra­dict peo­ple, but if you don’t no­tice you’re do­ing it, it’s a sign that you’re the one who is con­fused.

• The third an­swer was meant to be used in con­junc­tion with the sec­ond; that’s what the scare quotes around “un­bi­ased” were meant to con­vey, along with the phrase “fre­quen­cies very close to uniform”. Sorry if that was in­suffi­ciently clear.

Also, if we’re ques­tion­ing (i.e. test­ing) the as­sump­tion, then we still need the as­sump­tion around as a hy­poth­e­sis against which to test. That’s ex­actly how it’s used in the post.

• No, re­ally, it was perfectly clear. The prob­lem is that it was wrong.

• Note the sym­me­try fac­tor with the fac­to­ri­als: we’re com­put­ing the prob­a­bil­ity of the ob­served counts, not the prob­a­bil­ity of a par­tic­u­lar string of out­comes, so we have to add up prob­a­bil­ities of all the out­comes with the same counts.

Can you clar­ify why we look at the prob­a­bil­ity of counts rather than the par­tic­u­lar string?

The rea­son I’m ask­ing is that if a prob­lem has con­tin­u­ous out­comes in­stead of dis­crete then we au­to­mat­i­cally look at the string of out­comes in­stead of the count (un­less we bin the re­sults). Is this just a fun­da­men­tal differ­ence be­tween con­tin­u­ous and dis­crete out­comes?

• It’s not re­ally im­por­tant, all that mat­ters is that we’re con­sis­tent in which one we use. We have to always in­clude the sym­me­try fac­tor, or never in­clude it.

In this case, I went with counts be­cause our data does, in fact, con­sist of counts. Be­cause we’re as­sum­ing each die roll is in­de­pen­dent, we’d get the same an­swer if we just made up a string of out­comes with the same counts, and used that as the data in­stead.