Wolf’s Dice

Around the mid-19th cen­tury, Swiss as­tronomer Ru­dolf Wolf rol­led a pair of dice 20000 times, record­ing each out­come. Here is his data:

I’ve taken the data from Jaynes, who uses it as an ex­am­ple for max­i­mum-en­tropy meth­ods. We’ll use it for sev­eral ex­am­ples, in­clud­ing some of Jaynes’ mod­els.

The most in­ter­est­ing fact about Wolf’s dice data is that some faces come up sig­nifi­cantly more of­ten than oth­ers. As a quick back-of-the-en­velope check, we can see this by calcu­lat­ing the ex­pected num­ber of times a face should come up on one die (), and the stan­dard de­vi­a­tion of this num­ber (). A quick glance at the data shows that the column- and row-to­tals differ from their ex­pected value by roughly 2 to 6 stan­dard de­vi­a­tions, an er­ror much larger than we’d ex­pect based on ran­dom noise.

That, how­ever, is an ad-hoc test. The point of this se­quence is to test our hy­pothe­ses from first prin­ci­ples, speci­fi­cally the prin­ci­ples of prob­a­bil­ity the­ory. If we want to know which of two or more mod­els is cor­rect, then we calcu­late the prob­a­bil­ity of each model given the data.

White Die: Bi­ased or Not?

Let’s start with just Wolf’s white die. Our two mod­els are:

  • Model 1: all faces equally probable

  • Model 2: each face i has its own prob­a­bil­ity (we’ll use a uniform prior on for now, to keep things sim­ple)

Our data is the bot­tom row in the table above, and our goal is to calcu­late for each model.

The first step is Bayes’ rule:

Here’s what those pieces each mean:

  • is our prior prob­a­bil­ity for each model—for sim­plic­ity, let’s say the two are equally likely a pri­ori.
  • is the nor­mal­izer, cho­sen so that the pos­te­rior prob­a­bil­ities sum to one: (im­plic­itly, we as­sume one of the two mod­els is “cor­rect”).
  • is com­puted us­ing

… so the ac­tual work is in calcu­lat­ing .

Model 1: Unbiased

For the first model, is a stan­dard prob­a­bil­ity prob­lem: given un­bi­ased die rolls, what’s the prob­a­bil­ity of 1’s, 2’s, etc? This is a mult­i­no­mial dis­tri­bu­tion; the an­swer is

Note the sym­me­try fac­tor with the fac­to­ri­als: we’re com­put­ing the prob­a­bil­ity of the ob­served counts, not the prob­a­bil­ity of a par­tic­u­lar string of out­comes, so we have to add up prob­a­bil­ities of all the out­comes with the same counts. The same fac­tor will ap­pear in model 2 as well, for the same rea­son.

Model 2: Biased

The sec­ond model is a bit more com­pli­cated. If we knew the prob­a­bil­ities for each face then we could use the mult­i­no­mial dis­tri­bu­tion for­mula, same as model 1. But since we don’t know the ’s, we need to in­te­grate over pos­si­ble val­ues:

Here is our prior dis­tri­bu­tion on - in this case uniform, i.e. a dirich­let dis­tri­bu­tion with pa­ram­e­ter . This is a some­what non­triv­ial in­te­gral: the con­straint means that we’re in­te­grat­ing over a five-di­men­sional sur­face in six di­men­sions. For­tu­nately, other peo­ple have already solved this in­te­gral in gen­eral: it’s the very handy dirich­let-mult­i­no­mial dis­tri­bu­tion. With the re­sult is par­tic­u­larly sim­ple; the in­te­gral comes out to:

We’ll use this for­mula quite a bit in the next post. For now, we’re us­ing out­comes, so we get

So the data is around times more likely un­der this model than un­der the un­bi­ased-die model.

Prob­a­bil­ity of Bias

Last step: let’s go back to where we started and com­pute the pos­te­rior prob­a­bil­ities of each model. We plug back into Bayes’ rule. Each model had a prior prob­a­bil­ity of 0.5, so we com­pute the nor­mal­izer Z as:

so

A few com­ments on this re­sult...

First, we have pretty con­clu­sively con­firmed that the faces are not equally prob­a­ble, given this data.

Se­cond, the num­bers in­volved are REALLY BIG—ra­tios on the or­der of . This is par for the course: since in­de­pen­dent prob­a­bil­ities mul­ti­ply, prob­a­bil­ities tend to be roughly ex­po­nen­tial in the num­ber of data points. One side effect is that, as long as we have a rea­son­able amount of data, the pri­ors don’t mat­ter much. Even if we’d thought that an un­bi­ased die was a thou­sand times more likely than a bi­ased die a pri­ori, those huge ex­po­nents would have com­pletely swamped our prior.

Play­ing around with will re­veal that the same ap­plies to our prior dis­tri­bu­tion for : the re­sult is not very sen­si­tive to changes in the prior. A word of cau­tion, how­ever: pri­ors over un­known pa­ram­e­ters be­come more im­por­tant in high-di­men­sional prob­lems.

Third, no­tice that there was no max­i­miza­tion and no ex­tra pa­ram­e­ters to twid­dle. Once the mod­els were speci­fied, we had zero choices to make, zero de­grees of free­dom to play with. Any “free pa­ram­e­ters”—i.e. - have a prior over which we in­te­grate. That in­te­gral is the hard part: as the mod­els get larger and more com­pli­cated, we need to eval­u­ate hairy high-di­men­sional in­te­grals. The prob­lem is not just NP-com­plete, it’s #P-com­plete. In prac­tice, ap­prox­i­ma­tions are used in­stead, in­clud­ing the en­tire range of hy­poth­e­sis tests and model com­par­i­son tests—that’s where max­i­miza­tion en­ters the pic­ture. We’ll talk about some of those later, es­pe­cially about when they’re likely to work or not work.

Fi­nally, note that we com­pared a model which is con­cep­tu­ally “com­pat­i­ble” with the data (i.e. ca­pa­ble of gen­er­at­ing roughly the ob­served fre­quen­cies), to one which is con­cep­tu­ally “in­com­pat­i­ble” (i.e. the ob­served fre­quen­cies are way out­side the ex­pected range). A more in­ter­est­ing case is to con­sider two mod­els which are both “com­pat­i­ble”. In that case, we’d want to use some kind of com­plex­ity penalty to say that the more com­plex model is less likely—oth­er­wise we’d ex­pect overfit. In the next post, we’ll re­visit Wolf’s dice with a cou­ple mod­els from Jaynes, and see how “pe­nal­izes” overly-com­pli­cated mod­els.