Priors as Mathematical Objects

Fol­lowup to: “In­duc­tive Bias”

What ex­actly is a “prior”, as a math­e­mat­i­cal ob­ject? Sup­pose you’re look­ing at an urn filled with red and white balls. When you draw the very first ball, you haven’t yet had a chance to gather much ev­i­dence, so you start out with a rather vague and fuzzy ex­pec­ta­tion of what might hap­pen—you might say “fifty/​fifty, even odds” for the chance of get­ting a red or white ball. But you’re ready to re­vise that es­ti­mate for fu­ture balls as soon as you’ve drawn a few sam­ples. So then this ini­tial prob­a­bil­ity es­ti­mate, 0.5, is not re­peat not a “prior”.

An in­tro­duc­tion to Bayes’s Rule for con­fused stu­dents might re­fer to the pop­u­la­tion fre­quency of breast can­cer as the “prior prob­a­bil­ity of breast can­cer”, and the re­vised prob­a­bil­ity af­ter a mam­mog­ra­phy as the “pos­te­rior prob­a­bil­ity”. But in the scrip­tures of Deep Bayesi­anism, such as Prob­a­bil­ity The­ory: The Logic of Science, one finds a quite differ­ent con­cept—that of prior in­for­ma­tion, which in­cludes e.g. our be­liefs about the sen­si­tivity and speci­fic­ity of mam­mog­ra­phy ex­ams. Our be­lief about the pop­u­la­tion fre­quency of breast can­cer is only one small el­e­ment of our prior in­for­ma­tion.

In my ear­lier post on in­duc­tive bias, I dis­cussed three pos­si­ble be­liefs we might have about an urn of red and white balls, which will be sam­pled with­out re­place­ment:

  • Case 1: The urn con­tains 5 red balls and 5 white balls;

  • Case 2: A ran­dom num­ber was gen­er­ated be­tween 0 and 1, and each ball was se­lected to be red (or white) at this prob­a­bil­ity;

  • Case 3: A mon­key threw balls into the urn, each with a 50% chance of be­ing red or white.

In each case, if you ask me—be­fore I draw any balls—to es­ti­mate my marginal prob­a­bil­ity that the fourth ball drawn will be red, I will re­spond “50%”. And yet, once I be­gin ob­serv­ing balls drawn from the urn, I rea­son from the ev­i­dence in three differ­ent ways:

  • Case 1: Each red ball drawn makes it less likely that fu­ture balls will be red, be­cause I be­lieve there are fewer red balls left in the urn.

  • Case 2: Each red ball drawn makes it more plau­si­ble that fu­ture balls will be red, be­cause I will rea­son that the ran­dom num­ber was prob­a­bly higher, and that the urn is hence more likely to con­tain mostly red balls.

  • Case 3: Ob­serv­ing a red or white ball has no effect on my fu­ture es­ti­mates, be­cause each ball was in­de­pen­dently se­lected to be red or white at a fixed, known prob­a­bil­ity.

Sup­pose I write a Python pro­gram to re­pro­duce my rea­son­ing in each of these sce­nar­ios. The pro­gram will take in a record of balls ob­served so far, and out­put an es­ti­mate of the prob­a­bil­ity that the next ball drawn will be red. It turns out that the only nec­es­sary in­for­ma­tion is the count of red balls seen and white balls seen, which we will re­spec­tively call R and W. So each pro­gram ac­cepts in­puts R and W, and out­puts the prob­a­bil­ity that the next ball drawn is red:

  • Case 1: re­turn (5 - R)/​(10 - R—W) # Num­ber of red balls re­main­ing /​ to­tal balls remaining

  • Case 2: re­turn (R + 1)/​(R + W + 2) # Laplace’s Law of Succession

  • Case 3: re­turn 0.5

Th­ese pro­grams are cor­rect so far as they go. But un­for­tu­nately, prob­a­bil­ity the­ory does not op­er­ate on Python pro­grams. Prob­a­bil­ity the­ory is an alge­bra of un­cer­tainty, a calcu­lus of cred­i­bil­ity, and Python pro­grams are not al­lowed in the for­mu­las. It is like try­ing to add 3 to a toaster oven.

To use these pro­grams in the prob­a­bil­ity calcu­lus, we must figure out how to con­vert a Python pro­gram into a more con­ve­nient math­e­mat­i­cal ob­ject—say, a prob­a­bil­ity dis­tri­bu­tion.

Sup­pose I want to know the com­bined prob­a­bil­ity that the se­quence ob­served will be RWWRR, ac­cord­ing to pro­gram 2 above. Pro­gram 2 does not have a di­rect fac­ulty for re­turn­ing the joint or com­bined prob­a­bil­ity of a se­quence, but it is easy to ex­tract any­way. First, I ask what prob­a­bil­ity pro­gram 2 as­signs to ob­serv­ing R, given that no balls have been ob­served. Pro­gram 2 replies “1/​2”. Then I ask the prob­a­bil­ity that the next ball is R, given that one red ball has been ob­served; pro­gram 2 replies “2/​3″. The sec­ond ball is ac­tu­ally white, so the joint prob­a­bil­ity so far is 12 * 13 = 16. Next I ask for the prob­a­bil­ity that the third ball is red, given that the pre­vi­ous ob­ser­va­tion is RW; this is sum­ma­rized as “one red and one white ball”, and the an­swer is 12. The third ball is white, so the joint prob­a­bil­ity for RWW is 112. For the fourth ball, given the pre­vi­ous ob­ser­va­tion RWW, the prob­a­bil­ity of red­ness is 25, and the joint prob­a­bil­ity goes to 130. We can write this as p(RWWR|RWW) = 25, which means that if the se­quence so far is RWW, the prob­a­bil­ity as­signed by pro­gram 2 to the se­quence con­tin­u­ing with R and form­ing RWWR equals 25. And then p(RWWRR|RWWR) = 12, and the com­bined prob­a­bil­ity is 160.

We can do this with ev­ery pos­si­ble se­quence of ten balls, and end up with a table of 1024 en­tries. This table of 1024 en­tries con­sti­tutes a prob­a­bil­ity dis­tri­bu­tion over se­quences of ob­ser­va­tions of length 10, and it says ev­ery­thing the Python pro­gram had to say (about 10 or fewer ob­ser­va­tions, any­way). Sup­pose I have only this prob­a­bil­ity table, and I want to know the prob­a­bil­ity that the third ball is red, given that the first two balls drawn were white. I need only sum over the prob­a­bil­ity of all en­tries be­gin­ning with WWR, and di­vide by the prob­a­bil­ity of all en­tries be­gin­ning with WW.

We have thus trans­formed a pro­gram that com­putes the prob­a­bil­ity of fu­ture events given past ex­pe­riences, into a prob­a­bil­ity dis­tri­bu­tion over se­quences of ob­ser­va­tions.

You wouldn’t want to do this in real life, be­cause the Python pro­gram is ever so much more com­pact than a table with 1024 en­tries. The point is not that we can turn an effi­cient and com­pact com­puter pro­gram into a big­ger and less effi­cient gi­ant lookup table; the point is that we can view an in­duc­tive learner as a math­e­mat­i­cal ob­ject, a dis­tri­bu­tion over se­quences, which read­ily fits into stan­dard prob­a­bil­ity calcu­lus. We can take a com­puter pro­gram that rea­sons from ex­pe­rience and think about it us­ing prob­a­bil­ity the­ory.

Why might this be con­ve­nient? Say that I’m not sure which of these three sce­nar­ios best de­scribes the urn—I think it’s about equally likely that each of the three cases holds true. How should I rea­son from my ac­tual ob­ser­va­tions of the urn? If you think about the prob­lem from the per­spec­tive of con­struct­ing a com­puter pro­gram that imi­tates my in­fer­ences, it looks com­pli­cated—we have to jug­gle the rel­a­tive prob­a­bil­ities of each hy­poth­e­sis, and also the prob­a­bil­ities within each hy­poth­e­sis. If you think about it from the per­spec­tive of prob­a­bil­ity the­ory, the ob­vi­ous thing to do is to add up all three dis­tri­bu­tions with weight­ings of 13 apiece, yield­ing a new dis­tri­bu­tion (which is in fact cor­rect). Then the task is just to turn this new dis­tri­bu­tion into a com­puter pro­gram, which turns out not to be difficult.

So that is what a prior re­ally is—a math­e­mat­i­cal ob­ject that rep­re­sents all of your start­ing in­for­ma­tion plus the way you learn from ex­pe­rience.