# Priors as Mathematical Objects

Fol­lowup to: “In­duc­tive Bias”

What ex­actly is a “prior”, as a math­e­mat­i­cal ob­ject? Sup­pose you’re look­ing at an urn filled with red and white balls. When you draw the very first ball, you haven’t yet had a chance to gather much ev­i­dence, so you start out with a rather vague and fuzzy ex­pec­ta­tion of what might hap­pen—you might say “fifty/​fifty, even odds” for the chance of get­ting a red or white ball. But you’re ready to re­vise that es­ti­mate for fu­ture balls as soon as you’ve drawn a few sam­ples. So then this ini­tial prob­a­bil­ity es­ti­mate, 0.5, is not re­peat not a “prior”.

An in­tro­duc­tion to Bayes’s Rule for con­fused stu­dents might re­fer to the pop­u­la­tion fre­quency of breast can­cer as the “prior prob­a­bil­ity of breast can­cer”, and the re­vised prob­a­bil­ity af­ter a mam­mog­ra­phy as the “pos­te­rior prob­a­bil­ity”. But in the scrip­tures of Deep Bayesi­anism, such as Prob­a­bil­ity The­ory: The Logic of Science, one finds a quite differ­ent con­cept—that of prior in­for­ma­tion, which in­cludes e.g. our be­liefs about the sen­si­tivity and speci­fic­ity of mam­mog­ra­phy ex­ams. Our be­lief about the pop­u­la­tion fre­quency of breast can­cer is only one small el­e­ment of our prior in­for­ma­tion.

In my ear­lier post on in­duc­tive bias, I dis­cussed three pos­si­ble be­liefs we might have about an urn of red and white balls, which will be sam­pled with­out re­place­ment:

• Case 1: The urn con­tains 5 red balls and 5 white balls;

• Case 2: A ran­dom num­ber was gen­er­ated be­tween 0 and 1, and each ball was se­lected to be red (or white) at this prob­a­bil­ity;

• Case 3: A mon­key threw balls into the urn, each with a 50% chance of be­ing red or white.

In each case, if you ask me—be­fore I draw any balls—to es­ti­mate my marginal prob­a­bil­ity that the fourth ball drawn will be red, I will re­spond “50%”. And yet, once I be­gin ob­serv­ing balls drawn from the urn, I rea­son from the ev­i­dence in three differ­ent ways:

• Case 1: Each red ball drawn makes it less likely that fu­ture balls will be red, be­cause I be­lieve there are fewer red balls left in the urn.

• Case 2: Each red ball drawn makes it more plau­si­ble that fu­ture balls will be red, be­cause I will rea­son that the ran­dom num­ber was prob­a­bly higher, and that the urn is hence more likely to con­tain mostly red balls.

• Case 3: Ob­serv­ing a red or white ball has no effect on my fu­ture es­ti­mates, be­cause each ball was in­de­pen­dently se­lected to be red or white at a fixed, known prob­a­bil­ity.

Sup­pose I write a Python pro­gram to re­pro­duce my rea­son­ing in each of these sce­nar­ios. The pro­gram will take in a record of balls ob­served so far, and out­put an es­ti­mate of the prob­a­bil­ity that the next ball drawn will be red. It turns out that the only nec­es­sary in­for­ma­tion is the count of red balls seen and white balls seen, which we will re­spec­tively call R and W. So each pro­gram ac­cepts in­puts R and W, and out­puts the prob­a­bil­ity that the next ball drawn is red:

• Case 1: re­turn (5 - R)/​(10 - R—W) # Num­ber of red balls re­main­ing /​ to­tal balls remaining

• Case 2: re­turn (R + 1)/​(R + W + 2) # Laplace’s Law of Succession

• Case 3: re­turn 0.5

Th­ese pro­grams are cor­rect so far as they go. But un­for­tu­nately, prob­a­bil­ity the­ory does not op­er­ate on Python pro­grams. Prob­a­bil­ity the­ory is an alge­bra of un­cer­tainty, a calcu­lus of cred­i­bil­ity, and Python pro­grams are not al­lowed in the for­mu­las. It is like try­ing to add 3 to a toaster oven.

To use these pro­grams in the prob­a­bil­ity calcu­lus, we must figure out how to con­vert a Python pro­gram into a more con­ve­nient math­e­mat­i­cal ob­ject—say, a prob­a­bil­ity dis­tri­bu­tion.

Sup­pose I want to know the com­bined prob­a­bil­ity that the se­quence ob­served will be RWWRR, ac­cord­ing to pro­gram 2 above. Pro­gram 2 does not have a di­rect fac­ulty for re­turn­ing the joint or com­bined prob­a­bil­ity of a se­quence, but it is easy to ex­tract any­way. First, I ask what prob­a­bil­ity pro­gram 2 as­signs to ob­serv­ing R, given that no balls have been ob­served. Pro­gram 2 replies “1/​2”. Then I ask the prob­a­bil­ity that the next ball is R, given that one red ball has been ob­served; pro­gram 2 replies “2/​3″. The sec­ond ball is ac­tu­ally white, so the joint prob­a­bil­ity so far is 12 * 13 = 16. Next I ask for the prob­a­bil­ity that the third ball is red, given that the pre­vi­ous ob­ser­va­tion is RW; this is sum­ma­rized as “one red and one white ball”, and the an­swer is 12. The third ball is white, so the joint prob­a­bil­ity for RWW is 112. For the fourth ball, given the pre­vi­ous ob­ser­va­tion RWW, the prob­a­bil­ity of red­ness is 25, and the joint prob­a­bil­ity goes to 130. We can write this as p(RWWR|RWW) = 25, which means that if the se­quence so far is RWW, the prob­a­bil­ity as­signed by pro­gram 2 to the se­quence con­tin­u­ing with R and form­ing RWWR equals 25. And then p(RWWRR|RWWR) = 12, and the com­bined prob­a­bil­ity is 160.

We can do this with ev­ery pos­si­ble se­quence of ten balls, and end up with a table of 1024 en­tries. This table of 1024 en­tries con­sti­tutes a prob­a­bil­ity dis­tri­bu­tion over se­quences of ob­ser­va­tions of length 10, and it says ev­ery­thing the Python pro­gram had to say (about 10 or fewer ob­ser­va­tions, any­way). Sup­pose I have only this prob­a­bil­ity table, and I want to know the prob­a­bil­ity that the third ball is red, given that the first two balls drawn were white. I need only sum over the prob­a­bil­ity of all en­tries be­gin­ning with WWR, and di­vide by the prob­a­bil­ity of all en­tries be­gin­ning with WW.

We have thus trans­formed a pro­gram that com­putes the prob­a­bil­ity of fu­ture events given past ex­pe­riences, into a prob­a­bil­ity dis­tri­bu­tion over se­quences of ob­ser­va­tions.

You wouldn’t want to do this in real life, be­cause the Python pro­gram is ever so much more com­pact than a table with 1024 en­tries. The point is not that we can turn an effi­cient and com­pact com­puter pro­gram into a big­ger and less effi­cient gi­ant lookup table; the point is that we can view an in­duc­tive learner as a math­e­mat­i­cal ob­ject, a dis­tri­bu­tion over se­quences, which read­ily fits into stan­dard prob­a­bil­ity calcu­lus. We can take a com­puter pro­gram that rea­sons from ex­pe­rience and think about it us­ing prob­a­bil­ity the­ory.

Why might this be con­ve­nient? Say that I’m not sure which of these three sce­nar­ios best de­scribes the urn—I think it’s about equally likely that each of the three cases holds true. How should I rea­son from my ac­tual ob­ser­va­tions of the urn? If you think about the prob­lem from the per­spec­tive of con­struct­ing a com­puter pro­gram that imi­tates my in­fer­ences, it looks com­pli­cated—we have to jug­gle the rel­a­tive prob­a­bil­ities of each hy­poth­e­sis, and also the prob­a­bil­ities within each hy­poth­e­sis. If you think about it from the per­spec­tive of prob­a­bil­ity the­ory, the ob­vi­ous thing to do is to add up all three dis­tri­bu­tions with weight­ings of 13 apiece, yield­ing a new dis­tri­bu­tion (which is in fact cor­rect). Then the task is just to turn this new dis­tri­bu­tion into a com­puter pro­gram, which turns out not to be difficult.

So that is what a prior re­ally is—a math­e­mat­i­cal ob­ject that rep­re­sents all of your start­ing in­for­ma­tion plus the way you learn from ex­pe­rience.

• I’m con­fused when you say that the prior rep­re­sents all your start­ing in­for­ma­tion plus the way you learn from ex­pe­rience. Isn’t the way you learn from ex­pe­rience fixed, in this frame­work? Given that you are us­ing Bayesian meth­ods, so that the idea of a prior is well defined, then doesn’t that already tell how you will learn from ex­pe­rience?

• Hal, with a poor prior, “Bayesian up­dat­ing” can lead to learn­ing in the wrong di­rec­tion or to no learn­ing at all. Bayesian up­dat­ing guaran­tees a cer­tain kind of con­sis­tency, but not cor­rect­ness. (If you have five city maps that agree with each other, they might still dis­agree with the city.) You might think of Bayesian up­dat­ing as a kind of lower level of or­ga­ni­za­tion—like a com­puter chip that runs pro­grams, or the laws of physics that run the com­puter chip—un­der­neath the ac­tivity of learn­ing. If you start with a max­en­tropy prior that as­signs equal prob­a­bil­ity to ev­ery se­quence of ob­ser­va­tions, and carry out strict Bayesian up­dat­ing, you’ll still never learn any­thing; your marginal prob­a­bil­ities will never change as a re­sult of the Bayesian up­dates. Con­versely, if you some­how had a good prior but no Bayesian en­g­ine to up­date it, you would stay frozen in time and no learn­ing would take place. To learn you need a good prior and an up­dat­ing en­g­ine. Tak­ing a pic­ture re­quires a cam­era, light—and also time.

This prob­a­bly de­serves its own post.

• Another thing I don’t fully un­der­stand is the pro­cess of “up­dat­ing” a prior. I’ve seen differ­ent fla­vors of Bayesian rea­son­ing de­scribed. In some, we start with a prior, get some in­for­ma­tion and up­date the prob­a­bil­ities. This new prob­a­bil­ity dis­tri­bu­tion now serves as our prior for in­ter­pret­ing the next in­com­ing piece of in­for­ma­tion, which then causes us to fur­ther up­date the prior. In other in­ter­pre­ta­tions, the pri­ors never change; they are always con­sid­ered the ini­tial prob­a­bil­ity dis­tri­bu­tion. We then use those prior prob­a­bil­ities plus our se­quence of ob­ser­va­tions since then to make new in­ter­pre­ta­tions and pre­dic­tions. I gather that these can be con­sid­ered math­e­mat­i­cally iden­ti­cal, but do you think one or the other is a more use­ful or helpful way to think of it?

In this ex­am­ple, you start off with un­cer­tainty about which pro­cess put in the balls, so we give 13 prob­a­bil­ity to each. But then as we ob­serve balls com­ing out, we can up­date this prior. Once we see 6 red balls for ex­am­ple, we can com­pletely elimi­nate Case 1 which put in 5 red and 5 white. We can think of our prior as our in­for­ma­tion about the ball-filling pro­cess plus the cur­rent state of the urn, and this can be up­dated af­ter each ball is drawn.

• Hal,

You are be­ing a bad boy. In his ear­lier dis­cus­sion Eliezer made it clear that he did not ap­prove of this ter­minol­ogy of “up­dat­ing pri­ors.” One has pos­te­rior prob­a­bil­ity dis­tri­bu­tions. The prior is what one starts with. How­ever, Eliezer has also been a bit con­fus­ing with his oc­ca­sional use of such lan­guage as a “prior learn­ing.” I re­peat, agents learn, not pri­ors, al­though in his view of the post-hu­man com­put­er­ized fu­ture, maybe it will be com­put­er­ized pri­ors that do the learn­ing.

The only way one is go­ing to get “wrong learn­ing” at least some­what asymp­tot­i­cally is if the di­men­sion­al­ity is high and the sup­port is dis­con­nected. Eliezer is right that if one starts off with a prior that is far enough off, one might well have “wrong learn­ing,” at least for awhile. But, un­less the con­di­tions I just listed hold, even­tu­ally the learn­ing will move in the right di­rec­tion and head to­wards the cor­rect an­swer, or prob­a­bil­ity dis­tri­bu­tion, at least that is what Bayes’ The­o­rem as­serts.

OTOH, the refer­ence to “deep Bayesi­anism” raises an­other is­sue, that of fun­da­men­tal sub­jec­tivism. There is this deep di­vide among Bayesi­ans be­tween the ones that are ul­ti­mately clas­si­cal fre­quen­tists but who ar­gue that Bayesian meth­ods are a su­pe­rior way of get­ting to the true ob­jec­tive dis­tri­bu­tion, and the deep sub­jec­tivist Bayesi­ans. For the lat­ter, there are no ul­ti­mately “true” prob­a­bil­ity dis­tri­bu­tions. We are always es­ti­mat­ing some­thing de­rived out of our sub­jec­tive pri­ors as up­dated by more re­cent in­for­ma­tion, wher­ever those pri­ors came from.

Also, say­ing a prior should the known prob­a­bil­ity dis­tri­bu­tion, say of can­cer vic­tims, as­sumes that this prob­a­bil­ity is some­how known. The prior is always sub­ject to how much in­for­ma­tion the as­sumer of a prior has when they be­ing their pro­cess of es­ti­ma­tion.

• Eliezer may not ap­prove of it, but al­most all of the liter­a­ture uses the phrase “up­dat­ing a prior” to mean ex­actly the type of se­quen­tial learn­ing from ev­i­dence that Eliezer dis­cusses. I pre­fer to think of it as ‘up­dat­ing a prior’. Bayes’ the­o­rem tells you that data is an op­er­a­tor on the space of prob­a­bil­ity dis­tri­bu­tions, con­vert­ing prior in­for­ma­tion into pos­te­rior in­for­ma­tion. I think it’s helpful to think of that pro­cess as ‘up­dat­ing’ so that my prior ac­tu­ally changes to some­thing new be­fore the next piece of in­for­ma­tion comes my way.

• Eliezer ,

Just to be clear . . . go­ing back to your first para­graph, that 0.5 is a prior prob­a­bil­ity for the out­come of one draw from the urn (that is, for the ran­dom vari­able that equals 1 if the ball is red and 0 if the ball is white). But, as you point out, 0.5 is not a prior prob­a­bil­ity for the se­ries of ten draws. What you’re call­ing a “prior” would typ­i­cally be called a “model” by statis­ti­ci­ans. Bayesi­ans tra­di­tion­ally di­vide a model into like­li­hood, prior, and hy­per­prior, but as you im­plic­itly point out, the di­vid­ing line be­tween these is not clear: ul­ti­mately, they’re all part of the big model.

• Barkley, I think you may be re­gard­ing like­li­hood dis­tri­bu­tions as fixed prop­er­ties held in com­mon by all agents, whereas I am re­gard­ing them as vari­ables folded into the prior—if you have a prob­a­bil­ity dis­tri­bu­tion over se­quences of ob­serv­ables, it im­plic­itly in­cludes be­liefs about pa­ram­e­ters and like­li­hoods. Where agents dis­agree about prior like­li­hood func­tions, not just prior pa­ram­e­ter prob­a­bil­ities, their be­liefs may triv­ially fail to con­verge.

An­drew’s point may be par­tic­u­larly rele­vant here—it may in­deed be that statis­ti­ci­ans call what I am talk­ing about a “model”. (Although in some cases, like the Laplace’s Law of Suc­ces­sion in­duc­tor, I think they might call it a “model class”?) Jaynes, how­ever, would have called it our “prior in­for­ma­tion” and he would have writ­ten “the prob­a­bil­ity of A, given that we ob­serve B” as p(A|B,I) where I stands for all our prior be­liefs in­clud­ing pa­ram­e­ter dis­tri­bu­tions and like­li­hood dis­tri­bu­tions. While we may of­ten want to dis­crim­i­nate be­tween differ­ent mod­els and model classes, it makes no sense to talk about dis­crim­i­nat­ing be­tween “prior in­for­ma­tions”—your prior in­for­ma­tion is ev­ery­thing you start out with.

• Eliezer, I am very in­ter­ested in the Bayesian ap­proach to rea­son­ing you’ve out­lined on this site, it’s one of the more el­e­gant ideas I’ve ever run into.

I am a bit con­fused, though, about to what ex­tent you are us­ing math di­rectly when as­sess­ing truth claims. If I asked you for ex­am­ple “what prob­a­bil­ity do you as­sign to the propo­si­tion ‘global warm­ing is an­thro­pogenic’ ?” (say), would you tell me a num­ber?

Or is this mostly about con­cep­tu­ally un­der­stand­ing that P(effect|~cause) needs to be taken into ac­count?

If it’s a num­ber, what’s your heuris­tic for get­ting there (i.e., de­cid­ing on a prior prob­a­bil­ity & all the other prob­a­bil­ities)?

If there’s a post that goes into that much de­tail, I haven’t seen it yet, though your ex­pla­na­tions of Bayes the­o­rem gen­er­ally are brilli­ant.

• My rea­son for writ­ing this is not to cor­rect Eliezer. Rather, I want to ex­pand on his dis­tinc­tion be­tween prior in­for­ma­tion and prior prob­a­bil­ity. Pages 87-89 of Prob­a­bil­ity The­ory: the Logic of Science by E. T. Jaynes (2004 reprint with cor­rec­tions, ISBN 0 521 59271 2) is dense with im­por­tant defi­ni­tions and prin­ci­ples. The quotes be­low are from there, un­less oth­er­wise in­di­cated.

Jaynes writes the fun­da­men­tal law of in­fer­ence as

``````  P(H|DX) = P(H|X) P(D|HX) /​ P(D|X)         (4.3)
``````

Which the reader may be more used to see­ing as

`````` P(H|D) = P(H) P(D|H) /​ P(D)
``````

Where

`````` H = some hy­poth­e­sis to be tested
D = the data un­der im­me­di­ate con­sid­er­a­tion
X = all other in­for­ma­tion known
``````

X is the mis­lead­ingly-named ‘prior in­for­ma­tion’, which rep­re­sents all the in­for­ma­tion available other than the spe­cific data D that we are con­sid­er­ing at the mo­ment. “This in­cludes, at the very least, all it’s past ex­pe­riences, from the time it left the fac­tory to the time it re­ceived its cur­rent prob­lem.”—Jaynes p.87, refer­ring to a hy­po­thet­i­cal prob­lem-solv­ing robot. It seems to me that in prac­tice, X ends up be­ing a rep­re­sen­ta­tion of a sub­set of all prior ex­pe­rience, at­tempt­ing to dis­card only what is ir­rele­vant to the prob­lem. In real hu­man prac­tice, that rep­re­sen­ta­tion may be wrong and may need to be cor­rected.

“ … to our robot, there is no such thing as an ‘ab­solute’ prob­a­bil­ity; all prob­a­bil­ities are nec­es­sar­ily con­di­tional on X at the least.” “Any prob­a­bil­ity P(A|X) which is con­di­tional on X alone is called a prior prob­a­bil­ity. But we cau­tion that ‘prior’ … does not nec­es­sar­ily mean ‘ear­lier in time’ … the dis­tinc­tion is purely a log­i­cal one; any in­for­ma­tion be­yond the im­me­di­ate data D of the cur­rent prob­lem is by defi­ni­tion ‘prior in­for­ma­tion’.”

“In­deed, the sep­a­ra­tion of the to­tal­ity of the ev­i­dence into two com­po­nents called ‘data’ and ‘prior in­for­ma­tion’ is an ar­bi­trary choice made by us, only for our con­ve­nience in or­ga­niz­ing a chain of in­fer­ences.” Please note his use of the word ‘ev­i­dence’.

Sam­pling the­ory, which is the ba­sis of many treat­ments of prob­a­bil­ity, “ … did not need to take any par­tic­u­lar note of the prior in­for­ma­tion X, be­cause all prob­a­bil­ities were con­di­tional on H, and so we could sup­pose im­plic­itly that the gen­eral ver­bal prior in­for­ma­tion defin­ing the prob­lem was in­cluded in H. This is the habit of no­ta­tion that we have slipped into, which has ob­scured the unified na­ture of all in­fer­ence.”

“From the start, it has seemed clear how one how one de­ter­mines nu­mer­i­cal val­ues of of sam­pling prob­a­bil­ities¹ [e.g. P(D|H) ], but not what de­ter­mines prior prob­a­bil­ities [AKA ‘pri­ors’ e.g. P(H|X)]. In the pre­sent work we shall see that this s only an ar­ti­fact of the un­sym­met­ri­cal way of for­mu­lat­ing prob­lems, which left them ill-posed. One could see clearly how to as­sign sam­pling prob­a­bil­ities be­cause the hy­poth­e­sis H was stated very speci­fi­cally; had the prior in­for­ma­tion X been speci­fied equally well, it would have been equally clear how to as­sign prior prob­a­bil­ities.”

Jaynes never gives up on that X no­ta­tion (though the let­ter may differ), he never drops it for con­ve­nience.

“When we look at these prob­lems on a suffi­ciently fun­da­men­tal level and re­al­ize how care­ful one must be to spec­ify prior in­for­ma­tion be­fore we have a well-posed prob­lem, it be­comes clear that … ex­actly the same prin­ci­ples are needed to as­sign ei­ther sam­pling prob­a­bil­ities or prior prob­a­bil­ities …” That is, P(H|X) should be calcu­lated. Keep your copy of Ken­dall and Stu­art handy.

I think pri­ors should not be cheaply set from an opinion, whim, or wish. “ … it would be a big mis­take to think of X as stand­ing for some hid­den ma­jor premise, or some uni­ver­sally valid propo­si­tion about Na­ture.”

The prior in­for­ma­tion has im­pact be­yond set­ting prior prob­a­bil­ities (pri­ors). It in­forms the for­mu­la­tion of the hy­pothe­ses, of the model, and of “al­ter­na­tive hy­pothe­ses” that come to mind when the data seem to be show­ing some­thing re­ally strange. For ex­am­ple, data that seems to strongly sup­port psy­choki­ne­sis may cause a skep­tic to bring up a hy­poth­e­sis of fraud, whereas a ca­reer psy­chic re­searcher may not do so. (see Jaynes pp.122-125)

I say, be alert for mis­in­for­ma­tion, bi­ases, and wish­ful think­ing in your X. Dis­card ev­ery­thing that is not ev­i­dence.

I’m pretty sure the free ver­sion Prob­a­bil­ity The­ory: The Logic of Science is off line. You can pre­view the book here: http://​​books.google.com/​​books?id=tTN4HuUNXjgC&printsec=front­cover&dq=Prob­a­bil­ity+The­ory:+The+Logic+of+Science&cd=1#v=onepage&q&f=false .

FOOTNOTES

1. There are mas­sive com­pendiums of meth­ods for sam­pling dis­tri­bu­tions, such as

• Fel­ler (An In­tro­duc­tion to Prob­a­bil­ity The­ory and its Ap­pli­ca­tions, Vol1, J. Wiley & Sons, New York, 3rd edn 1968 and Vol 2. J. Wiley & Sons, New York, 2nd edn 1971) and Ken­dall and

• Stu­art (The Ad­vanced The­ory of Statis­tics: Vol­ume 1, Distri­bu­tion The­ory, McMillan, New York 1977).
** Be fa­mil­iar with what is in them.

Edited 05/​05/​2010 to put in the ac­tual refer­ences.

• Then the task is just to turn this new dis­tri­bu­tion into a com­puter pro­gram, which turns out not to be difficult.

Can some­one please provide a hint how?

• Here’s some Python code to calcu­late a prior dis­tri­bu­tion from a rule for as­sign­ing prob­a­bil­ity to the next ob­ser­va­tion.

A “rule” is rep­re­sented as a func­tion that takes as a first ar­gu­ment the next ob­ser­va­tion (like “R”) and as a sec­ond ar­gu­ment all pre­vi­ous ob­ser­va­tions (a string like “RRWR”). I in­cluded some ex­am­ple rules at the end.

EDIT: oh man, what hap­pened to my line spac­ing? my in­dents? jeez.

EDIT2: here’s a drop­box link: https://​​www.drop­box.com/​​s/​​16n01acrauf8h7g/​​prior_pro­ducer.py

``````from func­tools im­port re­duce

def prod(se­quence):
‴Product equiv­a­lent of python’s “sum”‴
re­turn re­duce(lambda a, b: a*b, se­quence)

def se­quence_prob(rule, se­quence):
‴Prob­a­bil­ity of a se­quence like “RRWR” us­ing the given rule for
com­put­ing the prob­a­bil­ity of the next ob­ser­va­tion.

To put it an­other way: com­putes the joint prob­a­bil­ity mass func­tion.‴
re­turn prod([rule(se­quence[i], se­quence[:i]) \
for i in range(len(se­quence))])

def num­ber2se­quence(num­ber, length):
‴Con­vert a num­ber like 5 into a se­quence like WWRWR.

The se­quence cor­re­sponds to the bi­nary digit rep­re­sen­ta­tion of the
num­ber: 5 --> 00101 --> WWRWR

This is con­ve­nient for list­ing all se­quences of a given length.‴
bi­nary_rep­re­sen­ta­tion = bin(num­ber)[2:]
seq_end = bi­nary_rep­re­sen­ta­tion.re­place(‘1’, ‘R’).re­place(‘0’, ‘W’)

if len(seq_end) > length:
raise ValueEr­ror(‘no se­quence of length {} with num­ber {}‘\
.for­mat(length, num­ber))

# Now add W’s to the be­gin­ning to make it the right length -
# like adding 0’s to the be­gin­ning of a bi­nary num­ber
re­turn ″.join(‘W’ for i in range(length—len(seq_end))) + seq_end

def prior(rule, n):
‴Gen­er­ate a joint prob­a­bil­ity dis­tri­bu­tion from the given rule over
all se­quences of length n. Doesn’t feed the rule any back­ground
knowl­edge, so it’s a prior dis­tri­bu­tion.‴
se­quences = [num­ber2se­quence(i, n) for i in range(2**n)]
re­turn [(seq, se­quence_prob(rule, seq)) for seq in se­quences]
``````

And here’s some ex­am­ples of func­tions that can be used as the “rule” ar­gu­ments.

``````def laplaces_rule(next, past):
R = past.count(‘R’)
W = past.count(‘W’)
if R + W != len(past):
raise ValueEr­ror(‘knowl­edge is not just of red and white balls’)
red_prob = (R + 1)/​(R + W + 2)
if next == ‘R’:
re­turn red_prob
elif next == ‘W’:
re­turn 1 - red_prob
else:
raise ValueEr­ror(‘can only pre­dict whether next will be red or white’)

def an­tilaplaces_rule(next, past):
re­turn 1 - laplaces_rule(next, past)
``````
• So just to be clear. There are two things, the prior prob­a­bil­ity, which is the value P(H|I), and the back ground in­for­ma­tion which is ‘I’. So P(H|D,I_1) is differ­ent from P(H|D,I_2) be­cause they are up­dates us­ing the same data and the same hy­poth­e­sis, but with differ­ent par­tial back­ground in­for­ma­tion, they are both how­ever pos­te­rior prob­a­bil­ities. And the pri­ors P(H_I_1) may be equal to P(H|I_2) even if I_1 and I_2 are rad­i­cally differ­ent and pro­duce up­dates in op­po­site di­rec­tions given the same data. P(H|I) is still called the prior prob­a­bil­ity, but it is sme­thing very differ­net from the back­ground in­for­ma­tion which is es­sen­tially just I.

Is this right? Let me be more spe­cific.

Let’s say my prior in­for­ma­tion is case1, then P( sec­ond ball is R| first ball is R & case1) = 49

If my prior in­for­ma­tion was case2, then P( sec­ond ball is R| first ball is R & case2) = 23 [by the rule of suc­ces­sion]

and P( first ball is R| case1) = 50% = P( first ball is R|case2)

This is why differ­ent prior in­for­ma­tion can make you learn in differ­ent di­rec­tions, even if two prior in­for­ma­tions pro­duce the same prior prob­a­bil­ity?

Please let me know if i am mak­ing any sort of mis­take. Or if I got it right, ei­ther way.

• You got it right. The three differ­ent cases cor­re­spond to differ­ent joint dis­tri­bu­tions over se­quences of out­comes. Prior in­for­ma­tion that one of the cases ob­tains amounts to pick­ing one of these dis­tri­bu­tions (of course, one can also have weighted com­bi­na­tions of these dis­tri­bu­tions if there is un­cer­tainty about which case ob­tains). It turns out that in this ex­am­ple, if you add to­gether the prob­a­bil­ities of all the se­quences that have a red ball in the sec­ond po­si­tion, you will get 0.5 for each of the three dis­tri­bu­tions. So equal prior prob­a­bil­ities. But even though the terms sum to 0.5 in all three cases, the in­di­vi­d­ual terms will not be the same. For in­stance, prior in­for­ma­tion of case 1 would as­sign a differ­ent prob­a­bil­ity to RRRRR (0.004) than prior in­for­ma­tion of case 2 (0.031).

So the prior in­for­ma­tion is a joint dis­tri­bu­tion over se­quences of out­comes, while the prior prob­a­bil­ity of the hy­poth­e­sis is (in this ex­am­ple at least) a marginal dis­tri­bu­tion calcu­lated from this joint dis­tri­bu­tion. Since mul­ti­ple joint dis­tri­bu­tions can give you the same marginal dis­tri­bu­tion for some ran­dom vari­able, differ­ent prior in­for­ma­tion can cor­re­spond to the same prior prob­a­bil­ity.

When you re­strict at­ten­tion to those se­quences that have a red ball in the first po­si­tion, and now add to­gether the (ap­pro­pri­ately renor­mal­ized) joint prob­a­bil­ities of se­quences with a red ball in the sec­ond po­si­tion, you don’t get the same num­ber with all three dis­tri­bu­tions. This cor­re­sponds to the fact that the three dis­tri­bu­tions are as­so­ci­ated with differ­ent learn­ing rules.

• No re­ally, i re­ally want help. Please help me un­der­stand if I am con­fused, and set­tle my anx­iety if I am not con­fused.

• One can up­date one’s be­liefs about one’s ex­ist­ing be­liefs and the ways in which one learns from ex­pe­rience too – click.

• Un­der stan­dard as­sump­tions about the draw­ing pro­cess, you only need 10 num­bers, not 1024: P(the urn ini­tially con­tained ten white balls), P(the urn ini­tially con­tained nine white balls and one red one), P(the urn ini­tially con­tained eight white balls and two red ones), and so on through P(one white ball and nine red ones). (P(ten red balls) equals 1 minus ev­ery­thing else.) P(RWRWWRWRWW) is then P(4R, 6W) di­vided by the ap­pro­pri­ate bino­mial co­effi­cient.

• So then this ini­tial prob­a­bil­ity es­ti­mate, 0.5, is not re­peat not a “prior”.

This re­ally con­fuses me. Con­sid­er­ing the Uni­verse in your ex­am­ple, which con­sists only of the urn with the balls, wouldn’t one of the prior hy­pothe­ses(e.g. case 2) be a prior and have all the nec­es­sary in­for­ma­tion to com­pute the lookup table?

In other words aren’t the three fol­low­ing equiv­a­lent in the urn-with-balls uni­verse?

1. Hy­poth­e­sis 2 + bayesian updating

2. Python pro­gram 2

3. The lookup table gen­er­ated from pro­gram 2 + Pro­ce­dure for calcu­lat­ing con­di­tional prob­a­bil­ity(e.g. if you want to know the prob­a­bil­ity that the third ball is red, given that the first two balls drawn were white.)

• Un­less I am mi­s­un­der­stand­ing you, yes, that’s pre­cisely the point.

I don’t un­der­stand why you are con­fused, though. None of these are, af­ter all, num­bers in (0,1), which would not con­tain any in­for­ma­tion as to how you would go about do­ing your up­dates given more ev­i­dence.