# Looking for proof of conditional probability

From what I un­der­stand, the Kol­mogorov ax­ioms make no men­tion of con­di­tional prob­a­bil­ity. That is sim­ply defined. If I re­ally want to show how prob­a­bil­ity ac­tu­ally works, I’m not go­ing to ar­gue “by defi­ni­tion”. Does any­one know a mod­ified form that uses sim­pler ax­ioms than P(A|B) = P(A∩B)/​P(B)?

• Your ques­tion doesn’t make any sense to me. I don’t know what it means to “prove” a defi­ni­tion. Did you mean to ask for an (in­for­mal) ar­gu­ment that the con­cept is use­ful?

My con­fu­sion is com­pounded by the fact that I find P(A|B) = P(A∩B)/​P(B) pretty self-ex­plana­tory. What seems to be the difficulty?

• There defi­ni­tion is equiv­a­lent to hav­ing an ax­iom that states that P(A|B) = P(A∩B)/​P(B). That’s not that difficult a con­cept, but it’s still more ad­vanced than ax­ioms tend to be. Com­pare it to the other three. It’s like Eu­clid’s fifth pos­tu­late.

• But it’s not an ax­iom; it’s a defi­ni­tion.

It both­ers me that you seem to be un­der the im­pres­sion that the equa­tion rep­re­sents some kind of sub­stan­tive claim. It doesn’t; it’s just the es­tab­lish­ment of a short­hand no­ta­tion. (It both­ers me even more that other com­menters don’t seem to be notic­ing that you’re suffer­ing from a con­fu­sion about this.)

A rea­son­able ques­tion to ask might be: “why is the quan­tity P(A∩B)/​P(B) in­ter­est­ing enough to be worth hav­ing a short­hand no­ta­tion for?” But that isn’t what you asked, and the an­swer wouldn’t con­sist of a “proof”, so de­spite its be­ing the clos­est non-con­fused ques­tion to yours I’m not yet sure whether an at­tempt to an­swer it would be helpful to you.

• If you sim­ply view P(A|B) = P(A∩B)/​P(B) as a short­hand, with “P(A|B)” as just an ar­bi­trary sym­bol, then you’re right—it needs no more ex­pla­na­tion. But we don’t con­sider P(A|B) to be just an ar­bitary sym­bol—we think it has a spe­cific mean­ing, which is “the prob­a­bil­ity of A given B”. And we think that “P(A∩B)/​P(B)” has been cho­sen to equal “P(A|B)” be­cause it has the prop­er­ties we feel “the prob­a­bil­ity of A given B” should have.

I think DanielLC is ask­ing why it is speci­fi­cally P(A∩B)/​P(B), and not some other for­mula, that has been cho­sen to cor­re­spond with the in­tu­itive no­tion of “the prob­a­bil­ity of A given B”.

• In that case, it’s no won­der that I’m hav­ing trou­ble re­lat­ing, be­cause I didn’t un­der­stand what “the prob­a­bil­ity of A given B” meant un­til some­body told me it was P(A∩B)/​P(B).

There is a larger point here:

But we don’t con­sider P(A|B) to be just an ar­bitary sym­bol—we think it has a spe­cific mean­ing, which is “the prob­a­bil­ity of A given B”. And we think that “P(A∩B)/​P(B)” has been cho­sen to equal “P(A|B)” be­cause it has the prop­er­ties we feel “the prob­a­bil­ity of A given B” should have.

In my opinion, an im­por­tant part of learn­ing to think math­e­mat­i­cally is learn­ing not to think like this. That is, not to think of sym­bols as hav­ing a mys­te­ri­ous “mean­ing” apart from their for­mal defi­ni­tions.

This is what causes some peo­ple to have trou­ble ac­cept­ing that 0.999.… = 1: they don’t un­der­stand that the ques­tion of what 0.999.… “is” is sim­ply a mat­ter of defi­ni­tion, and not some mys­te­ri­ous em­piri­cal fact.

Para­dox­i­cally, this is a way in which my lack of “math­e­mat­i­cal abil­ity” is a kind of math­e­mat­i­cal abil­ity in its own right, be­cause I of­ten don’t have these mys­te­ri­ous “in­tu­itions” that other peo­ple seem to, and thus for me it tends to be sec­ond na­ture that the for­mal defi­ni­tion of some­thing is what the thing is. For other peo­ple, I sup­pose, think­ing this way is a kind of skill they have to con­sciously learn.

• If I have a set of ax­ioms, and I de­rive the­o­rems from them, then any­thing that these ax­ioms are true about, all the the­o­rems are also true about. For ex­am­ple, sup­pose we took Eu­clid’s first four pos­tu­lates and de­rived a bunch of the­o­rems from them. Th­ese pos­tu­lates are true if you use them to de­scribe figures on a plane, so the the­o­rems are also true about those figures. This also works if it’s on a sphere. It’s not that a “point” means a spot on a plane, or two op­po­site spots on a sphere, it’s just that the rea­son­ing for ab­stract points ap­plies to phys­i­cal mod­els.

Statis­tics isn’t just those ax­ioms. You might be able to find some­thing else that those ax­ioms ap­ply to. If you do, ev­ery statis­ti­cal the­o­rem will also ap­ply. It still wouldn’t be statis­tics. Statis­tics is a spe­cific ap­pli­ca­tion. P(A|B) rep­re­sents some­thing in this ap­pli­ca­tion. P(A|B) always equals P(A∩B)/​P(B). We can find this out the same way we figured out that P(∅) always equals zero. It’s just that the lat­ter is more ob­vi­ous than the former, and we may be able to de­rive the former from some­thing else equally ob­vi­ous.

• That is, not to think of sym­bols as hav­ing a mys­te­ri­ous “mean­ing” apart from their for­mal defi­ni­tions.

Pure for­mal­ism is use­ful for de­vel­op­ing new math, but math can­not be ap­plied to real prob­lems with­out the as­sign­ment of mean­ing to the vari­ables and equa­tions. Most peo­ple are more in­ter­ested in us­ing math than in what amounts to in­tel­lec­tual play, as en­joy­able and po­ten­tially use­ful as that can be. Note that I tend to be more of a for­mal­ist my­self, which is why I men­tioned in an old com­ment on HN that I tend to learn math con­cepts fairly eas­ily, but have trou­ble ap­ply­ing it.

• I find this set of an­swers be­ing top rated quite dis­turb­ing to be hon­est.

There are sev­eral peo­ple in the same main thread point­ing out that

a) There ways to define it that would make it ob­vi­ously vi­o­lat­ing ba­sic in­tu­ition and hence a dis­con­nec­tion of it from in­tu­ition does have limits

b) There are in­tu­itive solu­tions to the prob­lem that may reach a proof for it and hence the whole ar­gu­ment to be un­founded.

• I agree with the OP: sim­ply defin­ing a prob­a­bil­ity con­cept doesn’t by it­self map it to our in­tu­itions about it. For ex­am­ple, if we defined P(A|B) = P(AB) /​ 2P(B), it wouldn’t cor­re­spond to our in­tu­itions, and here’s why.

In­tu­itively, P(A|B) is the prob­a­bil­ity of A hap­pen­ing if we know that B already hap­pened. In other words, the en­tirety of the el­e­men­tary out­come space we’re tak­ing into con­sid­er­a­tion now are those that cor­re­spond to B. Of those re­main­ing el­e­men­tary out­comes, the only ones that can lead to A are those that lie in AB. Their mea­sure in ab­solute terms is equal to P(AB); how­ever, their mea­sure in re­la­tion to the el­e­men­tary out­comes in B is equal to P(AB)/​P(B).

Thus, P(A|B) is P(A) as it would be if the only el­e­men­tary out­comes in ex­is­tence were those yield­ing B. P(B) here is a nor­mal­iz­ing co­effi­cient: if we were eval­u­at­ing the con­di­tional prob­a­bil­ity of A in re­la­tion to a set of ex­haus­tive and mu­tu­ally ex­clu­sive ex­per­i­men­tal out­comes, as it is done in Bayesian rea­son­ing, di­vid­ing by P(B) means renor­mal­iz­ing the el­e­men­tary out­come space af­ter B is fixed.

• Ba­si­cally, P(A|B) = 0 when A and B are dis­joint, and P(A|C)/​P(B|C) = P(A)/​P(B) when A and B are sub­sets of C?

It’s bet­ter, but it’s still not that good. I have a sneak­ing sus­pi­cion that that’s the best I can do, though.

• Now, a hope­fully in­tu­itive ex­pla­na­tion of in­de­pen­dent events.

By defi­ni­tion, A is in­de­pen­dent from B if P(A|B) = P(A), or equiv­a­lently P(AB) = P(A)P(B). What does it mean in terms of mea­sures?

It is easy to prove that if A is in­de­pen­dent from B, then A is also in­de­pen­dent from ~B: P(A|~B) = P(A ~B) /​ P(~B) = (P(A) - P(AB)) /​ (1 - P(B)) = (P(A) - P(A)P(B)) /​ (1 - P(B)) = P(A).

There­fore, A is in­de­pen­dent from B iff P(A) = P(AB) /​ P(B) = P(A ~B) /​ P(~B), which im­plies that P(AB) /​ P(A ~B) = P(B) /​ P(~B).

Geo­met­ri­cally, it means that A in­ter­sects B and ~B with sub­sets of mea­sures pro­por­tionate to the mea­sures of B and ~B. So if P(B) = 14, then 14 of A lies in B, and the re­main­ing 34 in ~B. And if B and ~B are equally likely, then A lies in equal shares of both.

And from an in­for­ma­tion-the­o­retic per­spec­tive, this ge­o­met­ric in­ter­peta­tion means that know­ing whether B or ~B hap­pened gives us no in­for­ma­tion about the rel­a­tive like­li­hood of A, since it will be equally likely to oc­cur in the renor­mal­ized out­come space ei­ther way.

• I feel like in­de­pen­dence re­ally is just a defi­ni­tion, or at least some­thing close to it. I guess P(A|B) = P(A|~B) might be bet­ter. In­de­pen­dence is just an­other way of say­ing that A is just as likely re­gard­less of B.

• P(A|B) = P(A|~B) is equiv­a­lent to the clas­sic defi­ni­tion of in­de­pen­dence, and in­tu­itively it means that “whether B hap­pens or not, it doesn’t af­fect the like­li­hood of A hap­pen­ing”.

I guess that since other ba­sic prob­a­bil­ity con­cepts are defined in terms of set op­er­a­tions (union and in­ter­sec­tion), and in­de­pen­dence lacks a similar ob­vi­ous ex­pla­na­tion in terms of sets and mea­sure, I wanted to find one.

• and P(A|C)/​P(B|C) = P(A)/​P(B) when A and B are sub­sets of C?

When A is a sub­set of C, P(A|C) = P(A).

• Um, no?

• ...Oops, yes, said that with­out think­ing. But this

Ba­si­cally, P(A|B) = 0 when A and B are dis­joint, and P(A|C)/​P(B|C) = P(A)/​P(B) when A and B are sub­sets of C?

is cor­rect.

• Syn­tac­ti­cally, adding con­di­tional prob­a­bil­ity doesn’t do any­thing new, be­sides serv­ing as short-hand for ex­pres­sions that look like “P(A∩B)/​P(B)”.

The is­sue you’re rec­og­niz­ing is that, se­man­ti­cally, con­di­tional prob­a­bil­ities should mean some­thing and be­have co­her­ently with the sys­tem of prob­a­bil­ity you’ve built up to this point.

This is not re­ally a ques­tion of prob­a­bil­ity the­ory but in­stead a ques­tion of in­ter­pre­ta­tions of prob­a­bil­ity. As such, it has a very philo­soph­i­cal feel and I’m not sure it’s go­ing to have the solid an­swers you’re look­ing for.

• Short an­swer: The Kol­mogorov ax­ioms are just math­e­mat­i­cal. They have noth­ing in­her­ently to do with the real world. P(A|B)=P(A∩B)/​P(B) is the defi­ni­tion of P(A|B). There is a com­pel­ling ar­gu­ment that the be­liefs of a ra­tio­nal agent should obey the Kol­mogorov ax­ioms, with P(A|B) cor­re­spond­ing to the de­gree of be­lief in A when B is known.

Long an­swer: I have a se­quence of posts com­ing up.

• There is a com­pel­ling ar­gu­ment that the be­liefs of a ra­tio­nal agent should obey the Kol­mogorov ax­ioms, with P(A|B) cor­re­spond­ing to the de­gree of be­lief in A when B is known.

Are you think­ing of this one, or some­thing else?

• I was think­ing of the Dutch book ar­gu­ment oth­ers have men­tioned. But I think you may have mi­s­un­der­stood my point.The origi­nal poster has summed up what I wanted to say bet­ter than I could:

If I have a set of ax­ioms, and I de­rive the­o­rems from them, then any­thing that these ax­ioms are true about, all the the­o­rems are also true about. For ex­am­ple, sup­pose we took Eu­clid’s first four pos­tu­lates and de­rived a bunch of the­o­rems from them. Th­ese pos­tu­lates are true if you use them to de­scribe figures on a plane, so the the­o­rems are also true about those figures. This also works if it’s on a sphere. It’s not that a “point” means a spot on a plane, or two op­po­site spots on a sphere, it’s just that the rea­son­ing for ab­stract points ap­plies to phys­i­cal mod­els.

Statis­tics isn’t just those ax­ioms. You might be able to find some­thing else that those ax­ioms ap­ply to. If you do, ev­ery statis­ti­cal the­o­rem will also ap­ply. It still wouldn’t be statis­tics. Statis­tics is a spe­cific ap­pli­ca­tion. P(A|B) rep­re­sents some­thing in this ap­pli­ca­tion. P(A|B) always equals P(A∩B)/​P(B). We can find this out the same way we figured out that P(∅) always equals zero. It’s just that the lat­ter is more ob­vi­ous than the former, and we may be able to de­rive the former from some­thing else equally ob­vi­ous.

I agree with the first para­graph but the sec­ond seems con­fused. We want to show that P(A|B) defined as P(A∩B)/​P(B) tells us how much weight to as­sign A given B. DanielLC seems to be look­ing for an a pri­ori math­e­mat­i­cal proof of this, but this is fu­tile. We’re try­ing to show that there is a cor­re­spon­dence be­tween the laws of prob­a­bil­ity and some­thing in the real world (the op­ti­mal be­liefs of agents) , so we have to men­tion prop­er­ties of the real world in our ar­gu­ments.

• I think you need the com­mon-sense ax­ioms P(B|B) = 1 and P(A|all pos­si­bil­ities) = P(A). Given these, the Venn di­a­gram ex­pla­na­tion is pretty straight­for­ward.

• A Venn di­a­gram is not a proof.

• A se­ries of Venn di­a­grams, with text ex­pla­na­tion, is a perfectly fine proof. The pic­tures can all be trans­lated into state­ments about sets. The ques­tion is what ax­ioms it starts with about P(A|B).

• Umm...I don’t know how rigor­ous this ex­pla­na­tion this is, but it might lead you in the right di­rec­tion...be­cause if you con­sider the Venn Di­a­gram with prob­a­bil­ity spaces A and B, the prob­a­bil­ity space of A within B is given by the over­lap of the two cir­cles, or P(A∩B). Then you get the prob­a­bil­ity of land­ing in that space out of all the space in B...as in, the prob­a­bil­ity that if you choose cir­cle B, you land in the over­lap be­tween A and B.

That’s prob­a­bly not what you were look­ing for, but hope it helps.

• I’m not sure if this is what you’re look­ing for, but I be­lieve one way you can de­rive it is via a dutch book ar­gu­ment.

• ...which is done ex­plic­itly in Jay Kadane’s free text,start­ing on page 29.

• Snap­ping poin­t­ers: di­rect link. Com­mer­cial use for­bid­den (prob­a­bly the rea­son for the poin­ter chain).

• Ah, thanks. (It’s done semi-ex­plic­itly right on the wiki page, though. Or at least an effec­tively gen­eral ex­am­ple is set up and the form of the proof is de­scribed. (ie, the only way that there wouldn’t au­to­mat­i­cally be a solu­tion to the equa­tions to dutch book you would be if the sys­tem had a de­ter­mi­nant of zero, and do­ing so forces the stan­dard rule for con­di­tional prob­a­bil­ity))

• I don’t think there’s any sim­ple way to state that with math.

• Why is P(A∩B)/​P(B) called con­di­tional prob­a­bil­ity? Or, let’s turn it the other way round (which is your ques­tion), why would con­di­tional prob­a­bil­ity be given by P(A∩B)/​P(B)? I think I was able to de­velop a proof, see be­low. Of course, dou­ble-check­ing by oth­ers would be re­quired.

First, I would define con­di­tional prob­a­bil­ity as the “Prob­a­bil­ity of A know­ing that B oc­curs”, which is mean­ingful and I guess ev­ery­body would agree on (see also wikipe­dia).

Start­ing from there, “Prob­a­bil­ity of A know­ing that B oc­curs” means the prob­a­bil­ity of A in a re­stricted space where B is known to oc­cur. This re­stricted space is sim­ply B. Now, in this re­stricted space (for­get about the larger space Ω for a minute), we want to know the prob­a­bil­ity of A. Well, the oc­cur­rence of A in B is given by A∩B. In case of equiprob­a­bil­ity of el­e­men­tary events, the prob­a­bil­ity of A∩B in B is card(A∩B)/​card(B). Now, this can be tweaked as card(A∩B)/​card(Ω)×card(Ω)/​card(B)=P(A∩B)/​P(B), where P is the prob­a­bil­ity in the origi­nal space Ω. The lat­ter re­sult be­ing Kol­mogorov’s defi­ni­tion of con­di­tional prob­a­bil­ity!

Note that we can also think in terms of ar­eas, writ­ing that the prob­a­bil­ity of A∩B in B is the area of A∩B di­vided by the area of B: area(A∩B)/​area(B), which can be tweaked the same way, and works with­out re­quiring equiprob­a­bil­ity (and the prob­a­bil­ity space could also be non-dis­crete).

• I’m sorry, the com­ments on this post all seem to miss the point. Bayes’ The­o­rem can be proven from ba­sic logic, look at places like the Khan Academy, or Luke­prog’s In­tu­itive Ex­pla­na­tion of Yud­kowsky’s In­tu­itive Ex­pla­na­tion. Once you un­der­stand that, the Kol­mogorov ax­ioms will be ob­vi­ous. It’s not as­sumed,

• Even Wikipe­dia notes that Cox’s The­o­rem makes an­other ap­proach pos­si­ble—that seems like the place to start look­ing if you want a math­e­mat­i­cal proof. So I think Larks came close to the right ques­tion (though it may or may not ad­dress your con­cerns).

Cox and Jaynes show that we can start by re­quiring prob­a­bil­ity or the logic of un­cer­tainty to have cer­tain fea­tures. For ex­am­ple, our calcu­la­tions should have a type of con­sis­tency such that it shouldn’t mat­ter to our fi­nal an­swer if we write P(A∩B) or P(B∩A). This, to­gether with the other re­quire­ments, ul­ti­mately tells us that:

P(A∩B) = P(B)P(A|B) = P(A)P(B|A)

Which im­me­di­ately gives us a pos­si­ble jus­tifi­ca­tion for both the Kol­mogorov defi­ni­tion and Bayes’ The­o­rem.

• Maybe you’d find it more in­tu­itive that P(AnB) = P(A|B)*P(B) ?