# Conditional Independence, and Naive Bayes

Pre­vi­ously I spoke of mu­tual in­for­ma­tion be­tween X and Y, I(X;Y), which is the differ­ence be­tween the of the joint prob­a­bil­ity dis­tri­bu­tion, H(X,Y) and the en­tropies of the marginal dis­tri­bu­tions, H(X) + H(Y).

I gave the ex­am­ple of a vari­able X, hav­ing eight states 1..8 which are all equally prob­a­ble if we have not yet en­coun­tered any ev­i­dence; and a vari­able Y, with states 1..4, which are all equally prob­a­ble if we have not yet en­coun­tered any ev­i­dence. Then if we calcu­late the marginal en­tropies H(X) and H(Y), we will find that X has 3 bits of en­tropy, and Y has 2 bits.

How­ever, we also know that X and Y are both even or both odd; and this is all we know about the re­la­tion be­tween them. So for the joint dis­tri­bu­tion (X,Y) there are only 16 pos­si­ble states, all equally prob­a­ble, for a joint en­tropy of 4 bits. This is a 1-bit en­tropy defect, com­pared to 5 bits of en­tropy if X and Y were in­de­pen­dent. This en­tropy defect is the mu­tual in­for­ma­tion—the in­for­ma­tion that X tells us about Y, or vice versa, so that we are not as un­cer­tain about one af­ter hav­ing learned the other.

Sup­pose, how­ever, that there ex­ists a third vari­able Z. Z has two states, “even” and “odd”, perfectly cor­re­lated to the even­ness or odd­ness of (X,Y). In fact, we’ll sup­pose that Z is just the ques­tion “Are X and Y even or odd?”

If we have no ev­i­dence about X and Y, then Z it­self nec­es­sar­ily has 1 bit of en­tropy on the in­for­ma­tion given. There is 1 bit of mu­tual in­for­ma­tion be­tween Z and X, and 1 bit of mu­tual in­for­ma­tion be­tween Z and Y. And, as pre­vi­ously noted, 1 bit of mu­tual in­for­ma­tion be­tween X and Y. So how much en­tropy for the whole sys­tem (X,Y,Z)? You might naively ex­pect that

H(X,Y,Z) = H(X) + H(Y) + H(Z) - I(X;Z) - I(Z;Y) - I(X;Y)

but this turns out not to be the case.

The joint sys­tem (X,Y,Z) only has 16 pos­si­ble states—since Z is just the ques­tion “Are X & Y even or odd?”—so H(X,Y,Z) = 4 bits.

But if you calcu­late the for­mula just given, you get

(3 + 2 + 1 − 1 − 1 − 1)bits = 3 bits = WRONG!

Why? Be­cause if you have the mu­tual in­for­ma­tion be­tween X and Z, and the mu­tual in­for­ma­tion be­tween Z and Y, that may in­clude some of the same mu­tual in­for­ma­tion that we’ll calcu­late ex­ists be­tween X and Y. In this case, for ex­am­ple, know­ing that X is even tells us that Z is even, and know­ing that Z is even tells us that Y is even, but this is the same in­for­ma­tion that X would tell us about Y. We dou­ble-counted some of our knowl­edge, and so came up with too lit­tle en­tropy.

The cor­rect for­mula is (I be­lieve):

H(X,Y,Z) = H(X) + H(Y) + H(Z) - I(X;Z) - I(Z;Y) - I(X;Y | Z)

Here the last term, I(X;Y | Z), means, “the in­for­ma­tion that X tells us about Y, given that we already know Z”. In this case, X doesn’t tell us any­thing about Y, given that we already know Z, so the term comes out as zero—and the equa­tion gives the cor­rect an­swer. There, isn’t that nice?

“No,” you cor­rectly re­ply, “for you have not told me how to calcu­late I(X;Y|Z), only given me a ver­bal ar­gu­ment that it ought to be zero.”

We calcu­late I(X;Y|Z) just the way you would ex­pect. I(X;Y) = H(X) + H(Y) - H(X,Y), so:

I(X;Y|Z) = H(X|Z) + H(Y|Z) - H(X,Y|Z)

And now, I sup­pose, you want to know how to calcu­late the con­di­tional en­tropy? Well, the origi­nal for­mula for the en­tropy is:

H(S) = Sum i: p(Si)*-log2(p(Si))

If we then learned a new fact Z0, our re­main­ing un­cer­tainty about S would be:

H(S|Z0) = Sum i: p(Si|Z0)*-log2(p(Si|Z0))

So if we’re go­ing to learn a new fact Z, but we don’t know which Z yet, then, on av­er­age, we ex­pect to be around this un­cer­tain of S af­ter­ward:

H(S|Z) = Sum j: (p(Zj) * Sum i: p(Si|Zj)*-log2(p(Si|Zj)))

And that’s how one calcu­lates con­di­tional en­tropies; from which, in turn, we can get the con­di­tional mu­tual in­for­ma­tion.

There are all sorts of an­cillary the­o­rems here, like:

H(X|Y) = H(X,Y) - H(Y)

and

if I(X;Z) = 0 and I(Y;X|Z) = 0 then I(X;Y) = 0

but I’m not go­ing to go into those.

“But,” you ask, “what does this have to do with the na­ture of words and their hid­den Bayesian struc­ture?”

I am just so un­speak­ably glad that you asked that ques­tion, be­cause I was plan­ning to tell you whether you liked it or not. But first there are a cou­ple more pre­limi­nar­ies.

You will re­mem­ber—yes, you will re­mem­ber—that there is a du­al­ity be­tween mu­tual in­for­ma­tion and Bayesian ev­i­dence. Mu­tual in­for­ma­tion is pos­i­tive if and only if the prob­a­bil­ity of at least some joint events P(x, y) does not equal the product of the prob­a­bil­ities of the sep­a­rate events P(x)*P(y). This, in turn, is ex­actly equiv­a­lent to the con­di­tion that Bayesian ev­i­dence ex­ists be­tween x and y:

I(X;Y) > 0 =>
P(x,y) != P(x)*P(y)
P(x,y) /​ P(y) != P(x)
P(x|y) != P(x)

If you’re con­di­tion­ing on Z, you just ad­just the whole deriva­tion ac­cord­ingly:

I(X;Y | Z) > 0 =>
P(x,y|z) != P(x|z)*P(y|z)
P(x,y|z) /​ P(y|z) != P(x|z)
(P(x,y,z) /​ P(z)) /​ (P(y, z) /​ P(z)) != P(x|z)
P(x,y,z) /​ P(y,z) != P(x|z)
P(x|y,z) != P(x|z)

Which last line reads “Even know­ing Z, learn­ing Y still changes our be­liefs about X.”

Con­versely, as in our origi­nal case of Z be­ing “even” or “odd”, Z screens off X from Y—that is, if we know that Z is “even”, learn­ing that Y is in state 4 tells us noth­ing more about whether X is 2, 4, 6, or 8. Or if we know that Z is “odd”, then learn­ing that X is 5 tells us noth­ing more about whether Y is 1 or 3. Learn­ing Z has ren­dered X and Y con­di­tion­ally in­de­pen­dent.

Con­di­tional in­de­pen­dence is a hugely im­por­tant con­cept in prob­a­bil­ity the­ory—to cite just one ex­am­ple, with­out con­di­tional in­de­pen­dence, the uni­verse would have no struc­ture.

To­day, though, I only in­tend to talk about one par­tic­u­lar kind of con­di­tional in­de­pen­dence—the case of a cen­tral vari­able that screens off other vari­ables sur­round­ing it, like a cen­tral body with ten­ta­cles.

Let there be five vari­ables U, V, W, X, Y; and more­over, sup­pose that for ev­ery pair of these vari­ables, one vari­able is ev­i­dence about the other. If you se­lect U and W, for ex­am­ple, then learn­ing U=U1 will tell you some­thing you didn’t know be­fore about the prob­a­bil­ity W=W1.

An un­man­age­able in­fer­en­tial mess? Ev­i­dence gone wild? Not nec­es­sar­ily.

Maybe U is “Speaks a lan­guage”, V is “Two arms and ten digits”, W is “Wears clothes”, X is “Poi­son­able by hem­lock”, and Y is “Red blood”. Now if you en­counter a thing-in-the-world, that might be an ap­ple and might be a rock, and you learn that this thing speaks Chi­nese, you are li­able to as­sess a much higher prob­a­bil­ity that it wears clothes; and if you learn that the thing is not poi­son­able by hem­lock, you will as­sess a some­what lower prob­a­bil­ity that it has red blood.

Now some of these rules are stronger than oth­ers. There is the case of Fred, who is miss­ing a finger due to a vol­cano ac­ci­dent, and the case of Bar­ney the Baby who doesn’t speak yet, and the case of Irv­ing the IRCBot who emits sen­tences but has no blood. So if we learn that a cer­tain thing is not wear­ing clothes, that doesn’t screen off ev­ery­thing that its speech ca­pa­bil­ity can tell us about its blood color. If the thing doesn’t wear clothes but does talk, maybe it’s Nude Nel­lie.

This makes the case more in­ter­est­ing than, say, five in­te­ger vari­ables that are all odd or all even, but oth­er­wise un­cor­re­lated. In that case, know­ing any one of the vari­ables would screen off ev­ery­thing that know­ing a sec­ond vari­able could tell us about a third vari­able.

But here, we have de­pen­den­cies that don’t go away as soon as we learn just one vari­able, as the case of Nude Nel­lie shows. So is it an un­man­age­able in­fer­en­tial in­con­ve­nience?

Fear not! for there may be some sixth vari­able Z, which, if we knew it, re­ally would screen off ev­ery pair of vari­ables from each other. There may be some vari­able Z—even if we have to con­struct Z rather than ob­serv­ing it di­rectly—such that:

p(u|v,w,x,y,z) = p(u|z)
p(v|u,w,x,y,z) = p(v|z)
p(w|u,v,x,y,z) = p(w|z)

Per­haps, given that a thing is “hu­man”, then the prob­a­bil­ities of it speak­ing, wear­ing clothes, and hav­ing the stan­dard num­ber of fingers, are all in­de­pen­dent. Fred may be miss­ing a finger—but he is no more likely to be a nud­ist than the next per­son; Nude Nel­lie never wears clothes, but know­ing this doesn’t make it any less likely that she speaks; and Baby Bar­ney doesn’t talk yet, but is not miss­ing any limbs.

This is called the “Naive Bayes” method, be­cause it usu­ally isn’t quite true, but pre­tend­ing that it’s true can sim­plify the liv­ing daylights out of your calcu­la­tions. We don’t keep sep­a­rate track of the in­fluence of clothed-ness on speech ca­pa­bil­ity given finger num­ber. We just use all the in­for­ma­tion we’ve ob­served to keep track of the prob­a­bil­ity that this thingy is a hu­man (or al­ter­na­tively, some­thing else, like a chim­panzee or robot) and then use our be­liefs about the cen­tral class to pre­dict any­thing we haven’t seen yet, like vuln­er­a­bil­ity to hem­lock.

Any ob­ser­va­tions of U, V, W, X, and Y just act as ev­i­dence for the cen­tral class vari­able Z, and then we use the pos­te­rior dis­tri­bu­tion on Z to make any pre­dic­tions that need mak­ing about un­ob­served vari­ables in U, V, W, X, and Y.

Sound fa­mil­iar? It should: As a mat­ter of fact, if you use the right kind of neu­ral net­work units, this “neu­ral net­work” ends up ex­actly, math­e­mat­i­cally equiv­a­lent to Naive Bayes. The cen­tral unit just needs a lo­gis­tic thresh­old—an S-curve re­sponse—and the weights of the in­puts just need to match the log­a­r­ithms of the like­li­hood ra­tios, etcetera. In fact, it’s a good guess that this is one of the rea­sons why lo­gis­tic re­sponse of­ten works so well in neu­ral net­works—it lets the al­gorithm sneak in a lit­tle Bayesian rea­son­ing while the de­sign­ers aren’t look­ing.

Just be­cause some­one is pre­sent­ing you with an al­gorithm that they call a “neu­ral net­work” with buz­zwords like “scruffy” and “emer­gent” plas­tered all over it, dis­claiming proudly that they have no idea how the learned net­work works—well, don’t as­sume that their lit­tle AI al­gorithm re­ally is Beyond the Realms of Logic. For this paradigm of ad­hock­ery , if it works, will turn out to have Bayesian struc­ture; it may even be ex­actly equiv­a­lent to an al­gorithm of the sort called “Bayesian”.

Even if it doesn’t look Bayesian, on the sur­face.

And then you just know that the Bayesi­ans are go­ing to start ex­plain­ing ex­actly how the al­gorithm works, what un­der­ly­ing as­sump­tions it re­flects, which en­vi­ron­men­tal reg­u­lar­i­ties it ex­ploits, where it works and where it fails, and even at­tach­ing un­der­stand­able mean­ings to the learned net­work weights.

Dis­ap­point­ing, isn’t it?

• Nice post. How­ever:

Please /​please/​ don’t use the “+” sign like that! H(X+Y+Z) should be H(X,Y,Z). “So for the joint dis­tri­bu­tion X+Y there are” should be “So for the joint dis­tri­bu­tion of X and Y there are” etc. I was skim­ming your post, mi­s­un­der­stood your mean­ing en­tirely and started won­der­ing if you had made a mis­take un­til I went back and no­ticed that some of your “+”s meant “X and Y” rather than “the value of X plus the value of Y”. (So for ex­am­ple, when I read “”“ Z has two states, “even” and “odd”, perfectly cor­re­lated to the even­ness or odd­ness of X+Y. In fact, we’ll sup­pose that Z is just the ques­tion “Are X+Y even or odd?” “”” I thought “golly, ‘Are’ X+Y even or odd? Must be a gram­mar mis­take.”) Now granted, it was easy to tell what you re­ally meant af­ter I slowly read through the in­tro­duc­tory para­graph, but please change it, be­cause:

• It will con­fuse math­e­mat­i­ci­ans who are skim­ming be­cause they’ve seen things like this before

• You use the other sense of “+” in your equa­tions, and there’s no ex­cuse for op­er­a­tor over­load­ing if it can be avoided.

• The books I’ve read just say “XY” or “X,Y” if they need a name for the joint vari­able.

• It’s not just non-stan­dard, but used in­con­sis­tantly. You say H(X+Y+Z) in one place and H(X,Y) in an­other.

• This is called the “Naive Bayes” method, be­cause it usu­ally isn’t quite true, but pre­tend­ing that it’s true can sim­plify the liv­ing daylights out of your calcu­la­tions.

Ha­haha! First thing on LessWrong to re­ally make me laugh out loud :) Good stuff.

[Edit: That’s the laugh­ter of agree­ment and ap­proval of a fun writ­ing style; I should be more ex­plicit on the in­ter­net, given the per­ni­cious amounts of sar­casm that gets tossed around.]

• To the down­vote, in case it wasn’t clear, I was laugh­ing be­cause I agree with the post, and be­cause “sim­plify the liv­ing daylights out of your calcu­la­tions” is just an awe­some phrase. I laugh at things I agree with way more than things I don’t, be­cause the former things ac­tu­ally make me happy. (And the lat­ter kind of laugh­ter, on the rare oc­ca­sion that it hap­pens, I keep to my­self.)

But if the down­vote was for ir­rele­vance, fair enough. I wouldn’t mind be­ing told that ex­press­ing ap­pre­ci­a­tion of writ­ing style alone is frowned upon.

• I wouldn’t mind be­ing told that ex­press­ing ap­pre­ci­a­tion of writ­ing style alone is frowned upon.

It’s a mat­ter of how it’s done. The more an­a­lytic and de­scrip­tive it is of what was good and how it worked, the bet­ter a re­ac­tion it’s likely to get.

I would guess that this was down­voted by some­one mis­read­ing it as an at­tack, in­ter­pret­ing the laugh­ing as con­sid­er­ing it worth ridicule.

• I wouldn’t mind be­ing told that ex­press­ing ap­pre­ci­a­tion of writ­ing style alone is frowned upon

It is frowned upon by some peo­ple, but not by all—cer­tainly not by me. See dis­cus­sion here.

• Thanks for the back­ground… I think for a com­pro­mise, I might just stick to ex­press­ing laugh­ter when I hap­pen to have some­thing of con­tent to say along with it :)

• My God! For months I thought that your posts had no pur­pose, but to see it wrapped up like this is beau­tiful.

• Two thoughts, which aren’t re­ally co­her­ent or in­formed enough to be called ques­tions.

If naive Bayes == neu­ral net­work, and you need 3 lay­ers of neu­ron to make a gen­eral clas­sifier(hy­per­planes → con­vex hulls → ar­bi­trary hy­per­shapes), do you need 3 lay­ers of naive Bayes?

Would an al­gorithm for in­duc­ing names be: poke around look­ing for things that have mu­tual in­for­ma­tion but are not screened off, and in­duce a name that screens them off. When you find names that have mu­tual in­for­ma­tion, de­cide whether they ought to be merged (a clump of all the pieces has mu­tual in­for­ma­tion) or clus­tered un­der a hi­er­ar­chi­cally higher name (oth­er­wise).

• Fair enough—I was just wor­ried that peo­ple wouldn’t un­der­stand that (X,Y) meant the joint dis­tri­bu­tion.

Not sure if I’ll get a chance to fix this tonight, I’m at the AGI-08 con­fer­ence now and may end up in time/​sleep de­pri­va­tion mode.

• Pos­si­ble (?) typo: “V is “Two arms and ten digits”“ could in­deed be meant as “V is “Two arms and ten fingers””

• I’m con­fused :/​

P(X,Y,Z) = P(X,(Y,Z)) = P(X) + P(Y,Z) - P(X;(Y,Z)) = P(X) + (P(Y) + P(Z) - P(Y;Z)) - P(X;(Y,Z)) = P(X) + (P(Y) + P(Z) - P(Y;Z)) - P((X;Y),(X;Z)) = P(X) + (P(Y) + P(Z) - P(Y;Z)) - (P(X;Y) + P(X;Z) - P(X;Y;Z)) = P(X) + P(Y) + P(Z) - P(X;Y) - P(Y;Z) - P(X;Z) + P(X;Y;Z)

By the in­clu­sion-ex­clu­sion prin­ci­ple, no?

• This was what I ex­pected to see, and I be­lieve it’s equiv­a­lent to H(X,Y,Z) = H(X) + H(Y) + H(Z) - I(X;Z) - I(Z;Y) - I(X;Y | Z)

It ap­pears that Z is very ar­tifi­cially con­structed—Z is ex­actly I(X,Y) in the ex­am­ple. There­fore, H(X,Y) = H(X,Y,Z). Since the term I(X,Y | Z) is mu­tual in­for­ma­tion about X and Y given Z, that’s just 0. There’s no new mu­tual in­for­ma­tion about X and Y that isn’t already in Z. So I be­lieve that we could re­place it with +I(X,Y) - I(X,Y,Z), and get in­clu­sion-ex­clu­sion.

• This is neat. Two things though:

First a minor math­e­mat­i­cal no­ta­tion quib­ble/​re­quest, speci­fi­cally about minus signs. Could you move them as much to the out­side of a term as pos­si­ble? What I mean is, for in­stance, you’ve got the fol­low­ing: H(S) = Sum i: p(Si)*-log2(p(Si))

Per­son­ally, I think I’d find H(S) = -Sum i: p(Si)*log2(p(Si)) eas­ier to parse. Prob­a­bly just due to what I’m used to see­ing in math­e­mat­i­cal no­ta­tion, but I seem to see the minus in the mid­dle “be­fore” I see the mul­ti­pli­ca­tion, so my ini­tial re­ac­tion is “wait, what? sub­tract­ing those? huh?” be­fore the rest of my brain catches up and sees it’s ac­tu­ally the ex­pected mul­ti­pli­ca­tion by a negated term, just that the nega­tion has been placed in an un­usual spot.

Maybe that’s just me though.

Se­cond, just won­der­ing if there’s a sim­ple gen­eral prin­ci­ple for know­ing when, given a bunch of mu­tu­ally cor­re­lated vari­ables, we can con­struct a cen­tral vari­able like the above that shields ev­ery­thing from ev­ery­thing else (ei­ther com­pletely or at least to a good ex­tent. (Ob­vi­ously this is tied to how well stuff clusters in con­ceptspace, but I meant if there’s any rel­a­tively sim­ple math­e­mat­i­cal crite­ria)

• Fixed.

• You for­got to cor­rect an in­stance of this :

The joint sys­tem (X,Y,Z) only has 16 pos­si­ble states—since Z is just the ques­tion “Are X+Y even or odd?”—so H(X,Y,Z) = 4 bits.

I find it par­tic­u­larly dis­turb­ing be­cause in this in­stance, X+Y is always even...

• “If the thing doesn’t wear clothes but does talk, maybe it’s Nude Nel­lie.”

A bet­ter ex­pla­na­tion is prob­a­bly a par­rot.