Conditional Independence, and Naive Bayes

Pre­vi­ously I spoke of mu­tual in­for­ma­tion be­tween X and Y, I(X;Y), which is the differ­ence be­tween the en­tropy of the joint prob­a­bil­ity dis­tri­bu­tion, H(X,Y) and the en­tropies of the marginal dis­tri­bu­tions, H(X) + H(Y).

I gave the ex­am­ple of a vari­able X, hav­ing eight states 1..8 which are all equally prob­a­ble if we have not yet en­coun­tered any ev­i­dence; and a vari­able Y, with states 1..4, which are all equally prob­a­ble if we have not yet en­coun­tered any ev­i­dence. Then if we calcu­late the marginal en­tropies H(X) and H(Y), we will find that X has 3 bits of en­tropy, and Y has 2 bits.

How­ever, we also know that X and Y are both even or both odd; and this is all we know about the re­la­tion be­tween them. So for the joint dis­tri­bu­tion (X,Y) there are only 16 pos­si­ble states, all equally prob­a­ble, for a joint en­tropy of 4 bits. This is a 1-bit en­tropy defect, com­pared to 5 bits of en­tropy if X and Y were in­de­pen­dent. This en­tropy defect is the mu­tual in­for­ma­tion—the in­for­ma­tion that X tells us about Y, or vice versa, so that we are not as un­cer­tain about one af­ter hav­ing learned the other.

Sup­pose, how­ever, that there ex­ists a third vari­able Z. Z has two states, “even” and “odd”, perfectly cor­re­lated to the even­ness or odd­ness of (X,Y). In fact, we’ll sup­pose that Z is just the ques­tion “Are X and Y even or odd?”

If we have no ev­i­dence about X and Y, then Z it­self nec­es­sar­ily has 1 bit of en­tropy on the in­for­ma­tion given. There is 1 bit of mu­tual in­for­ma­tion be­tween Z and X, and 1 bit of mu­tual in­for­ma­tion be­tween Z and Y. And, as pre­vi­ously noted, 1 bit of mu­tual in­for­ma­tion be­tween X and Y. So how much en­tropy for the whole sys­tem (X,Y,Z)? You might naively ex­pect that

H(X,Y,Z) = H(X) + H(Y) + H(Z) - I(X;Z) - I(Z;Y) - I(X;Y)

but this turns out not to be the case.

The joint sys­tem (X,Y,Z) only has 16 pos­si­ble states—since Z is just the ques­tion “Are X & Y even or odd?”—so H(X,Y,Z) = 4 bits.

But if you calcu­late the for­mula just given, you get

(3 + 2 + 1 − 1 − 1 − 1)bits = 3 bits = WRONG!

Why? Be­cause if you have the mu­tual in­for­ma­tion be­tween X and Z, and the mu­tual in­for­ma­tion be­tween Z and Y, that may in­clude some of the same mu­tual in­for­ma­tion that we’ll calcu­late ex­ists be­tween X and Y. In this case, for ex­am­ple, know­ing that X is even tells us that Z is even, and know­ing that Z is even tells us that Y is even, but this is the same in­for­ma­tion that X would tell us about Y. We dou­ble-counted some of our knowl­edge, and so came up with too lit­tle en­tropy.

The cor­rect for­mula is (I be­lieve):

H(X,Y,Z) = H(X) + H(Y) + H(Z) - I(X;Z) - I(Z;Y) - I(X;Y | Z)

Here the last term, I(X;Y | Z), means, “the in­for­ma­tion that X tells us about Y, given that we already know Z”. In this case, X doesn’t tell us any­thing about Y, given that we already know Z, so the term comes out as zero—and the equa­tion gives the cor­rect an­swer. There, isn’t that nice?

“No,” you cor­rectly re­ply, “for you have not told me how to calcu­late I(X;Y|Z), only given me a ver­bal ar­gu­ment that it ought to be zero.”

We calcu­late I(X;Y|Z) just the way you would ex­pect. I(X;Y) = H(X) + H(Y) - H(X,Y), so:

I(X;Y|Z) = H(X|Z) + H(Y|Z) - H(X,Y|Z)

And now, I sup­pose, you want to know how to calcu­late the con­di­tional en­tropy? Well, the origi­nal for­mula for the en­tropy is:

H(S) = Sum i: p(Si)*-log2(p(Si))

If we then learned a new fact Z0, our re­main­ing un­cer­tainty about S would be:

H(S|Z0) = Sum i: p(Si|Z0)*-log2(p(Si|Z0))

So if we’re go­ing to learn a new fact Z, but we don’t know which Z yet, then, on av­er­age, we ex­pect to be around this un­cer­tain of S af­ter­ward:

H(S|Z) = Sum j: (p(Zj) * Sum i: p(Si|Zj)*-log2(p(Si|Zj)))

And that’s how one calcu­lates con­di­tional en­tropies; from which, in turn, we can get the con­di­tional mu­tual in­for­ma­tion.

There are all sorts of an­cillary the­o­rems here, like:

H(X|Y) = H(X,Y) - H(Y)

and

if I(X;Z) = 0 and I(Y;X|Z) = 0 then I(X;Y) = 0

but I’m not go­ing to go into those.

“But,” you ask, “what does this have to do with the na­ture of words and their hid­den Bayesian struc­ture?”

I am just so un­speak­ably glad that you asked that ques­tion, be­cause I was plan­ning to tell you whether you liked it or not. But first there are a cou­ple more pre­limi­nar­ies.

You will re­mem­ber—yes, you will re­mem­ber—that there is a du­al­ity be­tween mu­tual in­for­ma­tion and Bayesian ev­i­dence. Mu­tual in­for­ma­tion is pos­i­tive if and only if the prob­a­bil­ity of at least some joint events P(x, y) does not equal the product of the prob­a­bil­ities of the sep­a­rate events P(x)*P(y). This, in turn, is ex­actly equiv­a­lent to the con­di­tion that Bayesian ev­i­dence ex­ists be­tween x and y:

I(X;Y) > 0 =>
P(x,y) != P(x)*P(y)
P(x,y) /​ P(y) != P(x)
P(x|y) != P(x)

If you’re con­di­tion­ing on Z, you just ad­just the whole deriva­tion ac­cord­ingly:

I(X;Y | Z) > 0 =>
P(x,y|z) != P(x|z)*P(y|z)
P(x,y|z) /​ P(y|z) != P(x|z)
(P(x,y,z) /​ P(z)) /​ (P(y, z) /​ P(z)) != P(x|z)
P(x,y,z) /​ P(y,z) != P(x|z)
P(x|y,z) != P(x|z)

Which last line reads “Even know­ing Z, learn­ing Y still changes our be­liefs about X.”

Con­versely, as in our origi­nal case of Z be­ing “even” or “odd”, Z screens off X from Y—that is, if we know that Z is “even”, learn­ing that Y is in state 4 tells us noth­ing more about whether X is 2, 4, 6, or 8. Or if we know that Z is “odd”, then learn­ing that X is 5 tells us noth­ing more about whether Y is 1 or 3. Learn­ing Z has ren­dered X and Y con­di­tion­ally in­de­pen­dent.

Con­di­tional in­de­pen­dence is a hugely im­por­tant con­cept in prob­a­bil­ity the­ory—to cite just one ex­am­ple, with­out con­di­tional in­de­pen­dence, the uni­verse would have no struc­ture.

To­day, though, I only in­tend to talk about one par­tic­u­lar kind of con­di­tional in­de­pen­dence—the case of a cen­tral vari­able that screens off other vari­ables sur­round­ing it, like a cen­tral body with ten­ta­cles.

Let there be five vari­ables U, V, W, X, Y; and more­over, sup­pose that for ev­ery pair of these vari­ables, one vari­able is ev­i­dence about the other. If you se­lect U and W, for ex­am­ple, then learn­ing U=U1 will tell you some­thing you didn’t know be­fore about the prob­a­bil­ity W=W1.

An un­man­age­able in­fer­en­tial mess? Ev­i­dence gone wild? Not nec­es­sar­ily.

Maybe U is “Speaks a lan­guage”, V is “Two arms and ten digits”, W is “Wears clothes”, X is “Poi­son­able by hem­lock”, and Y is “Red blood”. Now if you en­counter a thing-in-the-world, that might be an ap­ple and might be a rock, and you learn that this thing speaks Chi­nese, you are li­able to as­sess a much higher prob­a­bil­ity that it wears clothes; and if you learn that the thing is not poi­son­able by hem­lock, you will as­sess a some­what lower prob­a­bil­ity that it has red blood.

Now some of these rules are stronger than oth­ers. There is the case of Fred, who is miss­ing a finger due to a vol­cano ac­ci­dent, and the case of Bar­ney the Baby who doesn’t speak yet, and the case of Irv­ing the IRCBot who emits sen­tences but has no blood. So if we learn that a cer­tain thing is not wear­ing clothes, that doesn’t screen off ev­ery­thing that its speech ca­pa­bil­ity can tell us about its blood color. If the thing doesn’t wear clothes but does talk, maybe it’s Nude Nel­lie.

This makes the case more in­ter­est­ing than, say, five in­te­ger vari­ables that are all odd or all even, but oth­er­wise un­cor­re­lated. In that case, know­ing any one of the vari­ables would screen off ev­ery­thing that know­ing a sec­ond vari­able could tell us about a third vari­able.

But here, we have de­pen­den­cies that don’t go away as soon as we learn just one vari­able, as the case of Nude Nel­lie shows. So is it an un­man­age­able in­fer­en­tial in­con­ve­nience?

Fear not! for there may be some sixth vari­able Z, which, if we knew it, re­ally would screen off ev­ery pair of vari­ables from each other. There may be some vari­able Z—even if we have to con­struct Z rather than ob­serv­ing it di­rectly—such that:

p(u|v,w,x,y,z) = p(u|z)
p(v|u,w,x,y,z) = p(v|z)
p(w|u,v,x,y,z) = p(w|z)

Per­haps, given that a thing is “hu­man”, then the prob­a­bil­ities of it speak­ing, wear­ing clothes, and hav­ing the stan­dard num­ber of fingers, are all in­de­pen­dent. Fred may be miss­ing a finger—but he is no more likely to be a nud­ist than the next per­son; Nude Nel­lie never wears clothes, but know­ing this doesn’t make it any less likely that she speaks; and Baby Bar­ney doesn’t talk yet, but is not miss­ing any limbs.

This is called the “Naive Bayes” method, be­cause it usu­ally isn’t quite true, but pre­tend­ing that it’s true can sim­plify the liv­ing daylights out of your calcu­la­tions. We don’t keep sep­a­rate track of the in­fluence of clothed-ness on speech ca­pa­bil­ity given finger num­ber. We just use all the in­for­ma­tion we’ve ob­served to keep track of the prob­a­bil­ity that this thingy is a hu­man (or al­ter­na­tively, some­thing else, like a chim­panzee or robot) and then use our be­liefs about the cen­tral class to pre­dict any­thing we haven’t seen yet, like vuln­er­a­bil­ity to hem­lock.

Any ob­ser­va­tions of U, V, W, X, and Y just act as ev­i­dence for the cen­tral class vari­able Z, and then we use the pos­te­rior dis­tri­bu­tion on Z to make any pre­dic­tions that need mak­ing about un­ob­served vari­ables in U, V, W, X, and Y.

Sound fa­mil­iar? It should:

Blegg2

As a mat­ter of fact, if you use the right kind of neu­ral net­work units, this “neu­ral net­work” ends up ex­actly, math­e­mat­i­cally equiv­a­lent to Naive Bayes. The cen­tral unit just needs a lo­gis­tic thresh­old—an S-curve re­sponse—and the weights of the in­puts just need to match the log­a­r­ithms of the like­li­hood ra­tios, etcetera. In fact, it’s a good guess that this is one of the rea­sons why lo­gis­tic re­sponse of­ten works so well in neu­ral net­works—it lets the al­gorithm sneak in a lit­tle Bayesian rea­son­ing while the de­sign­ers aren’t look­ing.

Just be­cause some­one is pre­sent­ing you with an al­gorithm that they call a “neu­ral net­work” with buz­zwords like “scruffy” and “emer­gent” plas­tered all over it, dis­claiming proudly that they have no idea how the learned net­work works—well, don’t as­sume that their lit­tle AI al­gorithm re­ally is Beyond the Realms of Logic. For this paradigm of ad­hock­ery , if it works, will turn out to have Bayesian struc­ture; it may even be ex­actly equiv­a­lent to an al­gorithm of the sort called “Bayesian”.

Even if it doesn’t look Bayesian, on the sur­face.

And then you just know that the Bayesi­ans are go­ing to start ex­plain­ing ex­actly how the al­gorithm works, what un­der­ly­ing as­sump­tions it re­flects, which en­vi­ron­men­tal reg­u­lar­i­ties it ex­ploits, where it works and where it fails, and even at­tach­ing un­der­stand­able mean­ings to the learned net­work weights.

Dis­ap­point­ing, isn’t it?