Mutual Information, and Density in Thingspace

Sup­pose you have a sys­tem X that can be in any of 8 states, which are all equally prob­a­ble (rel­a­tive to your cur­rent state of knowl­edge), and a sys­tem Y that can be in any of 4 states, all equally prob­a­ble.

The en­tropy of X, as defined yes­ter­day, is 3 bits; we’ll need to ask 3 yes-or-no ques­tions to find out X’s ex­act state. The en­tropy of Y, as defined yes­ter­day, is 2 bits; we have to ask 2 yes-or-no ques­tions to find out Y’s ex­act state. This may seem ob­vi­ous since 23 = 8 and 22 = 4, so 3 ques­tions can dis­t­in­guish 8 pos­si­bil­ities and 2 ques­tions can dis­t­in­guish 4 pos­si­bil­ities; but re­mem­ber that if the pos­si­bil­ities were not all equally likely, we could use a more clever code to dis­cover Y’s state us­ing e.g. 1.75 ques­tions on av­er­age. In this case, though, X’s prob­a­bil­ity mass is evenly dis­tributed over all its pos­si­ble states, and like­wise Y, so we can’t use any clever codes.

What is the en­tropy of the com­bined sys­tem (X,Y)?

You might be tempted to an­swer, “It takes 3 ques­tions to find out X, and then 2 ques­tions to find out Y, so it takes 5 ques­tions to­tal to find out the state of X and Y.”

But what if the two vari­ables are en­tan­gled, so that learn­ing the state of Y tells us some­thing about the state of X?

In par­tic­u­lar, let’s sup­pose that X and Y are ei­ther both odd, or both even.

Now if we re­ceive a 3-bit mes­sage (ask 3 ques­tions) and learn that X is in state 5, we know that Y is in state 1 or state 3, but not state 2 or state 4. So the sin­gle ad­di­tional ques­tion “Is Y in state 3?”, an­swered “No”, tells us the en­tire state of (X,Y): X=X5, Y=Y1. And we learned this with a to­tal of 4 ques­tions.

Con­versely, if we learn that Y is in state 4 us­ing two ques­tions, it will take us only an ad­di­tional two ques­tions to learn whether X is in state 2, 4, 6, or 8. Again, four ques­tions to learn the state of the joint sys­tem.

The mu­tual in­for­ma­tion of two vari­ables is defined as the differ­ence be­tween the en­tropy of the joint sys­tem and the en­tropy of the in­de­pen­dent sys­tems: I(X;Y) = H(X) + H(Y) - H(X,Y).

Here there is one bit of mu­tual in­for­ma­tion be­tween the two sys­tems: Learn­ing X tells us one bit of in­for­ma­tion about Y (cuts down the space of pos­si­bil­ities from 4 to 2, a fac­tor-of-2 de­crease in the vol­ume) and learn­ing Y tells us one bit of in­for­ma­tion about X (cuts down the pos­si­bil­ity space from 8 to 4).

What about when prob­a­bil­ity mass is not evenly dis­tributed? Yes­ter­day, for ex­am­ple, we dis­cussed the case in which Y had the prob­a­bil­ities 12, 14, 18, 18 for its four states. Let us take this to be our prob­a­bil­ity dis­tri­bu­tion over Y, con­sid­ered in­de­pen­dently—if we saw Y, with­out see­ing any­thing else, this is what we’d ex­pect to see. And sup­pose the vari­able Z has two states, 1 and 2, with prob­a­bil­ities 38 and 58 re­spec­tively.

Then if and only if the joint dis­tri­bu­tion of Y and Z is as fol­lows, there is zero mu­tual in­for­ma­tion be­tween Y and Z:

Z1Y1: 316 Z1Y2: 332 Z1Y3: 364 Z1Y3: 364
Z2Y1: 516 Z2Y2: 532 Z2Y3: 564 Z2Y3: 564

This dis­tri­bu­tion obeys the law:

p(Y,Z) = P(Y)P(Z)

For ex­am­ple, P(Z1Y2) = P(Z1)P(Y2) = 38 * 14 = 332.

And ob­serve that we can re­cover the marginal (in­de­pen­dent) prob­a­bil­ities of Y and Z just by look­ing at the joint dis­tri­bu­tion:

P(Y1) = to­tal prob­a­bil­ity of all the differ­ent ways Y1 can hap­pen
= P(Z1Y1) + P(Z2Y1)
= 316 + 516
= 12.

So, just by in­spect­ing the joint dis­tri­bu­tion, we can de­ter­mine whether the marginal vari­ables Y and Z are in­de­pen­dent; that is, whether the joint dis­tri­bu­tion fac­tors into the product of the marginal dis­tri­bu­tions; whether, for all Y and Z, P(Y,Z) = P(Y)P(Z).

This last is sig­nifi­cant be­cause, by Bayes’s Rule:

P(Yi,Zj) = P(Yi)P(Zj)
P(Yi,Zj)/​P(Zj) = P(Yi)
P(Yi|Zj) = P(Yi)

In English, “After you learn Zj, your be­lief about Yi is just what it was be­fore.”

So when the dis­tri­bu­tion fac­tor­izes—when P(Y,Z) = P(Y)P(Z) - this is equiv­a­lent to “Learn­ing about Y never tells us any­thing about Z or vice versa.”

From which you might sus­pect, cor­rectly, that there is no mu­tual in­for­ma­tion be­tween Y and Z. Where there is no mu­tual in­for­ma­tion, there is no Bayesian ev­i­dence, and vice versa.

Sup­pose that in the dis­tri­bu­tion YZ above, we treated each pos­si­ble com­bi­na­tion of Y and Z as a sep­a­rate event—so that the dis­tri­bu­tion YZ would have a to­tal of 8 pos­si­bil­ities, with the prob­a­bil­ities shown—and then we calcu­lated the en­tropy of the dis­tri­bu­tion YZ the same way we would calcu­late the en­tropy of any dis­tri­bu­tion:

316 log2(3/​16) + 332 log2(3/​32) + 364 log2(3/​64) + … + 564 log2(5/​64)

You would end up with the same to­tal you would get if you sep­a­rately calcu­lated the en­tropy of Y plus the en­tropy of Z. There is no mu­tual in­for­ma­tion be­tween the two vari­ables, so our un­cer­tainty about the joint sys­tem is not any less than our un­cer­tainty about the two sys­tems con­sid­ered sep­a­rately. (I am not show­ing the calcu­la­tions, but you are wel­come to do them; and I am not show­ing the proof that this is true in gen­eral, but you are wel­come to Google on “Shan­non en­tropy” and “mu­tual in­for­ma­tion”.)

What if the joint dis­tri­bu­tion doesn’t fac­tor­ize? For ex­am­ple:

Z1Y1: 1264 Z1Y2: 864 Z1Y3: 164 Z1Y4: 364
Z2Y1: 2064 Z2Y2: 864 Z2Y3: 764 Z2Y4: 564

If you add up the joint prob­a­bil­ities to get marginal prob­a­bil­ities, you should find that P(Y1) = 12, P(Z1) = 38, and so on—the marginal prob­a­bil­ities are the same as be­fore.

But the joint prob­a­bil­ities do not always equal the product of the marginal prob­a­bil­ities. For ex­am­ple, the prob­a­bil­ity P(Z1Y2) equals 864, where P(Z1)P(Y2) would equal 38 * 14 = 664. That is, the prob­a­bil­ity of run­ning into Z1Y2 to­gether, is greater than you’d ex­pect based on the prob­a­bil­ities of run­ning into Z1 or Y2 sep­a­rately.

Which in turn im­plies:

P(Z1Y2) > P(Z1)P(Y2)
P(Z1Y2)/​P(Y2) > P(Z1)
P(Z1|Y2) > P(Z1)

Since there’s an “un­usu­ally high” prob­a­bil­ity for P(Z1Y2) - defined as a prob­a­bil­ity higher than the marginal prob­a­bil­ities would in­di­cate by de­fault—it fol­lows that ob­serv­ing Y2 is ev­i­dence which in­creases the prob­a­bil­ity of Z1. And by a sym­met­ri­cal ar­gu­ment, ob­serv­ing Z1 must fa­vor Y2.

As there are at least some val­ues of Y that tell us about Z (and vice versa) there must be mu­tual in­for­ma­tion be­tween the two vari­ables; and so you will find—I am con­fi­dent, though I haven’t ac­tu­ally checked—that calcu­lat­ing the en­tropy of YZ yields less to­tal un­cer­tainty than the sum of the in­de­pen­dent en­tropies of Y and Z. H(Y,Z) = H(Y) + H(Z) - I(Y;Z) with all quan­tities nec­es­sar­ily non­nega­tive.

(I digress here to re­mark that the sym­me­try of the ex­pres­sion for the mu­tual in­for­ma­tion shows that Y must tell us as much about Z, on av­er­age, as Z tells us about Y. I leave it as an ex­er­cise to the reader to rec­on­cile this with any­thing they were taught in logic class about how, if all ravens are black, be­ing al­lowed to rea­son Raven(x)->Black(x) doesn’t mean you’re al­lowed to rea­son Black(x)->Raven(x). How differ­ent seem the sym­met­ri­cal prob­a­bil­ity flows of the Bayesian, from the sharp lurches of logic—even though the lat­ter is just a de­gen­er­ate case of the former.)

“But,” you ask, “what has all this to do with the proper use of words?”

In Empty La­bels and then Re­place the Sym­bol with the Sub­stance, we saw the tech­nique of re­plac­ing a word with its defi­ni­tion—the ex­am­ple be­ing given:

All [mor­tal, ~feathers, bipedal] are mor­tal.
Socrates is a [mor­tal, ~feathers, bipedal].
There­fore, Socrates is mor­tal.

Why, then, would you even want to have a word for “hu­man”? Why not just say “Socrates is a mor­tal feather­less biped”?

Be­cause it’s helpful to have shorter words for things that you en­counter of­ten. If your code for de­scribing sin­gle prop­er­ties is already effi­cient, then there will not be an ad­van­tage to hav­ing a spe­cial word for a con­junc­tion—like “hu­man” for “mor­tal feather­less biped”—un­less things that are mor­tal and feather­less and bipedal, are found more of­ten than the marginal prob­a­bil­ities would lead you to ex­pect.

In effi­cient codes, word length cor­re­sponds to prob­a­bil­ity—so the code for Z1Y2 will be just as long as the code for Z1 plus the code for Y2, un­less P(Z1Y2) > P(Z1)P(Y2), in which case the code for the word can be shorter than the codes for its parts.

And this in turn cor­re­sponds ex­actly to the case where we can in­fer some of the prop­er­ties of the thing, from see­ing its other prop­er­ties. It must be more likely than the de­fault that feather­less bipedal things will also be mor­tal.

Of course the word “hu­man” re­ally de­scribes many, many more prop­er­ties—when you see a hu­man-shaped en­tity that talks and wears clothes, you can in­fer whole hosts of bio­chem­i­cal and anatom­i­cal and cog­ni­tive facts about it. To re­place the word “hu­man” with a de­scrip­tion of ev­ery­thing we know about hu­mans would re­quire us to spend an in­or­di­nate amount of time talk­ing. But this is true only be­cause a feather­less talk­ing biped is far more likely than de­fault to be poi­son­able by hem­lock, or have broad nails, or be over­con­fi­dent.

Hav­ing a word for a thing, rather than just list­ing its prop­er­ties, is a more com­pact code pre­cisely in those cases where we can in­fer some of those prop­er­ties from the other prop­er­ties. (With the ex­cep­tion per­haps of very prim­i­tive words, like “red”, that we would use to send an en­tirely un­com­pressed de­scrip­tion of our sen­sory ex­pe­riences. But by the time you en­counter a bug, or even a rock, you’re deal­ing with non­sim­ple prop­erty col­lec­tions, far above the prim­i­tive level.)

So hav­ing a word “wig­gin” for green-eyed black-haired peo­ple, is more use­ful than just say­ing “green-eyed black-haired per­son”, pre­cisely when:

  1. Green-eyed peo­ple are more likely than av­er­age to be black-haired (and vice versa), mean­ing that we can prob­a­bil­is­ti­cally in­fer green eyes from black hair or vice versa; or

  2. Wig­gins share other prop­er­ties that can be in­ferred at greater-than-de­fault prob­a­bil­ity. In this case we have to sep­a­rately ob­serve the green eyes and black hair; but then, af­ter ob­serv­ing both these prop­er­ties in­de­pen­dently, we can prob­a­bil­is­ti­cally in­fer other prop­er­ties (like a taste for ketchup).

One may even con­sider the act of defin­ing a word as a promise to this effect. Tel­ling some­one, “I define the word ‘wig­gin’ to mean a per­son with green eyes and black hair”, by Gricean im­pli­ca­tion, as­serts that the word “wig­gin” will some­how help you make in­fer­ences /​ shorten your mes­sages.

If green-eyes and black hair have no greater than de­fault prob­a­bil­ity to be found to­gether, nor does any other prop­erty oc­cur at greater than de­fault prob­a­bil­ity along with them, then the word “wig­gin” is a lie: The word claims that cer­tain peo­ple are worth dis­t­in­guish­ing as a group, but they’re not.

In this case the word “wig­gin” does not help de­scribe re­al­ity more com­pactly—it is not defined by some­one send­ing the short­est mes­sage—it has no role in the sim­plest ex­pla­na­tion. Equiv­a­lently, the word “wig­gin” will be of no help to you in do­ing any Bayesian in­fer­ence. Even if you do not call the word a lie, it is surely an er­ror.

And the way to carve re­al­ity at its joints, is to draw your bound­aries around con­cen­tra­tions of un­usu­ally high prob­a­bil­ity den­sity in Thingspace.