Entropy, and Short Codes

Sup­pose you have a sys­tem X that’s equally likely to be in any of 8 pos­si­ble states:

{X1, X2, X3, X4, X5, X6, X7, X8.}

There’s an ex­traor­di­nar­ily ubiquitous quan­tity—in physics, math­e­mat­ics, and even biol­ogy—called en­tropy; and the en­tropy of X is 3 bits. This means that, on av­er­age, we’ll have to ask 3 yes-or-no ques­tions to find out X’s value. For ex­am­ple, some­one could tell us X’s value us­ing this code:

X1: 001 X2: 010 X3: 011 X4: 100
X5: 101 X6: 110 X7: 111 X8: 000

So if I asked “Is the first sym­bol 1?” and heard “yes”, then asked “Is the sec­ond sym­bol 1?” and heard “no”, then asked “Is the third sym­bol 1?” and heard “no”, I would know that X was in state 4.

Now sup­pose that the sys­tem Y has four pos­si­ble states with the fol­low­ing prob­a­bil­ities:

Y1: 12 (50%) Y2: 14 (25%) Y3: 18 (12.5%) Y4: 18 (12.5%)

Then the en­tropy of Y would be 1.75 bits, mean­ing that we can find out its value by ask­ing 1.75 yes-or-no ques­tions.

What does it mean to talk about ask­ing one and three-fourths of a ques­tion? Imag­ine that we des­ig­nate the states of Y us­ing the fol­low­ing code:

Y1: 1 Y2: 01 Y3: 001 Y4: 000

First you ask, “Is the first sym­bol 1?” If the an­swer is “yes”, you’re done: Y is in state 1. This hap­pens half the time, so 50% of the time, it takes 1 yes-or-no ques­tion to find out Y’s state.

Sup­pose that in­stead the an­swer is “No”. Then you ask, “Is the sec­ond sym­bol 1?” If the an­swer is “yes”, you’re done: Y is in state 2. Y is in state 2 with prob­a­bil­ity 14, and each time Y is in state 2 we dis­cover this fact us­ing two yes-or-no ques­tions, so 25% of the time it takes 2 ques­tions to dis­cover Y’s state.

If the an­swer is “No” twice in a row, you ask “Is the third sym­bol 1?” If “yes”, you’re done and Y is in state 3; if “no”, you’re done and Y is in state 4. The 18 of the time that Y is in state 3, it takes three ques­tions; and the 18 of the time that Y is in state 4, it takes three ques­tions.

(1/​2 * 1) + (1/​4 * 2) + (1/​8 * 3) + (1/​8 * 3)
= 0.5 + 0.5 + 0.375 + 0.375
= 1.75.

The gen­eral for­mula for the en­tropy of a sys­tem S is the sum, over all Si, of -p(Si)*log2(p(Si)).

For ex­am­ple, the log (base 2) of 18 is −3. So -(1/​8 * −3) = 0.375 is the con­tri­bu­tion of state S4 to the to­tal en­tropy: 18 of the time, we have to ask 3 ques­tions.

You can’t always de­vise a perfect code for a sys­tem, but if you have to tell some­one the state of ar­bi­trar­ily many copies of S in a sin­gle mes­sage, you can get ar­bi­trar­ily close to a perfect code. (Google “ar­ith­metic cod­ing” for a sim­ple method.)

Now, you might ask: “Why not use the code 10 for Y4, in­stead of 000? Wouldn’t that let us trans­mit mes­sages more quickly?”

But if you use the code 10 for Y4 , then when some­one an­swers “Yes” to the ques­tion “Is the first sym­bol 1?”, you won’t know yet whether the sys­tem state is Y1 (1) or Y4 (10). In fact, if you change the code this way, the whole sys­tem falls apart—be­cause if you hear “1001”, you don’t know if it means “Y4, fol­lowed by Y2″ or “Y1, fol­lowed by Y3.”

The moral is that short words are a con­served re­source.

The key to cre­at­ing a good code—a code that trans­mits mes­sages as com­pactly as pos­si­ble—is to re­serve short words for things that you’ll need to say fre­quently, and use longer words for things that you won’t need to say as of­ten.

When you take this art to its limit, the length of the mes­sage you need to de­scribe some­thing, cor­re­sponds ex­actly or al­most ex­actly to its prob­a­bil­ity. This is the Min­i­mum De­scrip­tion Length or Min­i­mum Mes­sage Length for­mal­iza­tion of Oc­cam’s Ra­zor.

And so even the la­bels that we use for words are not quite ar­bi­trary. The sounds that we at­tach to our con­cepts can be bet­ter or worse, wiser or more fool­ish. Even apart from con­sid­er­a­tions of com­mon us­age!

I say all this, be­cause the idea that “You can X any way you like” is a huge ob­sta­cle to learn­ing how to X wisely. “It’s a free coun­try; I have a right to my own opinion” ob­structs the art of find­ing truth. “I can define a word any way I like” ob­structs the art of carv­ing re­al­ity at its joints. And even the sen­si­ble-sound­ing “The la­bels we at­tach to words are ar­bi­trary” ob­structs aware­ness of com­pact­ness. Prosody too, for that mat­ter—Tolk­ien once ob­served what a beau­tiful sound the phrase “cel­lar door” makes; that is the kind of aware­ness it takes to use lan­guage like Tolk­ien.

The length of words also plays a non­triv­ial role in the cog­ni­tive sci­ence of lan­guage:

Con­sider the phrases “re­cliner”, “chair”, and “fur­ni­ture”. Re­cliner is a more spe­cific cat­e­gory than chair; fur­ni­ture is a more gen­eral cat­e­gory than chair. But the vast ma­jor­ity of chairs have a com­mon use—you use the same sort of mo­tor ac­tions to sit down in them, and you sit down in them for the same sort of pur­pose (to take your weight off your feet while you eat, or read, or type, or rest). Re­clin­ers do not de­part from this theme. “Fur­ni­ture”, on the other hand, in­cludes things like beds and ta­bles which have differ­ent uses, and call up differ­ent mo­tor func­tions, from chairs.

In the ter­minol­ogy of cog­ni­tive psy­chol­ogy, “chair” is a ba­sic-level cat­e­gory.

Peo­ple have a ten­dency to talk, and pre­sum­ably think, at the ba­sic level of cat­e­go­riza­tion—to draw the bound­ary around “chairs”, rather than around the more spe­cific cat­e­gory “re­cliner”, or the more gen­eral cat­e­gory “fur­ni­ture”. Peo­ple are more likely to say “You can sit in that chair” than “You can sit in that re­cliner” or “You can sit in that fur­ni­ture”.

And it is no co­in­ci­dence that the word for “chair” con­tains fewer syl­la­bles than ei­ther “re­cliner” or “fur­ni­ture”. Ba­sic-level cat­e­gories, in gen­eral, tend to have short names; and nouns with short names tend to re­fer to ba­sic-level cat­e­gories. Not a perfect rule, of course, but a definite ten­dency. Fre­quent use goes along with short words; short words go along with fre­quent use.

Or as Dou­glas Hofs­tadter put it, there’s a rea­son why the English lan­guage uses “the” to mean “the” and “an­tidis­es­tab­lish­men­tar­i­anism” to mean “an­tidis­es­tab­lish­men­tar­i­anism” in­stead of an­tidis­es­tab­lish­men­tar­i­anism other way around.