Mutual Information, and Density in Thingspace

Sup­pose you have a sys­tem X that can be in any of 8 states, which are all equally prob­a­ble (rel­a­tive to your cur­rent state of knowl­edge), and a sys­tem Y that can be in any of 4 states, all equally prob­a­ble.

The en­tropy of X, as defined yes­ter­day, is 3 bits; we’ll need to ask 3 yes-or-no ques­tions to find out X’s ex­act state. The en­tropy of Y, as defined yes­ter­day, is 2 bits; we have to ask 2 yes-or-no ques­tions to find out Y’s ex­act state. This may seem ob­vi­ous since 23 = 8 and 22 = 4, so 3 ques­tions can dis­t­in­guish 8 pos­si­bil­ities and 2 ques­tions can dis­t­in­guish 4 pos­si­bil­ities; but re­mem­ber that if the pos­si­bil­ities were not all equally likely, we could use a more clever code to dis­cover Y’s state us­ing e.g. 1.75 ques­tions on av­er­age. In this case, though, X’s prob­a­bil­ity mass is evenly dis­tributed over all its pos­si­ble states, and like­wise Y, so we can’t use any clever codes.

What is the en­tropy of the com­bined sys­tem (X,Y)?

You might be tempted to an­swer, “It takes 3 ques­tions to find out X, and then 2 ques­tions to find out Y, so it takes 5 ques­tions to­tal to find out the state of X and Y.”

But what if the two vari­ables are en­tan­gled, so that learn­ing the state of Y tells us some­thing about the state of X?

In par­tic­u­lar, let’s sup­pose that X and Y are ei­ther both odd, or both even.

Now if we re­ceive a 3-bit mes­sage (ask 3 ques­tions) and learn that X is in state 5, we know that Y is in state 1 or state 3, but not state 2 or state 4. So the sin­gle ad­di­tional ques­tion “Is Y in state 3?”, an­swered “No”, tells us the en­tire state of (X,Y): X=X5, Y=Y1. And we learned this with a to­tal of 4 ques­tions.

Con­versely, if we learn that Y is in state 4 us­ing two ques­tions, it will take us only an ad­di­tional two ques­tions to learn whether X is in state 2, 4, 6, or 8. Again, four ques­tions to learn the state of the joint sys­tem.

The mu­tual in­for­ma­tion of two vari­ables is defined as the differ­ence be­tween the en­tropy of the joint sys­tem and the en­tropy of the in­de­pen­dent sys­tems: I(X;Y) = H(X) + H(Y) - H(X,Y).

Here there is one bit of mu­tual in­for­ma­tion be­tween the two sys­tems: Learn­ing X tells us one bit of in­for­ma­tion about Y (cuts down the space of pos­si­bil­ities from 4 to 2, a fac­tor-of-2 de­crease in the vol­ume) and learn­ing Y tells us one bit of in­for­ma­tion about X (cuts down the pos­si­bil­ity space from 8 to 4).

What about when prob­a­bil­ity mass is not evenly dis­tributed? Yes­ter­day, for ex­am­ple, we dis­cussed the case in which Y had the prob­a­bil­ities 12, 14, 18, 18 for its four states. Let us take this to be our prob­a­bil­ity dis­tri­bu­tion over Y, con­sid­ered in­de­pen­dently—if we saw Y, with­out see­ing any­thing else, this is what we’d ex­pect to see. And sup­pose the vari­able Z has two states, 1 and 2, with prob­a­bil­ities 38 and 58 re­spec­tively.

Then if and only if the joint dis­tri­bu­tion of Y and Z is as fol­lows, there is zero mu­tual in­for­ma­tion be­tween Y and Z:

 Z1Y1: 3⁄16 Z1Y2: 3⁄32 Z1Y3: 3⁄64 Z1Y3: 3⁄64 Z2Y1: 5⁄16 Z2Y2: 5⁄32 Z2Y3: 5⁄64 Z2Y3: 5⁄64

This dis­tri­bu­tion obeys the law:

p(Y,Z) = P(Y)P(Z)

For ex­am­ple, P(Z1Y2) = P(Z1)P(Y2) = 38 * 14 = 332.

And ob­serve that we can re­cover the marginal (in­de­pen­dent) prob­a­bil­ities of Y and Z just by look­ing at the joint dis­tri­bu­tion:

P(Y1) = to­tal prob­a­bil­ity of all the differ­ent ways Y1 can hap­pen
= P(Z1Y1) + P(Z2Y1)
= 316 + 516
= 12.

So, just by in­spect­ing the joint dis­tri­bu­tion, we can de­ter­mine whether the marginal vari­ables Y and Z are in­de­pen­dent; that is, whether the joint dis­tri­bu­tion fac­tors into the product of the marginal dis­tri­bu­tions; whether, for all Y and Z, P(Y,Z) = P(Y)P(Z).

This last is sig­nifi­cant be­cause, by Bayes’s Rule:

P(Yi,Zj) = P(Yi)P(Zj)
P(Yi,Zj)/​P(Zj) = P(Yi)
P(Yi|Zj) = P(Yi)

In English, “After you learn Zj, your be­lief about Yi is just what it was be­fore.”

So when the dis­tri­bu­tion fac­tor­izes—when P(Y,Z) = P(Y)P(Z) - this is equiv­a­lent to “Learn­ing about Y never tells us any­thing about Z or vice versa.”

From which you might sus­pect, cor­rectly, that there is no mu­tual in­for­ma­tion be­tween Y and Z. Where there is no mu­tual in­for­ma­tion, there is no Bayesian ev­i­dence, and vice versa.

Sup­pose that in the dis­tri­bu­tion YZ above, we treated each pos­si­ble com­bi­na­tion of Y and Z as a sep­a­rate event—so that the dis­tri­bu­tion YZ would have a to­tal of 8 pos­si­bil­ities, with the prob­a­bil­ities shown—and then we calcu­lated the en­tropy of the dis­tri­bu­tion YZ the same way we would calcu­late the en­tropy of any dis­tri­bu­tion:

316 log2(3/​16) + 332 log2(3/​32) + 364 log2(3/​64) + … + 564 log2(5/​64)

You would end up with the same to­tal you would get if you sep­a­rately calcu­lated the en­tropy of Y plus the en­tropy of Z. There is no mu­tual in­for­ma­tion be­tween the two vari­ables, so our un­cer­tainty about the joint sys­tem is not any less than our un­cer­tainty about the two sys­tems con­sid­ered sep­a­rately. (I am not show­ing the calcu­la­tions, but you are wel­come to do them; and I am not show­ing the proof that this is true in gen­eral, but you are wel­come to Google on “Shan­non en­tropy” and “mu­tual in­for­ma­tion”.)

What if the joint dis­tri­bu­tion doesn’t fac­tor­ize? For ex­am­ple:

 Z1Y1: 12⁄64 Z1Y2: 8⁄64 Z1Y3: 1⁄64 Z1Y4: 3⁄64 Z2Y1: 20⁄64 Z2Y2: 8⁄64 Z2Y3: 7⁄64 Z2Y4: 5⁄64

If you add up the joint prob­a­bil­ities to get marginal prob­a­bil­ities, you should find that P(Y1) = 12, P(Z1) = 38, and so on—the marginal prob­a­bil­ities are the same as be­fore.

But the joint prob­a­bil­ities do not always equal the product of the marginal prob­a­bil­ities. For ex­am­ple, the prob­a­bil­ity P(Z1Y2) equals 864, where P(Z1)P(Y2) would equal 38 * 14 = 664. That is, the prob­a­bil­ity of run­ning into Z1Y2 to­gether, is greater than you’d ex­pect based on the prob­a­bil­ities of run­ning into Z1 or Y2 sep­a­rately.

Which in turn im­plies:

P(Z1Y2) > P(Z1)P(Y2)
P(Z1Y2)/​P(Y2) > P(Z1)
P(Z1|Y2) > P(Z1)

Since there’s an “un­usu­ally high” prob­a­bil­ity for P(Z1Y2) - defined as a prob­a­bil­ity higher than the marginal prob­a­bil­ities would in­di­cate by de­fault—it fol­lows that ob­serv­ing Y2 is ev­i­dence which in­creases the prob­a­bil­ity of Z1. And by a sym­met­ri­cal ar­gu­ment, ob­serv­ing Z1 must fa­vor Y2.

As there are at least some val­ues of Y that tell us about Z (and vice versa) there must be mu­tual in­for­ma­tion be­tween the two vari­ables; and so you will find—I am con­fi­dent, though I haven’t ac­tu­ally checked—that calcu­lat­ing the en­tropy of YZ yields less to­tal un­cer­tainty than the sum of the in­de­pen­dent en­tropies of Y and Z. H(Y,Z) = H(Y) + H(Z) - I(Y;Z) with all quan­tities nec­es­sar­ily non­nega­tive.

(I digress here to re­mark that the sym­me­try of the ex­pres­sion for the mu­tual in­for­ma­tion shows that Y must tell us as much about Z, on av­er­age, as Z tells us about Y. I leave it as an ex­er­cise to the reader to rec­on­cile this with any­thing they were taught in logic class about how, if all ravens are black, be­ing al­lowed to rea­son Raven(x)->Black(x) doesn’t mean you’re al­lowed to rea­son Black(x)->Raven(x). How differ­ent seem the sym­met­ri­cal prob­a­bil­ity flows of the Bayesian, from the sharp lurches of logic—even though the lat­ter is just a de­gen­er­ate case of the former.)

“But,” you ask, “what has all this to do with the proper use of words?”

In Empty La­bels and then Re­place the Sym­bol with the Sub­stance, we saw the tech­nique of re­plac­ing a word with its defi­ni­tion—the ex­am­ple be­ing given:

All [mor­tal, ~feathers, bipedal] are mor­tal.
Socrates is a [mor­tal, ~feathers, bipedal].
There­fore, Socrates is mor­tal.

Why, then, would you even want to have a word for “hu­man”? Why not just say “Socrates is a mor­tal feather­less biped”?

Be­cause it’s helpful to have shorter words for things that you en­counter of­ten. If your code for de­scribing sin­gle prop­er­ties is already effi­cient, then there will not be an ad­van­tage to hav­ing a spe­cial word for a con­junc­tion—like “hu­man” for “mor­tal feather­less biped”—un­less things that are mor­tal and feather­less and bipedal, are found more of­ten than the marginal prob­a­bil­ities would lead you to ex­pect.

In effi­cient codes, word length cor­re­sponds to prob­a­bil­ity—so the code for Z1Y2 will be just as long as the code for Z1 plus the code for Y2, un­less P(Z1Y2) > P(Z1)P(Y2), in which case the code for the word can be shorter than the codes for its parts.

And this in turn cor­re­sponds ex­actly to the case where we can in­fer some of the prop­er­ties of the thing, from see­ing its other prop­er­ties. It must be more likely than the de­fault that feather­less bipedal things will also be mor­tal.

Of course the word “hu­man” re­ally de­scribes many, many more prop­er­ties—when you see a hu­man-shaped en­tity that talks and wears clothes, you can in­fer whole hosts of bio­chem­i­cal and anatom­i­cal and cog­ni­tive facts about it. To re­place the word “hu­man” with a de­scrip­tion of ev­ery­thing we know about hu­mans would re­quire us to spend an in­or­di­nate amount of time talk­ing. But this is true only be­cause a feather­less talk­ing biped is far more likely than de­fault to be poi­son­able by hem­lock, or have broad nails, or be over­con­fi­dent.

Hav­ing a word for a thing, rather than just list­ing its prop­er­ties, is a more com­pact code pre­cisely in those cases where we can in­fer some of those prop­er­ties from the other prop­er­ties. (With the ex­cep­tion per­haps of very prim­i­tive words, like “red”, that we would use to send an en­tirely un­com­pressed de­scrip­tion of our sen­sory ex­pe­riences. But by the time you en­counter a bug, or even a rock, you’re deal­ing with non­sim­ple prop­erty col­lec­tions, far above the prim­i­tive level.)

So hav­ing a word “wig­gin” for green-eyed black-haired peo­ple, is more use­ful than just say­ing “green-eyed black-haired per­son”, pre­cisely when:

1. Green-eyed peo­ple are more likely than av­er­age to be black-haired (and vice versa), mean­ing that we can prob­a­bil­is­ti­cally in­fer green eyes from black hair or vice versa; or

2. Wig­gins share other prop­er­ties that can be in­ferred at greater-than-de­fault prob­a­bil­ity. In this case we have to sep­a­rately ob­serve the green eyes and black hair; but then, af­ter ob­serv­ing both these prop­er­ties in­de­pen­dently, we can prob­a­bil­is­ti­cally in­fer other prop­er­ties (like a taste for ketchup).

One may even con­sider the act of defin­ing a word as a promise to this effect. Tel­ling some­one, “I define the word ‘wig­gin’ to mean a per­son with green eyes and black hair”, by Gricean im­pli­ca­tion, as­serts that the word “wig­gin” will some­how help you make in­fer­ences /​ shorten your mes­sages.

If green-eyes and black hair have no greater than de­fault prob­a­bil­ity to be found to­gether, nor does any other prop­erty oc­cur at greater than de­fault prob­a­bil­ity along with them, then the word “wig­gin” is a lie: The word claims that cer­tain peo­ple are worth dis­t­in­guish­ing as a group, but they’re not.

In this case the word “wig­gin” does not help de­scribe re­al­ity more com­pactly—it is not defined by some­one send­ing the short­est mes­sage—it has no role in the sim­plest ex­pla­na­tion. Equiv­a­lently, the word “wig­gin” will be of no help to you in do­ing any Bayesian in­fer­ence. Even if you do not call the word a lie, it is surely an er­ror.

And the way to carve re­al­ity at its joints, is to draw your bound­aries around con­cen­tra­tions of un­usu­ally high prob­a­bil­ity den­sity in Thingspace.

• Even an op­ti­mal lan­guage would not be one de­signed to min­i­mize av­er­age mes­sage length, be­cause some mes­sages are more ur­gent than oth­ers, even if rel­a­tively un­com­mon; e.g., mes­sages about tigers.

• I’m won­der­ing if a com­bi­na­tion is so rare as to be odd, is it worth nam­ing? E.g. wig­ger, or wang­ster. Wouldn’t it be use­ful pre­cisely be­cause we don’t ex­pect it?

• 12 May 2016 3:25 UTC
1 point

So, hold on, if you wrote this in 2008, why the hell did you keep writ­ing this blog in­stead of pub­lish­ing at least one of what were even­tu­ally nu­mer­ous pa­pers on in­for­ma­tion-the­o­retic clus­ter­ing with mu­tual-in­for­ma­tion mea­sure­ments? Some of those didn’t even come out un­til 2012 or 2014 or so, so it’s not like you wouldn’t have had time to pub­lish a solid re­vi­sion to MI-clus­ter­ing if you came up with a good al­gorithm.

• Just so you know, there are two columns of Y sub­script 3s in the first joint dis­tri­bu­tion.

• This typo is still there.

Then if and only if the joint dis­tri­bu­tion of Y and Z is as fol­lows, there is zero mu­tual in­for­ma­tion be­tween Y and Z:

Z1Y1: 316 Z1Y2: 332 Z1Y3: 364 Z1Y3: 364

Z2Y1: 516 Z2Y2: 532 Z2Y3: 564 Z2Y3: 564

Fourth column has mis­num­bered sub­scripts.

• Green-eyed peo­ple are more likely than av­er­age to be black-haired (and vice versa), mean­ing that we can prob­a­bil­is­ti­cally in­fer green eyes from black hair or vice versa

There is noth­ing in the mind that is not first in the cen­sus.

• Have a look at the cap­tion here

That’s what hap­pens to you when you in­sist on be­ing the ex­cep­tion to the rule!

• Since we are rest­ing all our lan­guage con­struc­tion and rea­son­ing on thingspace there are a few things that need to be defined.

What is the dis­tance met­ric for thingspace? How is thingspace ex­tended?

• You’ve for­got­ten one im­por­tant caveat in the phrase “And the way to carve re­al­ity at its joints, is to draw your bound­aries around con­cen­tra­tions of un­usu­ally high prob­a­bil­ity den­sity in Thingspace.” The im­por­tant caveat is : ‘bound­aries around where con­cen­tra­tions of un­usu­ally high prob­a­bil­ity den­sity lie, to the best of our knowl­edge and be­lief’ . All the im­perfec­tions in cat­e­gori­sa­tion in ex­ist­ing lan­guages come from that limi­ta­tion. Other prob­lems in cat­e­gori­sa­tion, like those of An­to­nio, in ‘Mer­chant of Venise’, or those of the found­ing fathers who wrote that it is ‘self ev­i­dent that all men were cre­ated equal’ but at the same time were slave own­ers, do not come from lan­guage prob­lems in cat­e­gori­sa­tion, they would have ac­knowl­edged that Shy­lock or the slaves were hu­man, but from differ­ent types of cog­ni­tive com­pro­mise. Apart from that, it’s an in­tel­lec­tu­ally satis­fy­ing ap­proach, and you might, if you per­se­vere, end up with a poor re­la­tion to an ex­ist­ing lan­guage. Why a poor re­la­tion ? be­cause it would lack nu­ance, am­bi­guity, and re­dun­dance, which are the roots of po­etry. It would also lack words for the sur­pris­ing but sig­nifi­cant im­prob­a­ble phe­nomenon. Like ge­nius, or albino. Then again, once you get around to say­ing you will have words for sig­nifi­cant low hills of prob­a­bil­ity, the whole ar­gu­ment blows away. Bon courage.

• “The im­por­tant caveat is : ‘bound­aries around where con­cen­tra­tions of un­usu­ally high prob­a­bil­ity den­sity lie, to the best of our knowl­edge and be­lief’ .”

I would call the above an in­stance of the Mind Pro­jec­tion Fal­lacy, as you seem to be as­sum­ing a prob­a­bil­ity den­sity that is a prop­erty of the phys­i­cal world, and which we are try­ing to as­cer­tain. But prob­a­bil­ities are prop­er­ties of our minds (or ideal, perfectly ra­tio­nal minds), not of the ex­te­rior world, and a prob­a­bil­ity dis­tri­bu­tion is sim­ply an en­tity to de­scribe our state of in­for­ma­tion; it is “the best of our knowl­edge and be­lief”.

• This is a brilli­ant es­say. One of the best in the se­quences, I think.

• Er­ra­tum: In the first ex­am­ple of YZ joint dis­tri­bu­tion, last column should list Z1Y4 and Z2Y4 in­stead of Z1Y3 and Z2Y3.

• Hav­ing a word [...] is a more com­pact code pre­cisely in those cases where we can in­fer some of those prop­er­ties from the other prop­er­ties. (With the ex­cep­tion per­haps of very prim­i­tive words, like “red” [...]).

Re­mem­ber that mu­tual in­for­ma­tion is sym­met­ric. If some things have the prop­erty of be­ing red, then “red” has the prop­erty of be­ing a prop­erty of those things. Say­ing “blood is red” is re­ally say­ing “re­mem­ber that vi­sual ex­pe­rience that you get when you look at cer­tain roses, ap­ples, pep­pers, lip­sticks and English buses and phone booths? The same hap­pens with blood.” If I give you the list above, can you find (“in­fer”) more red things? Then “red” is a good word.

But do note that this is a dual sense to the one in which “hu­man” is a good word. Most of the prop­er­ties of hu­mans are statis­ti­cally nec­es­sary for be­ing hu­man: re­move any one of them, and the thing is much less likely to be hu­man. “Hu­man” is a good word be­cause these prop­er­ties are pos­i­tively cor­re­lated. On the other hand, most of the red things are statis­ti­cally suffi­cient for be­ing red: take any one of them, and the thing is much more likely to be red. “Red” is a good word be­cause these things are nega­tively cor­re­lated—they are a bunch of dis­tinct things with a shared as­pect.

• Gor­don,

I’d hope they weren’t so hope­lessly ‘over­trained’ that they wouldn’t be able to step back from their P’s and paren­the­ses and ask them­selves whether they re­ally think that a black ob­ject can­not be a raven.

If it’s a raven, it’s black. If it ain’t black, it ain’t a raven.

• I agree that it makes no sense, but as I was writ­ing the com­ment I figured I would take you down the wrong path of what some­one might naively think and then cor­rect it. I think that some­one who was overly trained in logic and not in prob­a­bil­ity might as­sume that if Raven(x)-->Black(x) be­ing true leads to P(B|R) = 1, they might rea­son that since the re­verse im­pli­ca­tion Black(x)-->Raven(x) is false, it leads to P(R|B) = 0. But based on the com­ments above, maybe only an an­cient Greek philoso­pher would be in­clined to make such a mis­take.

• Gor­don, I fixed the Z1/​Y2 swap.

“Vice versa” seems to have been in­ter­preted am­bigu­ously so I sub­sti­tuted “doesn’t mean you’re al­lowed to rea­son Black(x)->Raven(x)” which was what I meant.

Gor­don, the whole busi­ness about P(R|B) = 0 makes no sense to me, and I sus­pect that it makes no sense even in prin­ci­ple. “If we learn that some­thing is black, we know it can­not pos­si­bly be a raven”?

• “Vice versa” would be the con­tra­pos­i­tive, which is NonBlack(x)->NonRaven(x), which is true iff R(x)->B(x) is true, no?

• I’ve no doubt got the wrong end of the stick here, but why P(R|B)=0? Surely the prob­a­bil­ity that a black thing is a raven is nonzero?

• Hope­fully not tak­ing away any­one’s fun here, but to rec­on­cile Raven(x)->Black(x) but not vice versa, what this state­ment wants to say, let­ting P(R) and P(B) be the prob­a­bil­ities of raven and black, re­spec­tively, is P(R|B)=0 and P(B|R)=1, which gives us that

P(R|B) = 0 P(RB)/​P(B) = 0 P(RB) = 0

and

P(B|R) = 1 P(BR)/​P(R) = 1 P(BR) = P(R)

But of course this leads to a con­tra­dic­tion, so it can’t re­ally be true that Black(x)-/​->Raven(x), can it? Sure, be­cause what is re­ally meant by im­plies (-/​->) is not P(B|R) = 0 but P(B|R)<1. But in logic we of­ten for­get this be­cause any­thing with a prob­a­bil­ity less than 1 is as­signed a truth value of false.

Logic has its value, since some­times you want to prove some­thing is true 100% of the time, but this is gen­er­ally only pos­si­ble in pure math­e­mat­ics. If you try to do it el­se­where you’ll get ex­cep­tions (e.g. albino ravens). So leave logic to math­e­mat­i­ci­ans; you should use Bayesian in­fer­ence.

• I be­lieve you made a slight typo, Eli.

You said: “Since there’s an “un­usu­ally high” prob­a­bil­ity for P(Z1Y2) - defined as a prob­a­bil­ity higher than the marginal prob­a­bil­ities would in­di­cate by de­fault—it fol­lows that ob­serv­ing Z1 is ev­i­dence which in­creases the prob­a­bil­ity of Y2. And by a sym­met­ri­cal ar­gu­ment, ob­serv­ing Y2 must fa­vor Z1.”

But I think what you meant was “Since there’s an “un­usu­ally high” prob­a­bil­ity for P(Z1Y2) - defined as a prob­a­bil­ity higher than the marginal prob­a­bil­ities would in­di­cate by de­fault—it fol­lows that ob­serv­ing Y2 is ev­i­dence which in­creases the prob­a­bil­ity of Z1. And by a sym­met­ri­cal ar­gu­ment, ob­serv­ing Z1 must fa­vor Y2.”

Noth­ing you said was un­true, but the im­pli­ca­tion of what you wrote doesn’t match up with the ex­am­ple you ac­tu­ally gave just above that text.

• While it is true that you don’t need a met­ric to draw a bound­ary, I per­son­ally need a met­ric to be able to en­vi­sion high con­cen­tra­tions of prob­a­bil­ity den­sity.

A con­cen­tra­tion im­plies a re­gion, which im­plies a met­ric space. While your spher­ing of the space nor­mal­ises it some­what and deals with part of the trou­ble, it still skips over the ques­tion of met­ric space. For ex­am­ple is 2, 2, 2 closer to 1, 1, 1 than 4, 1, 1? If that was a co-or­di­nate of a po­si­tion in three di­men­sional space you would want to use the eu­clidean met­ric i.e. d = ((x2 - x1)^2 + (y2 - y1)^2+ (z2 - z1)^2)^1/​2 or you that might not be ap­pro­pri­ate and you would have to use city block dis­tances and put them equally far away (if they were av­er­age en­ergy us­age, weight and how many copies of the gene for green eyes it had).

• Will, thingspace may not need a dis­tance met­ric de­pend­ing on how you draw your bound­aries, which are not nec­es­sar­ily sur­faces con­tain­ing vol­umes of con­stant den­sity. For ex­am­ple, a class in Naive Bayes /​ neu­ral net­work of type 2 also slices up thingspace. More about this shortly. But if you’re in­ter­ested in the gen­eral topic, I be­lieve that in the field of statis­ti­cal learn­ing, for al­gorithms that ac­tu­ally do de­pend on dis­tance met­rics, the stan­dard cheap trick is to “sphere” the space by mak­ing the stan­dard de­vi­a­tion equal 1 in all di­rec­tions. An ad-hoc tech­nique but ap­par­ently a use­ful one, though it has all the flaws you would ex­pect.

tcp­kac, see Ken­neway’s re­sponse.

• tcp­kac wrote: ‘bound­aries around where con­cen­tra­tions of un­usu­ally high prob­a­bil­ity den­sity lie, to the best of our knowl­edge and be­lief’

The “prob­a­bil­ity den­sity” is already the best of our knowl­edge and be­lief, un­less Eliezer has con­verted to fre­quen­tism.

• tcp­kac: The im­por­tant caveat is : ‘bound­aries around where con­cen­tra­tions of un­usu­ally high prob­a­bil­ity den­sity lie, to the best of our knowl­edge and be­lief’ . All the im­perfec­tions in cat­e­gori­sa­tion in ex­ist­ing lan­guages come from that limi­ta­tion.

This strikes me as a rather bold state­ment, but “to the best of our knowl­edge and be­lief” might be fuzzy enough to make it true. Some spe­cific fac­tors that dis­tort our lan­guage (and con­se­quently our think­ing) might be:

• Prob­a­bil­ity shifts in thingspace in­val­i­dat­ing pre­vi­ously use­ful clus­ter­ings. Nat­u­ral lan­guages need time adapt, and dic­tio­nary writ­ers tend to be con­ser­va­tive.

• Cog­ni­tive bi­ases that dis­tort our per­cep­tion of thingspace. Very on topic here, I sup­pose. ^_^

• Ma­nipu­la­tion (in­tended and un­in­tended). Hu­mans treat ar­tic­u­la­tions from other hu­mans as ev­i­dence. That can go so far that au­then­tic con­trary ev­i­dence is ex­plained away us­ing con­fir­ma­tion bias.

Other prob­lems in cat­e­gori­sa­tion, [...] do not come from lan­guage prob­lems in cat­e­gori­sa­tion, [...] but from differ­ent types of cog­ni­tive com­pro­mise.

Well, lack of con­sis­tency in im­por­tant mat­ters seems to me to be a rather bad sign.

It would also lack words for the sur­pris­ing but sig­nifi­cant im­prob­a­ble phe­nomenon. Like ge­nius, or albino. Then again, once you get around to say­ing you will have words for sig­nifi­cant low hills of prob­a­bil­ity, the whole ar­gu­ment blows away.

I don’t think so. Once the most sig­nifi­cant hills have been named, we go on and name the next sig­nifi­cant hills. We just choose longer names.

• We have a thou­sand words for sor­row http://​​rhhardin.home.mind­spring.com/​​sor­row.txt

I don’t know if that af­fects the the­ory.

(com­puter clus­ter­ing a short dis­tance down paths of a the­saurus)

• In­clud­ing: “twit­ter”, “al­tru­ism”, “trust”, “start” and “cu­ri­os­ity” ap­par­ently?

• We’ll ig­nore the ex­is­tence of albino ravens for the sake of ar­gu­ment.