Superexponential Conceptspace, and Simple Words

Thingspace, you might think, is a rather huge space. Much larger than re­al­ity, for where re­al­ity only con­tains things that ac­tu­ally ex­ist, Thingspace con­tains ev­ery­thing that could ex­ist.

Ac­tu­ally, the way I “defined” Thingspace to have di­men­sions for ev­ery pos­si­ble at­tribute—in­clud­ing cor­re­lated at­tributes like den­sity and vol­ume and mass—Thingspace may be too poorly defined to have any­thing you could call a size. But it’s im­por­tant to be able to vi­su­al­ize Thingspace any­way. Surely, no one can re­ally un­der­stand a flock of spar­rows if all they see is a cloud of flap­ping caw­ing things, rather than a cluster of points in Thingspace.

But as vast as Thingspace may be, it doesn’t hold a can­dle to the size of Con­ceptspace.

“Con­cept”, in ma­chine learn­ing, means a rule that in­cludes or ex­cludes ex­am­ples. If you see the data 2:+, 3:-, 14:+, 23:-, 8:+, 9:- then you might guess that the con­cept was “even num­bers”. There is a rather large liter­a­ture (as one might ex­pect) on how to learn con­cepts from data… given ran­dom ex­am­ples, given cho­sen ex­am­ples… given pos­si­ble er­rors in clas­sifi­ca­tion… and most im­por­tantly, given differ­ent spaces of pos­si­ble rules.

Sup­pose, for ex­am­ple, that we want to learn the con­cept “good days on which to play ten­nis”. The pos­si­ble at­tributes of Days are:

Sky: {Sunny, Cloudy, Rainy}
AirTemp: {Warm, Cold}
Hu­midity: {Nor­mal, High}
Wind: {Strong, Weak}

We’re then pre­sented with the fol­low­ing data, where + in­di­cates a pos­i­tive ex­am­ple of the con­cept, and—in­di­cates a nega­tive clas­sifi­ca­tion:

+ Sky: Sunny; AirTemp: Warm; Hu­midity: High; Wind: Strong.
- Sky: Rainy; AirTemp: Cold; Hu­midity: High; Wind: Strong.
+ Sky: Sunny; AirTemp: Warm; Hu­midity: High; Wind: Weak.

What should an al­gorithm in­fer from this?

A ma­chine learner might rep­re­sent one con­cept that fits this data as fol­lows:

Sky: ?; AirTemp: Warm; Hu­midity: High; Wind: ?

In this for­mat, to de­ter­mine whether this con­cept ac­cepts or re­jects an ex­am­ple, we com­pare el­e­ment-by-el­e­ment: ? ac­cepts any­thing, but a spe­cific value ac­cepts only that spe­cific value.

So the con­cept above will ac­cept only Days with AirTemp=Warm and Hu­midity=High, but the Sky and the Wind can take on any value. This fits both the nega­tive and the pos­i­tive clas­sifi­ca­tions in the data so far—though it isn’t the only con­cept that does so.

We can also sim­plify the above con­cept rep­re­sen­ta­tion to {?, Warm, High, ?}.

Without go­ing into de­tails, the clas­sic al­gorithm would be:

  • Main­tain the set of the most gen­eral hy­pothe­ses that fit the data—those that pos­i­tively clas­sify as many ex­am­ples as pos­si­ble, while still fit­ting the facts.

  • Main­tain an­other set of the most spe­cific hy­pothe­ses that fit the data—those that nega­tively clas­sify as many ex­am­ples as pos­si­ble, while still fit­ting the facts.

  • Each time we see a new nega­tive ex­am­ple, we strengthen all the most gen­eral hy­pothe­ses as lit­tle as pos­si­ble, so that the new set is again as gen­eral as pos­si­ble while fit­ting the facts.

  • Each time we see a new pos­i­tive ex­am­ple, we re­lax all the most spe­cific hy­pothe­ses as lit­tle as pos­si­ble, so that the new set is again as spe­cific as pos­si­ble while fit­ting the facts.

  • We con­tinue un­til we have only a sin­gle hy­poth­e­sis left. This will be the an­swer if the tar­get con­cept was in our hy­poth­e­sis space at all.

In the case above, the set of most gen­eral hy­pothe­ses would be {?, Warm, ?, ?} and {Sunny, ?, ?, ?}, while the set of most spe­cific hy­pothe­ses is the sin­gle mem­ber {Sunny, Warm, High, ?}.

Any other con­cept you can find that fits the data will be strictly more spe­cific than one of the most gen­eral hy­pothe­ses, and strictly more gen­eral than the most spe­cific hy­poth­e­sis.

(For more on this, I recom­mend Tom Mitchell’s Ma­chine Learn­ing, from which this ex­am­ple was adapted.)

Now you may no­tice that the for­mat above can­not rep­re­sent all pos­si­ble con­cepts. E.g. “Play ten­nis when the sky is sunny or the air is warm”. That fits the data, but in the con­cept rep­re­sen­ta­tion defined above, there’s no quadru­plet of val­ues that de­scribes the rule.

Clearly our ma­chine learner is not very gen­eral. Why not al­low it to rep­re­sent all pos­si­ble con­cepts, so that it can learn with the great­est pos­si­ble flex­i­bil­ity?

Days are com­posed of these four vari­ables, one vari­able with 3 val­ues and three vari­ables with 2 val­ues. So there are 3*2*2*2 = 24 pos­si­ble Days that we could en­counter.

The for­mat given for rep­re­sent­ing Con­cepts al­lows us to re­quire any of these val­ues for a vari­able, or leave the vari­able open. So there are 4*3*3*3 = 108 con­cepts in that rep­re­sen­ta­tion. For the most-gen­eral/​most-spe­cific al­gorithm to work, we need to start with the most spe­cific hy­poth­e­sis “no ex­am­ple is ever pos­i­tively clas­sified”. If we add that, it makes a to­tal of 109 con­cepts.

Is it sus­pi­cious that there are more pos­si­ble con­cepts than pos­si­ble Days? Surely not: After all, a con­cept can be viewed as a col­lec­tion of Days. A con­cept can be viewed as the set of days that it clas­sifies pos­i­tively, or iso­mor­phi­cally, the set of days that it clas­sifies nega­tively.

So the space of all pos­si­ble con­cepts that clas­sify Days is the set of all pos­si­ble sets of Days, whose size is 224 = 16,777,216.

This com­plete space in­cludes all the con­cepts we have dis­cussed so far. But it also in­cludes con­cepts like “Pos­i­tively clas­sify only the ex­am­ples {Sunny, Warm, High, Strong} and {Sunny, Warm, High, Weak} and re­ject ev­ery­thing else” or “Nega­tively clas­sify only the ex­am­ple {Rainy, Cold, High, Strong} and ac­cept ev­ery­thing else.” It in­cludes con­cepts with no com­pact rep­re­sen­ta­tion, just a flat list of what is and isn’t al­lowed.

That’s the prob­lem with try­ing to build a “fully gen­eral” in­duc­tive learner: They can’t learn con­cepts un­til they’ve seen ev­ery pos­si­ble ex­am­ple in the in­stance space.

If we add on more at­tributes to Days—like the Water tem­per­a­ture, or the Fore­cast for to­mor­row—then the num­ber of pos­si­ble days will grow ex­po­nen­tially in the num­ber of at­tributes. But this isn’t a prob­lem with our re­stricted con­cept space, be­cause you can nar­row down a large space us­ing a log­a­r­ith­mic num­ber of ex­am­ples.

Let’s say we add the Water: {Warm, Cold} at­tribute to days, which will make for 48 pos­si­ble Days and 325 pos­si­ble con­cepts. Let’s say that each Day we see is, usu­ally, clas­sified pos­i­tive by around half of the cur­rently-plau­si­ble con­cepts, and clas­sified nega­tive by the other half. Then when we learn the ac­tual clas­sifi­ca­tion of the ex­am­ple, it will cut the space of com­pat­i­ble con­cepts in half. So it might only take 9 ex­am­ples (29 = 512) to nar­row 325 pos­si­ble con­cepts down to one.

Even if Days had forty bi­nary at­tributes, it should still only take a man­age­able amount of data to nar­row down the pos­si­ble con­cepts to one. 64 ex­am­ples, if each ex­am­ple is clas­sified pos­i­tive by half the re­main­ing con­cepts. As­sum­ing, of course, that the ac­tual rule is one we can rep­re­sent at all!

If you want to think of all the pos­si­bil­ities, well, good luck with that. The space of all pos­si­ble con­cepts grows su­per­ex­po­nen­tially in the num­ber of at­tributes.

By the time you’re talk­ing about data with forty bi­nary at­tributes, the num­ber of pos­si­ble ex­am­ples is past a trillion—but the num­ber of pos­si­ble con­cepts is past two-to-the-trillionth-power. To nar­row down that su­per­ex­po­nen­tial con­cept space, you’d have to see over a trillion ex­am­ples be­fore you could say what was In, and what was Out. You’d have to see ev­ery pos­si­ble ex­am­ple, in fact.

That’s with forty bi­nary at­tributes, mind you. 40 bits, or 5 bytes, to be clas­sified sim­ply “Yes” or “No”. 40 bits im­plies 2^40 pos­si­ble ex­am­ples, and 2^(2^40) pos­si­ble con­cepts that clas­sify those ex­am­ples as pos­i­tive or nega­tive.

So, here in the real world, where ob­jects take more than 5 bytes to de­scribe and a trillion ex­am­ples are not available and there is noise in the train­ing data, we only even think about highly reg­u­lar con­cepts. A hu­man mind—or the whole ob­serv­able uni­verse—is not nearly large enough to con­sider all the other hy­pothe­ses.

From this per­spec­tive, learn­ing doesn’t just rely on in­duc­tive bias, it is nearly all in­duc­tive bias—when you com­pare the num­ber of con­cepts ruled out a pri­ori, to those ruled out by mere ev­i­dence.

But what has this (you in­quire) to do with the proper use of words?

It’s the whole rea­son that words have in­ten­sions as well as ex­ten­sions.

In yes­ter­day’s post, I con­cluded:

The way to carve re­al­ity at its joints, is to draw bound­aries around con­cen­tra­tions of un­usu­ally high prob­a­bil­ity den­sity.

I de­liber­ately left out a key qual­ifi­ca­tion in that (slightly ed­ited) state­ment, be­cause I couldn’t ex­plain it un­til to­day. A bet­ter state­ment would be:

The way to carve re­al­ity at its joints, is to draw sim­ple bound­aries around con­cen­tra­tions of un­usu­ally high prob­a­bil­ity den­sity in Thingspace.

Other­wise you would just ger­ry­man­der Thingspace. You would cre­ate re­ally odd non­con­tigu­ous bound­aries that col­lected the ob­served ex­am­ples, ex­am­ples that couldn’t be de­scribed in any shorter mes­sage than your ob­ser­va­tions them­selves, and say: “This is what I’ve seen be­fore, and what I ex­pect to see more of in the fu­ture.”

In the real world, noth­ing above the level of molecules re­peats it­self ex­actly. Socrates is shaped a lot like all those other hu­mans who were vuln­er­a­ble to hem­lock, but he isn’t shaped ex­actly like them. So your guess that Socrates is a “hu­man” re­lies on draw­ing sim­ple bound­aries around the hu­man cluster in Thingspace. Rather than, “Things shaped ex­actly like [5-megabyte shape speci­fi­ca­tion 1] and with [lots of other char­ac­ter­is­tics], or ex­actly like [5-megabyte shape speci­fi­ca­tion 2] and [lots of other char­ac­ter­is­tics]”, …, are hu­man.”

If you don’t draw sim­ple bound­aries around your ex­pe­riences, you can’t do in­fer­ence with them. So you try to de­scribe “art” with in­ten­sional defi­ni­tions like “that which is in­tended to in­spire any com­plex emo­tion for the sake of in­spiring it”, rather than just point­ing at a long list of things that are, or aren’t art.

In fact, the above state­ment about “how to carve re­al­ity at its joints” is a bit chicken-and-eggish: You can’t as­sess the den­sity of ac­tual ob­ser­va­tions, un­til you’ve already done at least a lit­tle carv­ing. And the prob­a­bil­ity dis­tri­bu­tion comes from draw­ing the bound­aries, not the other way around—if you already had the prob­a­bil­ity dis­tri­bu­tion, you’d have ev­ery­thing nec­es­sary for in­fer­ence, so why would you bother draw­ing bound­aries?

And this sug­gests an­other—yes, yet an­other—rea­son to be sus­pi­cious of the claim that “you can define a word any way you like”. When you con­sider the su­per­ex­po­nen­tial size of Con­ceptspace, it be­comes clear that singling out one par­tic­u­lar con­cept for con­sid­er­a­tion is an act of no small au­dac­ity—not just for us, but for any mind of bounded com­put­ing power.

Pre­sent­ing us with the word “wig­gin”, defined as “a black-haired green-eyed per­son”, with­out some rea­son for rais­ing this par­tic­u­lar con­cept to the level of our de­liber­ate at­ten­tion, is rather like a de­tec­tive say­ing: “Well, I haven’t the slight­est shred of sup­port one way or the other for who could’ve mur­dered those or­phans… not even an in­tu­ition, mind you… but have we con­sid­ered John Q. Wiffle­heim of 1234 Norkle Rd as a sus­pect?”