A Word to the Wise is Sufficient because the Wise Know So Many Words

Collect Ontologies

An ontology is a way of bucketing reality. For example, Russians distinguish голубой from синий whereas Anglos bucket both into “blue”. An ontology is an implicit set of priors about the world. If you never bucket people according to skin color then you will be bad at predicting who prefers chopsticks over forks. If you always bucket people according to skin color then you will miss out on human universals.

It takes an untrained neural network a long time to distinguish poisonous snakes from nonpoisonous snakes. It will take you much less time if you treat coral snakes and milk snakes as separate species, even though they look similar to each other.

Different ontologies are appropriate for different contexts. Zero-sum models of the world work great when planning adversarial strategies. Zero-sum models of the world are counterproductive when attempting to put together a family dinner. The more different ontologies you know, the the faster you can process new data.

The more ontologies you know the more information it takes you to distinguish between them. In practice, it doesn’t matter because the entropy required to distinguish ontologies is so tiny[1].

The (Cheap) Price

How many ontologies can a person learn? Words aren’t ontologies but I think the number of words a person knows provides a reasonable Fermi estimate of the number of ontologies bouncing around our heads. A normal person knows perhaps 32,768 words. If we treat each word as a distinct binary ontology of equal prior probability then it takes 15 bits of entropy to figure out which ontology to use.

The entropy of English text is estimated at 2.3 bits per letter. It takes approximately 6 letters to figure out what word a person is trying to say. Technically, this is just a roundabout way of saying the average English word is about 6 letters long but my point is even with tens of thousands of ontologies to draw from, it takes a tiny bit of entropy (on the order of one word) to find the needle in the haystack. A word to the wise is literally sufficient.

That’s assuming all prior probabilities are equally probable. If we allow for unequal prior probability distributions then the median entropy cost is less than a single word. The more ontologies you learn, the smarter you become, because each ontology contains a heavy Bayesian prior you can apply to new situations you encounter. The extreme informatic efficiency of pre-learned ontologies is how you get good at small data.


  1. ↩︎