How to pick your categories

Note: this is intended to be a friendly math post, so apologies to anyone for whom this is all old hat. I’m deliberately staying elementary for the benefit of people who are new to the ideas. There are no proofs: this is long enough as it is.

Related: Where to Draw the Boundary, The Cluster Structure of Thingspace, Disguised Queries.

Here’s a rather deep problem in philosophy: how do we come up with categories? What’s the difference between a horror movie and a science fiction movie? Or the difference between a bird and a mammal? Are there such things as “natural kinds,” or are all such ideas arbitrary?

We can frame this in a slightly more mathematical way as follows. Objects in real life (animals, moving pictures, etc.) are enormously complicated and have many features and properties. You can think of this as a very high dimensional space, one dimension for each property, and each object having a value corresponding to each property. A grayscale picture, for example, has a color value for each pixel. A text document has a count for every word (the word “flamingo” might have been used 7 times, for instance.) A multiple-choice questionnaire has an answer for each question. Each object is a point in a high-dimensional featurespace. To identify which objects are similar to each other, we want to identify how close points are in featurespace. For example, two pictures that only differ at one pixel should turn out to be similar.

We could then start to form categories if the objects form empirical clusters in featurespace. If some animals have wings and hollow bones and feathers, and some animals have none of those things but give milk and bear live young, it makes sense to distinguish birds from mammals. If empirical clusters actually exist, then there’s nothing arbitrary about the choice of categories—the categories are appropriate to the data!

There are a number of mathematical techniques for assigning categories; all of them are basically attacking the same problem, and in principle should all agree with each other and identify the “right” categories. But in practice they have different strengths and weaknesses, in computational efficiency, robustness to noise, and ability to classify accurately. This field is incredibly useful—this is how computers do image and speech recognition, this is how natural language processing works, this is how they sequence your DNA. It also, I hope, will yield insights into how people think and perceive.

Clustering techniques

These techniques attempt to directly find clusters in observations. A common example is the K-means algorithm. The goal here is, given a set of observations x1...xn, to partition them into k sets so as to minimize the within-cluster sum of squared differences:

argminSik x in S ||xi—mi||2

where mi are the means.

The standard algorithm is to pick k means randomly, anywhere, assign points to the cluster with the closest mean, and then assign the new mean of each cluster to be the centroid (average) of the points in the cluster. Then we iterate again, possibly assigning different points to different clusters with each iterative step. This is usually very fast but can be slow in the worst-case scenario.

There are many, many clustering algorithms. They vary in the choice of distance metric (it doesn’t have to be Euclidean, we could take taxicab distances or Hamming distances or something else). There’s also something called hierarchical clustering, which outputs a tree of clusters.

Linear dimensionality reduction techniques

Here’s another way to think about this problem: perhaps, of all the possible features that could distinguish two objects, most of the variation is in only a few features. You have a high-dimensional feature space, but in fact all the points are lying in a much lower-dimensional space. (Maybe, for instance, once you’ve identified what color a flower is, how big it is, and how many petals it has, you’ve almost completely identified the flower.) We’d like to know which coordinate axes explain the data well.

There are a number of methods for doing this—I’ll mention a classic one, singular value decomposition (SVD). For any m x n matrix M, we have a factorization of the form

M = U Σ V*

where U and V are orthogonal and Σ is diagonal. The columns of U are the eigenvectors of M*M, the columns of V are the eigenvectors of MM*, and the elements of Σ (called singular values) are the square roots of the eigenvalues of M*M and MM*.

Now, if we want a low-rank approximation of M (that is, every point in the approximation lies on a low-dimensional hyperplane) all we have to do is chop off Σ to contain only the largest k singular values. Intuitively, these are the dimensions that account for most of the variation in the data matrix M. The approximate matrix M’ = U Σ′ V* can be shown to be the closest possible rank-k approximation to M.

In other words, if you have high-dimensional data, and you suspect that only two coordinates explain all the data—maybe you have a questionnaire, and you think that age and sex explain all the patterns in answering—something like SVD can identify which those coordinates are. In a sense, dimensionality reduction can identify the factors that are worth looking at.

Nonlinear dimensionality reduction techniques

An algorithm like SVD is linear—the approximation it spits out lies entirely on a vector subspace of your original vector space (a line, a plane, a hyperplane of some dimension.) Sometimes, though, that’s a bad idea. What if you have a cloud of data that actually lies on a circle? Or some other curvy shape? SVD will get it wrong.

One interesting tweak on this process is manifold learning—if we suspect that the data lies on a low-dimensional but possibly curvy shape, we try to identify the manifold, just as in SVD we tried to identify the subspace. There are a lot of algorithms for doing this. One of my favorites is the Laplacian eigenmap.

Here, we look at each data point as a node on a graph; if we have lots of data (and we usually do) the graph is a sort of mesh approximation for the smooth manifold it lies on. We construct a sparse, weighted adjacency matrix: it’s N x N, where N is the number of data points. Matrix elements are zero if they correspond to two points that are far apart, but if they correspond to nearby points we put the heat kernel e-||x-y||^2. Then we look at the eigenvectors of this matrix. We use the top eigenvectors as coordinates to embed into Euclidean space. The reason this works, roughly, is that we’re approximating a Laplace operator on the manifold: two points are close together if a diffusion moving along the graph would travel between them quickly. It’s a way of mapping the graph into a lower-dimensional space such that points that are close on the graph are close in the embedding.

Good nonlinear dimensionality reduction techniques can identify data that lies on curvy shapes, like circles and spirals and the “swiss roll” (a rolled-up plane) much better than linear dimensionality reduction techniques.

What’s the moral?

Once upon a time, the search engine Yahoo tried to categorize all the sites on the web according to a pre-set classification system. Science, sports, entertainment, and so on. It was phenomenally unpopular. The content that grew online often didn’t fit the categories. People didn’t want to surf the internet with the equivalent of the Dewey Decimal System.

These days, to some degree, we know better. Amazon.com doesn’t recommend books based on pre-set categories, giving you a horror book if you liked a horror book before. It recommends a book based on the choices of other customers who liked the same books you like. Evidently, they have a big adjacency matrix somewhere, one column for every customer and one row for every purchase, and quite possibly they’re running some sort of a graph diffusion on it. They let the categories emerge organically from the data. If a new genre is emerging, they don’t have to scurry around trying to add a new label for it; it’ll show up automatically.

This suggests a sort of rationalist discipline: categories should always be organic. Humans like to categorize, and categories can be very useful. But not every set of categories is efficient at describing the variety of actual observations. Biologists used to have a kingdom called Monera, consisting of all the single-celled organisms; it was a horrible grab bag, because single-celled organisms are very different from each other genetically. After analyzing genomes, they decided there were actually three domains, Bacteria, Archaea, and Eukarya (the only one which includes multicellular life.) In a real, non-arbitrary way, this is a better kind of categorization, and Monera was a bad category.

Sometimes it seems that researchers don’t always pay attention to the problem of choosing categories and axes well. For example, I once saw a study of autism that did the following: created a questionnaire that rated the user’s “empathizing” and “systematizing” qualities, found that autistics were less “empathizing” and more “systematizing” than non-autistics, and concluded that autism was defined by more systematizing and less empathizing. This is a classic example of privileging one axis over others—what if autistics and non-autistics also differ in some other way? How do you know that you’ve chosen an efficient way to define that category? Wouldn’t you have to go look?

If you say “There are two kinds of people in the world,” but if you look around and lots of people don’t fit your binary, then maybe your binary is bad. If you want to know whether your set of categories is good, go look—see if the data actually clusters that way. There’s still a lot of debate about which mathematical techniques are best for defining categories, but it is a field where science has actually made progress. It’s not all arbitrary. When you play Twenty Questions with the universe, some questions are more useful than others.

References:

Wikipedia is generally very good on this subject, and the wiki page on singular value decomposition contains the proof that it actually works.

This paper by Mikhail Belkin and Partha Niyogi does a much better job of explaining Laplacian eigenmaps. Some other nonlinear dimensionality reduction techniques: Isomap, Locally Linear Embedding, and diffusion maps.