The median and mode use less information than the mean does

Epistemic status: Unsure. I had this in drafts from a year ago, and am posting it for Goodhart day. (Though it’s April 1, all the arguments and statements in this post are things I think are true, with no jokes). I’m interested in arguments against this thesis, and especially interested in thoughts on the question at the end—does the distribution-summarizer corresponding to the or minimizers use more information than the mean (the minimizer)?


The mean, median, and mode are the “big 3” location parameters that most have heard about. But they can have very different properties, and these different properties are related to the fact that the mean uses more information than the median, and the median uses more information than the mode.

Refresher

The mean, median, and mode measure the location of a probability distribution. For the Gaussian distribution, they are all the same, but this isn’t the case in general. Here’s an example of a Gamma distribution where the three differ:

gamma(shape = 2, rate = 1)

The mean corresponds to the middle of the distribution when weighted by frequency.

The median corresponds to the middle of the distribution, without using the weights. The median is the vertical line that splits a distribution such that 50% of the probability mass is on the left and 50% on the right.

The mode is the highest point on a distribution.

Different amounts of information usage

The median is preserved under a larger set of changes to the data than the mean is. Really, this is often why people use the median: outliers don’t knock it around as much as they do the mean. But that ability to resist being knocked around—“robustness”—is the same as the ability to ignore information. The mean’s sensitivity is sometimes seen as a liability (and sometimes is a liability), but being sensitive here is the same thing as reacting more to the data. It’s good to react to the data: the mean can distinguish between the three different datasets below, but the median can’t. The plots below are an example: if all you had was the median, you wouldn’t know these three datasets were different; but if all you had was the mean, you would know:

Top: Some data .
Middle: The data from the top plot, but with the values right of the median shifted 4 to the left.
Bottom: The data from the middle plot, but with the values left of the median shifted and scrambled with noise.
Since no data crossed the green median line, it didn’t move.

Similar arguments show why the mode uses even less information than the median: you can shift and scramble around any data that isn’t the peak of the distribution, and you’ll still have the same mode. You can even pass data across the mode without changing it, unlike with the median.

So there’s a trade off between robustness and information.

Using p-distances to reason about distribution summarizers

We can talk about location parameters in terms of different distance metrics, called ” distances”. An distance measures the distance between two data vectors. If we call these vectors and , and each has elements, the general form is . For different , this can easily produce different distances, even for the same and . So each different represents a different way of measuring distance.

(These things show up all over the place in math, under different names: I first saw them in linear algebra as , , and . Then later I saw them in optimization, where the norm was being called the Manhattan/​Taxicab distance, and the Euclidean distance - the form will look familiar to some. If any of those look familiar, you’ve heard of norms.)

The reason distances matter here is: the different location parameters minimize different distances. The mean is the single number that minimizes the distance . The median is the number that minimizes , and the mode is the that minimizes .

(Proof for the case: Say the scalar minimizes . Since square root is a monotonic function, that’s the same as minimizing . One way to minimize something is to take the derivative and set it to 0, so let’s do that:

[taking the derivative and setting it to 0]

[dropping the irrelevant −2, and separating ]

Rearranging, , which means . The right-hand side is the form of the mean, so if minimizes the distance, then .)

So… if the minimizer uses more information from than the minimizer does, can we do better than the mean? Maybe the minimizer uses yet more information? (I’m skipping for reasons. [Edit, a year later: I’ve forgotten these reasons]). To investigate, we’ll first need the form of that minimizes , just like we found for in the proof above. We’re minimizing ; setting the derivative to 0 means finding such that [At this point, writing this a year ago, I got stuck, and now I am posting this anyway.]

Another approach to the same question: We compute how much Shannon information is in each parameter. For a fixed parameter estimate /​ /​ , we can compare how many different data vectors would give that same estimate. For example, we can ask: Given a fixed mean estimate , for how many different is it true that minimizes ? It turns out there are way more such for (, ) than there are for (, ). To see this, think of a distribution of some count data. Since the median is the point for which 50% of data is on either side, moving around any value located to the right of the median, as long as it doesn’t pass , doesn’t change the median! You could take a distribution of medical costs with median $40,000, and move a bunch of points that were around $50,000 up to $100,000, or 1 million or really as far as you want, for as many points as you want. Compare to the mean, where moving a bunch of points into the millions will strongly tug the mean to the right. In terms of the domain and range of these averaging functions: the mean is a function from some set X (data) to some set M (possible means), the median is a function from the same X to another set , and M is bigger than . Is the set for the average—call it - bigger than M? I’m interested in thoughts on this!