Epistemic status: Unsure. I had this in drafts from a year ago, and am posting it for Goodhart day. (Though it’s April 1, all the arguments and statements in this post are things I think are true, with no jokes). I’m interested in arguments against this thesis, and especially interested in thoughts on the question at the end—does the distribution-summarizer corresponding to the L3 or L4 minimizers use more information than the mean (the L2 minimizer)?
The mean, median, and mode are the “big 3” location parameters that most have heard about. But they can have very different properties, and these different properties are related to the fact that the mean uses more information than the median, and the median uses more information than the mode.
Refresher
The mean, median, and mode measure the location of a probability distribution. For the Gaussian distribution, they are all the same, but this isn’t the case in general. Here’s an example of a Gamma distribution where the three differ:
The mean corresponds to the middle of the distribution when weighted by frequency.
The median corresponds to the middle of the distribution, without using the weights. The median is the vertical line that splits a distribution such that 50% of the probability mass is on the left and 50% on the right.
The mode is the highest point on a distribution.
Different amounts of information usage
The median is preserved under a larger set of changes to the data than the mean is. Really, this is often why people use the median: outliers don’t knock it around as much as they do the mean. But that ability to resist being knocked around—“robustness”—is the same as the ability to ignore information. The mean’s sensitivity is sometimes seen as a liability (and sometimes is a liability), but being sensitive here is the same thing as reacting more to the data. It’s good to react to the data: the mean can distinguish between the three different datasets below, but the median can’t. The plots below are an example: if all you had was the median, you wouldn’t know these three datasets were different; but if all you had was the mean, you would know:
Similar arguments show why the mode uses even less information than the median: you can shift and scramble around any data that isn’t the peak of the distribution, and you’ll still have the same mode. You can even pass data across the mode without changing it, unlike with the median.
So there’s a trade off between robustness and information.
Using p-distances to reason about distribution summarizers
We can talk about location parameters in terms of different distance metrics, called ”Lp distances”. An Lp distance measures the distance between two data vectors. If we call these vectors x and y, and each has n elements, the general form is Lp(x,y)=(∑ni|xi−yi|p)1p. For different p, this can easily produce different distances, even for the same x and y. So each different p represents a different way of measuring distance.
(These things show up all over the place in math, under different names: I first saw them in linear algebra as ||x−y||, ||x−y||2, and ||x−y||∞. Then later I saw them in optimization, where the L1 norm was being called the Manhattan/Taxicab distance, and L2 the Euclidean distance - the form √∑ni(xi−yi)2 will look familiar to some. If any of those look familiar, you’ve heard of Lp norms.)
The reason Lp distances matter here is: the different location parameters minimize different Lp distances. The mean is the single number μ that minimizes the L2 distance L2(x,μ)=√∑ni(xi−μ)2. The median is the number ~m that minimizes L1(x,~m), and the mode is the ^m that minimizes L∞(x,^m).
(Proof for the L2 case: Say the scalar β minimizes √∑ni(xi−β)2. Since square root is a monotonic function, that’s the same as β minimizing ∑ni(xi−β)2. One way to minimize something is to take the derivative and set it to 0, so let’s do that:
ddβ∑ni(xi−β)2=−2∑ni(xi−β)=0 [taking the derivative and setting it to 0]
∑nixi−∑niβ=∑nixi−nβ=0 [dropping the irrelevant −2, and separating β]
Rearranging, ∑nixi=nβ, which means β=1n∑nixi. The right-hand side is the form of the mean, so if β minimizes the L2 distance, then β=μ.)
So… if the L2 minimizer uses more information from x than the L1 minimizer does, can we do better than the mean? Maybe the L4 minimizer uses yet more information? (I’m skipping L3 for reasons. [Edit, a year later: I’ve forgotten these reasons]). To investigate, we’ll first need the form of η that minimizes L4(x,η), just like we found for β in the proof above. We’re minimizing ∑ni(xi−η)4; setting the derivative to 0 means finding η such that −4∑ni(xi−η)3=0. [At this point, writing this a year ago, I got stuck, and now I am posting this anyway.]
Another approach to the same question: We compute how much Shannon information is in each parameter. For a fixed parameter estimate μ / ~m / ^m, we can compare how many different data vectors x would give that same estimate. For example, we can ask: Given a fixed mean estimate μ, for how many different x is it true that μ minimizes L2(x,μ)? It turns out there are way more such x for (μ, L2) than there are for (~m, L1). To see this, think of a distribution of some count data. Since the median is the point for which 50% of data is on either side, moving around any value xi located to the right of the median, as long as it doesn’t pass ~m, doesn’t change the median! You could take a distribution of medical costs with median $40,000, and move a bunch of points that were around $50,000 up to $100,000, or 1 million or really as far as you want, for as many points as you want. Compare to the mean, where moving a bunch of points into the millions will strongly tug the mean to the right. In terms of the domain and range of these averaging functions: the mean is a function from some set X (data) to some set M (possible means), the median is a function from the same X to another set ~M, and M is bigger than ~M. Is the set for the L4 average—call it M′ - bigger than M? I’m interested in thoughts on this!
The median and mode use less information than the mean does
Epistemic status: Unsure. I had this in drafts from a year ago, and am posting it for Goodhart day. (Though it’s April 1, all the arguments and statements in this post are things I think are true, with no jokes). I’m interested in arguments against this thesis, and especially interested in thoughts on the question at the end—does the distribution-summarizer corresponding to the L3 or L4 minimizers use more information than the mean (the L2 minimizer)?
The mean, median, and mode are the “big 3” location parameters that most have heard about. But they can have very different properties, and these different properties are related to the fact that the mean uses more information than the median, and the median uses more information than the mode.
Refresher
The mean, median, and mode measure the location of a probability distribution. For the Gaussian distribution, they are all the same, but this isn’t the case in general. Here’s an example of a Gamma distribution where the three differ:
The mean corresponds to the middle of the distribution when weighted by frequency.
The median corresponds to the middle of the distribution, without using the weights. The median is the vertical line that splits a distribution such that 50% of the probability mass is on the left and 50% on the right.
The mode is the highest point on a distribution.
Different amounts of information usage
The median is preserved under a larger set of changes to the data than the mean is. Really, this is often why people use the median: outliers don’t knock it around as much as they do the mean. But that ability to resist being knocked around—“robustness”—is the same as the ability to ignore information. The mean’s sensitivity is sometimes seen as a liability (and sometimes is a liability), but being sensitive here is the same thing as reacting more to the data. It’s good to react to the data: the mean can distinguish between the three different datasets below, but the median can’t. The plots below are an example: if all you had was the median, you wouldn’t know these three datasets were different; but if all you had was the mean, you would know:
Similar arguments show why the mode uses even less information than the median: you can shift and scramble around any data that isn’t the peak of the distribution, and you’ll still have the same mode. You can even pass data across the mode without changing it, unlike with the median.
So there’s a trade off between robustness and information.
Using p-distances to reason about distribution summarizers
We can talk about location parameters in terms of different distance metrics, called ”Lp distances”. An Lp distance measures the distance between two data vectors. If we call these vectors x and y, and each has n elements, the general form is Lp(x,y)=(∑ni|xi−yi|p)1p. For different p, this can easily produce different distances, even for the same x and y. So each different p represents a different way of measuring distance.
(These things show up all over the place in math, under different names: I first saw them in linear algebra as ||x−y||, ||x−y||2, and ||x−y||∞. Then later I saw them in optimization, where the L1 norm was being called the Manhattan/Taxicab distance, and L2 the Euclidean distance - the form √∑ni(xi−yi)2 will look familiar to some. If any of those look familiar, you’ve heard of Lp norms.)
The reason Lp distances matter here is: the different location parameters minimize different Lp distances. The mean is the single number μ that minimizes the L2 distance L2(x,μ)=√∑ni(xi−μ)2. The median is the number ~m that minimizes L1(x,~m), and the mode is the ^m that minimizes L∞(x,^m).
(Proof for the L2 case: Say the scalar β minimizes √∑ni(xi−β)2. Since square root is a monotonic function, that’s the same as β minimizing ∑ni(xi−β)2. One way to minimize something is to take the derivative and set it to 0, so let’s do that:
ddβ∑ni(xi−β)2=−2∑ni(xi−β)=0 [taking the derivative and setting it to 0]
∑nixi−∑niβ=∑nixi−nβ=0 [dropping the irrelevant −2, and separating β]
Rearranging, ∑nixi=nβ, which means β=1n∑nixi. The right-hand side is the form of the mean, so if β minimizes the L2 distance, then β=μ.)
So… if the L2 minimizer uses more information from x than the L1 minimizer does, can we do better than the mean? Maybe the L4 minimizer uses yet more information? (I’m skipping L3 for reasons. [Edit, a year later: I’ve forgotten these reasons]). To investigate, we’ll first need the form of η that minimizes L4(x,η), just like we found for β in the proof above. We’re minimizing ∑ni(xi−η)4; setting the derivative to 0 means finding η such that −4∑ni(xi−η)3=0. [At this point, writing this a year ago, I got stuck, and now I am posting this anyway.]
Another approach to the same question: We compute how much Shannon information is in each parameter. For a fixed parameter estimate μ / ~m / ^m, we can compare how many different data vectors x would give that same estimate. For example, we can ask: Given a fixed mean estimate μ, for how many different x is it true that μ minimizes L2(x,μ)? It turns out there are way more such x for (μ, L2) than there are for (~m, L1). To see this, think of a distribution of some count data. Since the median is the point for which 50% of data is on either side, moving around any value xi located to the right of the median, as long as it doesn’t pass ~m, doesn’t change the median! You could take a distribution of medical costs with median $40,000, and move a bunch of points that were around $50,000 up to $100,000, or 1 million or really as far as you want, for as many points as you want. Compare to the mean, where moving a bunch of points into the millions will strongly tug the mean to the right. In terms of the domain and range of these averaging functions: the mean is a function from some set X (data) to some set M (possible means), the median is a function from the same X to another set ~M, and M is bigger than ~M. Is the set for the L4 average—call it M′ - bigger than M? I’m interested in thoughts on this!