Similar arguments show why the mode uses even less information than the median: you can shift and scramble around any data that isn’t the peak of the distribution, and you’ll still have the same mode. You can even pass data across the mode without changing it, unlike with the median.
This characterization doesn’t seem to quantify information entirely.
If the data has say, a mode of 3 (as in the number 3), and the next runner up is 4, coming in at 2 4s and 3 3s, then adding a couple 4s makes 4 the mode, even if the median and mean haven’t changed a lot (maybe there’s a few dozen items on the list total).
The probabilistic mode: (that I just made up)
Pick an element at random, call that the mode. Now the expected distribution is such that it’s max is the mode. By ditching the non-continuous nature of the mode, the question shifts to:
how informative is this (now)
or the old question
what is this useful for?
So… if the L2 minimizer uses more information from x than the L1 minimizer does, can we do better than the mean? Maybe the L4 minimizer uses yet more information? (I’m skipping L3 for reasons. [Edit, a year later: I’ve forgotten these reasons]).
What’s L3?
Another approach to the same question: We compute how much Shannon information is in each parameter.
This characterization doesn’t seem to quantify information entirely.
If the data has say, a mode of 3 (as in the number 3), and the next runner up is 4, coming in at 2 4s and 3 3s, then adding a couple 4s makes 4 the mode, even if the median and mean haven’t changed a lot (maybe there’s a few dozen items on the list total).
The probabilistic mode: (that I just made up)
Pick an element at random, call that the mode. Now the expected distribution is such that it’s max is the mode. By ditching the non-continuous nature of the mode, the question shifts to:
how informative is this (now)
or the old question
what is this useful for?
What’s L3?
Good.