The idea that the mean or average is a good measure of central tendency of a distribution, or a good estimator, is so familiar we forget that it requires justification. For Normal distributions, it is the lowest MSE estimator, the maximum likelihood estimator, and is an unbiased estimator, but this isn’t true of all distributions. For a skewed, long-tailed distribution, for example, the median is a better estimator.
Is it correct to say that the mean is a good estimator whenever the variance is finite? If so, maybe I should have added that assumption to the post.
I wonder how to think about that in the case of entropy, which you thought about analyzing. Differential entropy can also be infinite, for example. But the Cauchy distribution, which you mention, has infinite variance but finite differential entropy, at least.
1.sorry, I haven’t figured out the equation editor yet.
You can type Cmd+4 to type inline latex formulas, and Cmd+m to type standalone latex formulas! Hope that helps.
In principle, conceptually, you can estimate entropy directly from the probability density function (PDF) non-parametrically as H = sum(-P log2 P), where the sum is over all possible values of Y, and P is the probability Y takes on a given value. Likewise, you can estimate the mutual information directly from the joint probability distribution between X and Y, the equation for which I won’t try to write out here without an equation editor.
Note: After writing the next paragraph, I noticed that you made essentially the same points further below in your answer, but I’m still keeping my paragraph here for completeness.
I was more wondering whether we can estimate them from data, where we don’t get the ground-truth values for the probabilities that appear in the formulas for entropy and mutual information, at least not directly. If we have lots of data, then we can approximate a PDF, that is true, but I’m not aware of a way of doing so that is entirely principles or works without regularity assumptions. As an example, let’s say we want to estimate the conditional entropy H(Y∣X) (a replacement for the “remaining variance” in my post) for continuous X and Y. I think in this case, if all sampled x-values differ from each other, you could in principle come to the conclusion that there is no uncertainty in Y conditional on X at all since you observe only one Y-value for each X-value. But that would be severe overfitting, similar to what you’d expect in my section titled “When you have lots of data” for continuous X.
Maybe it would be interesting to analyze the conditional entropy case for non-continuous distributions where variance makes less sense.
I think from my point of view we’re largely in agreement, thanks for your further elaborations!
Is it correct to say that the mean is a good estimator whenever the variance is finite?
Well, yes, in the sense that the law of large numbers applies, i.e.
limn→∞Pr{|¯x−E[X]|<ε}=1∀ε>0
The condition for that to hold is actually weaker. If all the xi are not only drawn from the same distribution, but are also independent, the existence of a finite E[X] is necessary and sufficient for the sample mean to converge in probability to E[X] as n goes to infinity, if I understand the theorem correctly (I can’t prove that yet though; the proof with a finite variance is easy). If xi aren’t independent, the necessary condition is still weaker than the finite variance, but it’s cumbersome and impractical, so finite variance is fine I guess.
But that kind of isn’t enough to always justify the use of a sample mean as an estimator in practice? As foodforthought says, for a normal distribution it’s simultaneously the lowest MSE estimator, the maximum likelihood estimator, and is an unbiased estimator, but that’s not true for other distributions.
A quick example: suppose we want to determine the parameter p of a Bernoulli random variable, i.e. “a coin”. The prior distribution over p is uniform; we flip the coin n=10 times, and use the sample success rate, kn, i.e. the mean, i.e. the maximum likelihood estimate. Per simulation the mean squared error E[(kn−p)2] is about 0.0167. However, if we use k+1n+2 instead, the mean squared error drops to 0.0139 (code).
Honestly though, all of this seems like a frequentist cockamamie to me. We can’t escape prior distributions; we may as well stop pretending that they don’t exist. Just calculate a posterior and do whatever you want with it. E.g., how did I come up with the k+1n+2 example? Well, it’s the expected value of the posterior beta distribution for p if the prior is uniform, so it also gives a lower MSE.
Is it correct to say that the mean is a good estimator whenever the variance is finite? If so, maybe I should have added that assumption to the post.
I wonder how to think about that in the case of entropy, which you thought about analyzing. Differential entropy can also be infinite, for example. But the Cauchy distribution, which you mention, has infinite variance but finite differential entropy, at least.
You can type Cmd+4 to type inline latex formulas, and Cmd+m to type standalone latex formulas! Hope that helps.
Note: After writing the next paragraph, I noticed that you made essentially the same points further below in your answer, but I’m still keeping my paragraph here for completeness.
I was more wondering whether we can estimate them from data, where we don’t get the ground-truth values for the probabilities that appear in the formulas for entropy and mutual information, at least not directly. If we have lots of data, then we can approximate a PDF, that is true, but I’m not aware of a way of doing so that is entirely principles or works without regularity assumptions. As an example, let’s say we want to estimate the conditional entropy H(Y∣X) (a replacement for the “remaining variance” in my post) for continuous X and Y. I think in this case, if all sampled x-values differ from each other, you could in principle come to the conclusion that there is no uncertainty in Y conditional on X at all since you observe only one Y-value for each X-value. But that would be severe overfitting, similar to what you’d expect in my section titled “When you have lots of data” for continuous X.
Maybe it would be interesting to analyze the conditional entropy case for non-continuous distributions where variance makes less sense.
I think from my point of view we’re largely in agreement, thanks for your further elaborations!
Well, yes, in the sense that the law of large numbers applies, i.e.
limn→∞Pr{|¯x−E[X]|<ε}=1∀ε>0The condition for that to hold is actually weaker. If all the xi are not only drawn from the same distribution, but are also independent, the existence of a finite E[X] is necessary and sufficient for the sample mean to converge in probability to E[X] as n goes to infinity, if I understand the theorem correctly (I can’t prove that yet though; the proof with a finite variance is easy). If xi aren’t independent, the necessary condition is still weaker than the finite variance, but it’s cumbersome and impractical, so finite variance is fine I guess.
But that kind of isn’t enough to always justify the use of a sample mean as an estimator in practice? As foodforthought says, for a normal distribution it’s simultaneously the lowest MSE estimator, the maximum likelihood estimator, and is an unbiased estimator, but that’s not true for other distributions.
A quick example: suppose we want to determine the parameter p of a Bernoulli random variable, i.e. “a coin”. The prior distribution over p is uniform; we flip the coin n=10 times, and use the sample success rate, kn, i.e. the mean, i.e. the maximum likelihood estimate. Per simulation the mean squared error E[(kn−p)2] is about 0.0167. However, if we use k+1n+2 instead, the mean squared error drops to 0.0139 (code).
Honestly though, all of this seems like a frequentist cockamamie to me. We can’t escape prior distributions; we may as well stop pretending that they don’t exist. Just calculate a posterior and do whatever you want with it. E.g., how did I come up with the k+1n+2 example? Well, it’s the expected value of the posterior beta distribution for p if the prior is uniform, so it also gives a lower MSE.