Is it correct to say that the mean is a good estimator whenever the variance is finite?
Well, yes, in the sense that the law of large numbers applies, i.e.
limn→∞Pr{|¯x−E[X]|<ε}=1∀ε>0
The condition for that to hold is actually weaker. If all the xi are not only drawn from the same distribution, but are also independent, the existence of a finite E[X] is necessary and sufficient for the sample mean to converge in probability to E[X] as n goes to infinity, if I understand the theorem correctly (I can’t prove that yet though; the proof with a finite variance is easy). If xi aren’t independent, the necessary condition is still weaker than the finite variance, but it’s cumbersome and impractical, so finite variance is fine I guess.
But that kind of isn’t enough to always justify the use of a sample mean as an estimator in practice? As foodforthought says, for a normal distribution it’s simultaneously the lowest MSE estimator, the maximum likelihood estimator, and is an unbiased estimator, but that’s not true for other distributions.
A quick example: suppose we want to determine the parameter p of a Bernoulli random variable, i.e. “a coin”. The prior distribution over p is uniform; we flip the coin n=10 times, and use the sample success rate, kn, i.e. the mean, i.e. the maximum likelihood estimate. Per simulation the mean squared error E[(kn−p)2] is about 0.0167. However, if we use k+1n+2 instead, the mean squared error drops to 0.0139 (code).
Honestly though, all of this seems like a frequentist cockamamie to me. We can’t escape prior distributions; we may as well stop pretending that they don’t exist. Just calculate a posterior and do whatever you want with it. E.g., how did I come up with the k+1n+2 example? Well, it’s the expected value of the posterior beta distribution for p if the prior is uniform, so it also gives a lower MSE.
Well, yes, in the sense that the law of large numbers applies, i.e.
limn→∞Pr{|¯x−E[X]|<ε}=1∀ε>0The condition for that to hold is actually weaker. If all the xi are not only drawn from the same distribution, but are also independent, the existence of a finite E[X] is necessary and sufficient for the sample mean to converge in probability to E[X] as n goes to infinity, if I understand the theorem correctly (I can’t prove that yet though; the proof with a finite variance is easy). If xi aren’t independent, the necessary condition is still weaker than the finite variance, but it’s cumbersome and impractical, so finite variance is fine I guess.
But that kind of isn’t enough to always justify the use of a sample mean as an estimator in practice? As foodforthought says, for a normal distribution it’s simultaneously the lowest MSE estimator, the maximum likelihood estimator, and is an unbiased estimator, but that’s not true for other distributions.
A quick example: suppose we want to determine the parameter p of a Bernoulli random variable, i.e. “a coin”. The prior distribution over p is uniform; we flip the coin n=10 times, and use the sample success rate, kn, i.e. the mean, i.e. the maximum likelihood estimate. Per simulation the mean squared error E[(kn−p)2] is about 0.0167. However, if we use k+1n+2 instead, the mean squared error drops to 0.0139 (code).
Honestly though, all of this seems like a frequentist cockamamie to me. We can’t escape prior distributions; we may as well stop pretending that they don’t exist. Just calculate a posterior and do whatever you want with it. E.g., how did I come up with the k+1n+2 example? Well, it’s the expected value of the posterior beta distribution for p if the prior is uniform, so it also gives a lower MSE.