In case it hasn’t crossed your mind, I personally think it’s helpful to start in the setting of estimating the true mean of a data stream. A very natural choice estimator for is the sample mean of the which I’ll denote . This can equivalently be formulated as the minimizer of .
Others have mentioned the normal distribution, but this feels secondary to me. Here’s why—let’s say , where is a known continuous probability distribution with mean 0 and variance 1, and are unknown. So the distribution of each has mean and variance (and assume independence).
What must be for the sample mean to be the maximum likelihood estimator of ? Gauss proved that it must be , and intuitively it’s not hard to see why it would have to be of the form .
So from this perspective, MSE is a generalization of taking the sample mean, and asking the linear model to have gaussian errors is necessary to formally justify MSE through MLE.
Replace sample mean with sample median and you get the mean absolute error.
Is there no way to salvage it via a Nash bargaining argument if the odds are different? Or at least, deal with scenarios where you have x:1 and 0:1 odds (i.e. you can only bet on heads)?