As usual, being a bayesian makes everything extraordinarily clear. The mean-squared-error loss is just the negative logarithm of your data likelihood P(x1,...,xn|α,β)=∏iexp(−(xi−αxi−β)22σ2) under the assumption of gaussian-distributed data, so “minimizing the mean-squared-loss” is completely equivalent to a MLE with gaussian errors . Any other loss you might want to compute directly implies an assumption about the data distribution, and vice-versa. If you have reason to believe that your data might not be normally distributed around an x-dependent mean… then don’t use a mean-squared loss
This approach also makes lots of regularisation techniques transparent. Typically regularisation corresponds to applying some prior (over the weights/parameters of the model you’re fitting). e.g. L2 norm aka ridge aka weight decay regularisation corresponds exactly to taking a Gaussian prior on the weights and finding the Maximum A Priori estimate (rather than the Maximum Likelihood).
see your β there? you assume that people remember to “control for bias” before they apply tools that assume Gaussian error
that is indeed what I should have remembered about the implications of “we can often assume approximately normal distribution” from my statistics course ~15 years ago, but then I saw people complaining about sensitivity to outliers in 1 direction and I failed to make a connection until I dug deeper into my reasoning
As usual, being a bayesian makes everything extraordinarily clear. The mean-squared-error loss is just the negative logarithm of your data likelihood P(x1,...,xn|α,β)=∏iexp(−(xi−αxi−β)22σ2) under the assumption of gaussian-distributed data, so “minimizing the mean-squared-loss” is completely equivalent to a MLE with gaussian errors . Any other loss you might want to compute directly implies an assumption about the data distribution, and vice-versa. If you have reason to believe that your data might not be normally distributed around an x-dependent mean… then don’t use a mean-squared loss
This approach also makes lots of regularisation techniques transparent. Typically regularisation corresponds to applying some prior (over the weights/parameters of the model you’re fitting). e.g. L2 norm aka ridge aka weight decay regularisation corresponds exactly to taking a Gaussian prior on the weights and finding the Maximum A Priori estimate (rather than the Maximum Likelihood).
see your β there? you assume that people remember to “control for bias” before they apply tools that assume Gaussian error
that is indeed what I should have remembered about the implications of “we can often assume approximately normal distribution” from my statistics course ~15 years ago, but then I saw people complaining about sensitivity to outliers in 1 direction and I failed to make a connection until I dug deeper into my reasoning