Why square errors?

Mean squared error (MSE) is a common metric to compare performance of models in linear regression or machine learning. But optimization based on the L2 norm metrics can exaggerate bias in our models in the name of lowering variance. So why do we square errors so often instead of using absolute value?

Follow my journey of deconfusion about the popularity of MSE.

MSE vs MAE outlier
MSE is sensitive to outliers

My original intuition was that optimizing mean absolute error (MAE) ought to always create “better” models that are closer to the underlying reality, capture the “true meaning” of the data, be more robust to outliers. If that intuition were true, would it mean that I think most researchers are lazy and use an obviously incorrect metric purely for convenience?

MSE is differentiable, while MAE needs some obscure linear programming to find the (not necessarily unique) solutions. But in practice, we would just call a different method in our programming language. So the Levenberg–Marquardt algorithm shouldn’t look like a mere convenience, there should be fundamental reasons why smart people want to use it.

Another explanation could be that MSE is easier to teach and to explain what’s happening “inside”, and thus more STEM students are more familiar with the benefits of MSE /​ L2. In toy examples, if you want to fit one line through an ambiguous set of points, minimizing MSE will give you that one intuitive line in the middle, while MAE will tell you that there are infinite number of equally good solutions to your linear regression problem:

MSE vs MAE non-unique
MAE might not give you a unique solution

I see this as an advantage of MAE or L1, that it reflects the ambiguousness of reality (“don’t use point estimates pf linear regression parameters when multiple lines would fit the data equally well, use those damned error bars and/​or collect more data”). But I also see that other people might understand this as an advantage of MSE or L2 (“I want to gain insights and some reasonable prediction, an over-simplified fast model is enough, without unnecessary complications”).

Previously, I also blamed Stein’s Paradox on L2, but that does not seem true, so other people probably formed their intuitions about L1 and L2 without imaginary evidence against one of them. If both norms are equally good starting points for simplifying reality into models, then computational performance will be a bigger factor in the consideration that L2 is “better”.

However, Root mean square error (RMSE) or mean absolute error (MAE)? by Chai and Draxler (2014) captures the biggest gap in my previous understanding—we have to make an assumption (or measurement) about the error distribution and choose our metric(s) accordingly:

Condensing a set of error values into a single number, either the RMSE or the MAE, removes a lot of information. The best statistics metrics should provide not only a performance measure but also a representation of the error distribution. The MAE is suitable to describe uniformly distributed errors. Because model errors are likely to have a normal distribution rather than a uniform distribution, the RMSE is a better metric to present than the MAE for such a type of data.

And for many problems, small errors are more likely than large errors, without an inherent bias in the direction of the error (or we can control for bias), so normal distribution is a reasonable assumption.

When we are not sure (which we never are), we can also use more than one metric to guide our understanding, evaluating our models by one over-simplified number shouldn’t be enough.