“Statistical Bias”

(Part one in a series on “statistical bias”, “inductive bias”, and “cognitive bias”.)

“Bias” as used in the field of statistics refers to directional error in an estimator. Statistical bias is error you cannot correct by repeating the experiment many times and averaging together the results.

The famous bias-variance decomposition states that the expected squared error is equal to the squared directional error, or bias, plus the squared random error, or variance. The law of large numbers says that you can reduce variance, not bias, by repeating the experiment many times and averaging the results.

An experiment has some randomness in it, so if you repeat the experiment many times, you may get slightly different data each time; and if you run a statistical estimator over the data, you may get a slightly different estimate each time. In classical statistics, we regard the true value of the parameter as a constant, and the experimental estimate as a probabilistic variable. The bias is the systematic, or average, difference between these two values; the variance is the leftover probabilistic component.

Let’s say you have a repeatable experiment intended to estimate, for example, the height of the Emperor of China. In fact, the Emperor’s height is 200 cm. Suppose that every single American believes, without variation, that the Emperor’s height is 180 cm. Then if you poll a random American and ask “How tall is the Emperor of China?”, the answer is always “180 cm”, the error is always −20 cm, and the squared error is always 400 (I shall omit the units on squared errors). But now suppose that Americans have normally distributed beliefs about the Emperor’s height, with mean belief 180 cm, and standard deviation 10 cm. You conduct two independent repetitions of the poll, and one American says “190 cm”, and the other says “170 cm”, with errors respectively of −10 cm and −30 cm, and squared errors of 100 and 900. The average error is −20 cm, as before, but the average squared error is 100 + 900 /​ 2 = 500. So even though the average (directional) error didn’t change as the result of adding noise to the experiments, the average squared error went up.

Although in one case the random perturbation of the answer happened to lead the American in the correct direction—the one who answered 190 cm, which is closer to the true value of 200 cm—the other American was led further away from the answer, replying 170 cm. Since these are equal deviations, the average answer did not change. But since the square increases faster than linear, the larger error corresponded to a still larger squared error, and the average squared error went up.

Furthermore, the new average squared error of 500 equals exactly the square of the directional error (-20 cm) plus the square of the random error (standard deviation of 10cm): 400 + 100 = 500.

In the long run, the above result is universal and exact: If the true value is constant X and the estimator is Y, then E[(X—Y)^2] = (X—E[Y])^2 + E[(E[Y] - Y)^2]. Expected squared error = squared expected bias + expected variance of estimator. This is the bias-variance decomposition.

If we averaged together the two Americans above, we would get an average estimate of 180 cm, with a squared error of 400, which is less than the average error of both experiments taken individually, but still erroneous.

If the true value is constant X and the estimator is Y, then by averaging many estimates together we converge toward the expected value of Y, E[Y], by the law of large numbers, and if we subtract this from X, we are left with a squared error of (X—E[Y])^2, which is the bias term of the bias-variance decomposition. If your estimator is all over the map and highly sensitive to noise in the experiment, then by repeating the experiment many times you can get the expected value of your estimator, and so you are left with only the systematic error of that estimator, and not the random noise in the estimator that varies from experiment to experiment. That’s what the law of large numbers is good for.