The Fallacy of Large Numbers

I’ve been seeing this a lot lately, and I don’t think it’s been written about here before

Let’s start with a motivating example. Suppose you have a fleet of 100 cars (or horses, or people, or whatever). For any given car, on any given day, there’s a 3% chance that it’ll be out for repairs (or sick, or attending grandmothers’ funerals, or whatever). For simplicity’s sake, assume all failures are uncorrelated. How many cars can you afford to offer to customers each day? Take a moment to think of a number.

Well, 3% failure means 97% success. So we expect 97 to be available and can afford to offer 97. Does that sound good? Take a moment to answer.

Well, maybe not so good. Sometimes we’ll get unlucky. And not being able to deliver on a contract is painful. Maybe we should reserve 4 and only offer 96. Or maybe we’ll play it very safe and reserve twice the needed number. 6 in reserve, 94 for customers. But is that overkill? Take note of what you’re thinking now.

The likelihood of having more than 4 unavailable is 18%. The likelihood of having more than 6 unavailable is 3.1%. About once a month. Even reserving 8, requiring 9 failures to get you in trouble, gets you in trouble 0.3% of the time. More than once a year. Reserving 9 -- three times the expected—gets the risk down to 0.087% or a little less than every three years. A number we can finally feel safe with.

So much for expected values. What happened to the Law of Large Numbers? Short answer: 100 isn’t large.

The Law of Large Numbers states that for sufficiently large samples, the results look like the expected value (for any reasonable definition of like).

The Fallacy of Large Numbers states that your numbers are sufficiently large.

This doesn’t just apply to expected values. It also applies to looking at a noisy signal and handwaving that the noise will average away with repeated measurements. Before you can say something like that, you need to look at how many measurements, and how much noise, and crank out a lot of calculations. This variant is particularly tricky because you often don’t have numbers on how much noise there is, making it hard to do the calculation. When the calculation is hard, the handwave is more tempting. That doesn’t make it more accurate.

I don’t know of any general tools for saying when statistical approximations become safe. The best thing I know is to spot-check like I did above. Brute-forcing combinatorics sounds scary, but Wolfram Alpha can be your friend (as above). So can python, which has native bignum support. Python has a reputation as being slow for number crunching, but with n<1000 and a modern cpu it usually doesn’t matter.

One warning sign is if your tools were developed in a very different context than where you’re using them. Some approximations were invented for dealing with radioactive decay, where n resembles Avogadro’s Number. Applying these tools to the American population is risky. Some were developed for the American population. Applying them to students in your classroom is risky.

Another danger is that your dataset can shrink. If you’ve validated your tools for your entire dataset, and then thrown out some datapoints and divided the rest along several axes, don’t be surprised if some of your data subsets are now too small for your tools.

This fallacy is related to “assuming events are uncorrelated” and “assuming distributions are normal”. It’s a special case of “choosing statistical tools based on how easy they are to use whether they’re applicable to your use-case or not”.