The usefulness of correlations

I sometimes wonder just how useful probability and statistics are. There is the theoretical argument that Bayesian probability is the fundamental method of correct reasoning, and that logical reasoning is just the limit as p=0 or 1 (although that never seems to be applied at the meta-level: what is the probability that Bayes’ Theorem is true?), but today I want to consider the practice.

Casinos, lotteries, and quantum mechanics: no problem. The information required for deterministic measurement is simply not available, by adversarial design in the first two cases, and by we know not what in the third. Insurance: by definition, this only works when it’s impossible to predict the catastrophes insured against. No-one will offer insurance against a risk that will happen, and no-one will buy it for a risk that won’t. Randomised controlled trials are the gold standard of medical testing; but over on OB Robin Hanson points out from time to time that the marginal dollar of medical spending has little effectiveness. And we don’t actually know how a lot of treatments work. Quality control: test a random sample from your production run and judge the whole batch from the results. Fine—it may be too expensive to test every widget, or impossible if the test is destructive. But wherever someone is doing statistical quality control of how accurately you’re filling jam jars with the weight of jam it says on the label, someone else will be thinking about how to weigh every single one, and how to make the filling process more accurate. (And someone else will be trying to get the labelling regulations amended to let you sell the occasional 15-ounce pound of jam.)

But when you can make real measurements, that’s the way to go. Here is a technical illustration.

Prof. Sagredo has assigned a problem to his two students Simplicio and Salviati: “X is difficult to measure accurately. Predict it in some other way.”

Simplicio collects some experimental data consisting of a great many pairs (X,Y) and with high confidence finds a correlation of 0.6 between X and Y. So given the value y of Y, his best prediction for the value of X is 0.6y. [Edit: that formula is mistaken. The regression line for Y against X is Y = bcX/​a, assuming the means have been normalised to zero, where a and b are the standard deviations of X and Y respectively. For the Y=X+D1 model below, bc/​a is equal to 1.]

Salviati instead tries to measure X, and finds a variable Z which is experimentally found to have a good chance of lying close to X. Let us suppose that the standard deviation of Z-X is 10% that of X.

How do these two approaches compare?

A correlation of 0.6 is generally considered pretty high in psychology and social science, especially if it’s established with p=0.001 to be above, say, 0.5. So Simplicio is quite pleased with himself.

A measurement whose range of error is 10% of the range of the thing measured is about as bad as it could be and still be called a measurement. (One might argue that any sort of entanglement whatever is a measurement, but one would be wrong.) It’s a rubber tape measure. By that standard, Salviati is doing rather badly.

In effect, Simplicio is trying to predict someone’s weight from their height, while Salviati is putting them on a (rather poor) weighing machine (and both, presumably, are putting their subjects on a very expensive and accurate weighing machine to obtain their true weights).

So we are comparing a good correlation with a bad measurement. How do they stack up? Let us suppose that the underlying reality is that Y = X + D1 and Z = X + D2, where X, D1, and D2 are normally distributed and uncorrelated (and causally unrelated, which is a stronger condition). I’m choosing the normal distribution because it’s easy to calculate exact numbers, but I don’t believe the conclusions would be substantially different for other distributions.

For convenience, assume the variables are normalised to all have mean zero, and let X, D1, and D2 have standard deviations 1, d1, and d2 respectively.

Z-X is D2, so d2 = 0.1. The correlation between Z and X is c(X,Z) = cov(X,Z)/​(sd(X)sd(Z)) = 1/​sqrt(1+d2 2) = 0.995.

The correlation between X and Y is c(X,Y) = 1/​sqrt(1+d1 2) = 0.6, so d1 = 1.333.

We immediately see something suspicious here. Even a terrible measurement yields a sky-high correlation. Or put the other way round, if you’re bothering to measure correlations, your data are rubbish. Even this “good” correlation gives a signal to noise ratio of less than 1. But let us proceed to calculate the mutual informations. How much do Y and Z tell you about X, separately or together?

For the bivariate normal distribution, the mutual information between variables A and B with correlation c is lg(I), where lg is the binary logarithm and I = sd(A)/​sd(A|B). (The denominator here—the standard deviation of A conditional on the value of B—happens to be independent of the particular value of B for this distribution.) This works out to 1/​sqrt(1-c2). So the mutual information is -lg(sqrt(1-c2)).

corr. mut. inf.
Simplicio 0.6 0.3219
Salviati 0.995 3.3291

What can you do with one third of a bit? If Simplicio tries to predict just the sign of X from the sign of Y, he will be right only 70% of the time (i.e. cos-1(-c(X,Y))/​π). Salviati will be right 96.8% of the time. Salviati’s estimate will even be in the right decile 89% of the time, while on that task Simplicio can hardly do better than chance. So even a good correlation is useless as a measurement.

Simplicio and Salviati show their results to Prof. Sagredo. Simplicio can’t figure out how Salviati did so much better without taking measurements on thousands of samples. Salviati seemed to just think about the problem and come up with a contraption out of nowhere that did the job, without doing a single statistical test. “But at least,” says Simplicio, “you can’t throw away my 0.3219, it all adds up!” Sagredo points out that it literally does not add up. The information gained about X from Y and Z together is not 0.3219+3.3291 = 3.6510 bits. The correct result is found from the standard deviation of X conditional on both Y and Z, which is sqrt(1/​(1 + 1/​d1 2 + 1/​d2 2)). The information gained is then lg(sqrt(1 + 1/​d1 2 + 1/​d2 2)) = 0.5*lg(101.5625) = 3.3331. The extra information over knowing just Z is only 0.0040 = 1250 of a bit, because nearly all of Simplicio’s information is already included in Salviati’s.

Sagredo tells Simplicio to go away and come up with some real data.