Correlation Does in Fact Imply Causation
(Cross-posted from my personal website)
Epistemic status: I’ve become slowly convinced of this broad point over the last handful of years, but pretty much every time I mention it in conversation people think it’s wrong. I wrote this in part looking for counterexamples (and to explain the point clearly enough that targeted counterexamples can be identified).
Perhaps the single most-quoted phrase about statistics1 is that ‘correlation does not imply causation.’ It’s a phrase I’ve spoken hundreds of times, even after the ideas that resulted in this essay were broadly developed. It’s often a useful educational tool for beginner-level students, and it’s convenient as a shorthand description of a failure of scientific reasoning that’s disturbingly common: just because A correlates with B, it doesn’t mean that A causes B. The classic example is that ice cream sales correlate with violent crime rates, but that doesn’t mean ice cream fuels crime — and of course this is true, and anyone still making base-level errors is well-served by that catchphrase ‘correlation does not imply causation’.
The thing is, our catchphrase is wrong — correlation does in fact imply2 causation. More precisely, if things are correlated, there exists a relatively short causal chain linking those things, with confidence one minus the p-value of the correlation. Far too many smart people think the catchphrase is literally true, and end up dismissing correlation as uninteresting. It’s of course possible for things to be correlated by chance, in the same way that it’s possible to flip a coin and get 10 heads in a row3, but as sample size increases this becomes less and less likely, that’s the whole point of calculating the p-value when testing for correlation. In other words, there are only two explanations for a correlation: coincidence or causation.
Let’s return to the ice cream example. It doesn’t take long to guess what’s really going on here: warm weather causes both the increased occupancy of public space and irritability that leads to spikes in violent crime and to a craving for a cold treat. So no, ice cream does not cause violent crime. But they are causally linked, through a quite short causal pathway. There are three possible architectures for the pathway: A causes B, B causes A, and C causes both, either directly or indirectly4.
I would hate to push anyone back to the truly naive position that A correlating with B means A causes B, but let’s not say false things: correlation does in fact imply causation5, just doesn’t show you which direction that causation flows.
Why do I care about correcting this phrase? Two reasons — it is bad as a community to have catchphrases that are factually false, and “correlation does not imply causation” can and has been used for dark arts before. Rather famously, Ronald Fisher spent decades arguing that there was insufficient evidence to conclude that smoking causes lung cancer—because correlation does not imply causation. The tobacco industry was grateful. Meanwhile, the correlation was telling us exactly what we should have been doing: not dismissing it, but designing experiments to determine which of the three causal architectures explained it. The answer, of course, was the obvious one. Correlation was trying to tell us something, and we spent decades pretending it wasn’t allowed to.
Notes
1.This one strikes closest to my heart as a longtime XKCD fan. Randall is almost gesturing at the point I make in this essay, but not quite. At the risk of thinking too hard about a joke (surely not a sin for this particular comic), the key flaw here is the tiny sample size — this isn’t even correlation, the p-value is 0.5. If 1,000 people take a statistics class and we survey them before and after, then we could get a meaningful, statistically-robust correlation here — and unfortunately it would probably be the case that taking the class makes people more likely to believe this phrase.
2.I’m using ‘imply’ in an empirical rather than logical sense — it’s not that correlation proves causation the way a mathematical proof does, but that it provides evidence for causation, with strength proportional to sample size.
3.p=0.00195, being generous and taking the two-tailed value.
4.That “indirectly” is pointing at a fourth option, actually an infinite set of options: C causes A and D which causes B, C causes A and E which causes D which causes B, etc. I’m not including these because it’s natural to consider those as variants on C causes both. As an analogy: if one pushes over the first domino, did that cause the last domino to fall? A pedant might argue the actual cause of the last domino falling was the penultimate domino falling on it, and in some cases that precision can be useful, but most of the time it’s natural to just say the person who pushed the first domino caused the last one to fall over. In practice the causal chain is probably pretty short, because interesting correlations tend to be well below one, and after a few intermediates the correlation strength drops below the noise threshold of detection.
5.With the evidentiary strength you would expect based on the p-value of the correlation. Coincidence is always a possibility, but becomes pretty unlikely for correlations with a large sample size.
Some counterexamples:
Two independent Wiener processes (Brownian motion) have an expected correlation of zero, but the distribution of correlations is wide. You can easily see correlations of magnitude above 0.8 between two independent realisations of the standard Wiener process, independently of sample size. The problem is the strong autocorrelation of the Wiener process, which drastically broadens the distribution of the correlation coefficient.
The Moon is slowly receding from the Earth; the Andromeda galaxy is approaching the Milky Way. Therefore the respective distances between them have a large negative correlation. But there is no causal connection. More generally, any two time series, each of which exhibits a monotonic trend over time, will have a substantial correlation.
Control systems often exhibit large correlations between variables with no direct causal conection (and near-zero correlation between variables that do have a direct causal connection). Control systems are ubiquitous in the life sciences, social sciences, and technology. The causal relationships within them are cyclic, putting them outside the scope of Reichenbach’s principle and Pearl-style causal analysis.
(1) and (2) are well-known to people who analyse time series data, and there are standard methods (prewhitening and detrending respectively) for dealing with them. While (3) gets a mention from time to time, none of the papers I have seen on extending causal analysis to cyclic systems have (IMO) made much progress.
Thank you for providing counterexamples! Quite useful. and convinces me that the broadest version of this is false. And apologies for the slow reply—the linked papers took a while to absorb (as you may guess, I’m not a mathematician).
1. I tried for a while, but couldn’t properly follow this paper. From reading the abstract, goung through the figures, your summary, and discussing it with Claude (Opus 4.6), I think the core issue here is the same problem as point 2 - the measures are not independent. My wife’s phrasing here is that “there’s a causal link between position at time t and position at time t+1” (in this case + noise, in the case of point 2 plus velocity times time between measurements).
2. Thank you—slightly embarrassed I didn’t find this one myself. I think the issue here is due to repeated measurement—the moon-earth distance now and at t+1s are obviously connected. I think the missing criteria here is independence—repeatedly measuring the same thing to see evolution over time breaks it. This requires some narrowing of the core thesis—my proposed added text is “I’m talking about correlations that survive proper statistical scrutiny—where the p-value is computed correctly for the data structure at hand. A naive Pearson correlation between two autocorrelated time series, or two monotonic trends, can look impressive while meaning nothing, but that’s not a real correlation any more than a loaded die produces a real test of probability. The well-known tools for handling this (differencing, detrending, cointegration tests, correcting for effective sample size) exist precisely because statisticians already understand that autocorrelation inflates apparent correlation. My claim applies to correlations that remain after you’ve done the statistics right.”
I think the claim “correlation of independent measures implies causation” is still interesting and surprising (probably not to this crowd who point out plenty of more rigorous prior work, but to most biologists at least), though a bit less exciting than my original claim. In particular, it’s less exciting because it doesn’t have a bulletproof checklist for “doing the statistics right”, which may or may not be possible to make.
3. This paper I could understand. I’ve worked through all their examples, and don’t think any are counterexamples to my point. Indeed causation can exist without correlation (as you say, very common in biology and technology), I’ve never claimed otherwise. The paper shows some examples where correlation is high despite the causal path containing intermediates with lower correlation which is intriguing, but I believe every correlation in the paper is explained by a causal path that links the two things correlated. As you say, not necessarily a direct causal connection, but still a “relatively short causal chain linking those things”—an indirect causal connection.
My favorite quote from the paper was “the simple maxim that ‘correlation does not imply causation’ having been superceded by methods such as those set out in [9, 14], and in shorter form in [10]”. Good to see that others (Pearl in addition to Reichenbach) have made something like the point I’m aiming to make here, as well as the restriction from time-series data that you brought to my attention in (2) - “cited limit attention to systems whose causal connections form a directed acyclic graph, together with certain further restrictions, and also do not consider dynamical systems or time series data.”
So far I don’t think cyclic systems cause any correlations between causally-unlinked variables, but the fact that people doing this formally haven’t solved it makes me hesitant to make any claims as an outsider to the field.
Thanks! I’ve been convinced of the general falsity of Reichenbach’s principle.
When we start introducing time into the mix I think it can be helpful to be somewhat more particular about how we define variables in a causal setting. When you have limited time resolution measurments of a quantity over time you can view it as a single “variable” that has is involved in a causal cycle. But you could also “unroll” this cycle in time and view the quantity at each time as a seperate variable. If you do this it seems to me like Pearl-style causal analysis generally holds. Even if X and Y have a montonic pattern over time with prior time points of each series causing its later time points but no causal interaction between them, the X(t)s and Y(t)s aren’t correlated with each other, and the pattern over time is fully explained by the causal structure. This makes sense in a Pearl-style analysis because you would have an SCM for the series that looks something like X(t) = f(X(t-1, t-2, ….)), which is treating the X(t)s as seperate variables. The correlation over time of X and Y isn’t a correlation between variables in this model, it involves mixing different variables together. If we treat them seperately it seems like the Pearl-style analysis still works and makes no prediction errors, and in fact has the advantage of being robust to potential interventions.
Although I won’t claim to understand all the claims and concepts fully, I’ve found this paper to be interesting and helpful in this regard, and to me some of the concepts seem to have connections to the persective I offer above.
Pearl-style SCMs assume that every single node in a graph is ontologically independent, which makes unrolled models as suggested not particularly great.
From a paper co-authored by Pearl himself:
( https://commonsensereasoning.org/2005/hopkins.pdf )
I haven’t been active in causality research since about 5 years ago, but I’m not aware of any good solutions to the time problem. I do know there are proposals for models that make improvements for causality involving sets of related variables, e.g.: platelet models. I think our own work on counterfactual probabilistic programming has a pretty strong basis, although the philosophy is fairly abridged in the paper.
tl; dr: The basic version of this is Reichenbach’s principle and is well-known. A lot of holes show up under a more advanced lens.
Other than the p-value part (would like to see your reasoning there—p-values are not probabilities, so this reads like a type error to me), this is Reichenbach’s principle ( https://plato.stanford.edu/entries/physics-Rpcc/ ): a correlation between A and B implies that either A causes B, B causes A, or there’s a common cause for both. (Or you are conditioning on something caused by both; c.f. Berkson’s paradox.)
Reichenbach’s principle is trivially true in Pearl-style causal networks, but philosphers argue that it’s false for the causal network of the whole universe. I haven’t found the short versions of the arguments convincing, but you can read about them in the linked PLATO article.
There’s a few things to keep in mind:
As the size of a causal network grows, the set of correlations grows far faster than the set of causal relationships, until almost all correlations become spurious. While there may be a causal chain in common, I’ll point out that the correlation between the set of nucleotides used by humans and by jellyfish is explained by an extremely long causal chain. Generally, common causes that are very far away lead to very weak correlations, but they can still be strong if each link in the causal chain is strong, as in DNA. So you might want to say “if the correlation is detectable with relatively low precision, then the causal chain linking them is probably short.” But if you are only willing to use low precision to detect correlations, then you run into the approximate unfaithfulness condition below.
A causal network is faithful if all correlations between variables can be explained by their relation in the causal network. In a deterministic setting such as within a program, causal networks generally are not faithful, so that you can have two variables on opposite sides of the program whose value is correlated for no good reason. In a more general setting, it’s easy to show that non-faithful networks are extremely unlikely under light conditions, but Caroline Uhler showed that approximately unfaithful networks—networks which are close to an unfaithful network—are quite likely .
Thanks for a long and detailed comment! tl;dr—I wish I’d kept some of the more detailed footnotes from an earlier draft of the essay.
My phrasing there was insufficiently precise. You’re right to call out that (1- p-value) isn’t the probability of coincidence, because that’s ignoring the prior. I should have gone with something vaguer like “with the evidentiary strength you would expect based on the p-value of the correlation”.
I do more or less interpret p-values as probabilities, but in a world of theory rather than the real world—the summary I generally give is “probability of getting a result this extreme or more under the null hypothesis”, does that still seem like a type error?
I came across this SEP article while researching the essay, thinking surely I couldn’t be the first to think of the idea. Ultimately I cut the section about it because a) I found the SEP article fairly long and confusing and thought I could explain it more simply without referencing it and b) having discussed this point with dozens of people, including many with degrees in math and physics, exactly zero have ever brought up Reichenbach’s common cause, which led me to believe it wasn’t very well known (I’d guess at least 2 OOMs fewer people are familiar with that compared to the classic injunction “correlation does not imply causation”). That said, you’ve convinced me to add the footnote referencing Reichenbach’s common cause back in, thank you.
This is another point featured in an earlier version of the essay, inspired by the spurious correlations charts. Correlations between thousands of variables indeed grow extremely quickly, to me this is just an argument for correctly adjusting for multiple comparisons—pretty much all the p-values there are reported as <0.01, but with tens of thousands of comparisons of course you’d find a ton of those by chance and need to adjust for that—my claim is that very few of those would remain interesting after such an adjustment.
This is the point about dominoes in footnote 4, I can necker-cube between the two views of it being an extremely long causal chain or a very short one, “they share common ancestry”. I find the shorter view generally easier to reason about and it’s the one I use most of the time.
Can you recommend any non-technical summary of or examples of non-faithful networks? The closes semantic match I found after a couple iterations of search was this paper, but it’s denser than I prefer. The core point
To dive into the weeds a bit: The phrasing “with the evidentiary strength you would expect based on the p-value of the correlation” works, but the issue with p-values is much stronger than “that’s ignoring the prior.” There’s another type error there. p-values and priors do not mix. A p-value is a supremum over probabilities from a space of hypothesis. Suppose you have two possible null hypothesis: one generates your data with probability 10%, but you assign 0.000000000001% confidence to this hypothesis. The other generates your data with probability 1%, and you assign 99.999999999999% confidence to this hypothesis. Then you want your p-value to be approximately 0.01. But it’s actually 0.1. Sorry. p-values are fundamentally frequentist, not Bayesian, Frequentism rejects the idea that you can express confidence as a probability.
Not surprised. The mathematics of causality is still not part of a standard stats curriculum. I can’t fully blame them—I dove deep into this area in my first year of grad school, in 2015, and concluded the field is still fairly primitive. But a disproportionate number of people here are familiar with it. (Indeed, I was excited to dive into causality in grad school when I saw an excuse to do so precisely because of my exposure to the field’s existence through LessWrong.)
Indeed, just as I could describe the causal chain linking humans and jellyfish in a few words. I would like there to be a formal way to render both DNA and dominoes as a short causal chain. But I don’t have one.
Think this got cut off.
A classic example from Pearl’s 2009 book: A and B are fair 0⁄1 coins, and C is their xor. Then the sets {A, C} and {B, C} each have pairwise independency, even though there are causal links A->C and B->C.
Thanks for a detailed reply!
I guess this is what I get for my statistics-knowledge being an odd mongrel mix of Bayesian vibes gathered from LessWrong and frequentist stats from biology grad school. You seem much more knowledgeable on the formalizations so I trust you’re right that they can’t formally mix, but informally to me it seems like there must be some way to mix them—fundamentally, this is just “extraordinary claims require extraordinary evidence”. To use an example from my day job: if I’m testing whether knocking out a candidate gene that I think will increase tryptophan in a plant tissue does indeed increase the tryptophan, p < 0.05 feels like a fine cutoff. If I’m testing whether that same KO causes my plants to communicate with me telepathically, I’d be crazy to tell anyone about my results unless I was seeing p < 0.001 in at least two independent experiments.
The difference in required p-value threshold to me seems to come down to the prior. Perhaps there’s no formal framework that combines them, but empirically I think that’s what I’m doing.
You seem to be correct here, but it strikes me as strange because quantifying the evidence for causality was one of the central themes of most of my stats classes (with names like “intro statistics” and “statistical design of experiments”).
Possibly the synthesis here is that most of what non-math people learn doesn’t qualify as real math—I only took the super basic stuff that doesn’t get into the actual math of causality (despite minoring in math in undergrad and taking more math than most in a bio PhD).
Thinking about it a bit more, I think I’m happy to bite the bullet here and say the causal chain is long (though well-approximated by both of our very short descriptions), and that this is one of the exceptions to the point in footnote 4 that real-world correlations are generally well below 1. The causal chain is quite long, but DNA replication is extremely good—something like 10^-7 errors per base of DNA per generation—pretty dang close to 1 (my guess is domino setups by hobbyists also have failure rates under 10^-3). I’m sure we could find a handful more examples of things with long chains of very high correlation, but not that many—pretty few in biology are over 0.95.
Thanks, that was fun to work out, and appropriately simple for a humble biologist. In the 2x2 matrix of “corr or not” and “causal linkage or not”, this is the opposite square to the one I’m looking for—causal but not correlated. I agree that such things happen (they happen a lot in biology due to regulatory feedback loops), and I now see that this is indeed a non-faithful network.
Is there a similar toy example for “corr but no causal linkage”, e.g. “two variables on opposite sides of the program whose value is correlated for no good reason”? I spent a while asking Claude Opus 4.6 for one and it couldn’t come up with any (it came up with a toy where two variables would always be the same constant, but correlation there is undefined so I don’t count it).
As a Pearl-style half-Bayesian (see here) who chose the path of logic over stats more than a decade ago, I am not at all a good representative of the frequentist school. My impression though is that they would say these kinds of biases and assumptions live outside the domain of stats.
A Bayesian would just tell you to use Bayesian techniques such as credible intervals and maximum a posteriori estimators in lieu of p-values, which I have mostly not studied. I believe Bayesian techniques as a whole are conceptually simpler but computationally more difficult than frequentist ones.
Well spoken!
So first, if you’re talking about literal correlation coefficients, you can come up with examples where Y is a deterministic function of X, but their correlation coefficient is 0. Correlation coefficients measure linear relationships. Pretty sure you can do it using a piecewise-linear function with just two pieces, and probably also with a simple parabola.
(Just searched: Yes you can. See page 4 of here .)
When we speak of correlation, I think it’s more helpful to consider mutual information and conditional mutual information, which capture all dependencies, not merely linear ones. (At least theoretically—convergence properties of conditional mutual information estimation are terrible, and a reason I gave up on my first causality project in grad school.)
Mutual information between two constants is defined. But it’s always 0. So that’s still a non-example.
I confess that I actually had to look up how to get spurious correlations in a causal network. Here’s an example showing you can do it even without determinism.
(Source: https://www.researchgate.net/figure/An-example-of-unfaithful-network-applying-d-separation-rules-to-the-DAG-would-indicate_fig15_392864356 )
....and then I realized this is still the opposite of what you’re looking for. Oops. I should have just gone to sleep.
Okay, you might be right about the spurious correlations not being possible in Pearl-style causal graphs.
What I recall from my investigations into this in 2015: (1) I was mostly concerned with trying to apply the PC algorithm for causal structure discovery to programs. PC essentially operates by looking for colliders, i.e.: independent variables that become dependent when conditioned on a mutual child. So spurious non-correlations were a bigger problem for me. (2) I found an old comment of mine where I mentioned having to condition on events of probabiilty 0 to apply the do-calculus.
I feel like this is using the word somewhat differently than is meant by the phrase you are discussing. I’ve always interpreted “correlation doesn’t imply causation” to mean that if X and Y are correlated you can’t necessarily say that X → Y or Y → X (probably based on some prior about the direction like timing), not that correlation is somehow completely unrelated to causation.
Congrats on always interpreting the phrase that way! Of the folks I’ve discussed this with in person, every one of them had the interpretation that things can be correlated without either one causing the other or a common cause.
I would be surprised if “correlation does’t imply causation” would have become so popular if most people interpreted it as strictly expanding the options from X → Y or Y → X to include common cause (and now might ask a stats-professor friend to run a poll to see which of these interpretations most people hold in their intro stats class), but I think your interpretation is correct. I certainly don’t want to attack a strawman version of the phrase, but if >30% of people interpret it that way I’ll conclude it’s not a strawman.