# Confound it! Correlation is (usually) not causation! But why not?

It is widely un­der­stood that statis­ti­cal cor­re­la­tion be­tween two vari­ables ≠ cau­sa­tion. But de­spite this ad­mo­ni­tion, peo­ple are rou­tinely over­con­fi­dent in claiming cor­re­la­tions to sup­port par­tic­u­lar causal in­ter­pre­ta­tions and are sur­prised by the re­sults of ran­dom­ized ex­per­i­ments, sug­gest­ing that they are bi­ased & sys­tem­at­i­cally un­der­es­ti­mat­ing the prevalence of con­founds/​com­mon-cau­sa­tion. I spec­u­late that in re­al­is­tic causal net­works or DAGs, the num­ber of pos­si­ble cor­re­la­tions grows faster than the num­ber of pos­si­ble causal re­la­tion­ships. So con­founds re­ally are that com­mon, and since peo­ple do not think in DAGs, the im­bal­ance also ex­plains over­con­fi­dence.

Full ar­ti­cle: http://​​www.gw­ern.net/​​Causality

• Hi, I will put re­sponses to your com­ment in the origi­nal thread here. I will do them slightly out of or­der.

I’m afraid I don’t un­der­stand you here. If we draw an ar­row from A to B, ei­ther as a causal or Bayesian net, be­cause we’ve ob­served cor­re­la­tion or cau­sa­tion (maybe we ac­tu­ally ran­dom­ized A for once), how can there not be a re­la­tion­ship in any un­der­ly­ing re­al­ity and there ac­tu­ally be an ‘in­de­pen­dence’ and the graph be ‘un­faith­ful’?

A Bayesian net­work is a statis­ti­cal model. A statis­ti­cal model is a set of joint dis­tri­bu­tions (un­der some re­stric­tions). A Bayesian net­work model of a DAG G with ver­tices X1,...,Xk = X is a set of joint dis­tri­bu­tions that Markov fac­tor­ize ac­cord­ing to this DAG. This set will in­clude dis­tri­bu­tions of the form p(x1,...,xk) = p(x1) p(xk) which (triv­ially!) fac­tor­ize with re­spect to any DAG in­clud­ing G, but which also have ad­di­tional in­de­pen­dences be­tween any Xi and Xj even if G has an edge be­tween Xi and Xj. When we are talk­ing about try­ing to learn a graph from a par­tic­u­lar dataset, we are talk­ing about a par­tic­u­lar joint dis­tri­bu­tion in the set (in the model). If we hap­pen to ob­serve a de­pen­dence be­tween Xi and Xj in the data then of course the cor­re­spond­ing edge will be “real”—in the par­tic­u­lar dis­tri­bu­tion that gen­er­ated the data. I am just say­ing the DAG cor­re­sponds to a set rather than any spe­cific dis­tri­bu­tion for any par­tic­u­lar dataset, and makes no uni­ver­sally quan­tified state­ments over the set about de­pen­dence, only about in­de­pen­dence. Same com­ment ap­plies to causal mod­els—but we aren’t talk­ing about just an ob­served joint any­more. The di­chotomy be­tween a “causal struc­ture” and a causal model (a set of causal struc­tures) still ap­plies. A causal model only makes uni­ver­sally quan­tified state­ments about in­de­pen­dences in “causal struc­tures” in its set.

I tried to read that [San­der’s Festschrift es­say], but I think I didn’t un­der­stand too much of it or its con­nec­tion to this topic.

I will try to clar­ify this (as­sum­ing you are ok w/​ in­ter­ven­tions). Your ques­tion is “why is cor­re­la­tion usu­ally not cau­sa­tion?”

One way you pro­posed to think about it is com­bi­na­to­rial for all pair­wise re­la­tion­ships—if we look at all pos­si­ble DAGs of n ver­tices, then you con­jec­tured that the num­ber of “pair­wise causal re­la­tion­ships” is much smaller than the num­ber of “pair­wise as­so­ci­a­tive re­la­tion­ships.” I think your con­jec­ture is ba­si­cally cor­rect, and can be re­duced to count­ing cer­tain types of paths in DAGs. Speci­fi­cally, pair­wise causal re­la­tion­ships just cor­re­spond to di­rected paths, and pair­wise as­so­ci­a­tive re­la­tion­ships (as­sum­ing we aren’t con­di­tion­ing on any­thing) cor­re­spond to marginally d-con­nected paths, which is a much larger set—so there are many more of them. How­ever, I have not worked out the ex­act com­bi­na­torics, in part be­cause even count­ing DAGs isn’t easy.

Another way to look at it, which is what San­der did in his es­say, is to see how of­ten we can re­duce causal re­la­tion­ships to as­so­ci­a­tive re­la­tion­ships. What I mean by that is that if we are in­ter­ested in a par­tic­u­lar pair­wise causal re­la­tion­ship, say whether X af­fects Y, which we can study by look­ing at p(y | do(x)), then as we know in gen­eral we will not be able to say any­thing by look­ing at p(y | x). This is be­cause in gen­eral p(y | do(x)) is not equal to p(y | x). But in some DAGs it is! And in other DAGs p(y | do(x)) is not equal to p(y | x), but is equal to some other func­tion of ob­served data. If we can ex­press p(y | do(x)) as a func­tion of ob­served data this is very nice be­cause we don’t need to run a ran­dom­ized trial to ob­tain p(y | do(x)), we can just do an ob­ser­va­tional study. When peo­ple “ad­just for con­founders” what they are try­ing to do is ex­press p(y | do(x)) as a func­tion \sum_c p(y | x,c) p(c) of the ob­served data, for some set C.

So the ques­tion is, how of­ten can we re­duce p(y | do(x)) to some func­tion of ob­served data (a weaker no­tion of “cau­sa­tion might be some sort of as­so­ci­a­tion if we mas­sage the data enough”). It turns out, not sur­pris­ingly, that if we pick cer­tain causal DAGs G con­tain­ing X and Y (pos­si­bly with hid­den vari­ables), there will not be any func­tion of the ob­served data equal to p(y | do(x)). What that means is that there ex­ist two causal struc­tures con­sis­tent with G which dis­agree on p(y | do(x)) but agree on the ob­served joint den­sity. So the map­ping from causal struc­tures (which tell you what causal re­la­tion­ships there are) to joint dis­tri­bu­tions (which tell you what as­so­ci­a­tive re­la­tion­ships there are) is many to one in gen­eral.

It will thus gen­er­ally (but not always given some as­sump­tions) be the case that a causal model will con­tain causal struc­tures which dis­agree about p(y | do(x)) of in­ter­est, but agree on the joint dis­tri­bu­tion. So there is just not enough in­for­ma­tion in the joint dis­tri­bu­tion to get causal­ity. To get around this, we need as­sump­tions on our causal model to pre­vent this. What San­der is say­ing is that the as­sump­tions we need to equate p(y | do(x)) with some func­tion of the ob­served data are gen­er­ally quite un­re­al­is­tic in prac­tice.

Another in­ter­est­ing com­bi­na­to­rial ques­tion here is: if we pick a pair X,Y, and then pick a DAG (w/​ hid­den vari­ables po­ten­tially) at ran­dom, how likely is p(y | do(x)) to be some func­tion of the ob­served joint (that is, there is “some sense” in which cau­sa­tion is a type of as­so­ci­a­tion). Given a par­tic­u­lar such DAG and X,Y I have a poly-time al­gorithm that will an­swer YES/​NO, which may prove helpful.

It might help if I de­scribe a con­crete way to test my claim us­ing just causal net­works.

I un­der­stand what you are say­ing, but I don’t like your spe­cific pro­posal be­cause it is con­flat­ing two sep­a­rate is­sues—a com­bi­na­to­rial is­sue (if we had in­finite data, we would still have many more as­so­ci­a­tive than causal re­la­tion­ships) and a statis­ti­cal is­sue (at finite sam­ples it might be hard to de­tect in­de­pen­dences). I think we can do an em­piri­cal in­ves­ti­ga­tion of asymp­totic be­hav­ior by just path count­ing, and avoid statis­ti­cal is­sues (and is­sues in­volv­ing “un­faith­ful” or “nearly un­faith­ful” (faith­ful but hard to tell at finite sam­ples) dis­tri­bu­tions).

Nerd sniping ques­tion:

What is “\sum{G a DAG w/​ n ver­tices} \sum{r is a di­rected path in G} 1” as a func­tion of n?

What is “\sum{G a DAG w/​ n ver­tices} \sum{r is a marginal d-con­nected path in G} 1” as a func­tion of n?

A path is marginal d-con­nected if it does not con­tain ← * as a sub­path.

Edit: I re­al­ized this might be con­fus­ing, so I will clar­ify some­thing. I men­tioned above that within a given causal model (a set of causal struc­tures) the map­ping from causal struc­tures (el­e­ments of a “causal model” set) to joint dis­tri­bu­tions (el­e­ments of a “statis­ti­cal model con­sis­tent with a causal model” set) is in gen­eral many to one. That is, if our causal model is of a DAG A → B ← H → A (H not ob­served), then there ex­ist two causal struc­tures in this model that dis­agree on p(b | do(a)), but agree on p(a,b) (ob­served marginal den­sity).

In ad­di­tion, the map­ping from causal mod­els (sets) to statis­ti­cal mod­els (sets) con­sis­tent with a given causal model is also many to one. That is, the fol­low­ing two causal mod­els A → B → C and A ← B ← C both map onto a statis­ti­cal model which as­serts that A is in­de­pen­dent of C given B. This is­sue is differ­ent from what I was talk­ing about. In both causal mod­els above, we can ob­tain p(y | do(x)) for any Y,X from { A, B, C } as func­tion of ob­served data. For ex­am­ple p(c | do(a)) = p(c | a) in A → B → C, and p(c | do(a)) = p(c) in A ← B ← C. So in some sense the map­ping from causal struc­tures to joint dis­tri­bu­tions is one to one in DAGs with all nodes ob­served. We just don’t know which map­ping to ap­ply if we just look at a joint dis­tri­bu­tion, be­cause we can’t tell differ­ent causal mod­els apart. That is, these two dis­tinct causal mod­els are ob­ser­va­tion­ally in­dis­t­in­guish­able given the data (both im­ply the same statis­ti­cal model with the same in­de­pen­dence). To tell these mod­els apart we need to perform ex­per­i­ments, e.g. in a gene net­work try to knock out A, and see if C changes.

• (How many differ­ent DAGs are pos­si­ble if you have 600 nodes? Ap­par­ently, >2^600.)

Naively, I would ex­pect it to be closer to 600^600 (the num­ber of pos­si­ble di­rected graphs with 600 nodes).

And in fact, it is some com­pli­cated thing that seems to scale much more like n^n than like 2^n: http://​​en.wikipe­dia.org/​​wiki/​​Directed_acyclic_graph#Com­bi­na­to­rial_enumeration

• There’s an asymp­totic ap­prox­i­ma­tion in the OEIS: a(n) ~ n!2^(n(n-1)/​2)/​(M*p^n), with M and p con­stants. So log(a(n)) = O(n^2), as op­posed to log(2^n) = O(n), log(n!) = O(n log(n)), log(n^n) = O(n log(n)).

• It ap­pears I’ve ac­ci­den­tally nerd­sniped ev­ery­one! I was just try­ing to give an idea that it was re­ally re­ally big. (I had done some googling for the ex­act an­swer but they all seemed rather com­pli­cated, and rather than try and get an ex­act an­swer wrong, just give a lower bound.)

• If we al­low cy­cles, then there are three pos­si­bil­ities for an edge be­tween a pair of ver­tices in a di­rected graph: no edge, or an ar­row in ei­ther di­rec­tion. Since a graph of n ver­tices has n choose 2 pairs, the to­tal num­ber of DAGs of n ver­tices has an up­per bound of 3^(n choose 2). This is much smaller than n^n.

edit: the last sen­tence is wrong.

Gw­ern, thanks for writ­ing more, I will have more to say later.

• Since a graph of n ver­tices has n choose 2 pairs, the to­tal num­ber of DAGs of n ver­tices has an up­per bound of 3^(n choose 2). This is much smaller than n^n.

It is much larger. $3^{n\;\mbox{choose}\;2}$ = $((\sqrt{3}\$^{n-1})^n), and $(\sqrt{3}\$^{n-1}) is much larger than n.

3^(10 choose 2) is about 10^21.

Since the nodes of these graphs are all dis­t­in­guish­able, there is no need to fac­tor out by graph iso­mor­phism, so 3^(n choose 2) is the ex­act num­ber.

• The pre­cise asymp­totic is $\\lambda \(n\!\$%202%5E{\binom{n}{2}}%20\omega%5E{-n}), as shown on page 4 of this ar­ti­cle. Here lambda and omega are con­stants be­tween 1 and 2.

• That’s the num­ber of all di­rected graphs, some of which cer­tainly have cy­cles.

• That’s the num­ber of all di­rected graphs, some of which cer­tainly have cy­cles.

So it is. 3^(n choose 2) >> n^n stands though.

A lower bound for the num­ber of DAGs can be found by ob­serv­ing that if we drop the di­rect­ed­ness of the edges, there are 2^(n choose 2) undi­rected graphs on a set of n dis­t­in­guish­able ver­tices, and each of these cor­re­sponds to at least 1 DAG. There­fore there are at least that many DAGs, and 2^(n choose 2) is also much larger than n.

• Yup you are right, re: what is larger.

• You’re miss­ing a 4th pos­si­bil­ity. A & B are not mean­ingfully linked. This is very im­por­tant when deal­ing with large sets of vari­ables. Your mea­sure of cor­re­la­tion will have a cer­tain per­centage of false pos­i­tives, and dis­count­ing the pos­si­bil­ity of false pos­i­tives is im­por­tant. If the prob­a­bil­ity of false pos­i­tives is 1/​X you should ex­pect one false cor­re­la­tion for ev­ery X com­par­i­sons.

XKCD pro­vides an ex­cel­lent ex­am­ple. jelly beans

• And we can’t ex­plain away all of this low suc­cess rate as the re­sult of illu­sory cor­re­la­tions be­ing throw up by the stan­dard statis­ti­cal prob­lems with find­ings such as small n, sam­pling er­ror (A & B just hap­pened to sync to­gether due to ran­dom­ness), se­lec­tion bias, pub­li­ca­tion bias, etc. I’ve read about those prob­lems at length, and de­spite know­ing about all that, there still seems to be a prob­lem: cor­re­la­tion too of­ten ≠ cau­sa­tion.

• I’m point­ing out that your list isn’t com­plete, and not con­sid­er­ing this pos­si­bil­ity when we see a cor­re­la­tion is ir­re­spon­si­ble. There are a lot of ap­par­ent cor­re­la­tions, and your three pos­si­bil­ities provide no means to re­ject false pos­i­tives.

• You are fight­ing the hy­po­thet­i­cal. In the least con­ve­nient pos­si­ble world where no dataset is smaller than a petabyte and no one has ever heard of sam­pling er­ror, would you mag­i­cally be able to spin the straw of cor­re­la­tion into the gold of cau­sa­tion? No. Why not? That’s what I am dis­cussing here.

• I sug­gest you move that point closer to the list of 3 pos­si­bil­ities—I too read that list and im­me­di­ately thought, ”...and also co­in­ci­dence.”

The quote you posted above (“And we can’t ex­plain away...”) is an un­sup­ported as­ser­tion—a cor­rect one in my opinion, but it re­ally doesn’t do enough to di­rect at­ten­tion away from false pos­i­tive cor­re­la­tions. I sug­gest that you make it ex­plicit in the OP that you’re talk­ing about a hy­po­thet­i­cal in which ran­dom co­in­ci­dences are ex­cluded from the start. (Upvoted the OP FWIW.)

(Also, if I un­der­stand it cor­rectly, Ram­sey the­ory sug­gests that co­in­ci­dences are in­evitable even in the ab­sence of sam­pling er­ror.)

• I agree with gw­ern’s de­ci­sion to sep­a­rate statis­ti­cal is­sues from is­sues which arise even with in­finite sam­ples. Statis­ti­cal is­sues are also ex­tremely im­por­tant, and de­serve care­ful study, how­ever we should di­vide and con­quer com­pli­cated sub­jects.

• I also agree—I’m recom­mend­ing that he make that split clearer to the reader by ad­dress­ing it up front.

• I see. I re­ally didn’t ex­pect this to be such an is­sue and come up in both the open thread & Main… I’ve tried rewrit­ing the in­tro­duc­tion a bit. If peo­ple still in­sist on get­ting snagged on that, I give up.

• I’m point­ing out that your list isn’t com­plete,

It ends with “etc.” for Pete’s sake!

• ...no it doesn’t?

• So, um … how do we as­sess the like­li­hood of cau­sa­tion, as­sum­ing we can’t con­duct an im­promptu ex­per­i­ment on the spot?

• The key­words are ‘causal dis­cov­ery,’ ‘struc­ture learn­ing.’ There is a large liter­a­ture.

• “how else could this cor­re­la­tion hap­pen if there’s no causal con­nec­tion be­tween A & B‽”

The main way to cor­rect for this bias to­ward see­ing cau­sa­tion where there is only cor­re­la­tion fol­lows from this in­tro­spec­tion: be more imag­i­na­tive about how it could hap­pen (other than by di­rect cau­sa­tion).

[The cau­sa­tion bias (does it have a name?) seems to ex­press the availa­bil­ity bias. So, the cor­rec­tive is to in­crease the availa­bil­ity of the other pos­si­bil­ities.]

• Maybe. I tend to doubt that elic­it­ing a lot of al­ter­nate sce­nar­ios would elimi­nate the bias.

We might call it ‘hy­per­ac­tive agent de­tec­tion’, bor­row­ing a page from the etiol­ogy of re­li­gious be­lief: https://​​en.wikipe­dia.org/​​wiki/​​Agent_de­tec­tion which now that I think about it, might be stem from the same un­der­ly­ing be­lief—that things must have clear un­der­ly­ing causes. In one con­text, it gives rise to be­lief in gods, in an­other, in­ter­pret­ing statis­ti­cal find­ings like cor­re­la­tion as cau­sa­tion.

• stem from the same un­der­ly­ing be­lief—that things must have clear un­der­ly­ing causes

Hmm, a very in­ter­est­ing idea.

Re­lated to the hu­man ten­dency to find pat­terns in ev­ery­thing, maybe?

• Yes. Even more gen­er­ally… might be an over-ap­pli­ca­tion of Oc­cam’s ra­zor: in­sist­ing ev­ery­thing be max­i­mally sim­ple? It’s max­i­mally sim­ple when A and B cor­re­late to in­fer that one of them causes the other (in­stead of pos­tu­lat­ing a C com­mon cause); it’s max­i­mally sim­ple to ex­plain in­ex­pli­ca­ble events as due to a su­per­nat­u­ral agent (in­stead of pos­tu­lat­ing a uni­verse of com­plex un­der­ly­ing pro­cesses whose full ex­pli­ca­tion fills up libraries with­out end and are still poorly un­der­stood).

• That sounds more like a poor un­der­stand­ing of Oc­cam’s ra­zor. Com­plex on­tolog­i­cally ba­sic pro­cesses is not sim­pler than a hand­ful of strict math­e­mat­i­cal rules.

• Of course it’s (nor­ma­tively) wrong. But if that par­tic­u­lar er­ror is what’s go­ing on in peo­ples’ heads, it’ll man­i­fest as a differ­ent pat­tern of er­rors (and hence use­ful in­ter­ven­tions) than an availa­bil­ity bias: availa­bil­ity bias will be cured by forc­ing gen­er­a­tion of sce­nar­ios, but a prefer­ence for over­sim­plifi­ca­tion will cause the er­ror even if you lay out the var­i­ous sce­nar­ios on a silver plat­ter, be­cause the sub­ject will still pre­fer the max­i­mally sim­ple ver­sion where A->B rather than A<-C->B.

• Yes. Even more gen­er­ally… might be an over-ap­pli­ca­tion of Oc­cam’s ra­zor: in­sist­ing ev­ery­thing be max­i­mally sim­ple?

That is an­other as­pect, I think, but I I’d prob­a­bly con­sider the un­der­ly­ing drive to be not the de­sire for sim­plic­ity but the de­sire for the world to make sense. To sup­port this let me point out an­other uni­ver­sal hu­man ten­dency—the yearn­ing for sto­ries, nar­ra­tives that im­pose some struc­ture on the sur­round­ing re­al­ity (and these maps do not seek to match the ter­ri­tory as well as they can) and so provide the illu­sion of un­der­stand­ing and con­trol.

In other words, hu­mans are driven to always have some un­der­stand­able map of the world around them, any map, even if not very good and even if it’s pretty bad. The lack of some map, the lack of un­der­stand­ing (even if false) of what’s hap­pen­ing is well-known to lead to se­vere stress and gen­eral un­hap­piness.

• The cau­sa­tion bias (does it have a name?)

Seems to me like a spe­cial case of priv­ileg­ing the hy­poth­e­sis?

• A crit­i­cal mis­take in the lead anal­y­sis is false as­sump­tion: where there is a causal re­la­tion be­tween two vari­ables, they will be cor­re­lated. This ig­nores that causes of­ten can­cel out. (Of course, not perfectly, but enough to make raw cor­re­la­tion a gen­er­ally poor guide to causal­ity.

I think you have a fun­da­men­tally mis­taken episte­mol­ogy, gw­ern: you don’t see that cor­re­la­tions only sup­port causal­ity when they are pre­dicted by a causal the­ory.

• If two vari­ables are d-sep­a­rated given a third, there is no par­tial cor­re­la­tion be­tween the two, and the con­verse holds for al­most all prob­a­bil­ity dis­tri­bu­tions con­sis­tent with the causal model. This is a the­o­rem (Pearl 1.2.4). It’s true that not all causal effects are iden­ti­fi­able from statis­ti­cal data, but there are gen­eral rules for de­ter­min­ing which effects in a model are iden­ti­fi­able (e.g., front-door and back-door crite­ria).

There­fore I don’t see how some­thing like “causes of­ten can­cel out” could be true. Do you have any math­e­mat­i­cal ev­i­dence?

I see noth­ing of this “fun­da­men­tally mis­taken episte­mol­ogy” that you claim to see in gw­ern’s es­say.

• Causes do can­cel out in some struc­tures, and Na­ture does not se­lect ran­domly (e.g. evolu­tion might se­lect for can­cel­la­tion for home­osta­sis rea­sons). So the ar­gu­ment that most mod­els are faith­ful is not always con­vinc­ing.

This is a real is­sue, a causal ver­sion of a re­lated is­sue in statis­tics where two types of statis­ti­cal de­pen­dence can­cel out such that there is a con­di­tional in­de­pen­dence in the data, but un­der­ly­ing phe­nom­ena are re­lated.

I don’t think gw­ern has a mis­taken episte­mol­ogy, how­ever, be­cause this is­sue ex­ists. The is­sue just makes causal (and statis­ti­cal) in­fer­ence harder.