Causal Diagrams and Causal Models

Sup­pose a gen­eral-pop­u­la­tion sur­vey shows that peo­ple who ex­er­cise less, weigh more. You don’t have any known di­rec­tion of time in the data—you don’t know which came first, the in­creased weight or the diminished ex­er­cise. And you didn’t ran­domly as­sign half the pop­u­la­tion to ex­er­cise less; you just sur­veyed an ex­ist­ing pop­u­la­tion.

The statis­ti­ci­ans who dis­cov­ered causal­ity were try­ing to find a way to dis­t­in­guish, within sur­vey data, the di­rec­tion of cause and effect—whether, as com­mon sense would have it, more obese peo­ple ex­er­cise less be­cause they find phys­i­cal ac­tivity less re­ward­ing; or whether, as in the virtue the­ory of metabolism, lack of ex­er­cise ac­tu­ally causes weight gain due to di­v­ine pun­ish­ment for the sin of sloth.

vs.

The usual way to re­solve this sort of ques­tion is by ran­dom­ized in­ter­ven­tion. If you ran­domly as­sign half your ex­per­i­men­tal sub­jects to ex­er­cise more, and af­ter­ward the in­creased-ex­er­cise group doesn’t lose any weight com­pared to the con­trol group [1], you could rule out causal­ity from ex­er­cise to weight, and con­clude that the cor­re­la­tion be­tween weight and ex­er­cise is prob­a­bly due to phys­i­cal ac­tivity be­ing less fun when you’re over­weight [3]. The ques­tion is whether you can get causal data with­out in­ter­ven­tions.

For a long time, the con­ven­tional wis­dom in philos­o­phy was that this was im­pos­si­ble un­less you knew the di­rec­tion of time and knew which event had hap­pened first. Among some philoso­phers of sci­ence, there was a be­lief that the “di­rec­tion of causal­ity” was a mean­ingless ques­tion, and that in the uni­verse it­self there were only cor­re­la­tions—that “cause and effect” was some­thing un­ob­serv­able and un­defin­able, that only un­so­phis­ti­cated non-statis­ti­ci­ans be­lieved in due to their lack of for­mal train­ing:

“The law of causal­ity, I be­lieve, like much that passes muster among philoso­phers, is a relic of a by­gone age, sur­viv­ing, like the monar­chy, only be­cause it is er­ro­neously sup­posed to do no harm.”—Ber­trand Rus­sell (he later changed his mind)

“Beyond such dis­carded fun­da­men­tals as ‘mat­ter’ and ‘force’ lies still an­other fetish among the in­scrutable ar­cana of mod­ern sci­ence, namely, the cat­e­gory of cause and effect.”—Karl Pearson

The fa­mous statis­ti­cian Fisher, who was also a smoker, tes­tified be­fore Congress that the cor­re­la­tion be­tween smok­ing and lung can­cer couldn’t prove that the former caused the lat­ter. We have rem­nants of this type of rea­son­ing in old-school “Cor­re­la­tion does not im­ply cau­sa­tion”, with­out the now-stan­dard ap­pendix, “But it sure is a hint”.

This skep­ti­cism was over­turned by a sur­pris­ingly sim­ple math­e­mat­i­cal ob­ser­va­tion.

Let’s say there are three vari­ables in the sur­vey data: Weight, how much the per­son ex­er­cises, and how much time they spend on the In­ter­net.

For sim­plic­ity, we’ll have these three vari­ables be bi­nary, yes-or-no ob­ser­va­tions: Y or N for whether the per­son has a BMI over 25, Y or N for whether they ex­er­cised at least twice in the last week, and Y or N for whether they’ve checked Red­dit in the last 72 hours.

Now let’s say our gath­ered data looks like this:

Over­weight Ex­er­cise In­ter­net #
Y Y Y 1,119
Y Y N 16,104
Y N Y 11,121
Y N N 60,032
N Y Y 18,102
N Y N 132,111
N N Y 29,120
N N N 155,033

And lo, merely by eye­bal­ling this data -

(which is to­tally made up, so don’t go ac­tu­ally be­liev­ing the con­clu­sion I’m about to draw)

- we now re­al­ize that be­ing over­weight and spend­ing time on the In­ter­net both cause you to ex­er­cise less, pre­sum­ably be­cause ex­er­cise is less fun and you have more al­ter­na­tive things to do, but ex­er­cis­ing has no causal in­fluence on body weight or In­ter­net use.

“What!” you cry. “How can you tell that just by in­spect­ing those num­bers? You can’t say that ex­er­cise isn’t cor­re­lated to body weight—if you just look at all the mem­bers of the pop­u­la­tion who ex­er­cise, they clearly have lower weights. 10% of ex­er­cisers are over­weight, vs. 28% of non-ex­er­cisers. How could you rule out the ob­vi­ous causal ex­pla­na­tion for that cor­re­la­tion, just by look­ing at this data?”


There’s a wee bit of math in­volved. It’s sim­ple math—the part we’ll use doesn’t in­volve solv­ing equa­tions or com­pli­cated proofs -but we do have to in­tro­duce a wee bit of novel math to ex­plain how the heck we got there from here.

Let me start with a ques­tion that turned out—to the sur­prise of many in­ves­ti­ga­tors in­volved—to be highly re­lated to the is­sue we’ve just ad­dressed.

Sup­pose that earth­quakes and bur­glars can both set off bur­glar alarms. If the bur­glar alarm in your house goes off, it might be be­cause of an ac­tual bur­glar, but it might also be be­cause a minor earth­quake rocked your house and trig­gered a few sen­sors. Early in­ves­ti­ga­tors in Ar­tifi­cial In­tel­li­gence, who were try­ing to rep­re­sent all high-level events us­ing prim­i­tive to­kens in a first-or­der logic (for rea­sons of his­tor­i­cal stu­pidity we won’t go into) were stymied by the fol­low­ing ap­par­ent para­dox:

  • If you tell me that my bur­glar alarm went off, I in­fer a bur­glar, which I will rep­re­sent in my first-or­der-log­i­cal database us­ing a the­o­rem ALARM → BURGLAR. (The sym­bol “” is called “turn­stile” and means “the log­i­cal sys­tem as­serts that”.)

  • If an earth­quake oc­curs, it will set off bur­glar alarms. I shall rep­re­sent this us­ing the the­o­rem EARTHQUAKE → ALARM, or “earth­quake im­plies alarm”.

  • If you tell me that my alarm went off, and then fur­ther tell me that an earth­quake oc­curred, it ex­plains away my bur­glar alarm go­ing off. I don’t need to ex­plain the alarm by a bur­glar, be­cause the alarm has already been ex­plained by the earth­quake. I con­clude there was no bur­glar. I shall rep­re­sent this by adding a the­o­rem which says (EARTHQUAKE & ALARM) → NOT BURGLAR.

Which rep­re­sents a log­i­cal con­tra­dic­tion, and for a while there were at­tempts to de­velop “non-mono­tonic log­ics” so that you could re­tract con­clu­sions given ad­di­tional data. This didn’t work very well, since the un­der­ly­ing struc­ture of rea­son­ing was a ter­rible fit for the struc­ture of clas­si­cal logic, even when mu­tated.

Just chang­ing cer­tain­ties to quan­ti­ta­tive prob­a­bil­ities can fix many prob­lems with clas­si­cal logic, and one might think that this case was like­wise eas­ily fixed.

Namely, just write a prob­a­bil­ity table of all pos­si­ble com­bi­na­tions of earth­quake or ¬earth­quake, bur­glar or ¬bur­glar, and alarm or ¬alarm (where ¬ is the log­i­cal nega­tion sym­bol), with the fol­low­ing en­tries:

Bur­glar Earthquake Alarm %
b e a .000162
b e ¬a .0000085
b ¬e a .0151
b ¬e ¬a .00168
¬b e a .0078
¬b e ¬a .002
¬b ¬e a .00097
¬b ¬e ¬a .972

Us­ing the op­er­a­tions of marginal­iza­tion and con­di­tion­al­iza­tion, we get the de­sired rea­son­ing back out:

Let’s start with the prob­a­bil­ity of a bur­glar given an alarm, p(bur­glar|alarm). By the law of con­di­tional prob­a­bil­ity,

i.e. the rel­a­tive frac­tion of cases where there’s an alarm and a bur­glar, within the set of all cases where there’s an alarm.

The table doesn’t di­rectly tell us p(alarm & bur­glar)/​p(alarm), but by the law of marginal prob­a­bil­ity,

Similarly, to get the prob­a­bil­ity of an alarm go­ing off, p(alarm), we add up all the differ­ent sets of events that in­volve an alarm go­ing off—en­tries 1, 3, 5, and 7 in the table.

So the en­tire set of calcu­la­tions looks like this:

  • If I hear a bur­glar alarm, I con­clude there was prob­a­bly (63%) a bur­glar.

  • If I learn about an earth­quake, I con­clude there was prob­a­bly (80%) an alarm.

  • I hear about an alarm and then hear about an earth­quake; I con­clude there was prob­a­bly (98%) no bur­glar.

Thus, a joint prob­a­bil­ity dis­tri­bu­tion is in­deed ca­pa­ble of rep­re­sent­ing the rea­son­ing-be­hav­iors we want.

So is our prob­lem solved? Our work done?

Not in real life or real Ar­tifi­cial In­tel­li­gence work. The prob­lem is that this solu­tion doesn’t scale. Boy howdy, does it not scale! If you have a model con­tain­ing forty bi­nary vari­ables—alert read­ers may no­tice that the ob­served phys­i­cal uni­verse con­tains at least forty things—and you try to write out the joint prob­a­bil­ity dis­tri­bu­tion over all com­bi­na­tions of those vari­ables, it looks like this:

.0000000000112 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
.000000000000034 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN
.00000000000991 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYNY
.00000000000532 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYNN
.000000000145 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYNYY

(1,099,511,627,776 en­tries)

This isn’t merely a stor­age prob­lem. In terms of stor­age, a trillion en­tries is just a ter­abyte or three. The real prob­lem is learn­ing a table like that. You have to de­duce 1,099,511,627,776 float­ing-point prob­a­bil­ities from ob­served data, and the only con­straint on this gi­ant table is that all the prob­a­bil­ities must sum to ex­actly 1.0, a prob­lem with 1,099,511,627,775 de­grees of free­dom. (If you know the first 1,099,511,627,775 num­bers, you can de­duce the 1,099,511,627,776th num­ber us­ing the con­straint that they all sum to ex­actly 1.0.) It’s not the stor­age cost that kills you in a prob­lem with forty vari­ables, it’s the difficulty of gath­er­ing enough ob­ser­va­tional data to con­strain a trillion differ­ent pa­ram­e­ters. And in a uni­verse con­tain­ing sev­enty things, things are even worse.

So in­stead, sup­pose we ap­proached the earth­quake-bur­glar prob­lem by try­ing to spec­ify prob­a­bil­ities in a for­mat where… never mind, it’s eas­ier to just give an ex­am­ple be­fore stat­ing ab­stract rules.

First let’s add, for pur­poses of fur­ther illus­tra­tion, a new vari­able, “Re­ces­sion”, whether or not there’s a de­pressed econ­omy at the time. Now sup­pose that:

  • The prob­a­bil­ity of an earth­quake is 0.01.

  • The prob­a­bil­ity of a re­ces­sion at any given time is 0.33 (or 13).

  • The prob­a­bil­ity of a bur­glary given a re­ces­sion is 0.04; or, given no re­ces­sion, 0.01.

  • An earth­quake is 0.8 likely to set off your bur­glar alarm; a bur­glar is 0.9 likely to set off your bur­glar alarm. And—we can’t com­pute this model fully with­out this info—the com­bi­na­tion of a bur­glar and an earth­quake is 0.95 likely to set off the alarm; and in the ab­sence of ei­ther bur­glars or earth­quakes, your alarm has a 0.001 chance of go­ing off any­way.

p(r) .33
p(¬r) .67
p(a|be) .95
p(a|b¬e) .9
p(a|¬be) .797
p(a|¬b¬e) .001
p(¬a|be) .05
p(¬a|b¬e) .1
p(¬a|¬be) .203
p(¬a|¬b¬e) .999
p(e) .01
p(¬e) .99
p(b|r) .04
p(b|¬r) .01
p(¬b|r) .96
p(¬b|¬r) .99

Ac­cord­ing to this model, if you want to know “The prob­a­bil­ity that an earth­quake oc­curs”—just the prob­a­bil­ity of that one vari­able, with­out talk­ing about any oth­ers—you can di­rectly look up p(e) = .01. On the other hand, if you want to know the prob­a­bil­ity of a bur­glar strik­ing, you have to first look up the prob­a­bil­ity of a re­ces­sion (.33), and then p(b|r) and p(b|¬r), and sum up p(b|r)*p(r) + p(b|¬r)*p(¬r) to get a net prob­a­bil­ity of .01*.66 + .04*.33 = .02 = p(b), a 2% prob­a­bil­ity that a bur­glar is around at some ran­dom time.

If we want to com­pute the joint prob­a­bil­ity of four val­ues for all four vari­ables—for ex­am­ple, the prob­a­bil­ity that there is no earth­quake and no re­ces­sion and a bur­glar and the alarm goes off—this causal model com­putes this joint prob­a­bil­ity as the product:

In gen­eral, to go from a causal model to a prob­a­bil­ity dis­tri­bu­tion, we com­pute, for each set­ting of all the vari­ables, the product

mul­ti­ply­ing to­gether the con­di­tional prob­a­bil­ity of each vari­able given the val­ues of its im­me­di­ate par­ents. (If a node has no par­ents, the prob­a­bil­ity table for it has just an un­con­di­tional prob­a­bil­ity, like “the chance of an earth­quake is .01”.)

This is a causal model be­cause it cor­re­sponds to a world in which each event is di­rectly caused by only a small set of other events, its par­ent nodes in the graph. In this model, a re­ces­sion can in­di­rectly cause an alarm to go off—the re­ces­sion in­creases the prob­a­bil­ity of a bur­glar, who in turn sets off an alarm—but the re­ces­sion only acts on the alarm through the in­ter­me­di­ate cause of the bur­glar. (Con­trast to a model where re­ces­sions set off bur­glar alarms di­rectly.)

vs.

The first di­a­gram im­plies that once we already know whether or not there’s a bur­glar, we don’t learn any­thing more about the prob­a­bil­ity of a bur­glar alarm, if we find out that there’s a re­ces­sion:

This is a fun­da­men­tal illus­tra­tion of the lo­cal­ity of causal­ity—once I know there’s a bur­glar, I know ev­ery­thing I need to know to calcu­late the prob­a­bil­ity that there’s an alarm. Know­ing the state of Bur­glar screens off any­thing that Re­ces­sion could tell me about Alarm—even though, if I didn’t know the value of the Bur­glar vari­able, Re­ces­sions would ap­pear to be statis­ti­cally cor­re­lated with Alarms. The pre­sent screens off the past from the fu­ture; in a causal sys­tem, if you know the ex­act, com­plete state of the pre­sent, the state of the past has no fur­ther phys­i­cal rele­vance to com­put­ing the fu­ture. It’s how, in a sys­tem con­tain­ing many cor­re­la­tions (like the re­ces­sion-alarm cor­re­la­tion), it’s still pos­si­ble to com­pute each vari­able just by look­ing at a small num­ber of im­me­di­ate neigh­bors.

Con­straints like this are also how we can store a causal model—and much more im­por­tantly, learn a causal model—with many fewer pa­ram­e­ters than the naked, raw, joint prob­a­bil­ity dis­tri­bu­tion.

Let’s illus­trate this us­ing a sim­plified ver­sion of this graph, which only talks about earth­quakes and re­ces­sions. We could con­sider three hy­po­thet­i­cal causal di­a­grams over only these two vari­ables:

p(r) 0.03
p(¬r) 0.97
p(e) 0.29
p(¬e) 0.71

p(E&R)=p(E)p(R)

p(e) 0.29
p(¬e) 0.71
p(r|e) 0.15
p(¬r|e) 0.85
p(r|¬e) 0.03
p(¬r|¬e) 0.97

p(E&R) = p(E)p(R|E)

p(r) 0.03
p(¬r) 0.97
p(e|r) 0.24
p(¬e|r) 0.76
p(e|¬r) 0.09
p(¬e|¬r) 0.91

p(E&R) = p(R)p(E|R)

Let’s con­sider the first hy­poth­e­sis—that there’s no causal ar­rows con­nect­ing earth­quakes and re­ces­sions. If we build a causal model around this di­a­gram, it has 2 real de­grees of free­dom—a de­gree of free­dom for say­ing that the prob­a­bil­ity of an earth­quake is, say, 29% (and hence that the prob­a­bil­ity of not-earth­quake is nec­es­sar­ily 71%), and an­other de­gree of free­dom for say­ing that the prob­a­bil­ity of a re­ces­sion is 3% (and hence the prob­a­bil­ity of not-re­ces­sion is con­strained to be 97%).

On the other hand, the full joint prob­a­bil­ity dis­tri­bu­tion would have 3 de­grees of free­dom—a free choice of (earth­quake&re­ces­sion), a choice of p(earth­quake&¬re­ces­sion), a choice of p(¬earth­quake&re­ces­sion), and then a con­strained p(¬earth­quake&¬re­ces­sion) which must be equal to 1 minus the sum of the other three, so that all four prob­a­bil­ities sum to 1.0.

By the pi­geon­hole prin­ci­ple (you can’t fit 3 pi­geons into 2 pi­geon­holes) there must be some joint prob­a­bil­ity dis­tri­bu­tions which can­not be rep­re­sented in the first causal struc­ture. This means the first causal struc­ture is falsifi­able; there’s sur­vey data we can get which would lead us to re­ject it as a hy­poth­e­sis. In par­tic­u­lar, the first causal model re­quires:

or equivalently

or equivalently

which is a con­di­tional in­de­pen­dence con­straint—it says that learn­ing about re­ces­sions doesn’t tell us any­thing about the prob­a­bil­ity of an earth­quake or vice versa. If we find that earth­quakes and re­ces­sions are highly cor­re­lated in the ob­served data—if earth­quakes and re­ces­sions go to­gether, or earth­quakes and the ab­sence of re­ces­sions go to­gether—it falsifies the first causal model.

For ex­am­ple, let’s say that in your state, an earth­quake is 0.1 prob­a­ble per year and a re­ces­sion is 0.2 prob­a­ble. If we sup­pose that earth­quakes don’t cause re­ces­sions, earth­quakes are not an effect of re­ces­sions, and that there aren’t hid­den aliens which pro­duce both earth­quakes and re­ces­sions, then we should find that years in which there are earth­quakes and re­ces­sions hap­pen around 0.02 of the time. If in­stead earth­quakes and re­ces­sions hap­pen 0.08 of the time, then the prob­a­bil­ity of a re­ces­sion given an earth­quake is 0.8 in­stead of 0.2, and we should much more strongly ex­pect a re­ces­sion any time we are told that an earth­quake has oc­curred. Given enough sam­ples, this falsifies the the­ory that these fac­tors are un­con­nected; or rather, the more sam­ples we have, the more we dis­be­lieve that the two events are un­con­nected.

On the other hand, we can’t tell apart the sec­ond two pos­si­bil­ities from sur­vey data, be­cause both causal mod­els have 3 de­grees of free­dom, which is the size of the full joint prob­a­bil­ity dis­tri­bu­tion. (In gen­eral, fully con­nected causal graphs in which there’s a line be­tween ev­ery pair of nodes, have the same num­ber of de­grees of free­dom as a raw joint dis­tri­bu­tion—and 2 nodes con­nected by 1 line are “fully con­nected”.) We can’t tell if earth­quakes are 0.1 likely and cause re­ces­sions with 0.8 prob­a­bil­ity, or re­ces­sions are 0.2 likely and cause earth­quakes with 0.4 prob­a­bil­ity (or if there are hid­den aliens which on 6% of years show up and cause earth­quakes and re­ces­sions with prob­a­bil­ity 1).

With larger uni­verses, the differ­ence be­tween causal mod­els and joint prob­a­bil­ity dis­tri­bu­tions be­comes a lot more strik­ing. If we’re try­ing to rea­son about a mil­lion bi­nary vari­ables con­nected in a huge causal model, and each vari­able could have up to four di­rect ‘par­ents’ - four other vari­ables that di­rectly ex­ert a causal effect on it—then the to­tal num­ber of free pa­ram­e­ters would be at most… 16 mil­lion!

The num­ber of free pa­ram­e­ters in a raw joint prob­a­bil­ity dis­tri­bu­tion over a mil­lion bi­nary vari­ables would be 21,000,000. Minus one.

So causal mod­els which are less than fully con­nected—in which most ob­jects in the uni­verse are not the di­rect cause or di­rect effect of ev­ery­thing else in the uni­verse—are very strongly falsifi­able; they only al­low prob­a­bil­ity dis­tri­bu­tions (hence, ob­served fre­quen­cies) in an in­finites­i­mally tiny range of all pos­si­ble joint prob­a­bil­ity ta­bles. Causal mod­els very strongly con­strain an­ti­ci­pa­tion—dis­al­low al­most all pos­si­ble pat­terns of ob­served fre­quen­cies—and gain mighty Bayesian ad­van­tages when these pre­dic­tions come true.

To see this effect at work, let’s con­sider the three vari­ables Re­ces­sion, Bur­glar, and Alarm.

Alarm Bur­glar Re­ces­sion %
Y Y Y .012
N Y Y .0013
Y N Y .00287
N N Y .317
Y Y N .003
N Y N .000333
Y N N .00591
N N N .654

All three vari­ables seem cor­re­lated to each other when con­sid­ered two at a time. For ex­am­ple, if we con­sider Re­ces­sions and Alarms, they should seem cor­re­lated be­cause re­ces­sions cause bur­glars which cause alarms. If we learn there was an alarm, for ex­am­ple, we con­clude it’s more prob­a­ble that there was a re­ces­sion. So since all three vari­ables are cor­re­lated, can we dis­t­in­guish be­tween, say, these three causal mod­els?

Yes we can! Among these causal mod­els, the pre­dic­tion which only the first model makes, which is not shared by ei­ther of the other two, is that once we know whether a bur­glar is there, we learn noth­ing more about whether there was an alarm by find­ing out that there was a re­ces­sion, since re­ces­sions only af­fect alarms through the in­ter­me­di­ary of bur­glars:

But the third model, in which re­ces­sions di­rectly cause alarms, which only then cause bur­glars, does not have this prop­erty. If I know that a bur­glar has ap­peared, it’s likely that an alarm caused the bur­glar—but it’s even more likely that there was an alarm, if there was a re­ces­sion around to cause the alarm! So the third model pre­dicts:

And in the sec­ond model, where alarms and re­ces­sions both cause bur­glars, we again don’t have the con­di­tional in­de­pen­dence. If we know that there’s a bur­glar, then we think that ei­ther an alarm or a re­ces­sion caused it; and if we’re told that there’s an alarm, we’d con­clude it was less likely that there was a re­ces­sion, since the re­ces­sion had been ex­plained away.

(This may seem a bit clearer by con­sid­er­ing the sce­nario B->A<-E, where bur­glars and earth­quakes both cause alarms. If we’re told the value of the bot­tom node, that there was an alarm, the prob­a­bil­ity of there be­ing a bur­glar is not in­de­pen­dent of whether we’re told there was an earth­quake—the two top nodes are not con­di­tion­ally in­de­pen­dent once we con­di­tion on the bot­tom node.)

On the other hand, we can’t tell the differ­ence be­tween:

vs.

vs.

us­ing only this data and no other vari­ables, be­cause all three causal struc­tures pre­dict the same pat­tern of con­di­tional de­pen­dence and in­de­pen­dence—three vari­ables which all ap­pear mu­tu­ally cor­re­lated, but Alarm and Re­ces­sion be­come in­de­pen­dent once you con­di­tion on Bur­glar.

Be­ing able to read off pat­terns of con­di­tional de­pen­dence and in­de­pen­dence is an art known as “D-sep­a­ra­tion”, and if you’re good at it you can glance at a di­a­gram like this...

...and see that, once we already know the Sea­son, whether the Sprin­kler is on and whether it is Rain­ing are con­di­tion­ally in­de­pen­dent of each other—if we’re told that it’s Rain­ing we con­clude noth­ing about whether or not the Sprin­kler is on. But if we then fur­ther ob­serve that the side­walk is Slip­pery, then Sprin­kler and Rain be­come con­di­tion­ally de­pen­dent once more, be­cause if the Side­walk is Slip­pery then it is prob­a­bly Wet and this can be ex­plained by ei­ther the Sprin­kler or the Rain but prob­a­bly not both, i.e. if we’re told that it’s Rain­ing we con­clude that it’s less likely that the Sprin­kler was on.


Okay, back to the obe­sity-ex­er­cise-In­ter­net ex­am­ple. You may re­call that we had the fol­low­ing ob­served fre­quen­cies:

Over­weight Ex­er­cise In­ter­net #
Y Y Y 1,119
Y Y N 16,104
Y N Y 11,121
Y N N 60,032
N Y Y 18,102
N Y N 132,111
N N Y 29,120
N N N 155,033

Do you see where this is go­ing?

“Er,” you re­ply, “Maybe if I had a calcu­la­tor and ten min­utes… you want to just go ahead and spell it out?”

Sure! First, we marginal­ize over the ‘ex­er­cise’ vari­able to get the table for just weight and In­ter­net use. We do this by tak­ing the 1,119 peo­ple who are YYY, over­weight and Red­dit users and ex­er­cis­ing, and the 11,121 peo­ple who are over­weight and non-ex­er­cis­ing and Red­dit users, YNY, and adding them to­gether to get 12,240 to­tal peo­ple who are over­weight Red­dit users:

Over­weight In­ter­net #
Y Y 12,240
Y N 76,136
N Y 47,222
N N 287,144

“And then?”

Well, that sug­gests that the prob­a­bil­ity of us­ing Red­dit, given that your weight is nor­mal, is the same as the prob­a­bil­ity that you use Red­dit, given that you’re over­weight. 47,222 out of 334,366 nor­mal-weight peo­ple use Red­dit, and 12,240 out of 88,376 over­weight peo­ple use Red­dit. That’s about 14% ei­ther way.

“And so we con­clude?”

Well, first we con­clude it’s not par­tic­u­larly likely that us­ing Red­dit causes weight gain, or that be­ing over­weight causes peo­ple to use Red­dit:

If ei­ther of those causal links ex­isted, those two vari­ables should be cor­re­lated. We shouldn’t find the lack of cor­re­la­tion or con­di­tional in­de­pen­dence that we just dis­cov­ered.

Next, imag­ine that the real causal graph looked like this:

In this graph, ex­er­cis­ing causes you to be less likely to be over­weight (due to the virtue the­ory of metabolism), and ex­er­cis­ing causes you to spend less time on the In­ter­net (be­cause you have less time for it).

But in this case we should not see that the groups who are/​aren’t over­weight have the same prob­a­bil­ity of spend­ing time on Red­dit. There should be an out­sized group of peo­ple who are both nor­mal-weight and non-Red­di­tors (be­cause they ex­er­cise), and an out­sized group of non-ex­er­cisers who are over­weight and Red­dit-us­ing.

So that causal graph is also ruled out by the data, as are oth­ers like:

Leav­ing only this causal graph:

Which says that weight and In­ter­net use ex­ert causal effects on ex­er­cise, but ex­er­cise doesn’t causally af­fect ei­ther.

All this dis­ci­pline was in­vented and sys­tem­atized by Judea Pearl, Peter Spirtes, Thomas Verma, and a num­ber of other peo­ple in the 1980s and you should be quite im­pressed by their ac­com­plish­ment, be­cause be­fore then, in­fer­ring causal­ity from cor­re­la­tion was thought to be a fun­da­men­tally un­solv­able prob­lem. The stan­dard vol­ume on causal struc­ture is Causal­ity by Judea Pearl.

Causal mod­els (with spe­cific prob­a­bil­ities at­tached) are some­times known as “Bayesian net­works” or “Bayes nets”, since they were in­vented by Bayesi­ans and make use of Bayes’s The­o­rem. They have all sorts of neat com­pu­ta­tional ad­van­tages which are far be­yond the scope of this in­tro­duc­tion—e.g. in many cases you can split up a Bayesian net­work into parts, put each of the parts on its own com­puter pro­ces­sor, and then up­date on three differ­ent pieces of ev­i­dence at once us­ing a neatly lo­cal mes­sage-pass­ing al­gorithm in which each node talks only to its im­me­di­ate neigh­bors and when all the up­dates are finished prop­a­gat­ing the whole net­work has set­tled into the cor­rect state. For more on this see Judea Pearl’s Prob­a­bil­is­tic Rea­son­ing in In­tel­li­gent Sys­tems: Net­works of Plau­si­ble In­fer­ence which is the origi­nal book on Bayes nets and still the best in­tro­duc­tion I’ve per­son­ally hap­pened to read.


[1] Some­what to my own shame, I must ad­mit to ig­nor­ing my own ob­ser­va­tions in this de­part­ment—even af­ter I saw no dis­cernible effect on my weight or my mus­cu­la­ture from aer­o­bic ex­er­cise and strength train­ing 2 hours a day 3 times a week, I didn’t re­ally start be­liev­ing that the virtue the­ory of metabolism was wrong [2] un­til af­ter other peo­ple had started the skep­ti­cal dog­pile.

[2] I should men­tion, though, that I have con­firmed a per­sonal effect where eat­ing enough cook­ies (at a con­ven­tion where no pro­tein is available) will cause weight gain af­ter­ward. There’s no other dis­cernible cor­re­la­tion be­tween my carbs/​pro­tein/​fat al­lo­ca­tions and weight gain, just that eat­ing sweets in large quan­tities can cause weight gain af­ter­ward. This ad­mit­tedly does bear with the straight-out virtue the­ory of metabolism, i.e., eat­ing plea­surable foods is sin­ful weak­ness and hence pun­ished with fat.

[3] Or there might be some hid­den third fac­tor, a gene which causes both fat and non-ex­er­cise. By Oc­cam’s Ra­zor this is more com­pli­cated and its prob­a­bil­ity is pe­nal­ized ac­cord­ingly, but we can’t ac­tu­ally rule it out. It is ob­vi­ously im­pos­si­ble to do the con­verse ex­per­i­ment where half the sub­jects are ran­domly as­signed lower weights, since there’s no known in­ter­ven­tion which can cause weight loss.


Main­stream sta­tus: This is meant to be an in­tro­duc­tion to com­pletely bog-stan­dard Bayesian net­works, causal mod­els, and causal di­a­grams. Any de­par­tures from main­stream aca­demic views are er­rors and should be flagged ac­cord­ingly.

Part of the se­quence Highly Ad­vanced Episte­mol­ogy 101 for Beginners

Next post: “Stuff That Makes Stuff Hap­pen

Pre­vi­ous post: “The Fabric of Real Things