# Causal Diagrams and Causal Models

Sup­pose a gen­eral-pop­u­la­tion sur­vey shows that peo­ple who ex­er­cise less, weigh more. You don’t have any known di­rec­tion of time in the data—you don’t know which came first, the in­creased weight or the diminished ex­er­cise. And you didn’t ran­domly as­sign half the pop­u­la­tion to ex­er­cise less; you just sur­veyed an ex­ist­ing pop­u­la­tion.

The statis­ti­ci­ans who dis­cov­ered causal­ity were try­ing to find a way to dis­t­in­guish, within sur­vey data, the di­rec­tion of cause and effect—whether, as com­mon sense would have it, more obese peo­ple ex­er­cise less be­cause they find phys­i­cal ac­tivity less re­ward­ing; or whether, as in the virtue the­ory of metabolism, lack of ex­er­cise ac­tu­ally causes weight gain due to di­v­ine pun­ish­ment for the sin of sloth.

# vs.

The usual way to re­solve this sort of ques­tion is by ran­dom­ized in­ter­ven­tion. If you ran­domly as­sign half your ex­per­i­men­tal sub­jects to ex­er­cise more, and af­ter­ward the in­creased-ex­er­cise group doesn’t lose any weight com­pared to the con­trol group [1], you could rule out causal­ity from ex­er­cise to weight, and con­clude that the cor­re­la­tion be­tween weight and ex­er­cise is prob­a­bly due to phys­i­cal ac­tivity be­ing less fun when you’re over­weight [3]. The ques­tion is whether you can get causal data with­out in­ter­ven­tions.

For a long time, the con­ven­tional wis­dom in philos­o­phy was that this was im­pos­si­ble un­less you knew the di­rec­tion of time and knew which event had hap­pened first. Among some philoso­phers of sci­ence, there was a be­lief that the “di­rec­tion of causal­ity” was a mean­ingless ques­tion, and that in the uni­verse it­self there were only cor­re­la­tions—that “cause and effect” was some­thing un­ob­serv­able and un­defin­able, that only un­so­phis­ti­cated non-statis­ti­ci­ans be­lieved in due to their lack of for­mal train­ing:

“The law of causal­ity, I be­lieve, like much that passes muster among philoso­phers, is a relic of a by­gone age, sur­viv­ing, like the monar­chy, only be­cause it is er­ro­neously sup­posed to do no harm.”—Ber­trand Rus­sell (he later changed his mind)

“Beyond such dis­carded fun­da­men­tals as ‘mat­ter’ and ‘force’ lies still an­other fetish among the in­scrutable ar­cana of mod­ern sci­ence, namely, the cat­e­gory of cause and effect.”—Karl Pearson

The fa­mous statis­ti­cian Fisher, who was also a smoker, tes­tified be­fore Congress that the cor­re­la­tion be­tween smok­ing and lung can­cer couldn’t prove that the former caused the lat­ter. We have rem­nants of this type of rea­son­ing in old-school “Cor­re­la­tion does not im­ply cau­sa­tion”, with­out the now-stan­dard ap­pendix, “But it sure is a hint”.

This skep­ti­cism was over­turned by a sur­pris­ingly sim­ple math­e­mat­i­cal ob­ser­va­tion.

Let’s say there are three vari­ables in the sur­vey data: Weight, how much the per­son ex­er­cises, and how much time they spend on the In­ter­net.

For sim­plic­ity, we’ll have these three vari­ables be bi­nary, yes-or-no ob­ser­va­tions: Y or N for whether the per­son has a BMI over 25, Y or N for whether they ex­er­cised at least twice in the last week, and Y or N for whether they’ve checked Red­dit in the last 72 hours.

Now let’s say our gath­ered data looks like this:

Over­weight Ex­er­cise In­ter­net #
Y Y Y 1,119
Y Y N 16,104
Y N Y 11,121
Y N N 60,032
N Y Y 18,102
N Y N 132,111
N N Y 29,120
N N N 155,033

And lo, merely by eye­bal­ling this data -

(which is to­tally made up, so don’t go ac­tu­ally be­liev­ing the con­clu­sion I’m about to draw)

- we now re­al­ize that be­ing over­weight and spend­ing time on the In­ter­net both cause you to ex­er­cise less, pre­sum­ably be­cause ex­er­cise is less fun and you have more al­ter­na­tive things to do, but ex­er­cis­ing has no causal in­fluence on body weight or In­ter­net use.

“What!” you cry. “How can you tell that just by in­spect­ing those num­bers? You can’t say that ex­er­cise isn’t cor­re­lated to body weight—if you just look at all the mem­bers of the pop­u­la­tion who ex­er­cise, they clearly have lower weights. 10% of ex­er­cisers are over­weight, vs. 28% of non-ex­er­cisers. How could you rule out the ob­vi­ous causal ex­pla­na­tion for that cor­re­la­tion, just by look­ing at this data?”

There’s a wee bit of math in­volved. It’s sim­ple math—the part we’ll use doesn’t in­volve solv­ing equa­tions or com­pli­cated proofs -but we do have to in­tro­duce a wee bit of novel math to ex­plain how the heck we got there from here.

Let me start with a ques­tion that turned out—to the sur­prise of many in­ves­ti­ga­tors in­volved—to be highly re­lated to the is­sue we’ve just ad­dressed.

Sup­pose that earth­quakes and bur­glars can both set off bur­glar alarms. If the bur­glar alarm in your house goes off, it might be be­cause of an ac­tual bur­glar, but it might also be be­cause a minor earth­quake rocked your house and trig­gered a few sen­sors. Early in­ves­ti­ga­tors in Ar­tifi­cial In­tel­li­gence, who were try­ing to rep­re­sent all high-level events us­ing prim­i­tive to­kens in a first-or­der logic (for rea­sons of his­tor­i­cal stu­pidity we won’t go into) were stymied by the fol­low­ing ap­par­ent para­dox:

• If you tell me that my bur­glar alarm went off, I in­fer a bur­glar, which I will rep­re­sent in my first-or­der-log­i­cal database us­ing a the­o­rem ALARM → BURGLAR. (The sym­bol “” is called “turn­stile” and means “the log­i­cal sys­tem as­serts that”.)

• If an earth­quake oc­curs, it will set off bur­glar alarms. I shall rep­re­sent this us­ing the the­o­rem EARTHQUAKE → ALARM, or “earth­quake im­plies alarm”.

• If you tell me that my alarm went off, and then fur­ther tell me that an earth­quake oc­curred, it ex­plains away my bur­glar alarm go­ing off. I don’t need to ex­plain the alarm by a bur­glar, be­cause the alarm has already been ex­plained by the earth­quake. I con­clude there was no bur­glar. I shall rep­re­sent this by adding a the­o­rem which says (EARTHQUAKE & ALARM) → NOT BURGLAR.

Which rep­re­sents a log­i­cal con­tra­dic­tion, and for a while there were at­tempts to de­velop “non-mono­tonic log­ics” so that you could re­tract con­clu­sions given ad­di­tional data. This didn’t work very well, since the un­der­ly­ing struc­ture of rea­son­ing was a ter­rible fit for the struc­ture of clas­si­cal logic, even when mu­tated.

Just chang­ing cer­tain­ties to quan­ti­ta­tive prob­a­bil­ities can fix many prob­lems with clas­si­cal logic, and one might think that this case was like­wise eas­ily fixed.

Namely, just write a prob­a­bil­ity table of all pos­si­ble com­bi­na­tions of earth­quake or ¬earth­quake, bur­glar or ¬bur­glar, and alarm or ¬alarm (where ¬ is the log­i­cal nega­tion sym­bol), with the fol­low­ing en­tries:

Bur­glar Earthquake Alarm %
b e a .000162
b e ¬a .0000085
b ¬e a .0151
b ¬e ¬a .00168
¬b e a .0078
¬b e ¬a .002
¬b ¬e a .00097
¬b ¬e ¬a .972

Us­ing the op­er­a­tions of marginal­iza­tion and con­di­tion­al­iza­tion, we get the de­sired rea­son­ing back out:

Let’s start with the prob­a­bil­ity of a bur­glar given an alarm, p(bur­glar|alarm). By the law of con­di­tional prob­a­bil­ity,

$p(b|a) = \frac{p(ab)}{p(a)}$

i.e. the rel­a­tive frac­tion of cases where there’s an alarm and a bur­glar, within the set of all cases where there’s an alarm.

The table doesn’t di­rectly tell us p(alarm & bur­glar)/​p(alarm), but by the law of marginal prob­a­bil­ity,

$p(ab) = p(abe) + p(ab\neg e) = .000162 + .0151 = .0153$

Similarly, to get the prob­a­bil­ity of an alarm go­ing off, p(alarm), we add up all the differ­ent sets of events that in­volve an alarm go­ing off—en­tries 1, 3, 5, and 7 in the table.

So the en­tire set of calcu­la­tions looks like this:

• If I hear a bur­glar alarm, I con­clude there was prob­a­bly (63%) a bur­glar.

$p(b|a) = \frac{p(ab)}{p(a)} = \frac{.0153}{.000162 + .0151 + .0078 + .00097} = .63$

• If I learn about an earth­quake, I con­clude there was prob­a­bly (80%) an alarm.

$p(a|e) = \frac{p(ae)}{p(e)} = \frac{.000162 + .0078}{.000162 + .0078 + .0000085 + .002} = .8$

• I hear about an alarm and then hear about an earth­quake; I con­clude there was prob­a­bly (98%) no bur­glar.

$\frac{p(ae\neg b)}{p(ae)} = \frac{p(ae\neg b)}{p(aeb) + p(ae\neg b)} = \frac{.0078}{.000162 + .0078} = .98$

Thus, a joint prob­a­bil­ity dis­tri­bu­tion is in­deed ca­pa­ble of rep­re­sent­ing the rea­son­ing-be­hav­iors we want.

So is our prob­lem solved? Our work done?

Not in real life or real Ar­tifi­cial In­tel­li­gence work. The prob­lem is that this solu­tion doesn’t scale. Boy howdy, does it not scale! If you have a model con­tain­ing forty bi­nary vari­ables—alert read­ers may no­tice that the ob­served phys­i­cal uni­verse con­tains at least forty things—and you try to write out the joint prob­a­bil­ity dis­tri­bu­tion over all com­bi­na­tions of those vari­ables, it looks like this:

 .0000000000112 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY .000000000000034 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYN .00000000000991 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYNY .00000000000532 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYNN .000000000145 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYNYY … …

(1,099,511,627,776 en­tries)

This isn’t merely a stor­age prob­lem. In terms of stor­age, a trillion en­tries is just a ter­abyte or three. The real prob­lem is learn­ing a table like that. You have to de­duce 1,099,511,627,776 float­ing-point prob­a­bil­ities from ob­served data, and the only con­straint on this gi­ant table is that all the prob­a­bil­ities must sum to ex­actly 1.0, a prob­lem with 1,099,511,627,775 de­grees of free­dom. (If you know the first 1,099,511,627,775 num­bers, you can de­duce the 1,099,511,627,776th num­ber us­ing the con­straint that they all sum to ex­actly 1.0.) It’s not the stor­age cost that kills you in a prob­lem with forty vari­ables, it’s the difficulty of gath­er­ing enough ob­ser­va­tional data to con­strain a trillion differ­ent pa­ram­e­ters. And in a uni­verse con­tain­ing sev­enty things, things are even worse.

So in­stead, sup­pose we ap­proached the earth­quake-bur­glar prob­lem by try­ing to spec­ify prob­a­bil­ities in a for­mat where… never mind, it’s eas­ier to just give an ex­am­ple be­fore stat­ing ab­stract rules.

First let’s add, for pur­poses of fur­ther illus­tra­tion, a new vari­able, “Re­ces­sion”, whether or not there’s a de­pressed econ­omy at the time. Now sup­pose that:

• The prob­a­bil­ity of an earth­quake is 0.01.

• The prob­a­bil­ity of a re­ces­sion at any given time is 0.33 (or 13).

• The prob­a­bil­ity of a bur­glary given a re­ces­sion is 0.04; or, given no re­ces­sion, 0.01.

• An earth­quake is 0.8 likely to set off your bur­glar alarm; a bur­glar is 0.9 likely to set off your bur­glar alarm. And—we can’t com­pute this model fully with­out this info—the com­bi­na­tion of a bur­glar and an earth­quake is 0.95 likely to set off the alarm; and in the ab­sence of ei­ther bur­glars or earth­quakes, your alarm has a 0.001 chance of go­ing off any­way.

 p(r) 0.33 p(¬r) 0.67
 p(a|be) 0.95 p(a|b¬e) 0.9 p(a|¬be) 0.797 p(a|¬b¬e) 0.001 p(¬a|be) 0.05 p(¬a|b¬e) 0.1 p(¬a|¬be) 0.203 p(¬a|¬b¬e) 0.999
 p(e) 0.01 p(¬e) 0.99
 p(b|r) 0.04 p(b|¬r) 0.01 p(¬b|r) 0.96 p(¬b|¬r) 0.99

Ac­cord­ing to this model, if you want to know “The prob­a­bil­ity that an earth­quake oc­curs”—just the prob­a­bil­ity of that one vari­able, with­out talk­ing about any oth­ers—you can di­rectly look up p(e) = .01. On the other hand, if you want to know the prob­a­bil­ity of a bur­glar strik­ing, you have to first look up the prob­a­bil­ity of a re­ces­sion (.33), and then p(b|r) and p(b|¬r), and sum up p(b|r)*p(r) + p(b|¬r)*p(¬r) to get a net prob­a­bil­ity of .01*.66 + .04*.33 = .02 = p(b), a 2% prob­a­bil­ity that a bur­glar is around at some ran­dom time.

If we want to com­pute the joint prob­a­bil­ity of four val­ues for all four vari­ables—for ex­am­ple, the prob­a­bil­ity that there is no earth­quake and no re­ces­sion and a bur­glar and the alarm goes off—this causal model com­putes this joint prob­a­bil­ity as the product:

$p(\neg e)p(\neg r)p(b|\neg r)p(a|b\neg e) = .99 * .67 * .01 * .9 = .006 = 0.6%$

In gen­eral, to go from a causal model to a prob­a­bil­ity dis­tri­bu­tion, we com­pute, for each set­ting of all the vari­ables, the product

$p(\mathbf{X}=\mathbf{x})= \prod_i p(X_i=x_i|\mathbf{PA_i}=\mathbf{pa_i})$

mul­ti­ply­ing to­gether the con­di­tional prob­a­bil­ity of each vari­able given the val­ues of its im­me­di­ate par­ents. (If a node has no par­ents, the prob­a­bil­ity table for it has just an un­con­di­tional prob­a­bil­ity, like “the chance of an earth­quake is .01”.)

This is a causal model be­cause it cor­re­sponds to a world in which each event is di­rectly caused by only a small set of other events, its par­ent nodes in the graph. In this model, a re­ces­sion can in­di­rectly cause an alarm to go off—the re­ces­sion in­creases the prob­a­bil­ity of a bur­glar, who in turn sets off an alarm—but the re­ces­sion only acts on the alarm through the in­ter­me­di­ate cause of the bur­glar. (Con­trast to a model where re­ces­sions set off bur­glar alarms di­rectly.)

# vs.

The first di­a­gram im­plies that once we already know whether or not there’s a bur­glar, we don’t learn any­thing more about the prob­a­bil­ity of a bur­glar alarm, if we find out that there’s a re­ces­sion:

$p(a|b) = p(a|br)$

This is a fun­da­men­tal illus­tra­tion of the lo­cal­ity of causal­ity—once I know there’s a bur­glar, I know ev­ery­thing I need to know to calcu­late the prob­a­bil­ity that there’s an alarm. Know­ing the state of Bur­glar screens off any­thing that Re­ces­sion could tell me about Alarm—even though, if I didn’t know the value of the Bur­glar vari­able, Re­ces­sions would ap­pear to be statis­ti­cally cor­re­lated with Alarms. The pre­sent screens off the past from the fu­ture; in a causal sys­tem, if you know the ex­act, com­plete state of the pre­sent, the state of the past has no fur­ther phys­i­cal rele­vance to com­put­ing the fu­ture. It’s how, in a sys­tem con­tain­ing many cor­re­la­tions (like the re­ces­sion-alarm cor­re­la­tion), it’s still pos­si­ble to com­pute each vari­able just by look­ing at a small num­ber of im­me­di­ate neigh­bors.

Con­straints like this are also how we can store a causal model—and much more im­por­tantly, learn a causal model—with many fewer pa­ram­e­ters than the naked, raw, joint prob­a­bil­ity dis­tri­bu­tion.

Let’s illus­trate this us­ing a sim­plified ver­sion of this graph, which only talks about earth­quakes and re­ces­sions. We could con­sider three hy­po­thet­i­cal causal di­a­grams over only these two vari­ables:

 p(r) 0.03 p(¬r) 0.97
 p(e) 0.29 p(¬e) 0.71

## p(E&R)=p(E)p(R)

 p(e) 0.29 p(¬e) 0.71
 p(r|e) 0.15 p(¬r|e) 0.85 p(r|¬e) 0.03 p(¬r|¬e) 0.97

## p(E&R) = p(E)p(R|E)

 p(r) 0.03 p(¬r) 0.97
 p(e|r) 0.24 p(¬e|r) 0.76 p(e|¬r) 0.09 p(¬e|¬r) 0.91

## p(E&R) = p(R)p(E|R)

Let’s con­sider the first hy­poth­e­sis—that there’s no causal ar­rows con­nect­ing earth­quakes and re­ces­sions. If we build a causal model around this di­a­gram, it has 2 real de­grees of free­dom—a de­gree of free­dom for say­ing that the prob­a­bil­ity of an earth­quake is, say, 29% (and hence that the prob­a­bil­ity of not-earth­quake is nec­es­sar­ily 71%), and an­other de­gree of free­dom for say­ing that the prob­a­bil­ity of a re­ces­sion is 3% (and hence the prob­a­bil­ity of not-re­ces­sion is con­strained to be 97%).

On the other hand, the full joint prob­a­bil­ity dis­tri­bu­tion would have 3 de­grees of free­dom—a free choice of (earth­quake&re­ces­sion), a choice of p(earth­quake&¬re­ces­sion), a choice of p(¬earth­quake&re­ces­sion), and then a con­strained p(¬earth­quake&¬re­ces­sion) which must be equal to 1 minus the sum of the other three, so that all four prob­a­bil­ities sum to 1.0.

By the pi­geon­hole prin­ci­ple (you can’t fit 3 pi­geons into 2 pi­geon­holes) there must be some joint prob­a­bil­ity dis­tri­bu­tions which can­not be rep­re­sented in the first causal struc­ture. This means the first causal struc­ture is falsifi­able; there’s sur­vey data we can get which would lead us to re­ject it as a hy­poth­e­sis. In par­tic­u­lar, the first causal model re­quires:

$p(er) = p(e)p(r)$

or equivalently

$p(e|r) = p(e)$

or equivalently

$p(r|e) = p(r)$

which is a con­di­tional in­de­pen­dence con­straint—it says that learn­ing about re­ces­sions doesn’t tell us any­thing about the prob­a­bil­ity of an earth­quake or vice versa. If we find that earth­quakes and re­ces­sions are highly cor­re­lated in the ob­served data—if earth­quakes and re­ces­sions go to­gether, or earth­quakes and the ab­sence of re­ces­sions go to­gether—it falsifies the first causal model.

For ex­am­ple, let’s say that in your state, an earth­quake is 0.1 prob­a­ble per year and a re­ces­sion is 0.2 prob­a­ble. If we sup­pose that earth­quakes don’t cause re­ces­sions, earth­quakes are not an effect of re­ces­sions, and that there aren’t hid­den aliens which pro­duce both earth­quakes and re­ces­sions, then we should find that years in which there are earth­quakes and re­ces­sions hap­pen around 0.02 of the time. If in­stead earth­quakes and re­ces­sions hap­pen 0.08 of the time, then the prob­a­bil­ity of a re­ces­sion given an earth­quake is 0.8 in­stead of 0.2, and we should much more strongly ex­pect a re­ces­sion any time we are told that an earth­quake has oc­curred. Given enough sam­ples, this falsifies the the­ory that these fac­tors are un­con­nected; or rather, the more sam­ples we have, the more we dis­be­lieve that the two events are un­con­nected.

On the other hand, we can’t tell apart the sec­ond two pos­si­bil­ities from sur­vey data, be­cause both causal mod­els have 3 de­grees of free­dom, which is the size of the full joint prob­a­bil­ity dis­tri­bu­tion. (In gen­eral, fully con­nected causal graphs in which there’s a line be­tween ev­ery pair of nodes, have the same num­ber of de­grees of free­dom as a raw joint dis­tri­bu­tion—and 2 nodes con­nected by 1 line are “fully con­nected”.) We can’t tell if earth­quakes are 0.1 likely and cause re­ces­sions with 0.8 prob­a­bil­ity, or re­ces­sions are 0.2 likely and cause earth­quakes with 0.4 prob­a­bil­ity (or if there are hid­den aliens which on 6% of years show up and cause earth­quakes and re­ces­sions with prob­a­bil­ity 1).

With larger uni­verses, the differ­ence be­tween causal mod­els and joint prob­a­bil­ity dis­tri­bu­tions be­comes a lot more strik­ing. If we’re try­ing to rea­son about a mil­lion bi­nary vari­ables con­nected in a huge causal model, and each vari­able could have up to four di­rect ‘par­ents’ - four other vari­ables that di­rectly ex­ert a causal effect on it—then the to­tal num­ber of free pa­ram­e­ters would be at most… 16 mil­lion!

The num­ber of free pa­ram­e­ters in a raw joint prob­a­bil­ity dis­tri­bu­tion over a mil­lion bi­nary vari­ables would be 21,000,000. Minus one.

So causal mod­els which are less than fully con­nected—in which most ob­jects in the uni­verse are not the di­rect cause or di­rect effect of ev­ery­thing else in the uni­verse—are very strongly falsifi­able; they only al­low prob­a­bil­ity dis­tri­bu­tions (hence, ob­served fre­quen­cies) in an in­finites­i­mally tiny range of all pos­si­ble joint prob­a­bil­ity ta­bles. Causal mod­els very strongly con­strain an­ti­ci­pa­tion—dis­al­low al­most all pos­si­ble pat­terns of ob­served fre­quen­cies—and gain mighty Bayesian ad­van­tages when these pre­dic­tions come true.

To see this effect at work, let’s con­sider the three vari­ables Re­ces­sion, Bur­glar, and Alarm.

Alarm Bur­glar Re­ces­sion %
Y Y Y .012
N Y Y .0013
Y N Y .00287
N N Y .317
Y Y N .003
N Y N .000333
Y N N .00591
N N N .654

All three vari­ables seem cor­re­lated to each other when con­sid­ered two at a time. For ex­am­ple, if we con­sider Re­ces­sions and Alarms, they should seem cor­re­lated be­cause re­ces­sions cause bur­glars which cause alarms. If we learn there was an alarm, for ex­am­ple, we con­clude it’s more prob­a­ble that there was a re­ces­sion. So since all three vari­ables are cor­re­lated, can we dis­t­in­guish be­tween, say, these three causal mod­els?

 $p(rba) = p(r)p(b|r)p(a|b)$ $p(rba)=p(r)p(a)p(b|ra)$ $p(rab)=p(r)p(a|r)p(b|a)$

Yes we can! Among these causal mod­els, the pre­dic­tion which only the first model makes, which is not shared by ei­ther of the other two, is that once we know whether a bur­glar is there, we learn noth­ing more about whether there was an alarm by find­ing out that there was a re­ces­sion, since re­ces­sions only af­fect alarms through the in­ter­me­di­ary of bur­glars:

$p(a|br) = p(a|b)$

But the third model, in which re­ces­sions di­rectly cause alarms, which only then cause bur­glars, does not have this prop­erty. If I know that a bur­glar has ap­peared, it’s likely that an alarm caused the bur­glar—but it’s even more likely that there was an alarm, if there was a re­ces­sion around to cause the alarm! So the third model pre­dicts:

$p(a|br) \neq p(a|b)$

And in the sec­ond model, where alarms and re­ces­sions both cause bur­glars, we again don’t have the con­di­tional in­de­pen­dence. If we know that there’s a bur­glar, then we think that ei­ther an alarm or a re­ces­sion caused it; and if we’re told that there’s an alarm, we’d con­clude it was less likely that there was a re­ces­sion, since the re­ces­sion had been ex­plained away.

(This may seem a bit clearer by con­sid­er­ing the sce­nario B->A<-E, where bur­glars and earth­quakes both cause alarms. If we’re told the value of the bot­tom node, that there was an alarm, the prob­a­bil­ity of there be­ing a bur­glar is not in­de­pen­dent of whether we’re told there was an earth­quake—the two top nodes are not con­di­tion­ally in­de­pen­dent once we con­di­tion on the bot­tom node.)

On the other hand, we can’t tell the differ­ence be­tween:

# vs.

us­ing only this data and no other vari­ables, be­cause all three causal struc­tures pre­dict the same pat­tern of con­di­tional de­pen­dence and in­de­pen­dence—three vari­ables which all ap­pear mu­tu­ally cor­re­lated, but Alarm and Re­ces­sion be­come in­de­pen­dent once you con­di­tion on Bur­glar.

Be­ing able to read off pat­terns of con­di­tional de­pen­dence and in­de­pen­dence is an art known as “D-sep­a­ra­tion”, and if you’re good at it you can glance at a di­a­gram like this...

...and see that, once we already know the Sea­son, whether the Sprin­kler is on and whether it is Rain­ing are con­di­tion­ally in­de­pen­dent of each other—if we’re told that it’s Rain­ing we con­clude noth­ing about whether or not the Sprin­kler is on. But if we then fur­ther ob­serve that the side­walk is Slip­pery, then Sprin­kler and Rain be­come con­di­tion­ally de­pen­dent once more, be­cause if the Side­walk is Slip­pery then it is prob­a­bly Wet and this can be ex­plained by ei­ther the Sprin­kler or the Rain but prob­a­bly not both, i.e. if we’re told that it’s Rain­ing we con­clude that it’s less likely that the Sprin­kler was on.

Okay, back to the obe­sity-ex­er­cise-In­ter­net ex­am­ple. You may re­call that we had the fol­low­ing ob­served fre­quen­cies:

Over­weight Ex­er­cise In­ter­net #
Y Y Y 1,119
Y Y N 16,104
Y N Y 11,121
Y N N 60,032
N Y Y 18,102
N Y N 132,111
N N Y 29,120
N N N 155,033

Do you see where this is go­ing?

“Er,” you re­ply, “Maybe if I had a calcu­la­tor and ten min­utes… you want to just go ahead and spell it out?”

Sure! First, we marginal­ize over the ‘ex­er­cise’ vari­able to get the table for just weight and In­ter­net use. We do this by tak­ing the 1,119 peo­ple who are YYY, over­weight and Red­dit users and ex­er­cis­ing, and the 11,121 peo­ple who are over­weight and non-ex­er­cis­ing and Red­dit users, YNY, and adding them to­gether to get 12,240 to­tal peo­ple who are over­weight Red­dit users:

Over­weight In­ter­net #
Y Y 12,240
Y N 76,136
N Y 47,222
N N 287,144

“And then?”

Well, that sug­gests that the prob­a­bil­ity of us­ing Red­dit, given that your weight is nor­mal, is the same as the prob­a­bil­ity that you use Red­dit, given that you’re over­weight. 47,222 out of 334,366 nor­mal-weight peo­ple use Red­dit, and 12,240 out of 88,376 over­weight peo­ple use Red­dit. That’s about 14% ei­ther way.

“And so we con­clude?”

Well, first we con­clude it’s not par­tic­u­larly likely that us­ing Red­dit causes weight gain, or that be­ing over­weight causes peo­ple to use Red­dit:

If ei­ther of those causal links ex­isted, those two vari­ables should be cor­re­lated. We shouldn’t find the lack of cor­re­la­tion or con­di­tional in­de­pen­dence that we just dis­cov­ered.

Next, imag­ine that the real causal graph looked like this:

In this graph, ex­er­cis­ing causes you to be less likely to be over­weight (due to the virtue the­ory of metabolism), and ex­er­cis­ing causes you to spend less time on the In­ter­net (be­cause you have less time for it).

But in this case we should not see that the groups who are/​aren’t over­weight have the same prob­a­bil­ity of spend­ing time on Red­dit. There should be an out­sized group of peo­ple who are both nor­mal-weight and non-Red­di­tors (be­cause they ex­er­cise), and an out­sized group of non-ex­er­cisers who are over­weight and Red­dit-us­ing.

So that causal graph is also ruled out by the data, as are oth­ers like:

Leav­ing only this causal graph:

Which says that weight and In­ter­net use ex­ert causal effects on ex­er­cise, but ex­er­cise doesn’t causally af­fect ei­ther.

All this dis­ci­pline was in­vented and sys­tem­atized by Judea Pearl, Peter Spirtes, Thomas Verma, and a num­ber of other peo­ple in the 1980s and you should be quite im­pressed by their ac­com­plish­ment, be­cause be­fore then, in­fer­ring causal­ity from cor­re­la­tion was thought to be a fun­da­men­tally un­solv­able prob­lem. The stan­dard vol­ume on causal struc­ture is Causal­ity by Judea Pearl.

Causal mod­els (with spe­cific prob­a­bil­ities at­tached) are some­times known as “Bayesian net­works” or “Bayes nets”, since they were in­vented by Bayesi­ans and make use of Bayes’s The­o­rem. They have all sorts of neat com­pu­ta­tional ad­van­tages which are far be­yond the scope of this in­tro­duc­tion—e.g. in many cases you can split up a Bayesian net­work into parts, put each of the parts on its own com­puter pro­ces­sor, and then up­date on three differ­ent pieces of ev­i­dence at once us­ing a neatly lo­cal mes­sage-pass­ing al­gorithm in which each node talks only to its im­me­di­ate neigh­bors and when all the up­dates are finished prop­a­gat­ing the whole net­work has set­tled into the cor­rect state. For more on this see Judea Pearl’s Prob­a­bil­is­tic Rea­son­ing in In­tel­li­gent Sys­tems: Net­works of Plau­si­ble In­fer­ence which is the origi­nal book on Bayes nets and still the best in­tro­duc­tion I’ve per­son­ally hap­pened to read.

[1] Some­what to my own shame, I must ad­mit to ig­nor­ing my own ob­ser­va­tions in this de­part­ment—even af­ter I saw no dis­cernible effect on my weight or my mus­cu­la­ture from aer­o­bic ex­er­cise and strength train­ing 2 hours a day 3 times a week, I didn’t re­ally start be­liev­ing that the virtue the­ory of metabolism was wrong [2] un­til af­ter other peo­ple had started the skep­ti­cal dog­pile.

[2] I should men­tion, though, that I have con­firmed a per­sonal effect where eat­ing enough cook­ies (at a con­ven­tion where no pro­tein is available) will cause weight gain af­ter­ward. There’s no other dis­cernible cor­re­la­tion be­tween my carbs/​pro­tein/​fat al­lo­ca­tions and weight gain, just that eat­ing sweets in large quan­tities can cause weight gain af­ter­ward. This ad­mit­tedly does bear with the straight-out virtue the­ory of metabolism, i.e., eat­ing plea­surable foods is sin­ful weak­ness and hence pun­ished with fat.

[3] Or there might be some hid­den third fac­tor, a gene which causes both fat and non-ex­er­cise. By Oc­cam’s Ra­zor this is more com­pli­cated and its prob­a­bil­ity is pe­nal­ized ac­cord­ingly, but we can’t ac­tu­ally rule it out. It is ob­vi­ously im­pos­si­ble to do the con­verse ex­per­i­ment where half the sub­jects are ran­domly as­signed lower weights, since there’s no known in­ter­ven­tion which can cause weight loss.

Main­stream sta­tus: This is meant to be an in­tro­duc­tion to com­pletely bog-stan­dard Bayesian net­works, causal mod­els, and causal di­a­grams. Any de­par­tures from main­stream aca­demic views are er­rors and should be flagged ac­cord­ingly.

Part of the se­quence Highly Ad­vanced Episte­mol­ogy 101 for Beginners

Next post: “Stuff That Makes Stuff Hap­pen

Pre­vi­ous post: “The Fabric of Real Things

• Hi Eliezer,

The fa­mous statis­ti­cian Fischer, who was also a smoker, tes­tified be­fore Congress that the cor­re­la­tion be­tween smok­ing and lung can­cer couldn’t prove that the former caused the lat­ter.

Fisher was speci­fi­cally wor­ried about hid­den com­mon causes. Fisher was also the one who brought the con­cept of a ran­dom­ized ex­per­i­ment into statis­tics. Fisher was “cur­mud­geony,” but it is not quite fair to use him as an ex­em­plar of the “keep causal­ity out of our statis­tics” camp.

Causal mod­els (with spe­cific prob­a­bil­ities at­tached) are some­times known as “Bayesian net­works” or “Bayes nets”, since they were in­vented by Bayesi­ans and make use of Bayes’s The­o­rem.

Graph­i­cal causal mod­els and Bayesian net­works are not the same thing (this is a com­mon mis­con­cep­tion). A dis­tri­bu­tion that fac­tor­izes ac­cord­ing to a DAG is a Bayesian net­work (this is just a prop­erty of a dis­tri­bu­tion—noth­ing about causal­ity). You can fur­ther say that a graph­i­cal model is causal if an ad­di­tional set of prop­er­ties holds. For ex­am­ple, you can (loosely) say that in a causal model all par­ents are “di­rect causes.” If you want to say that for­mally, you would talk about the trun­cated fac­tor­iza­tion and do(.). Without in­ter­ven­tions there is no in­ter­ven­tion­ist causal model.

All this dis­ci­pline was in­vented and sys­tem­atized by Judea Pearl, Peter Spirtes, Thomas Verma, and a num­ber of other peo­ple in the 1980s and you should be quite im­pressed by their ac­com­plish­ment, be­cause be­fore then, in­fer­ring causal­ity from cor­re­la­tion was thought to be a fun­da­men­tally un­solv­able prob­lem.

I of­ten find my­self in a weird po­si­tion of hav­ing to point folks to peo­ple other than Pearl. I think it’s com­mend­able that you looked into other peo­ple in the field. The big early names in causal­ity that you did not men­tion are Haavelmo (1950s) and Se­wall Wright (this guy is amaz­ing—he figured so many things out cor­rectly in the 1920s). Spe­cial cases of non-causal graph­i­cal mod­els (Is­ing mod­els, Hid­den Markov mod­els, etc.), along with fac­tor­iza­tions and prop­a­ga­tion al­gorithms were known long be­fore Pearl in other com­mu­ni­ties.

P.S. Since I am in Cali.: if folks at SI are in­ter­ested in new de­vel­op­ments on the “learn­ing causal struc­ture from data” front, I could be bribed by the Cheese­board to come by and give a talk.

• Hi Eliezar,

(it’s Eliezer)

• Done, sorry.

• In fair­ness, a lot of peo­ple seem to pro­nounce it Eliezar for some rea­son.

• The statis­ti­ci­ans who dis­cov­ered causal­ity were try­ing to find a way to dis­t­in­guish, within sur­vey data, the di­rec­tion of cause and effect—whether, as com­mon sense would have it, more obese peo­ple ex­er­cise less be­cause they find phys­i­cal ac­tivity less re­ward­ing; or whether, as in the virtue the­ory of metabolism, lack of ex­er­cise ac­tu­ally causes weight gain due to di­v­ine pun­ish­ment for the sin of sloth.

I recom­mend that Eliezer edit this post to re­move this kind of provo­ca­tion. The na­ture of the ac­tual ra­tio­nal­ity mes­sage in this post is such that peo­ple are likely to link to it in the fu­ture (in­deed, I found it via an ex­ter­nal link my­self). It even seems like some­thing that may be in­tended to be part of a se­quence. As it stands I ex­pect many fu­ture refer­ences to be de­railed and also ex­pect to see this crop up promi­nently in lists of rea­sons to not take Eliezer’s blog posts se­ri­ously. And, frankly, this rea­son would be a heck of a lot bet­ter than most oth­ers that are usu­ally pro­vided by de­trac­tors.

• Maybe the “main­stream sta­tus” sec­tion should be placed at the top? It would sig­nal right at the top that this post is backed by proper au­thor­ity.

In ad­di­tion to the provo­ca­tion you men­tion, openly bash­ing main­stream philos­o­phy in the fourth para­graph doesn’t help. If you add a pos­si­ble rep­u­ta­tion of hold­ing un­sub­stan­ti­ated whacky be­liefs, well…

That said, I was quite sur­prised by the num­ber of com­ments about this is­sue. I for one didn’t see any prob­lem with this post.

When I read “di­v­ine pun­ish­ment for the sin of sloth”, I just smiled at the su­per­nat­u­ral ex­pla­na­tion, know­ing that Eliezer of course knows the virtue the­ory of metabolism have a perfectly nat­u­ral (and rea­son­able sound­ing) ex­pla­na­tion. Ac­tu­ally, it didn’t even touched my model of his prob­a­bil­ity dis­tri­bu­tion of the ve­rac­ity of the “virtue” the­ory —nor my own. After hav­ing read so much of his writ­ings, I just can’t be­lieve he rules such a hy­poth­e­sis out a pri­ori. Re­mem­ber re­duc­tion­ism. And my model of him definitely does not ex­pect to in­fluence LessWrong read­ers with an un­sub­stan­ti­ated mock­ery.

Also, this:

And lo, merely by eye­bal­ling this data -

(which is to­tally made up, so don’t go ac­tu­ally be­liev­ing the con­clu­sion I’m about to draw)

made clear he wasn’t dis­cussing the ob­ject at all. It was then eas­ier for me to put my­self in a po­si­tion of to­tal un­cer­tainty re­gard­ing the causal model im­plied by this “data”. The same way my HPMOR an­ti­ci­pa­tions are no longer build on can­non —Bel­la­trix could re­ally be in­no­cent, for al I know.

But this is me as­sum­ing to­tal good faith from Eliezer. I to­tally for­got that many peo­ple in fact do not as­sume good faith.

• I mostly liked the post. In Pearl’s book, the ex­am­ple of whether smok­ing causes can­cer worked pretty well for me de­spite be­ing po­ten­tially con­tro­ver­sial, and was more en­gag­ing for be­ing on a con­tro­ver­sial topic. Part of that is he kept his ex­am­ple fairly cleanly hy­po­thet­i­cal. Eliezer’s “I didn’t re­ally start be­liev­ing that the virtue the­ory of metabolism was wrong” in a foot­note, and “as com­mon sense would have it” in the main text, both were sug­gest­ing it was about the real world. I think in Pearl’s ex­am­ple, he may have even made his hy­po­thet­i­cal data give the op­po­site re­sult to the real world.

This post I also thought was more en­gag­ing due to the con­tro­ver­sial topic, so if you can keep that while re­duc­ing the “mind-kil­ler poli­tics” po­ten­tial I’d en­courage that.

I was fine with the model he was falsify­ing be­ing sim­ple and eas­ily dis­proved—that’s great for an ex­am­ple.

I’m kind of con­fused and skep­ti­cal at the bit at the end: we’ve ruled out all the mod­els ex­cept one. From Pearl’s book I’d some­how picked up that we need to make some causal as­sump­tion, statis­ti­cal data wasn’t enough to get all the way from ig­no­rance to know­ing the causal model.

Is as­sum­ing “cau­sa­tion would im­ply cor­re­la­tion” and “the model will have only these three vari­ables” enough in this case?

• I think in Pearl’s ex­am­ple, he may have even made his hy­po­thet­i­cal data give the op­po­site re­sult to the real world.

He in­tro­duces a “hy­po­thet­i­cal data set,” works through the math, then fol­lows the con­clu­sion that tar de­posits pro­tect against can­cer with this para­graph:

The data in Table 3.1 are ob­vi­ously un­re­al­is­tic and were de­liber­ately crafted so as to sup­port the geno­type the­ory. How­ever, the pur­pose of this ex­er­cise was to demon­strate how rea­son­able qual­i­ta­tive as­sump­tions about the work­ings of mechanisms, cou­pled with non­ex­per­i­men­tal data, can pro­duce pre­cise quan­ti­ta­tive as­sess­ments of causal effects. In re­al­ity, we would ex­pect ob­ser­va­tional stud­ies in­volv­ing me­di­at­ing vari­ables to re­fute the geno­type the­ory by show­ing, for ex­am­ple, that the me­di­at­ing con­se­quences of smok­ing (such as tar de­posits) tend to in­crease, not de­crease, the risk of can­cer in smok­ers and non­smok­ers al­ike. The es­ti­mand of (3.29) could then be used for quan­tify­ing the causal effect of smok­ing on can­cer.

When I read it, I re­mem­ber be­ing mildly both­ered by the ex­am­ple (why not have a clearly fic­tional ex­am­ple to match clearly fic­tional data, or find an ac­tual study and use the real data as an ex­am­ple?) but mostly mol­lified by his ex­tended dis­claimer.

(I feel like point­ing out, as an­other ex­am­ple, the de­ci­sion anal­y­sis class that I took, which had a cen­tral ex­am­ple which was re­peated and ex­tended through­out the semester. The pro­fes­sor was an ac­tive con­sul­tant, and could have drawn on a wealth of ex­am­ples in, say, petroleum ex­plo­ra­tion. But the ex­am­ple was a girl choos­ing a lo­ca­tion for a party, sub­ject to un­cer­tain weather. Why that? Be­cause it was ob­vi­ously a toy ex­am­ple. If they tried to use a petroleum ex­am­ple for petroleum en­g­ineers, the petroleum en­g­ineers would be rightly sus­pi­cious of any sim­plified model put in front of them- “you mean this pro­ce­dure only takes into ac­count two things!?”- and any ac­cu­rate model would be far too com­pli­cated to teach the method­ol­ogy. An ob­vi­ously toy ex­am­ple taught the pro­cess, and then once they un­der­stood the pro­cess, they were will­ing to ap­ply it to more com­pli­cated situ­a­tions- which, of course, needed much more com­pli­cated mod­els.)

• There may also be the as­sump­tion that the graph is acyclic.

Some causal mod­els, while not flat out falsified by the data, are ren­dered less prob­a­ble by the fact the data hap­pens to fit more pre­cise (less con­nected) causal graphs. A fully con­nected graph is im­pos­si­ble to falsify, for in­stance (it can ex­plain any data).

Among all graphs that ex­plain the fic­tional data here, there is only one that has only two edges. That’s the most prob­a­ble one.

• I strongly agree.

I read that para­graph and no­ticed that I was con­fused. Be­cause I was go­ing through this post to ac­quire a more-than-cur­sory tech­ni­cal in­tu­ition, I was mak­ing a point to fol­lowup on and re­solve any points of con­fu­sion.

There’s enough tech­ni­cal de­tail to care­fully parse, with­out adding ex­tra pieces that don’t make sense on first read­ing. I’d pre­fer to be able to spend my car­ful think­ing on the math.

• As was writ­ten in this sem­i­nal post:

In Ar­tifi­cial In­tel­li­gence, and par­tic­u­larly in the do­main of non­mono­tonic rea­son­ing, there’s a stan­dard prob­lem: “All Quak­ers are paci­fists. All Repub­li­cans are not paci­fists. Nixon is a Quaker and a Repub­li­can. Is Nixon a paci­fist?”
What on Earth was the point of choos­ing this as an ex­am­ple? To rouse the poli­ti­cal emo­tions of the read­ers and dis­tract them from the main ques­tion? To make Repub­li­cans feel un­wel­come in courses on Ar­tifi­cial In­tel­li­gence and dis­cour­age them from en­ter­ing the field? (And no, be­fore any­one asks, I am not a Repub­li­can. Or a Demo­crat.)
Why would any­one pick such a dis­tract­ing ex­am­ple to illus­trate non­mono­tonic rea­son­ing? Prob­a­bly be­cause the au­thor just couldn’t re­sist get­ting in a good, solid dig at those hated Greens. It feels so good to get in a hearty punch, y’know, it’s like try­ing to re­sist a choco­late cookie.
As with choco­late cook­ies, not ev­ery­thing that feels plea­surable is good for you. And it cer­tainly isn’t good for our hap­less read­ers who have to read through all the an­gry com­ments your blog post in­spired.

It’s not quite the same prob­lem, but it has some of the same con­se­quences.

• But not quite as dam­ag­ing as ra­tio­nal­ist case study: the ideal child-bear­ing age turns out to be 13 years old (ad­vances in mod­ern medicine, you know).

• Ideal for what, ex­actly? Churn­ing out the most ba­bies in the short­est amount of time? Hav­ing a happy and well-ad­justed pop­u­lace? Hav­ing a long life?

Ideal is a very loaded word and us­ing it im­plies that there’s an ob­vi­ous util­ity func­tion, when there of­ten isn’t.

• In any case, 13 is too young in many cases. I was be­ing face­tious.

• That’s a clear out­line of the the­ory. I just want to note that the the­ory it­self makes some as­sump­tions about pos­si­ble pat­terns of cau­sa­tion, even be­fore you be­gin to se­lect which causal graphs are plau­si­ble can­di­dates for test­ing. Pearl him­self stresses that with­out putting causal in­for­ma­tion in, you can’t get causal in­for­ma­tion out from purely ob­ser­va­tional data.

For ex­am­ple, if over­weight causes lack of ex­er­cise and lack of ex­er­cise causes over­weight, you don’t have an acyclic graph. Acyclic­ity of cau­sa­tion is one of the back­ground as­sump­tions here. Acyclic­ity of cau­sa­tion is rea­son­able when talk­ing about point events in a uni­verse with­out time-like loops. How­ever, “weight” and “ex­er­cise level” are tem­po­rally ex­tended pro­cesses, which makes acyclic­ity a strong as­sump­tion.

• Pearl him­self stresses that with­out putting causal in­for­ma­tion in, you can’t get causal in­for­ma­tion out from purely ob­ser­va­tional data.

Koan: How, then, does the pro­cess of at­tribut­ing cau­sa­tion get started?

• My an­swer:

First, no­tice a situ­a­tion that oc­curs many times. Then pay at­ten­tion to the ways in which things are differ­ent from one iter­a­tion to the next. At this point, and here is where causal in­for­ma­tion be­gins, if some of the vari­ables rep­re­sent your own be­hav­ior, you can sys­tem­at­i­cally in­ter­vene in the situ­a­tion by chang­ing those be­hav­iors. For clean­est re­sults, con­trive a con­trol­led ex­per­i­ment that is analo­gous the the origi­nal situ­a­tion.

In short, you in­sert causal in­for­ma­tion by in­ter­ven­ing.

This of course re­quires you to con­struct a refer­ence class of situ­a­tions that are sub­stan­tially similar to one an­other, but hu­mans seem to be pretty good at that within our do­mains of fa­mil­iar­ity.

By the way, thank you for ex­plain­ing the un­der­ly­ing as­sump­tion of acyclic­ity. I’ve been try­ing to in­ter­nal­ize the math of causal calcu­lus and it bugged me that cyclic causes weren’t al­lowed. Now I un­der­stand that it is a sim­plifi­ca­tion and that the calcu­lus isn’t quite as pow­er­ful as I thought.

• I don’t have an an­swer to my own koan, but this was one of the pos­si­bil­ities that I thought of:

In short, you in­sert causal in­for­ma­tion by in­ter­ven­ing.

But how does one in­ter­vene? By caus­ing some vari­able to take some value, while ob­struct­ing the other causal in­fluences on it. So causal knowl­edge is already re­quired be­fore one can in­ter­vene. This is not a triv­ial point—if the knowl­edge is mis­taken, the in­ter­ven­tion may not be suc­cess­ful, as I pointed out with the ex­am­ple of try­ing to warm a room ther­mo­stat by plac­ing a can­dle near it.

• Causal knowl­edge is re­quired to en­sure suc­cess, but not to stum­ble across it. Over time, notic­ing (or stum­bling across if you pre­fer) re­la­tion­ships be­tween the suc­cesses stum­bled upon can quickly co­a­lesce into a model of how to in­ter­vene. Isn’t this es­sen­tially how we be­lieve causal rea­son­ing origi­nated? In a sense, all DNA is in­for­ma­tion about how to in­ter­vene that, once stum­bled across, per­sisted due to its effi­cacy.

• I think that one boot­straps the pro­cess with con­trived situ­a­tions de­signed to ap­peal to ones in­tu­itions. For ex­am­ple, one at­tempts to ob­tain causal in­for­ma­tion through a ran­domised con­trol­led trial. You mark the ob­verse face of a coin “treat­ment” and re­verse face “con­trol” and toss the coin to “ran­domly” as­sign your pa­tients.

Let us briefly con­sider the ab­solute zero of no a pri­ori knowl­edge at all. Per­haps the coin knows the prog­no­sis of the pa­tient and comes down “treat­ment” for pa­tients with a good prog­no­sis, in­tend­ing to mis­lead you into be­liev­ing that the treat­ment is the cause of good out­comes. Maybe, maybe not. Let’s stop con­sid­er­ing this be­cause in­san­ity is stalk­ing us.

We are will­ing to take a stand. We know enough, a prior, to choose and op­er­ate a ran­domi­sa­tion de­vice and thus ob­tain a vari­able which is in­de­pen­dent of all the oth­ers and causally con­nected to none of them. We don’t prove this, we as­sume it. When we en­counter a com­pul­sive gam­bler, who be­lieves in Lady Luck who is fickle and very likely is ac­tu­ally mess­ing with us via the coin, we just dis­miss his hy­poth­e­sis. Life is short, one has to as­sume that cer­tain ob­vi­ous things are ac­tu­ally true in or­der to get started, and work up from there.

• My an­swer: At­tribut­ing cau­sa­tion is part of our hu­man in­stincts. We are born with some de­sire to do it. We may de­velop that skill by re­flect­ing on it dur­ing our life­time.

(How did we hu­mans de­velop that in­stinct? Evolu­tion, prob­a­bly. Hu­mans who had mu­tated to rea­son about causal­ity died less – for in­stance, they might have avoided drink­ing from a body of wa­ter af­ter see­ing some­thing poi­sonous put in, be­cause they rea­soned that the poi­son ad­di­tion would cause the wa­ter to be poi­sonous.)

• This is a non-ex­pla­na­tion, or rather, three non-ex­pla­na­tions.

“Hu­man na­ture does it” ex­plains no more than “God does it”.

“It’s part of hu­man na­ture be­cause it must have been adap­tive in the past” like­wise. Causal rea­son­ing works, but why does it work?

And “mu­tated to rea­son about causal­ity” is just say­ing “genes did it”, which is still not an ad­vance on “God did it”.

• There isn’t any bet­ter ex­pla­na­tion. If you don’t ac­cept the idea of causal­ity as given, you can never ex­plain any­thing. Ro­ryokane is us­ing causal­ity to ex­plain how causal­ity origi­nated, and that’s not a good way to go about prov­ing the way causal­ity works or any­thing but it is a good way of un­der­stand­ing why causal­ity ex­ists, or rather just ac­cept­ing that we can never prove causal­ity ex­ists.

Our in­stincts are just wired to in­ter­pret causal­ity that way, and that makes it a brute fact. You might as well claim that call­ing a cer­tain color yel­low and then say­ing it looks yel­low as a re­sult of hu­man na­ture is a non-ex­pla­na­tion, you might be tech­ni­cally right to do so but in that case then you’re ask­ing for an­swers you’re never ac­tu­ally go­ing to get.

• You might as well claim that call­ing a cer­tain color yel­low and then say­ing it looks yel­low as a re­sult of hu­man na­ture is a non-ex­pla­na­tion, you might be tech­ni­cally right to do so but in that case then you’re ask­ing for an­swers you’re never ac­tu­ally go­ing to get.

That would be a non-ex­pla­na­tion, but a bet­ter ex­pla­na­tion is in fact pos­si­ble. You can look at the way that light is turned into neu­ral sig­nals by the eye, and dis­cover the ex­is­tence of red-green, blue-yel­low, and light-dark axes, and there you have phys­iolog­i­cal jus­tifi­ca­tion for six of our ba­sic colour words. (I don’t know just how set­tled that story is, but it’s set­tled enough to be liter­ally text­book stuff.)

So, that is what a real ex­pla­na­tion looks like. At­tribut­ing any­thing to “hu­man na­ture” is even more wrong than at­tribut­ing it to “God”. At least we have some idea of what “God” would be if he ex­isted, but “hu­man na­ture” is a blank, a la­bel pa­per­ing over a void. How do Se­bas­tian Thrun’s cars drive them­selves? Be­cause he has in­te­grated self-driv­ing into their na­ture. How does opium pro­duce sleep? By its dor­mi­tive na­ture. How do hu­mans dis­t­in­guish colours? By their hu­man na­ture.

• But causal­ity is uniquely im­per­vi­ous to those kind of ex­pla­na­tions. You can ex­plain why hu­mans be­lieve in causal­ity in a phys­iolog­i­cal sense, but I didn’t think that is what you were ask­ing for. I thought you were ask­ing for some over­all meta­phys­i­cal jus­tifi­ca­tion for causal­ity, and there re­ally isn’t any. Causal rea­son­ing works be­cause it works, there’s no other jus­tifi­ca­tion to be had for it.

• How­ever, “weight” and “ex­er­cise level” are tem­po­rally ex­tended pro­cesses, which makes acyclic­ity a strong as­sump­tion.

This is a hugely im­por­tant point in prac­ti­cal, ev­ery­day rea­son­ing about causal­ity. Feed­back loops abound.

• Pearl him­self stresses that with­out putting causal in­for­ma­tion in, you can’t get causal in­for­ma­tion out from purely ob­ser­va­tional data.

Where do you get this? My re­call of Causal­ity is that he speci­fi­cally re­jected the “no causes in, no causes out” view in fa­vor of the “Oc­cam’s Ra­zor in, some causes out” view.

• Yes, the Oc­camian view is in his book in sec­tion 2.3 (and still in the 2009 2nd edi­tion). But that defi­ni­tion of “in­ferred cau­sa­tion”—those ar­rows com­mon to all causal mod­els con­sis­tent with the statis­ti­cal data—de­pends on gen­eral causal as­sump­tions, the usual ones be­ing the DAG, Markov, and Faith­ful­ness prop­er­ties.

In other places, for ex­am­ple: “Causal in­fer­ence in statis­tics: An overview”, which is in effect the Cliff Notes ver­sion of his book, he writes:

“one can­not sub­stan­ti­ate causal claims from as­so­ci­a­tions alone, even at the pop­u­la­tion level—be­hind ev­ery causal con­clu­sion there must lie some causal as­sump­tion that is not testable in ob­ser­va­tional stud­ies.”

Here is a similar sur­vey ar­ti­cle from 2003, in which he writes that ex­act sen­tence, fol­lowed by:

“Nancy Cartwright (1989) ex­pressed this prin­ci­ple as “no causes in, no causes out”, mean­ing that we can­not con­vert statis­ti­cal knowl­edge into causal knowl­edge.”

Every­where, he defines cau­sa­tion in terms of coun­ter­fac­tu­als: claims about what would have hap­pened had some­thing been differ­ent, which, he says, can­not be ex­pressed in terms of statis­ti­cal dis­tri­bu­tions over ob­ser­va­tional data. Here is a long in­ter­view (both au­dio and text tran­script) in which he re­counts the whole course of his work.

• In other places, for ex­am­ple: “Causal in­fer­ence in statis­tics: An overview”, which is in effect the Cliff Notes ver­sion of his book, he writes:

“one can­not sub­stan­ti­ate causal claims from as­so­ci­a­tions alone, even at the pop­u­la­tion level—be­hind ev­ery causal con­clu­sion there must lie some causal as­sump­tion that is not testable in ob­ser­va­tional stud­ies.”

Here is a similar sur­vey ar­ti­cle from 2003, in which he writes that ex­act sen­tence, fol­lowed by:

“Nancy Cartwright (1989) ex­pressed this prin­ci­ple as “no causes in, no causes out”, mean­ing that we can­not con­vert statis­ti­cal knowl­edge into causal knowl­edge.”

In­ter­est­ing, but how do those files evade word searches for the parts you’ve quoted?

• In­ter­est­ing, but how do those files evade word searches for the parts you’ve quoted?

Dunno, not all PDFs are search­able and not all PDF view­ers fail to make a pig’s ear of search­ing. The quotes can be found on p.99 (the third page of the file) and pp.284-285 (6th-7th pages of the file) re­spec­tively.

• Btw, Scott Aaron­son just re­cently posted the ques­tion of whether you would care about causal­ity if you could only work with ob­ser­va­tional data (some­one already linked this ar­ti­cle in the com­ments) and I put up a com­ment with my sum­mary of the LW po­si­tion (plus some com­plex­ity-the­o­retic con­sid­er­a­tions).

• I don’t think that Bayesian net­works im­plic­itly con­tain the con­cept of causal­ity.

For­mally, a prob­a­bil­ity dis­tri­bu­tion is rep­re­sented by a Bayesian net­work if it can be fac­tored as a product of P(node | node’s par­ents). But this is not unique, given one net­work you can cre­ate lots of other net­works which also rep­re­sent the same dis­tri­bu­tion by e.g. chang­ing the di­rec­tion of ar­rows as long as the in­de­pen­dence prop­er­ties from the graph stay the same (e.g. the graphs A → B → C and A ← B ← C can rep­re­sent ex­actly the same class of prob­a­bil­ity dis­tri­bu­tions). Pearl dis­t­in­guishes Baysian net­works from causal net­works, which are Bayesian net­works in which the ar­rows point in the di­rec­tion of causal­ity.

And of course, there are other sparse rep­re­sen­ta­tions like Markov net­works, which also in­cor­po­rates in­de­pen­dence as­sump­tions but are undi­rected.

• The non-unique­ness doesn’t make causal­ity ab­sent or ir­rele­vant; it must means there are mul­ti­ple min­i­mal rep­re­sen­ta­tions that use causal­ity. The causal­ity arises in how your node con­nec­tions are asym­met­ric. If the rel­a­tivity of si­mul­tane­ity (ob­servers see­ing the same events in a differ­ent time or­der) doesn’t ob­vi­ate causal­ity, nei­ther should he ex­is­tence of mul­ti­ple causal net­works.

There are in­deed equiv­a­lent mod­els that use purely sym­met­ric node con­nec­tions (or none at all in the case of the su­per­ex­po­nen­tial pair wise con­di­tional in­de­pen­dence table across all vari­ables), but (cor­rect me if I’m wrong) by throw­ing away the in­for­ma­tion graph­i­cally rep­re­sented by the ar­rows, you no longer have a max­i­mally effi­cient en­cod­ing of the joint prob­a­bil­ity dis­tri­bu­tion (even though it’s cer­tainly not as bad as the su­per­ex­po­nen­tial table).

• I guess there are two points here.

First, au­thors like Pearl do not use “causal­ity” to mean just that there is a di­rected edge in a Bayesian net­work (i.e. that cer­tain con­di­tional in­de­pen­dence prop­er­ties hold). Rather, he uses it to mean that the model de­scribes what hap­pens un­der in­ter­ven­tions. One can see the differ­ence by com­par­ing Rain → WetGrass with WetGrass → Rain (which are equiv­a­lent as Bayesian net­works). Of course, maybe he is con­fused and the differ­ence will dis­solve un­der more care­ful con­sid­er­a­tion, but I think this shows one should be care­ful in claiming that Bayes net­works en­code our best un­der­stand­ing of causal­ity.

Se­cond, do we need Bayesian net­works to eco­nom­i­cally rep­re­sent dis­tri­bu­tions? This is slightly sub­tle.

We do not need the di­rected ar­rows when rep­re­sent­ing a par­tic­u­lar dis­tri­bu­tion. For ex­am­ple, sup­pose a dis­tri­bu­tion P(A,B,C) is rep­re­sented by the Bayesian net­work A → B ← C. Ex­pand­ing the defi­ni­tion, this means that the joint dis­tri­bu­tion can be fac­tored as

P(A=a,B=b,C=c) = P1(A=a) P2(B=b|A=a,C=c) P3(C=c)

where P1 and P3 are the marginal dis­tri­bu­tions of A and B, and P2 is the con­di­tional dis­tri­bu­tion of B. So the data we needed to spec­ify P were two one-column ta­bles spec­i­fy­ing P1 and P3, and a three-column table spec­i­fy­ing P2(a|b,c) for all val­ues of a,b,c. But now note that we do not gain very much by know­ing that these are prob­a­bil­ity dis­tri­bu­tions. To save space it is enough to note that P fac­tors as

P(A=a,B=b,C=c) = F1(a) F2(b,a,c) F3(c)

for some real-val­ued func­tions F1, F2, and F3. In other words, that P is rep­re­sented by a Markov net­work A—B - C. The di­rec­tions on the edges were not es­sen­tial.

And in­deed, typ­i­cal al­gorithms for in­fer­ence given a prob­a­bil­ity dis­tri­bu­tion, such as be­lief prop­a­ga­tion, do not make use of the Bayesian struc­ture. They work equally well for di­rected and undi­rected graphs.

Rather, the point of Bayesian ver­sus Markov net­works is that the class of prob­a­bil­ity dis­tri­bu­tions that can be rep­re­sented by them are differ­ent. So they are use­ful when we try to learn a prob­a­bil­ity dis­tri­bu­tion, and want to cut down the search space by con­strain­ing the dis­tri­bu­tion by some in­de­pen­dence re­la­tions that we know a pri­ori.

Bayesian net­works are pop­u­lar be­cause they let us write down many in­de­pen­dence as­sump­tions that we know hold for prac­ti­cal prob­lems. How­ever, we then have to ask how we know those par­tic­u­lar in­de­pen­dence re­la­tions hold. And that’s be­cause they cor­re­spond to causual re­la­tions! The rea­son Bayesian net­works are pop­u­lar with hu­man re­searchers is that they cor­re­spond well with the no­tion of causal­ity that hu­mans use. We don’t know that the Arm­chairi­ans would also find them use­ful.

• To save space it is enough to note that P fac­tors as

P(A=a,B=b,C=c) = F1(a) F2(b,a,c) F3(c)

for some real-val­ued func­tions F1, F2, and F3. In other words, that P is rep­re­sented by a Markov net­work A—B - C. The di­rec­tions on the edges were not es­sen­tial.

Can’t the di­rec­tions be re­cov­ered au­to­mat­i­cally from that ex­pres­sion, though? That is, dis­card­ing the di­rec­tions from the no­ta­tion of con­di­tional prob­a­bil­ities doesn’t ac­tu­ally dis­card them.

The re­con­struc­tion al­gorithm would la­bel ev­ery func­tion ar­gu­ment as “pri­mary” or “sec­ondary”, be­gin with no ar­gu­ments la­bel­led, and re­peat­edly do this:

For ev­ery func­tion with no pri­mary vari­able and ex­actly one un­la­bel­led vari­able, la­bel that vari­able as pri­mary and all of its oc­cur­rences as ar­gu­ments to other func­tions as sec­ondary.

When all ar­gu­ments are la­bel­led, make a graph of the vari­ables with an ar­row from X to Y when­ever X and Y oc­cur as ar­gu­ments to the same func­tion, X as sec­ondary and Y as pri­mary. If the func­tions F1 F2 etc. origi­nally came from a Bayesian net­work, won’t this re­cover that pre­cise net­work?

If the origi­nal graph was A ← B → C, the ex­pres­sion would have been F1(a,b) F2(b) F3(c,b).

• If the func­tions F1 F2 etc. origi­nally came from a Bayesian net­work, won’t this re­cover that pre­cise net­work?

I think this is right, if you know that the fac­tors were learned by fit­ting them to a Bayesian net­work, you can re­cover what that net­work must have been. And you can go even fur­ther, if you only have a joint dis­tri­bu­tion you can use the tech­niques of the origi­nal ar­ti­cle to see which Bayesian net­works could be con­sis­tent with it.

But there is a sep­a­rate ques­tion about why we are in­ter­ested in Bayesian net­works in the first place. SilasBarta seemed to claim that you are nat­u­rally led to them if you are in­ter­ested in rep­re­sent­ing prob­a­bil­ity dis­tri­bu­tions effi­ciently. But for that pur­pose (I claim), you only need the idea of fac­tors, not the di­rected graph struc­ture. E.g. a prob­a­bil­ity dis­tri­bu­tion which fits the (equiv­a­lent) Bayesian net­works A → B → C or A ← B ← C or A ← B → C can be effi­ciently rep­re­sented as F1(a,b) F2(b,c). You would not think of rep­re­sent­ing it as F1(a) F2(a,b) F3(b,c) un­less you were already in­ter­ested in causal­ity.

• In other words, that P is rep­re­sented by a Markov net­work A—B - C. The di­rec­tions on the edges were not es­sen­tial.

On the con­trary, they are im­por­tant and store in­for­ma­tion about the re­la­tion­ships that saves you time and space. Like I said in my linked com­ment, the di­rec­tion of the ar­rows be­tween A,C and B tell you whether con­di­tion­ing on B (per­haps by sep­a­rat­ing it out into buck­ets of var­i­ous val­ues) cre­ates or de­stroys mu­tual in­for­ma­tion be­tween A and C. That saves you from hav­ing to ex­plic­itly write out all the com­bi­na­tions of con­di­tional (in)de­pen­dence.

• In other words, that P is rep­re­sented by a Markov net­work A—B - C. The di­rec­tions on the edges were not es­sen­tial.

Oops, on sec­ond thought the fac­tor­iza­tion is equiv­a­lent to the com­plete tri­an­gle, not a line. But this doesn’t change the point that the space re­quire­ments are de­ter­mined by the fac­tors, not the graph struc­ture, so the two rep­re­sen­ta­tions will use the same amount of space.

On the con­trary, they are im­por­tant and store in­for­ma­tion about the re­la­tion­ships that saves you time and space.

All in­de­pen­dence re­la­tions are im­plicit in the dis­tri­bu­tion it­self, so the graph can only save you time, not space.

It is true that know­ing a min­i­mal Bayes net­work or a min­i­mal Markov net­work for a dis­tri­bu­tion lets you read of cer­tain in­de­pen­dence as­sump­tions quickly. But it doesn’t save you from hav­ing to write out all the com­bi­na­tions. There are ex­po­nen­tially many pos­si­ble con­di­tional in­de­pen­dences, each of which may hold or not, so no sub-ex­po­nen­tial rep­re­sen­ta­tion can get en­code all of them. And in­deed, there are some kinds of in­de­pen­dence as­sump­tions that can be ex­pressed as Bayesian net­works but not Markov net­works, and vice versa. Even in ev­ery­day ma­chine learn­ing, it is not the case that Bayesian net­works is always the best rep­re­sen­ta­tion.

You also do not mo­ti­vate why some­one would be in­ter­ested in a big list of con­di­tional in­de­pen­den­cies for its own sake. Surely, what we ul­ti­mately want to know is e.g. the prob­a­bil­ity that it will rain to­mor­row, not whether or not rain is cor­re­lated with sprin­klers.

• But it doesn’t save you from hav­ing to write out all the com­bi­na­tions.

It saves you from hav­ing to write them un­til needed, in which case they can be ex­tracted by walk­ing through the graph rather than do­ing a lookup on a su­per­ex­po­nen­tial table.

You also do not mo­ti­vate why some­one would be in­ter­ested in a big list of con­di­tional in­de­pen­den­cies for its own sake. Surely, what we ul­ti­mately want to know is e.g. the prob­a­bil­ity that it will rain to­mor­row, not whether or not rain is cor­re­lated with sprin­klers.

Yes, the ques­tion was what they would care about if they were only in­ter­ested in pre­dic­tions. And so I think I’ve mo­ti­vated why they would care about con­di­tional (in)de­pen­den­cies: it de­ter­mines the (min­i­mal) set of vari­ables they need to look at! What­ever min­i­mal method of rep­re­sent­ing their knowl­edge will then have these ar­rows (from one of the net­works that fits the data).

If you re­quire that causal­ity defi­ni­tions be re­stricted to (un­cor­re­lated) coun­ter­fac­tual op­er­a­tions (like Pearl’s “do” op­er­a­tion), then sure, the Arm­char­i­ans won’t do that spe­cific com­pu­ta­tion. But if you use the defi­ni­tion of causal­ity from this ar­ti­cle, then I think it’s clear that effi­ciency con­sid­er­a­tions will lead them to use some­thing iso­mor­phic to it.

• It saves you from hav­ing to write them un­til needed

I was say­ing that not ev­ery in­de­pen­dence prop­erty is rep­re­sentable as a Bayesian net­work.

What­ever min­i­mal method of rep­re­sent­ing their knowl­edge will then have these ar­rows (from one of the net­works that fits the data).

No! Once you have learned a dis­tri­bu­tion us­ing Bayesian net­work-based meth­ods, the min­i­mal rep­re­sen­ta­tion of it is the set of fac­tors. You don’t need the di­rec­tion of the ar­rows any more.

• I was say­ing that not ev­ery in­de­pen­dence prop­erty is rep­re­sentable as a Bayesian net­work.

You mean when all vari­ables are in­de­pen­dent, or some other class of cases?

No! Once you have learned a dis­tri­bu­tion us­ing Bayesian net­work-based meth­ods, the min­i­mal rep­re­sen­ta­tion of it is the set of fac­tors. You don’t need the di­rec­tion of the ar­rows any more.

Read the rest: you need the ar­rows if you want to effi­ciently look up the con­di­tional (in)de­pen­den­cies.

• You mean when all vari­ables are in­de­pen­dent, or some other class of cases?

Well, there are dou­bly-ex­po­nen­tially many pos­si­bil­ities…

The usual ex­am­ple for Markov net­works is four vari­ables con­nected in a square. The cor­re­spond­ing in­de­pen­dence as­sump­tion is that any two op­po­site cor­ners are in­de­pen­dent given the other two cor­ners. There is no Bayesian net­work en­cod­ing ex­actly that.

you need the ar­rows if you want to effi­ciently look up the con­di­tional (in)de­pen­den­cies.

But again, why would you want that? As I said in the grand^(n)par­ent, you don’t need to when do­ing in­fer­ence.

• The usual ex­am­ple for Markov net­works is four vari­ables con­nected in a square. The cor­re­spond­ing in­de­pen­dence as­sump­tion is that any two op­po­site cor­ners are in­de­pen­dent given the other two cor­ners. There is no Bayesian net­work en­cod­ing ex­actly that.

Okay, I’m re­call­ing the “trou­ble­some” cases that Pearl brings up, which gives me a bet­ter idea of what you mean. But this is not a coun­terex­am­ple. It just means that you can’t do it on a Bayes net with bi­nary nodes. You can still rep­re­sent that situ­a­tion by merg­ing (ei­ther pair of) the screen­ing nodes into one node that cov­ers all com­bi­na­tions of pos­si­bil­ities be­tween them.

Do you have an­other ex­am­ple?

But again, why would you want that? As I said in the grand^(n)par­ent, you don’t need to when do­ing in­fer­ence.

Sure you do: you want to know which and how many vari­ables you have to look up to make your pre­dic­tion.

• merg­ing (ei­ther pair of) the screen­ing nodes into one node

Then the net­work does not en­code the con­di­tional in­de­pen­dence be­tween the two vari­ables that you merged.

The task you have to do when mak­ing pre­dic­tions is marginal­iza­tion: in or­der to com­puter P(Rain|WetGrass), you need to com­pute the sum of P(Rain|WetGrass, X,Y,Z) for all pos­si­ble val­ues of the vari­ables X, Y, Z that you didn’t ob­serve. Here it is very helpful to have the dis­tri­bu­tion fac­tored into a tree, since that can make it fea­si­ble to do vari­able elimi­na­tion (or re­lated al­gorithms like be­lief prop­a­ga­tion). But the di­rec­tions on the edges in the tree don’t mat­ter, you can start at any leaf node and work across.

• It’s not clear to me that the virtue the­ory of metabolism is a good ex­am­ple for this post, since it’s likely to be highly con­tentious.

• It seems clear to me that it is a very bad ex­am­ple. I find that con­sis­tently the worst part of Eliezer’s non-fic­tion writ­ing is that he fails to sep­a­rate con­tentious claims from writ­ings on un­re­lated sub­jects. More­over, he usu­ally dis­cards the tra­di­tional view as ridicu­lous rather than ad­mit­ting that its in­cor­rect­ness is ex­tremely non-ob­vi­ous. He goes so far in this piece as to give the stan­dard view a straw-man name and to state only the most laugh­able of its pro­po­nents’ jus­tifi­ca­tions. This mars an oth­er­wise ex­cel­lent piece and I am un­will­ing to recom­mend this ar­ti­cle to those who are not already read­ing LW.

• Yeah, I didn’t even mind the topic, but I thought this par­tic­u­lar sen­tence was pretty sketchy:

in the virtue the­ory of metabolism, lack of ex­er­cise ac­tu­ally causes weight gain due to di­v­ine pun­ish­ment for the sin of sloth.

This sounds like a Fully Gen­eral Mock­ery of any claim that hu­mans can ever af­fect out­comes. For ex­am­ple:

in the virtue the­ory of traf­fic, drink­ing al­co­hol ac­tu­ally causes ac­ci­dents due to di­v­ine pun­ish­ment for the sin of intemperance

in the virtue the­ory of con­cep­tion, un­pro­tected sex ac­tu­ally causes preg­nancy due to di­v­ine pun­ish­ment for the sin of lust

And se­lec­tively ap­plied Fully Gen­eral Mock­eries seem pretty Dark Artsy.

• in the virtue the­ory of traf­fic, drink­ing al­co­hol ac­tu­ally causes ac­ci­dents due to di­v­ine pun­ish­ment for the sin of intemperance

Of course not! The real rea­son drinkers cause more ac­ci­dents is that low-con­scien­tious­ness peo­ple are both more likely to drink be­fore driv­ing and more likely to drive reck­lessly. (The im­pair­ment of re­flexes due to al­co­hol does it­self have an effect, but it’s not much larger than that due to e.g. sleep de­pri­va­tion.) If a high-con­scien­tious­ness per­son was ran­domly as­signed to the “drunk driv­ing” con­di­tion, they would drive ex­tra cau­tiously to com­pen­sate for their im­pair­ment. ;-)

(I’m ex­ag­ger­at­ing for com­i­cal effect, but I do be­lieve a weaker ver­sion of this.)

• “Ex­tremely non-ob­vi­ous”? Have you looked at how many calories one hour of ex­er­cise burns, and com­pared that to how many calories food­stuffs com­mon in the First World con­tain?

• I agree that fo­cus­ing on in­put has far higher re­turns than fo­cus­ing on out­put. Sim­ple calorie com­par­i­son pre­dicts it, and in my per­sonal ex­pe­rience I’ve noted small ap­pear­ance and weight changes af­ter changes in ex­er­cise level and large ap­pear­ance and weight changes af­ter changes in in­take level. That said, the tra­di­tional view- “eat less and ex­er­cise more”- has the di­rec­tion of cau­sa­tion mostly right for both in­ter­ven­tions and to rep­re­sent it as just “ex­er­cise more” seems mis­taken.

• I was also dis­tracted by the foot­notes, since even though I found them quite funny, [3] at least is ob­vi­ously wrong: “there’s no known in­ter­ven­tion which can cause weight loss.” Sure there is—the effec­tive­ness of baria­tric surgery is quite well ev­i­denced at this point (http://​​en.wikipe­dia.org/​​wiki/​​Bari­a­tric_surgery#Weight_loss).

I gen­er­ally share Eliezer’s as­sess­ment of the state of con­ven­tional wis­dom in dietary sci­ence (abysmal), but care­less for­mu­la­tions like this one are—well, dis­tract­ing.

• Also, even if he meant non-sur­gi­cal in­ter­ven­tions, it should be “which can re­li­ably cause long-term weight loss”—there are peo­ple who lose weight by diet­ing, and a few of them don’t even gain it back.

• I think that’s why it’s a good ex­am­ple. It in­duces gen­uine cu­ri­os­ity about the truth, and about the method.

• It in­duces gen­uine cu­ri­os­ity about the truth

I sus­pect this de­pends on the han­dling of the is­sue. Eliezer pre­sent­ing his model of the world as “com­mon sense,” straw man­ning the al­ter­na­tive, and then us­ing fake data that backs up his prefer­ences is, frankly, em­bar­rass­ing.

This is es­pe­cially trou­ble­some be­cause this is an in­tro­duc­tory ex­pla­na­tion to a tech­ni­cal topic- some­thing Eliezer has done well be­fore- and in­tro­duc­tory ex­pla­na­tions are great ways to in­tro­duce peo­ple to Less Wrong. But how can I send this to peo­ple I know who will no­tice the bias in the sec­ond para­graph and stop read­ing be­cause that’s nega­tive ev­i­dence about the qual­ity of the ar­ti­cle? How can I send this to peo­ple I know who will ask why he’s us­ing two time-var­i­ant vari­ables as sin­gle acyclic nodes, rather than a lad­der (where ex­er­cise and weight at t-1 both cause ex­er­cise at t and weight at t)?

What would it look like to steel man the al­ter­na­tive? One of my physics pro­fes­sors called ‘calories in-calories out=change in fat’ the “physics diet,” since it was rooted in con­ser­va­tion of en­ergy; that seems like a far bet­ter name. Like many things in physics, it’s a good first or­der ap­prox­i­ma­tion to the en­g­ineer­ing re­al­ity, but there are mean­ingful sec­ond or­der terms to con­sider. “Calories in” is prop­erly “calories ab­sorbed” not “calories put into your mouth”- though we’ll note it’s difficult to ab­sorb more calories than you put into your mouth. Similarly, calories out is non-triv­ial to mea­sure- cur­rent weight and ac­tivity level can give you a broad guess, but it can be com­pli­cated by many things, like am­bi­ent tem­per­a­ture! Any at­tempt we make to con­trol calories in and calories out will have to be passed through the psy­chol­ogy and phys­iol­ogy of the per­son in ques­tion, mak­ing them even more difficult to con­trol in the field.

Com­pare the vol­ume of dis­cus­sion of the method and the over­weight-ex­er­cise link in the com­ments.

• How can I send this to peo­ple I know who will ask why he’s us­ing two time-var­i­ant vari­ables as sin­gle acyclic nodes, rather than a lad­der (where ex­er­cise and weight at t-1 both cause ex­er­cise at t and weight at t)?

Why do you need to send this ar­ti­cle to peo­ple who could ask that? If you’re say­ing “Oh, this should ac­tu­ally be mod­eled us­ing causal markov chains...” then this is prob­a­bly too ba­sic for you.

• Why do you need to send this ar­ti­cle to peo­ple who could ask that?

Be­cause I’m still a grad stu­dent, most of those peo­ple that I know are rou­tinely en­gaged in teach­ing these sorts of con­cepts, and so will find ar­ti­cles like this use­ful for ped­a­gog­i­cal rea­sons.

• How can I send this to peo­ple I know who will ask why he’s us­ing two time-var­i­ant vari­ables as sin­gle acyclic nodes

This is not in­tended for read­ers who already know that much about causal mod­els, btw, it’s a very very very ba­sic in­tro.

• I’m torn. On the one hand, us­ing the method to ex­plain some­thing the reader prob­a­bly was not pre­vi­ously aware of is an awe­some tech­nique that I truly ap­pre­ci­ate. Yet Vaniver’s point that con­tro­ver­sial opinions should not be un­nec­es­sar­ily put into in­tro­duc­tory se­quence posts makes sense. It might turn off read­ers who would oth­er­wise learn from the text, like nyan sand­wich.

In my opinion, the best fix would be to steel­man the ar­gu­ment as much as pos­si­ble. Call it the physics diet, not the virtue-the­ory of metabolism. Add in an ex­tra few sen­tences that re­ally buff up the ba­sics of the physics diet ar­gu­ment. And, at the end, in­clude a note ex­plain­ing why the physics diet doesn’t work (ap­petite in­creases as ex­er­cise in­creases).

• The point Eliezer is ad­dress­ing is the one that RichardKen­n­away brought up sep­a­rately. Causal mod­els can still func­tion with feed­back (in Causal­ity, Pearl works through an eco­nomic model where price and quan­tity both cause each other, and have their own in­de­pen­dent causes), but it’s a bit more both­er­some.

A model where the three are one-time events- like, say, whether a per­son has a par­tic­u­lar gene, whether or not they were breast­fed, and their height as an adult- won’t have the prob­lem of be­ing cyclic, but will have the ped­a­gog­i­cal prob­lem that the cau­sa­tion is ob­vi­ous from the timing of the events.

One could have, say, the weather witch’s pre­dic­tion of whether or not there will be rain, whether or not you brought an um­brella with you, and whether or not it rained. Aside from learn­ing, this will be an acyclic sys­tem that has a num­ber of plau­si­ble un­der­ly­ing causal di­a­grams (with the pres­ence of the witch mak­ing the ex­am­ple clearly fic­tional and mud­dy­ing our causal in­tu­itions, so we can only rely on the math).

In my opinion, the best fix would be to steel­man the ar­gu­ment as much as pos­si­ble.

The con­cept of in­fer­en­tial dis­tance sug­gests to me that posts should try and make their path­ways as short and straight as pos­si­ble. Why write a dou­ble-length post that ex­plains both causal mod­els and metabolism, when you could write a sin­gle-length post that ex­plains only causal mod­els? (And, if metabolism takes longer to dis­cuss than causal mod­els, the post will mostly be about the illus­tra­tive de­tour, not the con­cept it­self!)

• won’t have the prob­lem of be­ing acyclic

Should that be “cyclic”? I take it from Richard’s post that “acyclic” is what we want.

• Yes, it should. Thanks for catch­ing the typo!

• The con­cept of in­fer­en­tial dis­tance sug­gests to me that posts should try and make their path­ways as short and straight as pos­si­ble. Why write a dou­ble-length post that ex­plains both causal mod­els and metabolism, when you could write a sin­gle-length post that ex­plains only causal mod­els? (And, if metabolism takes longer to dis­cuss than causal mod­els, the post will mostly be about the illus­tra­tive de­tour, not the con­cept it­self!)

You’ve con­vinced me. I now agree that EY should go back and edit the post to use a differ­ent more con­ven­tional ex­am­ple.

• In my opinion, the best fix would be to steel­man the ar­gu­ment as much as pos­si­ble. Call it the physics diet, not the virtue-the­ory of metabolism.

“Physics diet” and “virtue-the­ory of metabolism” are not steel­man and straw­man of each other; they are quite differ­ent things. Pro­po­nents of the physics diet (e.g. John Walker) do not say that if you want to lose weight you should ex­er­cise more—they say you should eat less. EDIT: the straw­man of this would be the the­ory that “ex­ces­sive eat­ing ac­tu­ally causes weight gain due to di­v­ine pun­ish­ment for the sin of glut­tony” (in­spired by Yvain’s com­ment).

Se­ri­ously; that was in­tended to be an ex­am­ple. What’s it mat­ter whether the nodes are la­bel­led “ex­er­cise/​over­weight/​in­ter­net” or “foo/​bar/​baz”? (But yeah, Foot­note 1 doesn’t be­long there, and Foot­note 3 might men­tion eat­ing.)

• Tak­ing a “con­tentious” point and re­solv­ing it in to a set­tled fact made the whole ar­ti­cle vastly more en­gag­ing to me. It also struck me as an el­e­gant demon­stra­tion of the value of the tool: It didn’t sim­ply in­tro­duce the con­cept, it used it to ac­com­plish some­thing worth­while.

• Tak­ing a “con­tentious” point and re­solv­ing it in to a set­tled fact

! From the ar­ti­cle:

(which is to­tally made up, so don’t go ac­tu­ally be­liev­ing the con­clu­sion I’m about to draw)

• Eliezer’s data is made up, but all the not-made-up re­search I’ve seen sup­ports his ac­tual con­clu­sion. The net emo­tional re­sult was the same for me as if he’d used the ac­tual re­search, since my brain could sub­sti­tute it in.

Per­haps I am weird in hav­ing this emo­tional link, or per­haps I am sim­ply more fa­mil­iar with the not-made-up re­search than you.

• The net emo­tional re­sult was the same for me as if he’d used the ac­tual re­search, since my brain could sub­sti­tute it in.

I un­der­stand. I think it’s im­por­tant to watch out for these sorts of illu­sions of trans­parency, though, es­pe­cially when deal­ing with ped­a­gog­i­cal ma­te­rial. One of the heuris­tics I’ve been us­ing is “who would I not recom­mend this to?”, be­cause that will use so­cial mod­ules my brain is skil­led at us­ing to find holes and snags in the ar­ti­cle. I don’t know how use­ful that heuris­tic will be to oth­ers, and wel­come the sug­ges­tion of oth­ers.

per­haps I am sim­ply more fa­mil­iar with the not-made-up re­search than you.

I am not an ex­pert in nu­tri­tional sci­ence, but it ap­pears to me that there is con­tro­versy among good nu­tri­tion­ists. This post also aided my un­der­stand­ing of the is­sue. I also de­tail some more of my un­der­stand­ing in this com­ment down an­other branch.

EDIT: Also, do­ing some more pok­ing around now, this seems rele­vant.

• Ahh, that heuris­tic makes sense! I wasn’t think­ing in that con­text :)

• P.S. When quot­ing two peo­ple, it can be use­ful to at­tribute the quotes. I ini­tially thought the sec­ond quote was your way of do­ing a snarky ed­i­to­rial com­ment on what I’d said, not quot­ing the ar­ti­cle...

• Thanks for the sug­ges­tion, I’ve ed­ited my com­ment to make it clearer.

• 13 Oct 2012 0:25 UTC
27 points

Pretty good over­all, but down­voted due to the in­flam­ma­tory straw-man­ning of the physics diet. That kind of sloppy think­ing just makes me think you have big blind-spots in your ra­tio­nal­ity. Maybe it’s wrong, but it re­ally has noth­ing to do with virtue or should-uni­verses. To sug­gest oth­er­wise is dishon­est and rude. I usu­ally don’t care about rude, but this pissed me off.

I strongly agree with Vaniver’s take

• Pretty good over­all, but down­voted due to the in­flam­ma­tory straw-man­ning of the physics diet. That kind of sloppy think­ing just makes me think you have big blind-spots in your ra­tio­nal­ity. Maybe it’s wrong, but it re­ally has noth­ing to do with virtue or should-uni­verses. To sug­gest oth­er­wise is dishon­est and rude. I usu­ally don’t care about rude, but this pissed me off.

I have to ad­mit it seemed en­tirely the wrong place for Eliezer to be drag­ging up his health is­sues. I find it re­ally hard to keep read­ing a post once it starts throw­ing out petu­lant straw men. It’s a shame be­cause causal­ity and in­fer­ence is some­thing Eliezer does prob­a­bly knows some­thing about.

• Be­ing able to read off pat­terns of con­di­tional de­pen­dence and in­de­pen­dence is an art known as “D-sep­a­ra­tion”, and if you’re good at it you can glance at a di­a­gram like this...

In or­der to get bet­ter at this, I recom­mend down­load­ing and play­ing around with UnBBayes. Here’s a brief video tu­to­rial of the ba­sic us­age. The pro­gram is pretty buggy—for ex­am­ple, some­times it ran­domly re­fuses to com­pile a file and then starts work­ing af­ter I close the file and re­open it—but that’s more of a triv­ial in­con­ve­nience than a ma­jor prob­lem.

What’s great about UnBBayes is that it al­lows you to con­struct a net­work and then show how the prob­a­bil­ity flows around it; you can also force some vari­able to be true or false and see how this af­fects the sur­round­ing prob­a­bil­ities. For ex­am­ple, here I’ve con­structed a copy of the “Sea­son” net­work from the post, filled it with con­di­tional prob­a­bil­ities I made up, and asked the pro­gram to calcu­late the over­all prob­a­bil­ities. (This was no tough feat—it took me maybe five min­utes, most of which I spent on mak­ing up the prob­a­bil­ities.)

http://​​ka­j­so­tala.fi/​​Ran­dom/​​UnBBayesEx­am­ple1.png

Let’s now run through Eliezer’s ex­pla­na­tion:

...and see that, once we already know the Season

So, we know the sea­son: let’s say that we know it’s fall. I tell the pro­gram to as­sume that it’s fall, and ask it to prop­a­gate the effects of this through­out the net­work. We can see how this changes the prob­a­bil­ites of the differ­ent vari­ables:

http://​​ka­j­so­tala.fi/​​Ran­dom/​​UnBBayesEx­am­ple2.png

whether the Sprin­kler is on and whether it is Rain­ing are con­di­tion­ally in­de­pen­dent of each other—if we’re told that it’s Rain­ing we con­clude noth­ing about whether or not the Sprin­kler is on.

The word­ing here is a lit­tle am­bigu­ous, but Eliezer’s say­ing that with our knowl­edge of the sea­son, the vari­ables of “Sprin­kler” and “Rain” have be­come in­de­pen­dent. Find­ing out that it rains shouldn’t change the prob­a­bil­ity of the sprin­kler be­ing on. Let’s test this by set­ting it to rain and again prop­a­gat­ing the effects:

http://​​ka­j­so­tala.fi/​​Ran­dom/​​UnBBayesEx­am­ple3.png

And in­deed, the prob­a­bil­ity of it be­ing wet in­creased, but the prob­a­bil­ity of the sprin­kler be­ing on didn’t change.

But if we then fur­ther ob­serve that the side­walk is Slip­pery, then Sprin­kler and Rain be­come con­di­tion­ally de­pen­dent once more, be­cause if the Side­walk is Slip­pery then it is prob­a­bly Wet and this can be ex­plained by ei­ther the Sprin­kler or the Rain but prob­a­bly not both, i.e. if we’re told that it’s Rain­ing we con­clude that it’s less likely that the Sprin­kler was on.

So let’s put the side­walk to “slip­pery”, and un­set “rain” again. I’ve defined the net­work so that the side­walk is never slip­pery un­less it’s wet, so set­ting the side­walk to “slip­pery” forces the prob­a­bil­ity of “wet” to 100%.

http://​​ka­j­so­tala.fi/​​Ran­dom/​​UnBBayesEx­am­ple4.png

Now let’s see the effect on prob­a­bil­ities if we set it to rain—as Eliezer pre­dicted, the prob­a­bil­ity of the sprin­kler then goes down:

http://​​ka­j­so­tala.fi/​​Ran­dom/​​UnBBayesEx­am­ple5.png

And vice versa, if we force the sprin­kler to be on, the prob­a­bil­ity of it rain­ing goes down:

http://​​ka­j­so­tala.fi/​​Ran­dom/​​UnBBayesEx­am­ple6.png

(It oc­curs to me that even if the side­walk wasn’t wet, it could be slip­pery if it was cov­ered by leaves, or with ice. So there should ac­tu­ally be an ar­row from “sea­son” to “slip­pery”. That would be triv­ial to add.)

Another great thing about UnBBayes is that it not only helps you un­der­stand the di­rec­tion of the prob­a­bil­ity flows, but also the mag­ni­tude of differ­ent kinds of changes. Depend­ing on how you’ve set up the con­di­tional prob­a­bil­ities, a piece of in­for­ma­tion can have a huge im­pact on an­other vari­able (the side­walk be­ing slip­pery always forces the prob­a­bil­ity of “wet” to 100%, re­gard­less of any­thing else), or a rather minor one (when we already knew that it was fall and slip­pery, find­ing out that it rained only budged the prob­a­bil­ity of the sprin­kler by about five per­centage points). Even­tu­ally, the logic starts to be­come in­tu­itive.

Build your own net­works and play around with them!

• After read­ing this post I was stunned. Now I think the cen­tral con­clu­sion is wrong, though I still think it is a great post, and I will go back to be­ing stunned if you con­vince me the con­clu­sion is cor­rect.

You’ve shown how to iden­tify the cor­rect graph struc­ture from the data. But you’ve erred in as­sum­ing that the di­rected edges of the graph im­ply causal­ity.

Imag­ine you did the same anal­y­sis, ex­cept in­stead of us­ing O=”over­weight” you use W=”wears size 44 or higher pants”. The data would look al­most the same. So you would reach an analo­gous con­clu­sion: that wear­ing large pants causes one not to ex­er­cise. This seems ob­vi­ously false un­less your no­tion of causal­ity is very differ­ent from mine.

In gen­eral, I think the fol­low­ing prin­ci­ple holds: in­fer­ring causal­ity re­quires an in­ter­ven­tion; it can­not be dis­cov­ered from ob­ser­va­tional data alone. A re­searcher who hy­poth­e­sized that W causes not-E could round up a bunch of peo­ple, have half of them wear big pants, ob­serve the effect of this in­ter­ven­tion on ex­er­cise rates, and then con­clude that there is no causal effect.

• You are cor­rect—di­rected edges do not im­ply causal­ity by means of only con­di­tional in­de­pen­dence tests. You need some­thing called the faith­ful­ness as­sump­tion, and ad­di­tional (causal) as­sump­tions, that Eliezer glossed over. Without causal as­sump­tions and with only faith­ful­ness, all you are re­cov­er­ing is the struc­ture of a statis­ti­cal, rather than a causal model. Without faith­ful­ness, con­di­tional in­de­pen­dence tests do not im­ply any­thing. This is a sub­tle is­sue, ac­tu­ally.

There is no magic—you do not get causal­ity with­out causal as­sump­tions.

• Is this an­other vari­a­tion of the theme that one needs to as­sume the pos­si­bil­ity of in­duc­tive rea­son­ing to make an ar­gu­ment for it (or also as­sume Oc­cam’s Ra­zor to ar­gue for it)? Also, the spe­cific ex­am­ple he gave seems to me like an in­stance of “given very skewed data, the best guesses are still wrong” (there was some­time a vari­a­tion of that here, re­gard­ing bets and op­po­nents who have su­pe­rior in­for­ma­tion). Or are you think­ing of some­thing for sub­tle?

• Even if you as­sume that we can do in­duc­tion (and as­sume faith­ful­ness!), con­di­tional in­de­pen­dence tests sim­ply do not se­lect among causal mod­els. They se­lect among statis­ti­cal mod­els, be­cause con­di­tional in­de­pen­dences are prop­er­ties of joint dis­tri­bu­tions (statis­ti­cal, rather than causal ob­jects). Link­ing those joint dis­tri­bu­tions with some­thing causal re­lies on causal as­sump­tions.

I think the biggest les­son to learn from Pearl’s book is to keep statis­ti­cal and causal no­tions sep­a­rate.

• He ad­dressed that in the third foot­note.

Or there might be some hid­den third fac­tor, a gene which causes both fat and non-ex­er­cise. By Oc­cam’s Ra­zor this is more com­pli­cated and its prob­a­bil­ity is pe­nal­ized ac­cord­ingly, but we can’t ac­tu­ally rule it out. It is ob­vi­ously im­pos­si­ble to do the con­verse ex­per­i­ment where half the sub­jects are ran­domly as­signed lower weights, since there’s no known in­ter­ven­tion which can cause weight loss.

The model as­sumes that those are the only rele­vant vari­ables. Given that as­sump­tion, we can prove that weight causes ex­er­cise. And that it can’t be the other way around.

If there are un­ob­served vari­ables, it’s pos­si­ble that they can cause weight and cause ex­er­cise. How­ever that wasn’t one of the hy­pothe­ses any­one be­lieved be­fore­hand; they were ar­gu­ing whether weight causes ex­er­cise or if ex­er­cise causes weight.

Se­cond, even if there is an un­ob­served vari­able, it still sug­gests that ex­er­cis­ing more will not im­prove your weight. Other­wise in­ter­net use would cor­re­late with weight. Be­cause in­ter­net use af­fects ex­er­cise. If ex­er­cise af­fected weight at all, then in­ter­net use would in­di­rectly cause weight gain, and there­fore cor­re­late with it.

The whole point of the ar­ti­cle is about this trick. Where tak­ing a weird and un­re­lated vari­able like in­ter­net use, lets us dis­cover the di­rec­tion of cau­sa­tion. Which ac­cord­ing to com­mon knowl­edge about statis­tics, shouldn’t be pos­si­ble. Not with­out ran­dom­ized con­trol­led ex­per­i­ments.

• In this case, the true struc­ture would be O->E, O->W, I->E. If O is un­ob­served, then you con­fuse a fork for an ar­row, but I’m not sure you can ac­tu­ally get an ar­row point­ing the wrong way just by omit­ting vari­ables.

• (sum­mary)

Cor­re­la­tion does not im­ply cau­sa­tion,

but

cau­sa­tion im­plies cor­re­la­tion,

and therefore

no cor­re­la­tion im­plies no causation

...which per­mits the falsifi­ca­tion of some causal the­o­ries based on the ab­sence of cer­tain cor­re­la­tions.

• Com­pu­ta­tion, Cau­sa­tion, & Dis­cov­ery starts with an overview chap­ter pro­vided by Gre­gory Cooper

The hope that no cor­re­la­tion im­plies no cau­sa­tion is referred to as the “causal faith­ful­ness as­sump­tion”.

While the faith­ful­ness as­sump­tion is plau­si­ble in many cir­cum­stances, there are cir­cum­stances in which it is in­valid.

Cooper dis­cusses out Deter­minis­tic re­la­tion­ships and Goal-ori­ented sys­tems as two ex­am­ples where it is in­valid.

I think that causal dis­cov­ery liter­a­ture is aware of Mil­ton Fried­man’s ther­mo­stat and knows it by the name “Failure of causal faith­ful­ness in goal ori­ented sys­tems”

• That post is slow to reach its point and kind of abra­sive. Here’s sum­mary with a differ­ent fla­vor.

Out­put is set by some Stuff and a Con­trol sig­nal. Agent with full power over Con­trol and ac­cu­rate mod­els of Out­put and Stuff can negate the in­fluence of Stuff, mak­ing Out­put what­ever it wants, within the range of pos­si­ble Out­puts given Stuff. In­tu­itively Agent is set­ting Out­put via Con­trol, even though there won’t be a cor­re­la­tion if Agent is keep­ing Out­put con­stant. I’m not so sure whether it still makes sense to say, even in­tu­itively, that Stuff is a causal par­ent of Out­put when the agent trumps it.

Then we break the situ­a­tion a lit­tle. Sup­pose a driver is keep­ing a car’s speed con­stant with a gas pedal. You can make the Agent’s be­liefs in­ac­cu­rate (di­rectly, by show­ing the driver a video of an up­com­ing hill when there is none in front of the car, or by in­ter­ven­ing on Stuff, like in­tro­duc­ing a gust of wind the driver can’t see, and then just not up­dat­ing Agent’s be­lief). Like­wise you can make Agent par­tially im­po­tent (push down the driver’s leg on the gas pedal, give them a seizure, re­place them with an oc­to­pus). Fi­nally you can change what ap­par­ent val­ues and causal re­la­tions the agent wants to en­force (“Please go faster”).

And those are maybe how you test for con­se­quen­tial­ist con­found­ing in real life? You can set en­vi­ron­ment vari­ables if the agent doesn’t an­ti­ci­pate you, or you can find that agent and make them beleive noise, break their grasp on your pre­cious vari­ables, or change their de­sires.

• “Mil­ton Fried­man’s ther­mo­stat” is an ex­cel­lent ar­ti­cle (al­though most of the com­ments are clue­less). But some things about it bear em­pha­sis­ing.

Out­put is set by some Stuff and a Con­trol sig­nal.

Yes.

Agent with full power over Con­trol and ac­cu­rate mod­els of Out­put and Stuff can negate the in­fluence of Stuff

No. Con­trol sys­tems do not work like that.

All the Agent needs to know is how to vary the Out­put to bring the thing to be con­trol­led to­wards its de­sired value. It need not even be aware of any of the Stuff. It might or might not be helpful, but it is not nec­es­sary. The room ther­mo­stat does not: it sim­ply turns the heat­ing on when the tem­per­a­ture is be­low the set point and off when it is above. It nei­ther knows nor cares what the am­bi­ent tem­per­a­ture is out­side, whether the sun is shin­ing on the build­ing, how many peo­ple are in the room, or any­thing at all ex­cept the sensed tem­per­a­ture and the refer­ence tem­per­a­ture.

You can make the Agent’s be­liefs in­ac­cu­rate (di­rectly, by show­ing the driver a video of an up­com­ing hill when there is none in front of the car, or by in­ter­ven­ing on Stuff, like in­tro­duc­ing a gust of wind the driver can’t see, and then just not up­dat­ing Agent’s be­lief).

If you try to keep the speed of your car con­stant by de­liber­ately com­pen­sat­ing for the dis­tur­bances you can see, you will do a poor job of it. The Agent does not need to an­ti­ci­pate hills, and wind is in­visi­ble from in­side a car. In­stead all you have to do—and all that an au­to­matic cruise con­trol does—is mea­sure the ac­tual speed, com­pare it with the speed you want, and vary the ac­cel­er­a­tor pedal ac­cord­ingly. The cruise con­trol does not sense the gra­di­ent, head winds, tail winds, a drag­ging brake, or the num­ber of pas­sen­gers in the car. It doesn’t need to. All it needs to do is sense the ac­tual and de­sired speeds, and know how to vary the flow of fuel to bring the former closer to the lat­ter. A sim­ple PID con­trol­ler is enough to do that.

This con­cept is ab­solutely fun­da­men­tal to con­trol sys­tems. The con­trol­ler can func­tion, and func­tion well, while know­ing al­most noth­ing. While you can de­sign con­trol sys­tems that do—or at­tempt to do—the things you men­tion, sens­ing dis­tur­bances and com­put­ing the out­puts re­quired to coun­ter­act them, none of that is a pre­req­ui­site for con­trol. Most con­trol sys­tems do with­out such re­fine­ments.

• I’m fa­mil­iar with feed­back con­trol and I’ve used PID con­trol­ers in the de­sign of servo-hy­draulic sys­tems. That wasn’t the situ­a­tion the blog post de­scribed. If you have de­lays, or hys­tere­sis, or any other rea­son for a non-zero im­pulse re­sponse, you lose the van­ish­ing cor­re­la­tions which made the prob­lem in­ter­est­ing.

• Good point. And here’s a made-up par­allel ex­am­ple to that about weight/​ex­er­cise:

Sup­pose level of ex­er­cise can in­fluence weight (E → W), and that be­ing un­der­fed re­duces weight (U->W) di­rectly but will also re­duce the amount of ex­er­cise peo­ple do (U->E) by an amount where the effect of the re­duced ex­er­cise on weight ex­actly can­cels out the di­rect weight re­duc­tion.

Sup­pose also there is no ran­dom vari­a­tion in amount of ex­er­cise, so it’s purely a func­tion of be­ing un­der­fed.

If we look at data gen­er­ated in that situ­a­tion, we would find no cor­re­la­tion be­tween ex­er­cise and weight. Ex­am­in­ing only those two vari­ables we might as­sume no causal re­la­tion.

Ad­ding in the third vari­able, would find a perfect cor­re­la­tion be­tween (lack of) ex­er­cise and un­der­feed­ing. Im­pli­ca­tions of find­ing this perfect cor­re­la­tion: you can’t tell if the causal re­la­tion be­tween them should be E->U or U->E. And even if you some­how know the graph is (E->W), (U->E) and (E->W), there is no data on what hap­pens to W for an un­der­fed per­son who ex­er­cise, or a well-fed per­son who doesn’t ex­er­cise, so you can’t pre­dict the effect of mod­ify­ing E.

• but will also re­duce the amount of ex­er­cise peo­ple do (U->E) by an amount where the effect of the re­duced ex­er­cise on weight ex­actly can­cels out the di­rect weight re­duc­tion.

It’s un­likely that two effects will ran­domly can­cel out un­less the situ­a­tion is the re­sult of some op­ti­miz­ing pro­cess. This is the case in Mil­ton Fried­man’s ther­mo­stat but doesn’t ap­pear to be the case in your ex­am­ple.

• It wouldn’t be ran­dom. It would be an op­ti­mis­ing pro­cess, tuned by evolu­tion (an­other well known op­ti­mis­ing pro­cess). If you have less food than needed to main­tain your cur­rent weight, ex­pend less en­ergy (on ac­tivi­ties other than try­ing to find more food). For most of our evolu­tion, los­ing weight was a per­sonal ex­is­ten­tial risk.

• I had meant to sug­gest some sort of un­in­tel­li­gent feed­back sys­tem. Not co­in­ci­dence, but also not an in­tel­li­gent op­ti­mi­sa­tion, so still not an ex­act par­allel to his ther­mo­stat.

• The ther­mo­stat was cre­ated by an in­tel­li­gent hu­man.

I never said the op­ti­miz­ing pro­cess had to be that in­tel­li­gent, i.e., the blind-idiot-god counts.

• Stud­ies can always have con­found­ing fac­tors, of course. And I wrote “falsifi­ca­tion” but could have more ac­cu­rately said some­thing about re­duc­ing the pos­te­rior prob­a­bil­ity. Lack of cor­re­la­tion (e.g. with speed) would sharply re­duce the p.p. of a sim­ple model with one in­put (e.g. gas pedal), but only re­duce the p.p. of a model with mul­ti­ple in­puts (e.g. gas pedal + hilly ter­rain) to a weaker ex­tent.

• No cor­re­la­tion only im­plies no cau­sa­tion if a cer­tain as­sump­tion called “faith­ful­ness” is true, not in gen­eral.

• Might you be will­ing to ex­plain for the rest of us what the “faith­ful­ness as­sump­tion” is, and why it’s needed for “no cor­re­la­tion” to im­ply “no cau­sa­tion”? I’d ap­pre­ci­ate it.

• Ab­solutely! In a typ­i­cal Bayesian net­work, we rep­re­sent a set of prob­a­bil­ity dis­tri­bu­tions by a di­rected acyclic graph, such that any dis­tri­bu­tion in the set can be writ­ten as

$p[x_1,\ldots,x_n]=\prod_i p[x_i | pa[x_i]]$

in other words, for ev­ery ran­dom vari­able in the dis­tri­bu­tion we as­so­ci­ate a node in the graph, and the dis­tri­bu­tion can be fac­tor­ized into a product of con­di­tional den­si­ties, where each con­di­tional is a vari­able (node) con­di­tional on vari­ables cor­re­spond­ing to par­ents of the node.

This im­plies that if cer­tain types of paths in the graph from a node set X to a node set Y are “blocked” in a par­tic­u­lar way (e.g. d-sep­a­rated) by a third set Z, then in all den­si­ties that fac­tor­ize as above, X is in­de­pen­dent of Y con­di­tional on Z. Note that this im­pli­ca­tion is one way. In par­tic­u­lar, we can still have some con­di­tional in­de­pen­dence that just hap­pens to hold be­cause the num­bers in the dis­tri­bu­tion lined up just right, and the graph does not in fact ad­ver­tise this in­de­pen­dence via d-sep­a­ra­tion. When this hap­pens, we say the dis­tri­bu­tion is un­faith­ful to the graph.

If we pick pa­ram­e­ters of a dis­tri­bu­tion that fac­tor­izes at ran­dom then al­most all pa­ram­e­ter picks will be faith­ful to the graph. How­ever, lots of dis­tri­bu­tions are “near un­faith­ful”, that is they are faith­ful, but it’s hard to tell with limited sam­ples. More­over, we can’t tell in ad­vance how many sam­ples we need to tell. Also, it’s easy to con­struct faith­ful­ness vi­o­la­tions and they do oc­cur in prac­tice. For ex­am­ple, we may have an AIDS drug that sup­presses the HIV (so it re­ally does help!), but the drug is very nasty, with lots of side effects and so on, so doc­tors usu­ally wait un­til the pa­tient is already very sick be­fore giv­ing the drug. If we then look at as­so­ci­a­tions be­tween in­stances of the use of this drug we may well find that those who take the drug ei­ther die more (pos­i­tive as­so­ci­a­tion of drug with death!) or don’t die less fre­quently than those with­out the drug (no as­so­ci­a­tion of drug with death!).

Does this then mean the drug has a bad effect or no effect? No! It just means there is an ob­vi­ous con­founder of health sta­tus that we aren’t record­ing. In the sec­ond case, this con­founder is caus­ing the dis­tri­bu­tion over drug and death to be “un­faith­ful”: there is an ar­row from drug to death, but there is no de­pen­dence of death on drug. And yet there is still a causal effect.

Note: I am gloss­ing over some dis­tinc­tions be­tween a Bayesian net­work and a causal model in or­der to not muddy the dis­cus­sion. What is im­por­tant to note is that: (a) A Bayesian net­work is not a graph­i­cal causal model, but (b) a graph­i­cal causal model in­duces a Bayesian net­work on the ob­serv­able data. Faith­ful­ness (or lack of it) ap­plies to the net­work ap­pear­ing due to (b), and thus af­fects causal rea­son­ing in the un­der­ly­ing causal model.

• Thanks!

This seems like a big prob­lem for in­fer­ring “no cau­sa­tion” from “no cor­re­la­tion.” Is there a stan­dard method­olog­i­cal solu­tion? And, do re­searchers of­ten just choose to in­fer “no cau­sa­tion” from “no cor­re­la­tion” and hope for the best, or do they avoid in­fer­ring “no cau­sa­tion” from “no cor­re­la­tion” due to the fact that they can’t tell whether the faith­ful­ness as­sump­tion holds?

• Well, in some sense this is why causal in­fer­ence is hard. Most of the time if you see in­de­pen­dence that re­ally does mean there is noth­ing there. The rea­son­able de­fault is the null hy­poth­e­sis: there is no causal effect. How­ever, if you are pok­ing around be­cause you sus­pect there is some­thing there, then not see­ing any cor­re­la­tions does not mean you should give up. What it does mean is you should think about causal struc­ture and speci­fi­cally about con­founders.

What peo­ple do about con­founders is:

(a) Try to mea­sure them some­how (epi­demiol­ogy, medicine). If you can mea­sure con­founders you can ad­just for them, and then the effect can­cel­la­tion will go away.

(b) Try to find an in­stru­men­tal vari­able (econo­met­rics). If you can find a good in­stru­ment, you can get a causal effect with some para­met­ric as­sump­tions, even if there are un­mea­sured con­founders.

(c) Try to ran­dom­ize (statis­tics). This ex­plic­itly cuts out all con­found­ing.

(d) You can some­times get around un­mea­sured con­founders by us­ing strong me­di­at­ing vari­ables by means of “front-door” type meth­ods. Th­ese meth­ods aren’t re­ally well known, and aren’t com­monly used.

There is no royal road: get­ting rid of con­founders is the en­tire point of causal in­fer­ence. Peo­ple have been think­ing of clever ways to do it for close to a hun­dred years now. If you have in­finite sam­ples, and know where un­ob­served con­found­ing is, there is an al­gorithm for get­ting the causal effect from ob­ser­va­tional data by be­ing sneaky. This al­gorithm only suc­ceeds some­times, and if it doesn’t, there is no other way in gen­eral to do it (e.g. it’s “com­plete”). More in my the­sis, if you are cu­ri­ous.

• Thanks again.

One more ques­tion, since this is your field. Do you hap­pen to know of an in­stance where some new causal effect was dis­cov­ered from ob­ser­va­tional data via causal mod­el­ing, and this cause was later con­firmed by an RCT?

• Well, I think smok­ing/​can­cer was first es­tab­lished in case con­trol stud­ies. In gen­eral peo­ple move up the “hi­er­ar­chy of ev­i­dence” Ka­woomba men­tioned. At the end of the day, peo­ple only trust RCTs (and they are right, other meth­ods rely on more as­sump­tions). There is an­other good ex­am­ple, but let me dou­ble check be­fore post­ing.

With case con­trol stud­ies you have the ad­di­tional prob­lem of se­lec­tion bias, on top of con­found­ing.

• I thought there were still no ac­tual RCTs of smok­ing in hu­mans.

• Right, you can’t always RCT in hu­mans. But a causal mechanism + RCTs in an­i­mals biolog­i­cally close to hu­mans is con­vinc­ing for some­thing like lung can­cer where minor differ­ences among mam­mals shouldn’t mat­ter much (al­though e.g. bears have evolved some crazy stuff to deal with all that fat they eat be­fore hi­ber­nat­ing).

• where minor differ­ences among mam­mals shouldn’t mat­ter much

I think you are en­tirely op­ti­mistic. I re­cently pointed out that the re­search in­di­cates that an­i­mal stud­ies rou­tinely (prob­a­bly usu­ally) do not trans­fer, and as it hap­pens, an­i­mal smok­ing stud­ies are an ex­am­ple of this, ac­cord­ing to Han­son. So the differ­ences are of­ten far from minor, and even if there were can­cer in the an­i­mal stud­ies, we could in­fer very lit­tle from it.

• Out of cu­ri­os­ity, do you smoke?

• I find much to agree with in Han­son’s writ­ings, but in this case I just don’t find him con­vinc­ing. One is­sue is that can­cer is a scourge of a long-liv­ing an­i­mal. One hy­poth­e­sis is that smok­ing causes long term cu­mu­la­tive dam­age, and you might not see effects in mice or dogs be­cause they die too soon re­gard­less. There is also the is­sue that we have a fair idea of the car­cino­genic mechanism now, so if you think smok­ing does not cause harm, there also needs to be a story how that mechanism is foiled in hu­mans.

• I find much to agree with in Han­son’s writ­ings, but in this case I just don’t find him con­vinc­ing.

His in­ter­pre­ta­tion, or his ev­i­dence? I point this out be­cause it looks to me like your po­si­tion has shifted from “the smok­ing /​ lung can­cer link is es­tab­lished by RCTs in an­i­mals” to “even though RCTs don’t es­tab­lish the smok­ing /​ lung can­cer link for an­i­mals, we have other rea­sons to be­lieve in the smok­ing /​ lung can­cer link for hu­mans.”

• I find much to agree with in Han­son’s writ­ings, but in this case I just don’t find him con­vinc­ing...One hy­poth­e­sis is that smok­ing causes long term cu­mu­la­tive dam­age, and you might not see effects in mice or dogs be­cause they die too soon re­gard­less.

So: heads I win, tails you lose? If the stud­ies had found smok­ing caused can­cer in an­i­mals, well, that proves it! And if they don’t, well, that just means they didn’t run long enough so we can ig­nore them and say we “just don’t find them con­vinc­ing”...

There is also the is­sue that we have a fair idea of the car­cino­genic mechanism now, so if you think smok­ing does not cause harm, there also needs to be a story how that mechanism is foiled in hu­mans.

You don’t think there were plenty of ‘fair ideas’ of mechanisms float­ing around in the thou­sands of an­i­mal stud­ies and in­ter­ven­tions cov­ered in my an­i­mal stud­ies link? Any re­searcher worth his de­gree can come up with a plau­si­ble ex post ex­pla­na­tion.

• This al­gorithm only suc­ceeds some­times, and if it doesn’t, there is no other way in gen­eral to do it (e.g. it’s “com­plete”). More in my the­sis, if you are cu­ri­ous.

Your the­sis deals only with acyclic causal graphs. What is the cur­rent state of the art for cyclic causal graphs? You’ll know already that I’ve been look­ing at that, and I have var­i­ous pa­pers of other peo­ple that at­tempt to take steps in that di­rec­tion, but my im­pres­sion is that none of them ac­tu­ally get very far and there is noth­ing like a set of sub­stan­tial re­sults that one can point to. Even my own, were they in print yet, are pri­mar­ily nega­tive.

• The re­cent stuff I have seen is nega­tive re­sults:

(a) Can’t as­sign Pear­lian se­man­tics to cyclic graphs.

(b) If you as­sign equil­ibrium se­man­tics, you might as well use a dy­namic causal Bayesian net­work, a cyclic graph does not buy you any­thing.

(c) A graph rep­re­sent­ing the Markov prop­erty of the equil­ibrium dis­tri­bu­tion of a Markov chain rep­re­sented by a causal DBN is an in­ter­est­ing open ques­tion. (This graph wouldn’t have a causal in­ter­pre­ta­tion of course).

• As far as I can tell, epi­demiol­ogy and medicine are mostly do­ing (c), in the form of RCTs (which are the gold stan­dard of med­i­cal ev­i­dence, other than meta-re­views). There are other study de­signs such as most var­i­ants of case-con­trol stud­ies and co­hort stud­ies which do take the (a) ap­proach, but they aren’t con­sid­ered to be the same level of ev­i­dence as ran­dom­ized con­trol­led tri­als.

• but they aren’t con­sid­ered to be the same level of ev­i­dence as ran­dom­ized con­trol­led tri­als.

Quite rightly—if we ran­dom­ize, we don’t care what the un­der­ly­ing causal struc­ture is, we just cut all con­found­ing out any­ways. Meth­ods (a), (b), (d) all rely on var­i­ous struc­tural as­sump­tions that may or may not hold. How­ever, even given those as­sump­tions figur­ing out how to do causal in­fer­ence from ob­ser­va­tional data is quite difficult. The prob­lem with RCTs is ex­pense, ethics, and statis­ti­cal power (hard to en­roll a ton of peo­ple in an RCT).

Epi­demiol­ogy and medicine does a lot of (a), look for the key­words “g-for­mula”, “g-es­ti­ma­tion”, “in­verse prob­a­bil­ity weight­ing,” “propen­sity score”, “marginal struc­tural mod­els,” “struc­tural nested mod­els”, “co­vari­ate ad­just­ment,” “back-door crite­rion”, etc. etc.

Peo­ple talk about “con­trol­ling for other fac­tors” when dis­cussing as­so­ci­a­tions all the time, even in non-tech­ni­cal press cov­er­age. They are talk­ing about (a).

• Peo­ple talk about “con­trol­ling for other fac­tors” when dis­cussing as­so­ci­a­tions all the time, even in non-tech­ni­cal press cov­er­age. They are talk­ing about (a).

True, true. “Gold stan­dard” or “preferred level of ev­i­dence” ver­sus “what’s mostly con­ducted given the fund­ing limi­ta­tions”. How­ever, to make it into a guideline, there are of­ten RCT fol­low-ups for hope­ful as­so­ci­a­tions un­cov­ered by the lesser study de­signs.

look for the key­words “g-for­mula”, “g-es­ti­ma­tion”, “in­verse prob­a­bil­ity weight­ing,” “propen­sity score”, “marginal struc­tural mod­els,” “struc­tural nested mod­els”, “co­vari­ate ad­just­ment,” “back-door crite­rion”, etc. etc.

I, of course, know all of those. The let­ters, I mean.

• “No sub­tle con­founders” and “in­creas­ing sam­ple size (de­creases rele­vance and like­li­hood of such spe­cial cases)” would have m-an­swered your pre­vi­ous z-com­ments. (SCNR)

• That only works if by cor­re­la­tion you mean any kind of statis­tic de­pen­dence—Pear­son’s cor­re­la­tion co­effi­cient does van­ish for cer­tain re­la­tion­ships if they’re non-mono­tonic.

• I re­call read­ing some lovely quote on this (from some­body of the old camp, who be­lieved that talk of ‘causal­ity’ was naive), but I couldn’t track it down in Pearl or on Google—if any­one knows of a good quote from the old school, may it be pro­vided.

Maybe it’s this one?

The law of causal­ity, I be­lieve, like much that passes muster among philoso­phers, is a relic of a by­gone age, sur­viv­ing, like the monar­chy, only be­cause it is er­ro­neously sup­posed to do no harm. (Rus­sell, 1913, p. 1).

It should be noted that Rus­sell later re­versed his skep­ti­cism about causal­ity.

• The law of causal­ity, I be­lieve, like much that passes muster among philoso­phers, is a relic of a by­gone age, sur­viv­ing, like the monar­chy, only be­cause it is er­ro­neously sup­posed to do no harm.

Out­side view: Con­sider the sentence

[X], I be­lieve, like much that passes muster among philoso­phers, is a relic of a by­gone age, sur­viv­ing, like the monar­chy, only be­cause it is er­ro­neously sup­posed to do no harm.

there are a num­ber of words that could re­place X in that sen­tence to pro­duce some­thing that would be con­sid­ered a stan­dard LW po­si­tion. Are we mak­ing a similar mis­take, i.e., as­sum­ing that just be­cause we don’t yet have a satis­fac­tory the­ory of X that no such the­ory can ex­ist?

• Are we mak­ing a similar mis­take, i.e., as­sum­ing that just be­cause we don’t yet have a satis­fac­tory the­ory of X that no such the­ory can ex­ist?

Our in­abil­ity to come up with a plau­si­ble-sound­ing the­ory of X is not es­pe­cially strong ev­i­dence for the ab­sence of X, agreed.

Still less, though, is it ev­i­dence for the pres­ence of X.

Espe­cially if the work a the­ory of X is sup­posed to do can be done with­out a the­ory of X, or turn out not to be nec­es­sary in the first place.

• Still less, though, is it ev­i­dence for the pres­ence of X.

Agreed, the ev­i­dence for the pres­ence of X is that hu­mans have been talk­ing about it for a long time and seem to mean some­thing.

Espe­cially if the work a the­ory of X is sup­posed to do can be done with­out a the­ory of X, or turn out not to be nec­es­sary in the first place.

Care­ful, it’s very easy to con­vince one­self that one doesn’t need a the­ory of X when one is ac­tu­ally hid­ing X be­hind cached thoughts and sneaked in con­no­ta­tions. For ex­am­ple, Rus­sell no doubt be­lieved that he didn’t need a the­ory of causal­ity to do the work the the­ory of causal­ity was sup­posed to do.

• Ab­solutely. If I fail to no­tice how the work is ac­tu­ally be­ing done, I will likely have all kinds of false be­liefs about that work.

• there are a num­ber of words that could re­place X in that sen­tence to pro­duce some­thing that would be con­sid­ered a stan­dard LW po­si­tion. Are we mak­ing a similar mis­take, i.e., as­sum­ing that just be­cause we don’t yet have a satis­fac­tory the­ory of X that no such the­ory can ex­ist?

And many more words that could re­place X in the sen­tence that wouldn’t be a stan­dard po­si­tion but just aren’t men­tioned be­cause only the rel­a­tively few differ­ences are even worth com­ment­ing on.

• This is ab­solutely in­cred­ible. As a broadly nu­mer­ate per­son who’s never stud­ied stats, I liter­ally did not know that this could be done and while I’d heard about ways to work out what caused what from statis­ti­cal in­for­ma­tion I thought they had to have far more as­sump­tions in them. I’m slightly dis­tressed that I stud­ied A level Maths and don’t know this, given that the con­cept at least can be taught in twenty min­utes judg­ing by the post and is mas­sively open to test­ing (I thought this post didn’t make sense at first and that there were other in­ter­pre­ta­tions of the data, but when I ex­pressed those in the ter­minol­ogy set out it be­cause clear I was wrong.

Se­ri­ously, thanks.

• I saw no dis­cernible effect on my weight or my musculature

How did you as­cer­tain pos­si­ble effects on your mus­cu­la­ture? If you used the in­ter-oc­u­lar trauma test, keep in mind that grad­ual changes in some­thing you look at ev­ery day (e.g. your­self, a room­mate, or a class­mate/​coworker) are much harder to dis­cern that way. Did you try ask­ing some­one who hadn’t seen you in a while, or com­par­ing two pho­tos of you taken a cou­ple months apart?

• Yes, his N=1 seems atyp­i­cal, but it’s true that those who don’t en­joy phys­i­cal ac­tivity, don’t sus­tain it (usu­ally the vir­tu­ous, or ob­ses­sive, cy­cle closes only af­ter mak­ing enough progress and/​or car­ing des­per­ately about ap­pear­ance to over­come the ini­tial re­sis­tance). Also, over­weight peo­ple shouldn’t run on hard sur­face, or at all.

His ob­ser­va­tion “only those who find ex­er­cise pos­si­ble or re­ward­ing re­port reg­u­larly ex­er­cis­ing” is in­deed pow­er­ful, ob­vi­ous, and ne­glected. There’s a pre­vailing im­pu­ta­tion of virtue to phys­i­cal fit­ness and sloth to those who lack it, so real think­ing is scarce.

• Yes, his N=1 seems atyp­i­cal,

He said he ex­er­cised for a year, so I don’t think not-ex­er­cis­ing-enough was the rea­son he (thinks he) didn’t gain mus­cle.

it’s true that those who don’t en­joy phys­i­cal ac­tivity, don’t sus­tain it (usu­ally the vir­tu­ous, or ob­ses­sive, cy­cle closes only af­ter mak­ing enough progress and/​or car­ing des­per­ately about ap­pear­ance to over­come the ini­tial re­sis­tance)

• Leav­ing only this causal graph:

Well, one hasn’t ruled out the “vir­tu­ous” gra­ph

until one com­putes that ex­er­cise and in­ter­net have a cor­re­la­tion.

• ″ [1] Some­what to my own shame, I must ad­mit to ig­nor­ing my own ob­ser­va­tions in this de­part­ment—even af­ter I saw no dis­cernible effect on my weight or my mus­cu­la­ture from aer­o­bic ex­er­cise and strength train­ing 2 hours a day 3 times a week, I didn’t re­ally start be­liev­ing that the virtue the­ory of metabolism was wrong [2] un­til af­ter other peo­ple had started the skep­ti­cal dog­pile.”

I am ex­tremely skep­ti­cal of this por­tion, it would im­ply that Eliezer’s body func­tions differ­ently then liter­ally ev­ery other per­son (my­self in­cluded) I have ever known to make a se­ri­ous at­tempt at work­ing out.. 2 Hours 3 times a week? How long did you try this?

• Were you try­ing to diet at the same time? Have you ever tried ex­er­cis­ing more with­out also re­strict­ing your food in­take?

Also, have you ever en­joyed ex­er­cis­ing while do­ing it?

Edit: Just to be clear, this isn’t sup­posed to be ad­vice, im­plicit or oth­er­wise. I’m just cu­ri­ous.

• Thanks for re­ply­ing.

If you don’t mind the con­tinued prob­ing: did your abil­ity to lift grow over that time pe­riod? Or were you about con­stant the whole year?

• I am ex­tremely skep­ti­cal of this por­tion, ti would im­ply that Eliezer’s body func­tions differ­ently then liter­ally ev­ery other per­son (my­self in­cluded)I have ever known to make a se­ri­ous at­tempt at work­ing out..

Ar­gu­ing from anec­dote, re­ally? Ex­er­cise re­sis­tance is a thing.

• The fact that peo­ple re­spond to ex­er­cise differ­ently to weight train­ing and ex­er­cise non uniformly de­pend­ing on their ge­net­ics and other fac­tors is no big sur­prise. But show­ing no gains at all is some­thing al­to­gether.

I can think of sev­eral ques­tions I would ask about the study you linked. For ex­am­ple: “In the com­bined strength-and-en­durance-ex­er­cise pro­gram, the vol­un­teers’ phys­iolog­i­cal im­prove­ment ranged from a nega­tive 8 per­cent (mean­ing they be­came 8 per­cent less fit) ” im­plies to me that the re­searchers didn’t con­trol for a host of other fac­tors.

Anec­dotes ARE data. Espe­cially a life time of sev­eral of them all ac­cu­mu­lat­ing in one way.

• im­plies to me that the re­searchers didn’t con­trol for a host of other fac­tors.

Aren’t you just con­ced­ing the point right there, and ad­mit­ting that in fact, there are peo­ple who will em­piri­cally see a nega­tive or zero effect size to their ex­er­cis­ing? Life is thought by most to be full of ‘a host of other fac­tors’...

• So you think my point is that ex­er­cise is magic? If you built my po­si­tion out of iron in­stead of straw, you might find that yes, ex­er­cise is not the ONLY im­por­tant fac­tor for fit­ness.

• Since you seem to have for­got­ten what you were ar­gu­ing, let us re­view. Eliezer wrote:

I saw no dis­cernible effect on my weight or my mus­cu­la­ture from aer­o­bic ex­er­cise and strength train­ing 2 hours a day 3 times a week

You wrote:

it would im­ply that Eliezer’s body func­tions differ­ently then liter­ally ev­ery other per­son (my­self in­cluded) I have ever known

And im­plied it must be im­pos­si­ble, hence Eliezer must be do­ing some­thing wrong.

I linked a study show­ing that peo­ple ‘do­ing it right’ could see their fit­ness go down, em­piri­cally re­fut­ing your uni­ver­sal­iz­ing claim that “ev­ery other per­son (my­self in­cluded)” would see their fit­ness only go up.

You then tried to wave away the study by a fully gen­eral counter-ar­gu­ment ap­peal­ing to other fac­tors ex­plain­ing why some peo­ple could see their fit­ness de­crease… But nei­ther I nor Eliezer ever made an ar­gu­ment about what caused the ex­er­cise re­sis­tance, merely that some peo­ple would em­piri­cally see their fit­ness de­crease or re­main sta­ble.

When I pointed this out, you smarmily replied about how I’m be­ing un­fair to you and straw­man­ning you, and im­plied that I hold the­o­ries of ex­er­cise as “magic”.

Per­son­ally, I see no need to con­struct any ‘iron­man­ning’ of your po­si­tion, since you do not yet seem to have un­der­stood that what we were say­ing was limited to ques­tions of fact and not spec­u­la­tion about what might ex­plain said ob­served fact. (What, ex­actly, is the iron­man­ning of a fact—as op­posed to a the­ory or paradigm?)

• I deny that the study had peo­ple all “do­ing it right”. In Eliezer’s case, I gave him the benefit of the doubt that he was in­tel­li­gent enough to avoid ob­vi­ous con­founders.

If some­one gets sick (for ex­am­ple) to­wards the end of the study and then shows a “nega­tive 8 per­cent ” fit­ness level then their data is crap.

If the study did not con­trol for in­ten­sity then it is crap.

The differ­ence be­tween some­one ac­tu­ally do­ing an effort­ful work­out and some­one just be­ing pre­sent at the gym for a pe­riod of time is as­tro­nom­i­cal, and an ex­tremely com­mon oc­cur­rence.

The study had an age range from 40 and 67...

This study is garbage.

• If some­one gets sick (for ex­am­ple) to­wards the end of the study and then shows a “nega­tive 8 per­cent ” fit­ness level then their data is crap.

And they could have been sick at the start, as well, pro­duc­ing pseudo gains… You’re pos­tu­lat­ing things which you have no rea­son to think hap­pened to ex­plain things that did hap­pen; nowhere is any­thing in­di­cated about that and you are ar­gu­ing solely that be­cause you dis­like the re­sults, the re­searchers were in­com­pe­tent.

If the study did not con­trol for in­ten­sity then it is crap.

Why should there be any con­trol for in­ten­sity? They did an in­ter­ven­tion; there should be a non-zero effect. If any level of ex­er­cise does not show any benefits, then you are wrong. And I guess you did not read the link, be­cause sev­eral in­ter­ven­tions were tested and did not show any differ­ence in terms of ex­er­cise re­sis­tance.

The study had an age range from 40 and 67...

So? Why do you think that ex­er­cise should be en­tirely in­effec­tive in peo­ple age 67? Are 40yos from a differ­ent species where ex­er­cise does not work? By ex­am­in­ing older peo­ple, who are much less fit and much more seden­tary, shouldn’t the effects be even more dra­matic and visi­ble?

This study is garbage.

So, in ad­di­tion to “In­di­vi­d­ual re­sponses to com­bined en­durance and strength train­ing in older adults”, Kar­avirta 2011, let me also cite “En­durance train­ing-in­duced changes in in­sulin sen­si­tivity and gene ex­pres­sion”, “In­di­vi­d­ual differ­ences in re­sponse to reg­u­lar phys­i­cal ac­tivity”, “Effects of Ex­er­cise Train­ing on Glu­cose Homeosta­sis: The HERITAGE Fam­ily Study”, “Ad­verse Metabolic Re­sponse to Reg­u­lar Ex­er­cise: Is It a Rare or Com­mon Oc­cur­rence?”, “Ge­nomic pre­dic­tors of train­abil­ity”, “Effects of gen­der, age, and fit­ness level on re­sponse of  vo2max to train­ing in 60–71 yr olds”, “Re­sis­tance to ex­er­cise-in­duced weight loss: com­pen­satory be­hav­ioral adap­ta­tions”, and “Car­dio­vas­cu­lar au­to­nomic func­tion cor­re­lates with the re­sponse to aer­o­bic train­ing in healthy seden­tary sub­jects”, to name a few. (One nice thing about HERITAGE and Bouchard’s ear­lier stud­ies is that they recorded ex­er­cise, so spare me the ‘maybe they didn’t ac­tu­ally ex­er­cise’.) In these, too, some peo­ple don’t benefit from ex­er­cise and show in­di­vi­d­ual differ­ences in ex­er­cise train­abil­ity ex­ist.

• Ep­stein 2014, The Sports Gene, ch6 “Su­perbaby, Bully Whip­pets, and the Train­abil­ity of Mus­cle”, pg68:

[...de­scrip­tion of Bouchard/​HERITAGE Fam­ily Study...]

A se­ries of stud­ies in 2007 and 2008 at the Univer­sity of Alabama-Bir­m­ing­ham’s Core Mus­cle Re­search Lab­o­ra­tory and the Veter­ans Af­fairs Med­i­cal Cen­ter in Bir­m­ing­ham showed that in­di­vi­d­ual differ­ences in gene and satel­lite cell ac­tivity are crit­i­cal to differ­en­ti­at­ing how peo­ple re­spond to weight train­ing. Sixty-six peo­ple of vary­ing ages were put on a four-month strength train­ing plan-squats, leg press, and leg lifts-all matched for effort level as a per­centage of the max­i­mum they could lift. (A typ­i­cal set was eleven reps at 75 per­cent of the max­i­mum that could be lifted for a sin­gle rep.) At the end of the train­ing, the sub­jects fell rather neatly into three groups: those whose thigh mus­cle fibers grew 50 per­cent in size; those whose fibers grew 25 per­cent; and those who had no in­crease in mus­cle size at all. A range from 0 per­cent to 50 per­cent im­prove­ment, de­spite iden­ti­cal train­ing. Sound fa­mil­iar? Just like the HERITAGE Fam­ily Study, differ­ences in train­abil­ity were im­mense, only this was strength as op­posed to en­durance train­ing. Seven­teen weight lifters were “ex­treme re­spon­ders,” who added mus­cle fu­ri­ously; thirty-two were mod­er­ate re­spon­ders, who had de­cent gains; and sev­en­teen were non­re­spon­ders, whose mus­cle fibers did not grow.* …The Bir­m­ing­ham re­searchers took a HERITAGE-like ap­proach in their search for genes that might pre­dict the high satel­lite cell folk, or high re­spon­ders, from the low re­spon­ders to a pro­gram of strength train­ing. Just as the HERITAGE and GEAR stud­ies found for en­durance, the ex­treme re­spon­ders to strength train­ing stood out by the ex­pres­sion lev­els of cer­tain genes. Mus­cle biop­sies were taken from all sub­jects be­fore the train­ing started, af­ter the first ses­sion, and af­ter the last ses­sion. Cer­tain genes were turned up or down similarly in all of the sub­jects who lifted weights, but oth­ers were turned up only in the re­spon­ders. One of the genes that dis­played much more ac­tivity in the ex­treme re­spon­ders when they trained was IGF-IEa, which is re­lated to the gene that H. Lee Sweeney used to make his Sch­warzeneg­ger mice. The other stand­outs were the MGF and myo­genin genes, both in­volved in mus­cle func­tion and growth. The ac­tivity lev­els of the MGF and myo­genin genes were turned up in the high re­spon­ders by 126 per­cent and 65 per­cent, re­spec­tively; in the mod­er­ate re­spon­ders by 73 per­cent and 41 per­cent; and not at all in the peo­ple who had no mus­cle growth.

Every similar strength-train­ing study has re­ported a broad spec­trum of re­spon­sive­ness to iron pump­ing. In Mi­ami’s GEAR study, the strength gains of 442 sub­jects in leg press and chest press ranged from un­der 50 per­cent to over 200 per­cent. [108 GEAR study data was gen­er­ously shared by mem­bers of the Univer­sity of Mi­ami re­search team.] A twelve-week study of 585 men and women, run by an in­ter­na­tional con­sor­tium of hos­pi­tals and uni­ver­si­ties, found that up­per-arm strength gains ranged from zero to over 250 per­cent.

• I don’t even re­mem­ber this con­ver­sa­tion (4 years of necro­mancy?). I don’t re­mem­ber the con­text of our dis­cus­sion, and it seems like I did a bad job of com­mu­ni­cat­ing what­ever my origi­nal point was and over-ex­ag­ger­ated. I am pretty sure you have a bet­ter un­der­stand­ing of the data.

• The con­text was whether ex­er­cise re­sis­tance was a thing that ex­isted (and hence, whether it was some­thing Eliezer could have). I was re­vis­it­ing my old com­ments on the topic to grab the cita­tions I had dug up as part of work­ing on a sec­tion for my longevity cost-benefit anal­y­sis where I ob­serve that given the phe­nomenon of ex­er­cise re­sis­tance, be­hav­ioral back­lash like low­er­ing basal ac­tivity lev­els, and twin stud­ies in­di­cat­ing var­i­ous ex­er­cise cor­re­la­tions are par­tially ge­net­i­cally con­founded, we should be gen­uinely doubt­ful about how much ex­er­cise will help with non-ath­letic or cos­metic things and be de­mand­ing ran­dom­ized tri­als.

• we should be gen­uinely doubt­ful about how much ex­er­cise will help with non-ath­letic or cos­metic things.

First, we prob­a­bly should be in­ter­ested in the amount of to­tal phys­i­cal ac­tivity—“ex­er­cise” im­plies ad­di­tional ac­tivity be­sides the baseline and the baseline varies a LOT. Some peo­ple work as lum­ber­jacks and some peo­ple only move be­tween the couch and the fridge.

Se­cond, as long, as we are ex­press­ing wishes about stud­ies, I’d like those stud­ies to fo­cus on differ­ences be­tween groups of peo­ple (e.g. run some clus­ter­ing) and not just smush ev­ery­thing to­gether into over­all av­er­ages.

Third, there is one more cat­e­gory be­sides longevity and (ath­letic and/​or cos­metic) -- qual­ity of life. Be­ing fit no­tice­ably im­proves it and be­ing out of shape makes it worse.

• Anec­dotes are poi­sonous data, and it is best to ex­clude them from your rea­son­ing when pos­si­ble. They are sub­ject to a mas­sive se­lec­tion bias. At best they are use­ful for in­fer­ring the ex­is­tence of some­thing, e.g. “I once saw a ple­siosaur in Loch Ness.”. Even then the in­fer­ence is ten­u­ous be­cause all you know is that there is at least once in­di­vi­d­ual who says they saw a ple­siosaur. In­fer­ring the ex­is­tence of a ple­siosaur re­quires that you have ad­di­tional sup­port­ing ev­i­dence that as­signs a high prob­a­bil­ity that they are tel­ling the truth, that their mem­ory has not changed sig­nifi­cantly since the origi­nal event, and that the origi­nal ex­pe­rience was gen­uine.

• I’m won­der­ing if there are study­ing con­trol­ling for ex­er­cise en­joy­ment, among other fac­tors.

• While there are in­di­vi­d­ual differ­ences in how fast the neu­ro­mus­cu­lar sys­tem adapts to ex­er­cise, the abil­ity to adapt is ab­solutely re­quired in or­der to main­tain nor­mal func­tion. Sig­nifi­cant ab­nor­mal­ities of the neu­ro­mus­cu­lar sys­tem re­sult in dis­abling con­di­tions such as mus­cu­lar at­ro­phy or mus­cu­lar dys­tro­phy.

As far ar I know, Yud­kowsky is able-bod­ied, there­fore his mus­cles must ex­ibit a re­sponse to ex­er­cise within the nor­mal healthy hu­man range.

The fact that he at­tempted to train and didn’t ob­serve any sig­nifi­cant strength in­crease is best ex­plained by the hy­poth­e­sis that he used an im­proper train­ing regime or just didn’t keep train­ing for long enough, not by the hy­poth­e­sis that he has some weird alien biol­ogy.

• As far ar I know, Yud­kowsky is able-bod­ied, there­fore his mus­cles must ex­ibit a re­sponse to ex­er­cise within the nor­mal healthy hu­man range.

Not quite. It is only im­plied that he re­sponds to ex­er­cise within the range of func­tional sur­viv­abil­ity, not nor­mal­ity or healthines.

The fact that he at­tempted to train and didn’t ob­serve any sig­nifi­cant strength increase

A lack of strength in­crease would be par­tic­u­larly weird. I thought the sub­ject was weight and body com­po­si­tion. The most dra­matic early strength in­crease comes from the ‘neuro’ part of ‘neu­ro­mus­cu­lar’ so lack of strength in­crease on a given strength re­lated task when go­ing from seden­tary to perform­ing said task reg­u­larly would in­di­cate a much more sig­nifi­cant prob­lem than merely failing to gain sig­nifi­cant mus­cle mass.

is best ex­plained by the hy­poth­e­sis that he used an im­proper train­ing regime or just didn’t keep train­ing for long enough, not by the hy­poth­e­sis that he has some weird alien biol­ogy.

There has been enough in­for­ma­tion pro­vided that we can rea­son­ably hy­poth­e­size that Eliezer’s ex­er­cise re­sponse is at least a stan­dard de­vi­a­tion or two in the di­rec­tion of “ge­net­i­cally dis­ad­van­taged” on the rele­vant scale of ex­er­cise re­sponse.

• A lack of strength in­crease would be par­tic­u­larly weird. I thought the sub­ject was weight and body com­po­si­tion. The most dra­matic early strength in­crease comes from the ‘neuro’ part of ‘neu­ro­mus­cu­lar’ so lack of strength in­crease on a given strength re­lated task when go­ing from seden­tary to perform­ing said task reg­u­larly would in­di­cate a much more sig­nifi­cant prob­lem than merely failing to gain sig­nifi­cant mus­cle mass.

That would im­ply that he has a neu­rolog­i­cal di­s­or­der that im­pairs mo­tor func­tion only up to the ex­tent that it pre­vents perfor­mances to im­prove past the re­quire­ments of a seden­tary lifestyle, but not to the ex­tent to cause ac­tual dis­abil­ity. Is any­thing like this doc­u­mented in med­i­cal liter­a­ture?

There has been enough in­for­ma­tion pro­vided that we can rea­son­ably hy­poth­e­size that Eliezer’s ex­er­cise re­sponse is at least a stan­dard de­vi­a­tion or two in the di­rec­tion of “ge­net­i­cally dis­ad­van­taged” on the rele­vant scale of ex­er­cise re­sponse.

It seems to me that Yud­kowsky is quite prone to ra­tio­nal­iza­tion: he might have started to train, didn’t par­tic­u­larly like it and when he didn’t get the re­sults he hoped for, in­stead of re­vis­ing his train­ing pro­gram or keep train­ing for a longer time, he came up with the weird ge­netic con­di­tion as an ex­cuse to quit. At least, this ex­pla­na­tion ap­pears to be more likely than the hy­poth­e­sis that he ac­tu­ally has a weird ge­netic con­di­tion un­known to sci­ence, AFAIK.

• Does you ac­tu­ally be­lieve in the virtue the­ory of metabolism, or did you be­lieve in the con­ser­va­tion of en­ergy be­tween ATP syn­the­sized thor­ough the break­down of food nu­tri­ents be­ing used to syn­the­size lipids?

There are ad­di­tional con­found­ing fac­tors, in­clud­ing ge­net­ics, hered­ity sep­a­rately from ge­net­ics (many or­ganel­les are not coded in DNA), and en­vi­ron­men­tal fac­tor which cause hor­mone fluc­tu­a­tions. Seth Roberts’ stud­ies as linked show vari­a­tions in ap­petite which cause vari­a­tions in body fat, and provide a clear the­ory on a spe­cific mechanism by which ap­petite can be in­ten­tion­ally al­tered.

• This slammed into my “math is hard” block. I will re­turn and read it, but it’s go­ing to be work.

But on pon­der­ing that, I think I re­al­ized why math is hard, com­pared to prose text that just pre­sents it­self as a fait ac­com­pli to my at­ten­tion. (And why it is not hard, for some peo­ple who are sa­vants.)

We are not ex­e­cut­ing math­e­mat­i­cal com­pu­ta­tions. We are em­u­lat­ing a crude math­e­mat­i­cal com­puter which takes the kind of ex­plicit al­gorithms that are fed to stu­dents. No at­tempt is made to cul­ti­vate and tune a “feel” for the re­sult (which is what ex­e­cut­ing com­pu­ta­tions would be like, since it’s what the other hard com­pu­ta­tions we do—like read­ing—feel like).

Just putting that out there.

• I just fo­cus on un­der­stand­ing ideas when I’m not will­ing to do math work.

The rough and non­tech­ni­cal ex­pla­na­tion of this post that I’ve got­ten is: You can’t tell what causes what when you’ve just got two things that come to­gether. But when you’ve got three things, then you can make pairs out of them, and the re­la­tion­ship be­tween the pairs can tell you when some­thing isn’t caus­ing some­thing else. (Un­less there are com­pli­cat­ing fac­tors like Fried­man’s Ther­mo­stat, see the com­ments be­low.)

• But I want to do math work. My in­abil­ity to think in math is a se­ri­ous weak­ness.

• Coursera’s math­e­mat­i­cal think­ing class is more than half over. But I’m re­ally en­joy­ing it, so you might keep an eye out for re­peats.

• Agreed, I would like that too. Ad­vice and re­sources would be nice.

• This slammed into my “math is hard” block.

“Math class is tough!”

- Barbie

• Causal net­works have an­other name in com­puter sci­ence, in the con­text of com­pilers. They’re the dataflow di­a­grams of loop-free ar­ray-free non-re­cur­sive func­tions rep­re­sented in SSA form. (Func­tions that con­tain loops, ar­rays and re­cur­sion can still be rea­soned about with causal net­works, but only lo­cally—you have to un­roll ev­ery­thing, and that blows up quickly if you try to do it globally.)

A dataflow di­a­gram is when you give each vari­able in a func­tion a node, and draw an edge from the vari­able to each of the other vari­ables that was used in its defi­ni­tion. SSA form is when you write a func­tion in such a way that ev­ery vari­able is as­signed to ex­actly once—ie, you can’t give a var a value, use it, then give it a new value. Com­pilers au­to­mat­i­cally trans­late func­tions to SSA form by giv­ing unique names to all the as­sign­ments, and defin­ing a new vari­able (con­ven­tion­ally marked with a com­bi­na­tion func­tion phi) at each point where one of sev­eral of those names would be cho­sen based on branch con­di­tions. Com­pilers use this form be­cause it makes it eas­ier to check which op­ti­miza­tions and trans­forms are valid be­fore ap­ply­ing them.

• Is this com­pa­rable to causal net­works in some sense, other than be­ing rep­re­sented as a di­rected graph? Di­graphs are ubiquitous in ap­pli­ca­tions of math, they’re used for all sorts of things.

• I’m not sure if this is what jim­ran­domh was get­ting at, but there does seem to be a con­nec­tion to me deeper than merely both be­ing DAGs: op­ti­miza­tions are of­ten limited by how much in­fer­ence can be made about the causal re­la­tions of vari­ables and func­tions.

One of Haskell’s pur­poses is to see what hap­pens when you make de­pen­den­cies ex­plicit, re­mov­ing most global state, and what op­ti­miza­tions are en­abled. For ex­am­ple, con­sider con­stant prop­a­ga­tion: a con­stant is a node un­con­nected to any other node, and there­fore can­not change, and be­ing un­chang­ing can be in­lined ev­ery­where and as­sumed to be what­ever its value is. If the con­stant were not con­stant but had a causal con­nec­tion to, say, a node which is an IO String that the user types in, then all these in­lin­ings are for­bid­den and must re­main in­di­rec­tions to the user in­put.

Or con­sider pro­file-guided op­ti­miza­tion: we re­lo­cate branches based on which one is usu­ally taken (prob­a­bil­ity!), but what about branches not speci­fi­cally in our pro­filing sam­ple? Seems to me like causal in­fer­ence on the pro­gram DAG might let you in­fer which branch is most likely.

• If we know that there’s a bur­glar, then we think that ei­ther an alarm or a re­ces­sion caused it; and if we’re told that there’s an alarm, we’d con­clude it was less likely that there was a re­ces­sion, since the re­ces­sion had been ex­plained away.

Is this to say that a given node/​ob­ser­va­tion/​fact can only have one cause?

More con­cretely, lets say we have nodes x, y, and z, with cau­sa­tion ar­rows from x to z and from y to z.

.X...........Y

...\......./​

.......Z

If z is just an “and” logic gate, that out­puts a “True” value only when x is True and y is True, then it seems like it must be caused by both x and y.

Am I mix­ing up my ab­strac­tions here? Is there some rea­son why logic gate-like rules are dis­al­lowed by causal mod­els?

• Logic gates are al­lowed just fine. For ex­am­ple, if bur­glars and earth­quakes both cause alarms, then A=OR(B,E). You could also have AND, or any other imag­in­able way of com­bin­ing the vari­ables.

The “ex­plained away” thing isn’t worded very well. For ex­am­ple, imag­ine that B and E are in­de­pen­dent and have prob­a­bil­ities equal to 15. Then learn­ing that there was an alarm (A) raises your prob­a­bil­ities of both B and E to 59, but then learn­ing that there was a earth­quake (E) low­ers your prob­a­bil­ity of bur­glar (B) back to 15. That’s the “ex­plained away” effect. With other logic gates you’d see other effects.

• I re­call read­ing some lovely quote on this (from some­body of the old camp, who be­lieved that talk of ‘causal­ity’ was naive), but I couldn’t track it down in Pearl or on Google—if any­one knows of a good quote from the old school, may it be pro­vided.

You think­ing of this maybe, quoted in the Epi­logue in Pearl’s Causal­ity?

Beyond such dis­carded fun­da­men­tals as ‘mat­ter’ and ‘force’ lies still an­other fetish among the in­scrutable ar­cana of mod­ern sci­ence, namely, the cat­e­gory of cause and effect.

That’s Karl Pear­son, as in “Pear­son’s r”, the cor­re­la­tion co­effi­cient.

• It is ob­vi­ously im­pos­si­ble to do the con­verse ex­per­i­ment where half the sub­jects are ran­domly as­signed lower weights, since there’s no known in­ter­ven­tion which can cause weight loss.

There is ex­actly one such in­ter­ven­tion that has been shown to cause per­sis­tent weight loss af­ter the in­ter­ven­tion pe­riod is over. (Star­va­tion also causes weight loss, even­tu­ally, but only dur­ing the in­ter­ven­tion pe­riod.)

• If my stom­ach doesn’t grow back to full size, sounds like an on­go­ing in­ter­ven­tion to me! :D (Also, since peo­ple don’t ex­clude weight loss meth­ods that are long-term plans, I’d bet there are some in­ter­est­ing things that have been shown to work as long-term in­ter­ven­tions.)

• (Also, since peo­ple don’t ex­clude weight loss meth­ods that are long-term plans, I’d bet there are some in­ter­est­ing things that have been shown to work as long-term in­ter­ven­tions.)

Nope. There haven’t been any that have been shown to work.

• I’m told that 5% of dieters keep off the weight long-term. (In­ter­est­ingly, this is also the suc­cess rate of quit­ting smok­ing.) Un­less 5% of peo­ple who don’t try to lose weight also lose weight and keep it off, sounds like diets work, just not very well.

• You would have to com­pare to how many non-dieters lose weight and keep it off long-term.

• 5% is a very small effect. Not only would you want to see the con­trol group, but you’d need a huge sam­ple size to get any­where.

• Un­less 5% of peo­ple who don’t try to lose weight also lose weight and keep it off, [...]

Is there a stan­dard ab­bre­vi­a­tion for “I would like to see this testable pre­dic­tion tested”?

• Chronic co­caine use. Let’s start with the fun stuff and go from there.

• Let me rephrase: There is no such in­ter­ven­tion that is con­sid­ered less dan­ger­ous than be­ing obese.

• I dunno, I feel like you’re just patch­ing. Univer­sal state­ments are always so frag­ile. Did that drug that made you poop out the fat you ate lead to weight loss? It looks like it’s been shown to be effec­tive for at least 2 years. How about ap­petite sup­pres­sants (safer ones than co­caine, that is)? The stud­ies seem to be over shorter time pe­ri­ods, but is that be­cause of safety/​effec­tivess rea­sons, or just habit?

• What about a small amount of mild stim­u­lant use?

• I dunno. The FDA did ap­prove a cou­ple of drugs this year, but they might only be in­tended for short-term use.

• I know that the an­tide­pres­sant Wel­lbutrin, which is a stim­u­lant, has been as­so­ci­ated with a small amount of weight loss over a few months, though I’m not sure if this has been shown to stay for longer. That’s an off-la­bel use though.

I’d guess that any stim­u­lant would show weight loss in the short-term. Is there some rea­son this wouldn’t stay long-term?

• There are a lot of drugs that peo­ple de­velop tol­er­ances to when used over long pe­ri­ods of time (the body’s var­i­ous feed­back mechanisms re­cal­ibrate them­selves to com­pen­sate for the drug’s pres­ence), but I can’t say with any au­thor­ity that this ap­plies to mild stim­u­lant use and weight loss.

• I’m pretty sure tol­er­ance to caf­feine is a thing, judg­ing from what I see on other peo­ple. (I usu­ally ab­stain from drink­ing any­thing with caf­feine at least on week­ends and holi­days to pre­vent that from hap­pen­ing to me.)

• Yes, the liter­a­ture seems to pretty solidly sup­port caf­feine tol­er­ance (which is one of the rea­sons it’s not as use­ful as most peo­ple think).

• But the third model, in which re­ces­sions di­rectly cause alarms, which only then cause bur­glars, does not have this prop­erty.

This is not the third model in your pic­ture.

• Fixed.

• Right, it seems like “Bur­glar” and “Re­ces­sion” should switch places in the third di­a­gram.

• I think it would be valuable if some­one pointed out that a third party watch­ing, with­out con­trol­ling, a sci­en­tist’s con­trol­led study is in pretty much the same situ­a­tion as the three-column ex­er­cise/​weight/​in­ter­net use situ­a­tion—they have in­stead ex­er­cise/​weight/​con­trol group.

This “ob­serve the re­sults of a sci­en­tist’s con­trol­led study” thought ex­per­i­ment mo­ti­vates and pro­vides hope that one can some­times de­rive cau­sa­tion from ob­ser­va­tion, where the cur­rent story arc makes a sortof mag­i­cal leap.

• This “ob­serve the re­sults of a sci­en­tist’s con­trol­led study” thought ex­per­i­ment mo­ti­vates and pro­vides hope that one can some­times de­rive cau­sa­tion from ob­ser­va­tion,

In­deed; one way to think about this is to con­sider na­ture as a sci­en­tist whose shoulder we can look over.

where the cur­rent story arc makes a sortof mag­i­cal leap.

The leap only seems mag­i­cal un­til you un­der­stand what the mov­ing parts in­side are. So let’s try go­ing in the re­verse di­rec­tion, and see if that helps make it clearer.

Sup­pose there are three bi­nary vari­ables, A, B, and C, and they are pair­wise de­pen­dent on each other: that is, P(A) isn’t P(A|B), but we haven’t looked at P(A|BC).

Alice says that A causes both B and C. Bob says that A causes B, which causes C. Char­lie says that A and B both cause C. (Each of these is a min­i­mal de­scrip­tion of the model- any arcs not men­tioned don’t ex­ist, which means there’s no di­rect causal link be­tween those two.)

Un­for­tu­nately, A, B, and C are easy to mea­sure but hard to in­fluence, so run­ning ex­per­i­ments is out of the ques­tion, but for­tu­nately we have lots of ob­ser­va­tional data to do statis­tics on.

We take a look at the mod­els and re­al­ize that they make falsifi­able pre­dic­tions:

If Alice is right, then B and C should be con­di­tion­ally in­de­pen­dent given A: that is, P(B|AC)=P(B|A) and P(C|AB)=P(C|A).

If Bob is right, then A and C should be con­di­tion­ally in­de­pen­dent given B: that is, P(A|BC)=P(A|B) and P(C|AB)=P(C|B).

If Char­lie is right, then A and B should be in­de­pen­dent, and only be­come de­pen­dent given C.

We know Char­lie’s wrong im­me­di­ately, since the vari­ables are un­con­di­tion­ally pair­wise de­pen­dent. To test if Alice or Bob are right, we look at the joint prob­a­bil­ity dis­tri­bu­tion and marginal­ize, like de­scribed in the post. Sup­pose we find that both Alice and Bob are wrong, and so we can con­clude that their mod­els are in­cor­rect, just like we could with Char­lie’s.

In gen­eral, we don’t look at three pro­posed mod­els. What we do in­stead is a pro­ce­dure that will im­plic­itly con­sider each of the 25 acyclic causal mod­els that could de­scribe a set of three bi­nary vari­ables, rul­ing them out un­til only a small set are left.

Note that an ob­ser­va­tion that, say, A and C are un­cor­re­lated given B en­sured that there is no arc be­tween A and C- rul­ing out around two thirds of the mod­els at once; that’s what we mean by im­plic­itly con­sid­er­ing all mod­els. As well, we’re left with a set of mod­els that agree with the data- some­times, we’ll be able to re­duce it to a sin­gle model, but some­times the data is in­suffi­cient to iden­tify the model ex­actly, and so we’ll have sev­eral mod­els which are all pos­si­ble- but many more mod­els which we know can’t be the case.

That’s the big in­sight, I think: causal mod­els make testable pre­dic­tions, and most imag­in­able mod­els will be wrong. My sus­pi­cion as to why this took so long to de­velop is that it’s worth­less when look­ing at graphs with only two nodes (ap­par­ently not; see this com­ment be­low): there, we can only tell the differ­ence be­tween in­de­pen­dence and cor­re­la­tion, and there’s no way to tell which way the cau­sa­tion goes. It’s only when we have sys­tems with at least three nodes that we start be­ing able to rule out causal mod­els, and the third node may let us con­clude things about the first two nodes that we couldn’t con­clude with­out that node.

• My sus­pi­cion as to why this took so long to de­velop is that it’s worth­less when look­ing at graphs with only two nodes: there, we can only tell the differ­ence be­tween in­de­pen­dence and cor­re­la­tion, and there’s no way to tell which way the cau­sa­tion goes.

Well, ac­tu­ally...

• Fas­ci­nat­ing; thanks for the pa­pers! Those look like they de­scribe con­tin­u­ous and dis­crete dis­tri­bu­tions; does my state­ment hold for bi­nary vari­ables?

• Aren’t bi­nary vari­ables a dis­crete dis­tri­bu­tion?

• Yes, but they con­tain less in­for­ma­tion. Check out figure 2 of the Peters pa­per (which de­scribes dis­crete dis­tri­bu­tions). If you have an ad­di­tive noise model, so Y is X plus noise, then by look­ing at the joint pdf you can dis­t­in­guish be­tween X caus­ing Y and Y caus­ing X by the cor­ners. This doesn’t seem pos­si­ble if X and Y can only have 2 val­ues (since you get a square, not a trape­zoid).

• [1] Some­what to my own shame, I must ad­mit to ig­nor­ing my own ob­ser­va­tions in this de­part­ment—even af­ter I saw no dis­cernible effect on my weight or my mus­cu­la­ture from aer­o­bic ex­er­cise and strength train­ing 2 hours a day 3 times a week, I didn’t re­ally start be­liev­ing that the virtue the­ory of metabolism was wrong [2] un­til af­ter other peo­ple had started the skep­ti­cal dog­pile.

Oh, speak­ing of which, I was amused the other day by http://​​well.blogs.ny­times.com/​​2012/​​10/​​10/​​are-you-likely-to-re­spond-to-ex­er­cise/​​ Ap­par­ently now there’s even SNPs linked to non-re­sponse...

• Huh… I won­der how I would go about figur­ing out whether 23andMe cov­ers those SNPs. (I didn’t see such a thing in the anal­y­sis, but 23andMe reads and re­ports a lot of SNPs it doesn’t an­a­lyze.)

• Rule of thumb: by the time you hear about a pa­per, 23andMe has ex­panded their cur­rent chip to cover all the SNPs in the pa­per. The pa­per does not cause 23andMe to adopt the SNP, but is a sign that the SNP is pop­u­lar enough to be on some­one else’s chip. The size of these chips is ex­pand­ing so fast that new ones sub­sume all old ones. (This pa­per was pub­lished in 2010 and prob­a­bly col­lected data in 2009.)

How to figure out whether 23andMe cov­ers a SNP:

1. Iden­tify the SNP. In table 1 of this pa­per, the first SNP is listed as “SVIL (rs6481619)”. That means that it is a SNP in the SVIL gene, but 23andMe has dozens of SNPs on this gene. The code start­ing with RS is a stan­dard dbSNP iden­ti­fier.

2. En­ter this num­ber into SNPe­dia, eg, rs6481619 and you will get a page that might say some­thing in­ter­est­ing, such as men­tion­ing the pa­per about the SNP, or, as in this case, might be pretty much empty. But it usu­ally will have links to other ser­vices, in­clud­ing 23andMe. I’m not sure, but I think the ex­is­tence of the link is a pretty good sign that 23andMe cov­ers the SNP.

3. Fol­low the link. This only works if you have a 23andMe ac­count, but they’re free. It tells me that “Lilly Men­del” is AC and “Greg Men­del” is AA, so, yes, 23andMe cov­ers this SNP. If I had been geno­typed, it would tell me about me, too.

If you have a 23andMe ac­count, you could search it di­rectly, but I like SNPe­dia bet­ter. It is es­pe­cially good for con­vert­ing other nam­ing con­ven­tions into RS num­bers.

• Some­one in #less­wrong, IIRC, said that at least 1 of the SNPs was in­deed cov­ered by 23andMe.

• I don’t post here much (yet), and nor­mally I feel fairly con­fi­dent in my un­der­stand­ing of ba­sic prob­a­bil­ity...

But I’m slightly lost here. “if the Side­walk is Slip­pery then it is prob­a­bly Wet and this can be ex­plained by ei­ther the Sprin­kler or the Rain but prob­a­bly not both, i.e. if we’re told that it’s Rain­ing we con­clude that it’s less likely that the Sprin­kler was on.” This sen­tence seems… Wrong. If we’re told that it’s Rain­ing, we con­clude that the chances of Sprin­kler is… Ex­actly the same as it was be­fore we learned that the side­walk was wet.

This seems es­pe­cially clear when there was an alarm, and we learn there was a bur­glar—p(B|A) = .9, so shouldn’t our cur­rent p(E) go up to 0.1 * p(E|A) + p(E|~A)? Bur­glars bur­gling doesn’t re­duce the chance of earth­quakes… Ad­ding an alarm shouldn’t change that.

What am I miss­ing?

• But I’m slightly lost here. “if the Side­walk is Slip­pery then it is prob­a­bly Wet and this can be ex­plained by ei­ther the Sprin­kler or the Rain but prob­a­bly not both, i.e. if we’re told that it’s Rain­ing we con­clude that it’s less likely that the Sprin­kler was on.” This sen­tence seems… Wrong. If we’re told that it’s Rain­ing, we con­clude that the chances of Sprin­kler is… Ex­actly the same as it was be­fore we learned that the side­walk was wet.

The prob­a­bil­ity of Sprin­kler goes up when we learn the side­walk is Slip­pery, but then down—but not be­low its origi­nal level—when we learn that it is rain­ing. (Note that the ex­am­ple is a lit­tle coun­ter­in­tu­itive, in that it stipu­lates that Sprin­kler and Rain are in­de­pen­dent, given Sea­son. In re­al­ity, peo­ple don’t usu­ally turn their sprin­klers on when it is rain­ing, a fact which would be rep­re­sented by an ar­row from Rain to Sprin­kler. If that con­nec­tion was added, the prob­a­bil­ity of Sprin­kler would drop close to zero when Rain was ob­served.)

It’s the same with Alarm/​Bur­glar/​Earthquake. The prob­a­bil­ity of Bur­glar and Earthquake both go up when Alarm is ob­served. When fur­ther ob­ser­va­tion in­creases the prob­a­bil­ity of Bur­glar, the prob­a­bil­ity of Earthquake drops, but not be­low its origi­nal level.

In the limit­ing case where Alarm is cer­tain to be trig­gered by Bur­glar or Earthquake but by noth­ing else, and Bur­glar and Earthquake have in­de­pen­dent prob­a­bil­ities of b and e, then hear­ing the Alarm raises the prob­a­bil­ity of Earthquake to e/​(b+e-be). The de­nom­i­na­tor is the prob­a­bil­ity of ei­ther Bur­glar or Earthquake. Dis­cov­er­ing a bur­glar low­ers it back to e.

• Ah, okay. This makes sense to me, but I found the word­ing rather con­fus­ing. I’ll have to warn peo­ple I sug­gest this ar­ti­cle to, I sup­pose.

Thank you kindly!

• I think you’ve missed an im­por­tant piece of this pic­ture, or per­haps have not em­pha­sized it as much as I would. The real real rea­son we can elu­ci­date cau­sa­tion from cor­re­la­tion is that we have a prior that prefers sim­ple ex­pla­na­tions over com­plex ones, and so when some ob­served fre­quen­cies can be ex­plained by a com­pact (sim­ple) bayes net we take the ar­rows in that bayes net to be cau­sa­tion.

A fully con­nected bayes net (or equiv­a­lently, a causal graph with one hid­den node point­ing to all ob­served nodes) can rep­re­sent any prob­a­bil­ity dis­tri­bu­tion what­so­ever. Such a Bayes net can never be flat-out falsified. Rather it is our prefer­ence for sim­ple ex­pla­na­tions that some­times gives us rea­son to in­fer struc­ture in the world.

This con­tra­dicts noth­ing you’ve said, but I guess I read this ar­ti­cle as sug­gest­ing there is some fun­da­men­tal rule that gives us a crisp method for ex­tract­ing cau­sa­tion from ob­ser­va­tions, whereas I would look at it as a spe­cial case of in­fer­ence-with-prior-and-like­li­hood, just like in other forms of Bayesian rea­son­ing.

• Well, that sug­gests that the prob­a­bil­ity of us­ing Red­dit, given that your weight is nor­mal, is the same as the prob­a­bil­ity that you use Red­dit, given that you’re over­weight. 47,222 out of 334,366 nor­mal-weight peo­ple use Red­dit, and 12,240 out of 88,376 over­weight peo­ple use Red­dit. That’s about 14% ei­ther way

Nit­pick, but I had busted out Ex­cel at this point, and this is ac­tu­ally 16% ei­ther way.

• I sus­pect you’re us­ing Ex­cel wrong. Try it with a stan­dard calcu­la­tor and you get: 47,222/​334,366 = 0.141241146630934 or 14.1% 12,240/​88,376 = 0.138499140038019 or 13.8%

• Oh whoops! I for­got that I didn’t ac­tu­ally calcu­late the true per­centage—I was just tak­ing the ra­tio for com­par­i­sons sake. Then when he said 14% it stuck out to me as wrong. Thanks for cor­rect­ing me.

• Some­what to my own shame, I must ad­mit to ig­nor­ing my own ob­ser­va­tions in this de­part­ment—even af­ter I saw no dis­cernible effect on my weight or my mus­cu­la­ture from aer­o­bic ex­er­cise and strength train­ing 2 hours a day 3 times a week, I didn’t re­ally start be­liev­ing that the virtue the­ory of metabolism was wrong un­til af­ter other peo­ple had started the skep­ti­cal dog­pile.

Wait, are you say­ing that aer­o­bic ex­er­cise and strength train­ing don’t have any sig­nifi­cant effect on weight?

• A per­son that I trust to be truth­ful, and who has done re­search on this topic, has pointed out to me that mus­cle has a higher den­sity than fat. So if you ex­pe­rience, si­mul­ta­neously, both an in­crease in mus­cle and a de­crease in fat, then your weight may very well not change (or even in­crease, de­pend­ing on the amount of mus­cle).

The same per­son tells me that ex­er­cise both in­creases mus­cle and de­creases fat.

• Yeah. After start­ing ex­er­cis­ing reg­u­larly, lots of peo­ple who hadn’t seen me in a while thought I had lost weight, even if I had ac­tu­ally gained some.

• Er, I don’t mean to be too harsh, but I tend to be a bit sus­pi­cious when some­body tells me to ex­pect weight loss, and then backpedals and says that maybe an un­ob­serv­able sub­sti­tu­tion of mus­cle for fat took place in­stead. I re­al­ize there are ways this could in prin­ci­ple be ver­ified, if some­one was will­ing to ex­pend enough effort. It is nonethe­less sus­pi­cious.

• I un­der­stand your sus­pi­cion, and I don’t think you’re be­ing too harsh at all. Scep­ti­cism on this point is more likely to im­prove un­der­stand­ing, af­ter all.

There are ways to mea­sure fat in­de­pen­dantly of weight, how­ever. The elec­tri­cal con­duc­tance of fat and mus­cle differs—you can get scales that will mea­sure both your weight and your con­duc­tance, and pre­sent you with a figure de­scribing what per­centage of your body weight is due to fat. There’s also a ma­chine at my lo­cal gym that pur­ports to mea­sure body fat per­centage (I’m not en­tirely sure how it works or how ac­cu­rate it is), and I have found that if I fail to ex­er­cise over a long pe­riod of time, then the figure that it mea­sures shows a gen­eral up­wards trend.

• Hav­ing done a bit of pok­ing around on this sub­ject, as far as I can tell the model is more or less as fol­lows.

The hu­man body is mod­el­led as a col­lec­tion of four el­e­ments; fat, mus­cle, wa­ter, and bone. The per­centages of these differ­ent el­e­ments can change with diet, with ex­er­cise, with differ­ent types of ex­er­cise. Bone is pretty much con­stant (though ap­par­ently lack of cal­cium can cause trou­ble there); wa­ter fluc­tu­ates a lot. Fat and mus­cle are more con­trol­lable; a given diet and ex­er­cise reg­i­men has a tar­get fat per­centage and mus­cle per­centage. Start­ing on the diet/​ex­er­cise causes the body to ap­proach the tar­get fat/​mus­cle per­centage in some man­ner (it may be asymp­totic). For this pur­pose, lack of ex­er­cise also counts as an ex­er­cise reg­i­men, and it is one that has a high fat per­centage and a low mus­cle per­centage (so if you have been ex­er­cis­ing and stop, you gain a fair amount of fat). There is some com­pli­cated in­ter­ac­tion be­tween the diet and the ex­er­cise reg­i­men here. There may be a ge­netic com­po­nent also af­fect­ing the model.

Each of these four el­e­ments—fat, mus­cle, wa­ter, bone—has a cer­tain den­sity, a cer­tain con­duc­tivity. There are cer­tain per­centages of these el­e­ments (I do not know what they are) that would lead to an op­ti­mal health (mea­sured as the great­est life ex­pec­tancy). Given a per­son’s height, and per­haps a few other mea­sure­ments, one can es­ti­mate the to­tal mass of bone (our skele­tons are pretty much stan­dard). From this, and given the op­ti­mal per­centages, one can es­ti­mate the op­ti­mal mass of fat, of mus­cle, for the great­est life ex­pec­tancy. (Water still fluc­tu­ates a lot, as I un­der­stand it).

Mea­sure­ments of these per­centages in­clude weight, girth, elec­tri­cal con­duc­tivity, and use of cal­ipers. The first three of these figures mea­sure quan­tities that are af­fected by all four per­centages; a change in one fac­tor can be masked by a change in the oth­ers.

All in all, it’s a far more com­plex prob­lem than it looks like at first glance. Some heuris­tics have leaked out into com­mon knowl­edge; things like “don’t eat too much fatty foods” and “ex­er­cise at least a bit”. I am not sure how ac­cu­rate these heuris­tics are—pre­sum­ably there is some rea­son­ing back­ing them, pos­si­bly based on the model vaguely de­scribed above. I also sus­pect that the idea of the ideal weight (based on BMI) is based on the ex­pec­ta­tion of a cer­tain com­mon max­i­mum mus­cle per­centage.

• I tend to be a bit sus­pi­cious when some­body tells me to ex­pect weight loss, and then backpedals and says that maybe an un­ob­serv­able sub­sti­tu­tion of mus­cle for fat took place in­stead.

Who told you to ex­pect weight loss?

• Er, I don’t mean to be too harsh, but I tend to be a bit sus­pi­cious when some­body tells me to ex­pect weight loss, and then backpedals and says that maybe an un­ob­serv­able sub­sti­tu­tion of mus­cle for fat took place in­stead. I re­al­ize there are ways this could in prin­ci­ple be ver­ified, if some­one was will­ing to ex­pend enough effort. It is nonethe­less sus­pi­cious.

I’d be more sus­pi­cious of re­ports that ex­er­cise didn’t change body com­po­si­tion than that it did. That’s how ex­er­cise tends to work for most peo­ple. I’d be more skep­ti­cal of the ini­tial claim for net weight loss, at least if it wasn’t qual­ified—that is usu­ally not what I would ex­pect in the short term.

I’d be more sus­pi­cious if the ‘un­ob­serv­able’ was a lit­tle more difficult to ver­ify.

• Hav­ing mus­cle sub­sti­tuted for fat would re­sult in bet­ter health or at least greater strength, I would think. Weight is (usu­ally) just an easy way to mea­sure a change in fat. I am try­ing suc­cess­fully to lose more weight based on the as­sump­tion that the con­di­tions for fat to form or per­sist de­pend largely on the bal­ance of food in­take and amount of ex­er­cise. If you main­tain a con­sis­tent food in­take, and main­tain a con­sis­tent amount of ex­er­cise, and gain fat, then if it is phys­i­cally safe to, ei­ther re­duce food in­take, or in­crease ex­er­cise. If given your cur­rent diet, and you slightly in­crease your ex­er­cise, you have proven that you do not lose fat, then I would as­sume that you should try chang­ing the vari­ables more, in­stead of giv­ing up. We’re not ex­actly spend­ing all day hunt­ing and gath­er­ing any­more. I am go­ing to in­crease my ex­er­cise and de­crease my food (al­though I still in­vest in daily choco­late lifestyle en­hance­ment, as you sug­gested as a sure bet as op­posed to play­ing the lot­tery), and I am fairly sure be­fore two weeks pass I will have lost five pounds.

• Let us know how it works.

• Sorry for the de­lay, I got caught up in the Hal­loween spirit. As for the fol­low­ing table, it lists the date and recorded weight on that date.

• 10/​13--149.0 (lb.)

• 10/​14--149.9

• 10/​15--149.5

• 10/​16--151.2

• 10/​17--151.9

• 10/​18--149.7

• 10/​20--151.0

• 10/​21--151.2

• 10/​22--149.3

• 10/​23--148.4

• 10/​24--148.2

• 10/​25--146.8

• 10/​26--147.3

• 10/​27--146.4

As you can see, I did not reach the goal that I set. The ex­cuse—er, ex­pla­na­tion, is that I made that claim on the very day I started a week-long va­ca­tion. Hence, I was much more seden­tary than while work­ing, and I ate more fre­quent and larger meals than on work­days. Oc­to­ber 22 was the day I re­turned to work, and was also the day that I ac­tu­ally be­gan los­ing weight, so my 2-week pre­dic­tion ac­tu­ally had about a week cut off, and (anec­do­tally) shows both sides of the story in do­ing so. On the up­side, I pro­gressed pretty far in Paper Mario. If I had started the two weeks from the 22nd, I have lost about 7lb. since then. Hope­fully this pro­vides some data for any­one in­ter­ested.

• Hmmm. Thank you for the data.

• (Well, we’re talk­ing about six hours a week, which ought to no­tice­ably make you lose weight if you keep your calorie in­take con­stant. But peo­ple who ex­er­cise six hours a week don’t usu­ally keep their calorie in­take con­stant.)

• This is great!

Tpyos:

• Shortly af­ter the sen­tence, “We could con­sider three hy­po­thet­i­cal causal di­a­grams over only these two vari­ables”, one of the “Earthquake --> Re­ces­sion” ta­bles gives p(¬e) as 0.70 when it should be 0.71 (so it and 0.29 sum to one).

• After the sen­tence, “So since all three vari­ables are cor­re­lated, can we dis­t­in­guish be­tween, say, these three causal mod­els?”, this di­a­gram I think is meant to have “Re­ces­sion” on top and “Bur­glar” on the bot­tom. (Vaniver also no­ticed this one.)

Edit:

• The para­graph that starts with “Sure! First, we marginal­ize over the ‘ex­er­cise’ vari­able to get the table for just weight and In­ter­net use” needs to have s/​nor­mal-weight/​over­weight/​ run on it. (And maybe have a sen­tence added say­ing that you’re get­ting the rest of the fol­low­ing table by do­ing the same math on the other three sets of peo­ple grouped by weight and In­ter­net us­age.)

• Fixed.

• Thanks.

The “marginal­ize over the ‘ex­er­cise’ vari­able” para­graph (men­tioned in an edit to the grand­par­ent) still seems to me to not match the ta­bles.

• Fixed! Thanks for be­ing per­sis­tent.

• Typo:

Tpyos:

• ;-)

• This makes me thing “T Python Oper­at­ing Sys­tem”.

• All these con­clu­sions seem to re­quire si­mul­tane­ity of cau­sa­tion. If earth­quakes al­most always caused re­ces­sions, but not un­til one year af­ter the earth­quake; and if re­ces­sions dras­ti­cally in­crease the num­ber of bur­glars, but not un­til one year af­ter the re­ces­sion; then draw­ing any of the con­clu­sions you made from a sur­vey taken at a sin­gle point in time would be en­tirely un­war­ranted. Doesn’t that mean you’re es­sen­tially mea­sur­ing en­tail­ment rather than cau­sa­tion via a se­ries of phys­i­cal events which take time to oc­cur?

Also, the virtue the­ory of metabolism is so ridicu­lous that it seems only to be act­ing as a car­i­ca­ture here. Wouldn’t the the­ory that “ex­er­cise nor­mally metabolises fat and pre­cur­sors of fat, re­duc­ing the amount of weight put on” re­sult in a much more use­ful ex­am­ple? Or is there a sub­text I’m miss­ing here, like the ex­ces­sive amount of fat-sham­ing done in many of the more de­vel­oped na­tions?

• Some of the in­ferred sub­text is be­ing ex­tracted from ear­lier posts that re­fer to diet while os­ten­si­bly dis­cussing other is­sues.

• Ahh, thanks.

• A few minor clar­ity/​read­abil­ity points:

1. The sec­ond para­graph open­ing “The statis­ti­ci­ans who dis­cov­ered the na­ture of re­al­ity” reads rather oddly when taken out of the con­text of “The Fabric of Real Things”.

2. When con­sid­er­ing the three causal mod­els of Bur­glars, Alarms and Re­ces­sions, tack­ling the mod­els in a “First, third, sec­ond” or­der threw me on first read­ing. It would prob­a­bly be eas­ier to fol­low if the text and the di­a­gram used the same or­der.

3. Per­haps giv­ing each node a differ­ent pas­tel colour would make it eas­ier to fol­low what is chang­ing be­tween differ­ent di­a­grams.

And this has prob­a­bly been said, but us­ing ex­er­cise and weight is prob­a­bly dis­tract­ing, since peo­ple already have opinions on the is­sue.

• I fol­lowed most of the math but the part right before

mul­ti­ply­ing to­gether the con­di­tional prob­a­bil­ity of each vari­able given the val­ues of its im­me­di­ate par­ents. (If a node has no par­ents, the prob­a­bil­ity table for it has just an un­con­di­tional prob­a­bil­ity, like “the chance of an earth­quake is .01”.)

has me puz­zled. The vari­ables used aren’t ex­plic­itly men­tioned el­se­where in the ar­ti­cle, and while I think they have some con­ven­tional mean­ing I can’t quiet re­mem­ber what. The con­text let me make a de­cent guess, but I still feel a lit­tle fuzzy. Other­wise the post was pretty clear, much clearer than the other ex­pla­na­tions I’ve seen.

• Here is a spread­sheet with all the num­bers for the Ex­er­cise ex­am­ple all crunched and the graph rea­son­ing ex­plained in a slightly differ­ent man­ner:

• So, what if the causal di­a­gram isn’t sim­ply de­pen­dent and/​or con­tains loops? What if re­ces­sions cause bur­glars, and bur­glars dis­able alarms, and alarms cause re­ces­sions?

You also for­got about the graph that has con­found­ing fac­tors C and D; C af­fects ex­er­cise and weight, while D af­fects ex­er­cise and in­ter­net us­age. Both of them make ex­er­cise more (or less) likely, and their other fac­tor less (or more) likely; Weight and in­ter­net use re­main un­cor­re­lated, but re­main nega­tively cor­re­lated with ex­er­cise.

• What if re­ces­sions cause bur­glars, and bur­glars dis­able alarms, and alarms cause re­ces­sions?

One op­tion is to make a much larger causal di­a­gram that has vari­ables re­ces­sions(t), bur­glars(t), and alarms(t) where t is a (say dis­crete) time vari­able, then have re­ces­sions(t) cause bur­glars(t+1), bur­glars(t) dis­abling alarms(t+1), and alarms(t) caus­ing re­ces­sions(t+1).

• Sorry, I didn’t mean to say that any of those things pre­cedes the other like fire and heat; I meant that they caused each other, like a gen­er­a­tor which is pro­vid­ing it’s own field cur­rent.

• Yes, that’s what it means when you draw a causal ar­row from re­ces­sions(t) to bur­glars(t+1), un­less you think that re­ces­sions in­stan­ta­neously cause bur­glars, etc.

• It’s en­tirely pos­si­ble that we lack the abil­ity to dis­t­in­guish in­for­ma­tion about t+1 from in­for­ma­tion about t. Do re­ces­sions cause bur­glars in less time than the re­s­olu­tion of our eco­nomic and po­lice statis­tics? Can a re­ces­sion cause bur­glaries which cause alarms which fur­ther cause more re­ces­sion in such a man­ner as the origi­nal re­ces­sion can­not be no­ticed?

• Cur­rently it looks like this page has lots of bro­ken images, which are ac­tu­ally for­mu­las. Can this be fixed? It’s kind of hard to un­der­stand now.

• It looks like a prob­lem at codecogs.com, the ser­vice that LW uses to trans­late LaTeX to for­mula images. Prob­a­bly tem­po­rary.

How much effort would it be to move to MathJax?

• If we know that there’s a bur­glar, then we think that ei­ther an alarm or a re­ces­sion caused it; and if we’re told that there’s an alarm, we’d con­clude it was less likely that there was a re­ces­sion, since the re­ces­sion had been ex­plained away.

Should that be “since the bur­glar had been ex­plained away”? Or am I con­fused?

Edit: I was con­fused. The bur­glar was ex­plained; the re­ces­sion was ex­plained away.

• (We have rem­nants of this type of rea­son­ing in old-school “Cor­re­la­tion does not im­ply cau­sa­tion”, with­out the now-stan­dard ap­pendix, “But it sure is a hint”.)

Given the rea­son­ing in this post and this post I think you can also in­fer that this old “Cor­re­la­tion does not im­ply cau­sa­tion” state­ment is not only flawed, but it’s also out­right wrong

And should in­stead just be “Cor­re­la­tion does im­ply cau­sa­tion, but doesn’t tell which kind

• but it’s also out­right wrong

“im­ply” in the tra­di­tional phrase is used in the strong sense. You can have a cor­re­la­tion be­tween 2 fac­tors with­out there nec­es­sar­ily be­ing a causal re­la­tion­ship be­tween them.

• If you can ex­clude co­in­ci­dence, which is a ques­tion of con­fi­dence and what kind of data the cor­re­la­tion is based on, then you can say that the cor­re­la­tion does nec­es­sar­ily in­volve a causal re­la­tion­ship.

Well that’s just what I think. If you can show me how that’s wrong, then please do. Ex­cept I don’t think you can.

• If you can ex­clude coincidence

That’s beg­ging the ques­tion, if by “co­in­ci­dence” you just mean those cases where there is a cor­re­la­tion which does not in­volve a causal re­la­tion­ship.

• I think that tra­dio­tional wis­dom is fairly ac­cu­rate, bear­ing in mind that cor­re­la­tion be­tween A and B doens’t im­ply cau­saiont be­tween A and B.

• I agree. But it’s still in­ac­cu­rate to say it does not im­ply cau­sa­tion.

cor­re­la­tion be­tween A and B is ex­plained by ei­ther 1. A->B 2. B->A 3. X->A & X->B or 4. By chance , or any com­bi­na­tion of the afore­men­tioned and which of 4. is usu­ally con­fi­dently elimi­nated by any­thing that is statis­ti­cally sig­nifi­cant.

Point be­ing there’s usu­ally a causal re­la­tion­ship be­hind the cor­re­la­tion, even if it in­volves more fac­tors than the ones that are be­ing stud­ied. There­fore that old phrase is mis­lead­ing and—in my opinion—wrong.

• As Peter noted, the mean­ing of “cor­re­la­tion does not im­ply cau­sa­tion” is “it is false that, for ev­ery X and Y, if pos­i­tively-cor­re­lated(X,Y) then ei­ther causes(X,Y) or causes(Y,X).” In­ter­preted in this way, the prin­ci­ple is com­pletely unim­peach­able. If you ob­ject to it, you must be tak­ing the prin­ci­ple to im­ply some­thing much more gen­eral, like “it is false that for ev­ery X and Y, if pos­i­tively-cor­re­lated(X,Y), then there is some Z that is in some way causally rele­vant to partly ex­plain­ing this fact.” The lat­ter ver­sion of the prin­ci­ple is much eas­ier to deny.

But your own ar­gu­ment doesn’t quite get us to be­ing able to deny ei­ther prin­ci­ple yet. For in­stance: What is meant by “X->A & X->B”? If this means di­rect cau­sa­tion, then it is surely false. But if it al­lows for tran­si­tive causal chains lead­ing back to some X, then the prin­ci­ple risks triv­ial­ity, since it is plau­si­ble that all events share at least some cause in com­mon, if you go back far enough. A sec­ond prob­lem: How can we rigor­ously un­pack the mean­ing of “by-chance” cor­re­la­tions? And third: How do you know that statis­ti­cally sig­nifi­cant cor­re­la­tions are usu­ally not “by chance” in your sense?

• But if it al­lows for tran­si­tive causal chains lead­ing back to some X, then the prin­ci­ple risks triv­ial­ity, since it is plau­si­ble that all events share at least some cause in com­mon, if you go back far enough.

How­ever, wouldn’t that be ex­tremely un­likely? And wouldn’t the like­li­hood be re­lated to the amount of cor­re­la­tion?

A sec­ond prob­lem: How can we rigor­ously un­pack the mean­ing of “by-chance” cor­re­la­tions?

I’m not sure be­cause I lack the skill in math­e­mat­ics to an­swer this ques­tion the proper way.

And third: How do you know that statis­ti­cally sig­nifi­cant cor­re­la­tions are usu­ally not “by chance” in your sense?

I’m not sure if there is a math­e­mat­i­cal for­mal­ism for this, pretty much for the same rea­sons as for prob­lem two: I don’t have the math­e­mat­i­cal abil­ities re­quired. How­ever, I do know what they’re about, and I’m rather con­fi­dent that you and I both can tell apart re­sults that could be ex­plained by mere chance and those that could not—it would be rather sur­pris­ing if it was not achiev­able by means of math if you can achieve that by mere fal­lible in­tu­ition?

Well I apol­o­gize if I’m mis­taken here, but I’m still try­ing to be rea­son­able.. Hmm.

Let’s cre­ate an ex­am­ples to illus­trate a point:

Stu­dents at some school take a spe­cial test each schoolyear and their tests re­sults are com­pared with some­thing fairly triv­ial. Let’s say the num­ber of pen­cils the stu­dents bring to the tests.

Then by means of cor­re­la­tion it is found that the num­ber of pen­cils brought by the stu­dents to the school has been in­creas­ing in a way that is cor­re­lated to prowess in the tests by the stu­dents.

In this case it’s not suffi­cient to say that the cor­re­la­tion im­plies that the num­ber of pen­cils is caus­ing the in­creas­ing prowess in the tests, nor that the prowess in the tests is caus­ing the in­creased num­ber of pen­cils. Which is what the phrase tra­di­tion­ally stands for.

But there still can be a causal re­la­tion­ship, for an ex­am­ple the school’s fund­ing has been in­creas­ing and they’ve been giv­ing more free ma­te­rial to stu­dents, and if in­creased ma­te­rial is cor­re­lated with in­creased prowess and in­creased num­ber of pen­cils, or in­creas­ing econ­omy.. and so forth, that’s causal­ity, but not of the same kind.

How­ever we can also say that this is just a co­in­ci­dence, par­tic­u­larly if there has been only a cou­ple of events. Or by some triv­ial causal chain like then you men­tioned, but....

… you can also see how these re­sults could be of a na­ture where ca­su­al­ity is ac­tu­ally re­quired. If we look at a sin­gle test­ing event and no­tice that for the 500 stu­dents of the school there’s a strong cor­re­la­tion be­tween num­ber of pen­cils and test prowess, we’re start­ing to talk about ex­tremely small prob­a­bil­ities that the re­sults are by co­in­ci­dence, are we not? Even if the pen­cils are not the cause , we can still de­duce that there is a cause at high like­li­hood?

Well any­way maybe I’m just mak­ing ex­cuses, at least it’s im­por­tant to con­sider that at this point, and I see your point any­way, and I think I was wrong. Oops, sorry.

But not ex­actly. Be­cause I think there’s some­thing to this. And I think you should know what I mean. Maybe it’s im­por­tant to start ask­ing what this co­in­ci­dence ac­tu­ally means? Isn’t this ac­tu­ally some­thing about Markov Blan­ket ? (or some­thing similar, sorry if I mi­sused the term)

• Oh well I think I can an­swer the ques­tion:

You can mea­sure the like­li­hood that the pro­file the datasets are similar by chance. For an ex­am­ple sim­ple in­creas­ing ten­dency—cor­re­la­tion that is—that can be ex­plained by co­in­ci­den­tially similar in­creas­ing ten­dency, but if there’s an com­plex pro­file to cor­re­la­tion, you can mea­sure what the like­li­hood for a co­in­ci­dence is? Even fur­ther, the more com­plex the pro­files are, the less likely a co­in­ci­dence be­comes? ( if they match )

• So you don’t no­tice a lot of cor­re­la­tion-cau­sa­tion er­rors? I see them ev­ery­were. Prac­ti­cally ev­ery sci­ence story in the press.

• How’d you get that from what I just said? Some­one else mak­ing er­rors is not an ex­cuse for you to do that too.

• it works like this: If peo­ple in gen­eral are erring on the side of over-as­so­ci­at­ing cor­re­la­tion with cau­sa­tion rather than un­der-as­so­ci­at­ing, then “cor­re­la­tion is not cau­sa­tion” is the bet­ter rule-of-thumb.

• Agreed. I’m sorry for for com­ment­ing about this be­fore think­ing things re­ally through, that was very lazy and thoughtless.

How­ever in the course of you peo­ple be­ing nice and point­ing out how fool­ish I was not only the ob­vi­ous er­ror was cor­rected, but it ap­pears that I also gained an in­sight(find­ing out some­thing I didn’t know per­son­ally that is) into the mat­ter. That be­ing: In some cases you can es­ti­mate the prob­a­bil­ity of a cor­re­la­tion be­ing merely co­in­ci­den­tial ver­sus it be­ing the re­sult of an ac­tual causal re­la­tion­ship. Although since I’m not a math­e­mat­i­cian I don’t ac­tu­ally know how do that, ex­cept by look­ing at graphs and let­ting the brain do all the work. It does though sound a lit­tle silly.

Does some­one know how to do that math­e­mat­i­cally? Es­ti­mate the prob­a­bil­ity of a cor­re­la­tion be­ing co­in­ci­den­tial ver­sus due to a causal re­la­tion­ship of an un­known type?

• I’ve been en­joy­ing this se­ries so far, and I found this ar­ti­cle to be par­tic­u­larly helpful. I did have a minor sug­ges­tion. The turn­stile and the log­i­cal nega­tion sym­bols were called out, and I thought it might be use­ful to ex­plic­itly break­down the prob­a­bil­ity dis­tri­bu­tion equa­tion. The cur­rent Less Wrong au­di­ence had lit­tle prob­lem with it, cer­tainly, but if you were show­ing it to some­one new to this for the first time, they might not be ac­quainted with it. I was think­ing some­thing along the lines of this from stat­trek:

“Gen­er­ally, statis­ti­ci­ans use a cap­i­tal let­ter to rep­re­sent a ran­dom vari­able and a lower-case let­ter, to rep­re­sent one of its val­ues. For ex­am­ple,

X rep­re­sents the ran­dom vari­able X.
P(X) rep­re­sents the prob­a­bil­ity of X.
P(X = x) refers to the prob­a­bil­ity that the ran­dom vari­able X is equal to a par­tic­u­lar value, de­noted by x. As an ex­am­ple, P(X = 1) refers to the prob­a­bil­ity that the ran­dom vari­able X is equal to 1.”


Also, page 2 of A Stu­dent’s Guide to Maxwell’s Equa­tions does a great job of di­a­gram­ming Gauss’ law for elec­tri­cal fields, and I think it would be helpful if this were available to break­down the right half of the equa­tion, with the be­gin­ning reader see­ing a break­down of the equa­tion.

If this all was set aside in the foot­note, the over­all con­ti­nu­ity of the ar­ti­cle wouldn’t be af­fected, and some­one who might be in­timi­dated at first by equa­tions might see that these aren’t so bad. With just a bit of ex­po­si­tion, more read­ers might be able to fol­low along with the en­tire ar­gu­ment, which I think could be in­tro­duced to some­one with very lit­tle back­ground.

• To get that last graph, you have to show that in­ter­net and ex­er­cise are cor­re­lated...

• This helped me un­der­stand what In­stru­men­tal Vari­ables are, but An­drew Gel­man’s cri­tique of in­stru­men­tal vari­ables has me con­fused again:

Sup­pose z is your in­stru­ment, T is your treat­ment, and y is your out­come. So the causal model is z → T → y. The trick is to think of (T,y) as a joint out­come and to think of the effect of z on each. For ex­am­ple, an in­crease of 1 in z is as­so­ci­ated with an in­crease of 0.8 in T and an in­crease of 10 in y. The usual “in­stru­men­tal vari­ables” sum­mary is to just say the es­ti­mated effect of T on y is 100.8=12.5, but I’d rather just keep it sep­a­rate and re­port the effects on T and y sep­a­rately.

In Piero’s ex­am­ple, this trans­lates into two state­ments: (a) States with higher penalties for mur­der had higher penalties for de­fa­ma­tion, and (b) States with higher penalties for mur­der had less re­port­ing of cor­rup­tion.

Fine. But I don’t see how this adds any­thing at all to my un­der­stand­ing of the de­fa­ma­tion/​cor­rup­tion re­la­tion­ship, be­yond what I learned from his sim­pler find­ing: States with higher penalties for de­fa­ma­tion had less re­port­ing of cor­rup­tion.

If your model is z → T → y, and you show that z in­ter­acts with each of T and y, isn’t the next step just to look at the re­la­tion be­tween z and y, con­trol­ling for T? In other words, if it turns out that z still mat­ters in pre­dict­ing y once you have T in your model, then you don’t have an in­stru­men­tal vari­able. But if T screens off the effect of z in pre­dict­ing y, then z is an in­stru­men­tal vari­able, and only af­fects y through T.

• Sorry, what you are miss­ing is T and Y could be con­founded by un­ob­served vari­ables. That is, the real graph is:

z → T → Y, with T ← U → Y, with U un­ob­served. Then if you con­trol for T, you will get an open path z → T ← U → Y which is not causal. In gen­eral if your graph is

T → Y ← U → T, the causal effect is not a func­tional of the ob­served data. How­ever with some para­met­ric as­sump­tions you can ob­tain the causal effect as a func­tional of the ob­served data if there is an in­stru­ment z.

• Oh… so the idea in your sec­ond para­graph is that when you hold T con­stant, a change in z sug­gests an equal and op­po­site change in U (mea­sur­ing by their mean effect on T). Then that change af­fects Y.

• That’s ex­actly right. The fact that for treat­ment T, and out­come Y, there is gen­er­ally an un­ob­served com­mon cause U of T and Y is in some sense the fun­da­men­tal prob­lem of causal in­fer­ence. The way out is ei­ther:

(a) Make para­met­ric as­sump­tions and find in­stru­men­tal vari­ables (econo­met­rics, mendelian ran­dom­iza­tion)

(b) Try to ob­serve U (epi­demiol­ogy, etc.)

(c) Ran­dom­ize T (statis­tics, em­piri­cal sci­ence)

There are some other lesser known ways as well:

(d) Find an un­con­founded me­di­a­tor W that in­ter­cepts all causal in­fluence from T to Y:

T → W → Y

Then use the “front-door crite­rion.”

• Let’s con­sider a prac­ti­cal ex­am­ple. Since the ques­tion of ex­er­cise and weight has turned up, let’s re­visit it. First, let’s col­lect some raw data (I can’t use in­ter­net us­age, since this poll is ex­tremely bi­ased on that axis).

For the pur­poses of this poll, “over­weight” means a body mass in­dex over 25. “Ex­er­cise” means at least 30 min­utes a week, work­ing hard at it, on a reg­u­lar ba­sis. “Diet” means that you ac­tu­ally think about the nu­tri­tional value of the food you eat, and con­sciously base your choice of food on that in­for­ma­tion in some sig­nifi­cant way.

Select one of the fol­low­ing:

[pol­lid:183]

Once we have some data, we can then prac­tice this skill on the re­sults of the poll, and see whether (and if so, how) these vari­ables are causally linked among poll re­spon­dants.

• This ques­tion is not very well for­mu­lated. I diet, and have lost 30 pounds or so since last de­cem­ber but am still overweight

• Then one of the two “I diet, I am over­weight” op­tions seems ap­pro­pri­ate, de­pend­ing on whether you ex­er­cise or not. Whether you have lost or gained weight re­cently doesn’t seem part of the poll.

• I’m say­ing that los­ing 30 pounds ap­pears to be ex­actly the sort of thing we’re ac­tu­ally try­ing to find out about but the poll doesn’t check for it.

• A ques­tion not be­ing very “well for­mu­lated” im­plies to me that it in­cor­po­rates con­fu­sions, am­bi­gui­ties, false dilem­mas, etc. That a differ­ent ques­tion might be more rele­vant to the pur­pose of the post, seems a differ­ent is­sue.

• I was very care­ful to for­mu­late the ques­tion to avoid con­fu­sions, par­tic­u­larly in the defi­ni­tion of ‘over­weight’ (my think­ing was, Obe­lix would claim he was not over­weight, by defin­ing ‘over­weight’ I at least en­sure that differ­ent defi­ni­tions of ‘over­weight’ do not blur the line). In the pro­cess, I did not con­sider the case of a per­son who had rel­a­tively re­cently started a diet (or an ex­er­cise reg­i­men) and whose weight had changed as a re­sult, but not suffi­ciently to move past the ar­bi­trary 25 BMI line.

This was there­fore prob­a­bly not the best way to phrase the ques­tion, and for that I apol­o­gise (if I were to go back in time and rewrite the ques­tion, I would take that case into ac­count). Nonethe­less, the ques­tion stands as is; I think that it is more im­por­tant at this point to be con­sis­tent, and thus one of the “I diet, I am over­weight” op­tions are ap­pro­pri­ate.

• Okay, six­teen peo­ple are not enough to say much from. There will be large er­ror bars in the fol­low­ing state­ments, due to small sam­ple size. Nonethe­less.

Tak­ing E for ex­er­cise, D for diet, O for over­weight:

• p(E)=0.625

• p(D)=0.1875

• p(O)=0.1875

• p(ED)=0.1875

• p(EO)=0.125

• p(DO)=0.0625

Ex­er­cise and diet­ing seem to be pretty well cor­re­lated; ei­ther diet­ing causes ex­er­cise (with 100% cer­tainty over this small data set) or ex­er­cise causes diet (about one-third of the time), or, more likely, a third fac­tor (a de­sire to lose weight, per­haps) causes both diet­ing and ex­er­cise. Strangely, be­ing over­weight doesn’t seem to be cor­re­lated with ei­ther ex­er­cise or diet… my first in­stinct here is to be sus­pi­cious of the sur­vey’s small sam­ple size. (At the very least, I’d ex­pect be­ing over­weight to cause diet­ing).

It also seems, from this sur­vey, that the best way to not be over­weight is to ex­er­cise but not diet—though a mere one vote can very eas­ily change that con­clu­sion, so this sur­vey should be con­sid­ered to have very lit­tle weight at six­teen re­sponses.

• In­ter­est­ing ar­ti­cle, thanks.

I agree with the gen­eral con­cept. I would be a bit more care­ful in the con­clu­sions, how­ever:
No visi­ble cor­re­la­tion does not mean no cau­sa­tion—it is just a strong hint. In the spe­cific ex­am­ple, the hint comes from a sin­gle pa­ram­e­ter—the lack of sig­nifi­cant cor­re­la­tion be­tween in­ter­net & over­weight when both ex­er­cise cat­e­gories are added; to­gether with the sig­nifi­cant cor­re­la­tion of in­ter­net us­age with the other two pa­ram­e­ters.

With the pro­posed di­a­gram, I get:
p(In­ter­net)=.141
p(not In­ter­net)=.859
p(Over­weight)=.209
p(not Over­weight)=.791

p(Ex|Int & Ov)=.10
p(Ex|Int & no OV)=.62
p(Ex|no Int & Ov)=.27
p(Ex|no Int & no Ov)=.85

This model has 6 free pa­ram­e­ters—the in­signifi­cant cor­re­la­tion be­tween over­weight and in­ter­net is the only con­straint. It is true that other mod­els have to be more com­plex to ex­plain data, but we know that our world is not a small toy simu­la­tion—there are causal con­nec­tions ev­ery­where, the ques­tion is just “are they neg­ligible or not?”.

• I haven’t read enough of Causal­ity, but I think I get how to find a causal model from the ex­am­ples above.

Ba­si­cally, a model se­lec­tion prob­lem? P(Model|Data) = P(Data|Model)P(Model)/​P(Data) ~ P(Data|Model)P(Model)?

Is P(Model) done in some ob­jec­tive sense, or is that left to the prior of the mod­eler? Or some com­bi­na­tion of con­tex­tu­ally ob­jec­tive and stan­dard causal mod­el­ing pri­ors (di­rec­tion of time, lo­cal­ity, etc.)?

Any good pow­er­point sum­mary of Pearl’s meth­ods out there?

• Hi,

P(Model) is usu­ally re­lated to the di­men­sion of the model (num­ber of pa­ram­e­ters). The more pa­ram­e­ters, the less likely the model (a form of the ra­zor we all know and love).

See these:

There are other ways of learn­ing causal struc­ture, based on rul­ing out graphs not con­sis­tent with con­straints found in the data. Th­ese do not rely on pri­ors, but have their own prob­lems.

• An earth­quake is 0.8 likely to set off your bur­glar alarm; a bur­glar is 0.9 likely to set off your bur­glar alarm. And—we can’t com­pute this model fully with­out this info—the com­bi­na­tion of a bur­glar and an earth­quake is 0.95 likely to set off the alarm;

I don’t think that true. The earth­quake can cause the bur­gler to have less con­trol over his own move­ment and there­fore in­crease the chance that he trig­gers the alarm.

• I don’t think this mat­ters too much to the main point, but if you like, you can imag­ine that with a 0.05 prob­a­bil­ity the alarm is in­cor­rectly wired and will not go off no mat­ter what hap­pens.

• This was a re­ally good ar­ti­cle over­all; I just finished go­ing through all the num­bers in Ex­cel and it makes a lot of sense.

The thing that is most coun­ter­in­tu­itive to me is that it ap­pears that the causal link be­tween ex­er­cise and weight can ONLY be com­puted if you bring in a 3rd, seem­ingly ir­rele­vant vari­able like in­ter­net us­age. It looks like that vari­able has to be some­how cor­re­lated with at least one of the causal nodes—maybe it has to be cor­re­lated with one spe­cific node… I am a lit­tle hazy on that.

I en­courage read­ers to open an Ex­cel file or some­thing and, us­ing Eliezer’s made-up ‘data’ about ex­er­cise/​weight/​in­ter­net, ex­haus­tively list all the pos­si­ble causal graphs for those 3 vari­ables, then falsify all of them un­til only the one re­mains. It re­ally shows how nicely the tech­nique works.

Now I am keen to find some con­tro­ver­sial real-world causal hy­poth­e­sis and test it us­ing this method.

• Early in­ves­ti­ga­tors in Ar­tifi­cial In­tel­li­gence, who were try­ing to rep­re­sent all high-level events us­ing prim­i­tive to­kens in a first-or­der logic (for rea­sons of his­tor­i­cal stu­pidity we won’t go into) were stymied by the fol­low­ing ap­par­ent para­dox:

[De­scrip­tion of a sys­tem with the fol­low­ing three the­o­rems: ⊢ ALARM → BURGLAR, ⊢ EARTHQUAKE → ALARM, and ⊢ (EARTHQUAKE & ALARM) → NOT BURGLAR]

Which rep­re­sents a log­i­cal con­tra­dic­tion.

This isn’t a log­i­cal con­tra­dic­tion: per­haps what you mean is that we can de­duce from this sys­tem that EARTHQUAKE is false. This would give us a con­tra­dic­tion in a modal sys­tem, if we also had the the­o­rem ⊢ pos­si­bly(EARTHQUAKE), but as it stands it isn’t yet con­tra­dic­tory.

• You clearly un­der­stand this, but I’ll make it ex­plicit for ob­servers:

1. A → B means that it can­not be the case that A is true and B is false.

2. E → A means that it can­not be the case that E is true and A is false.

3. (E & A) → !B means that it can­not be the case that E is true, A is true, and B is true.

Sup­pose we learn that E is false. We can’t in­fer any­thing about A and B, ex­cept that it can­not be that A is true and B is false.

Sup­pose we learn that E is true. By 2, we know that A can­not be false, and so must be true. By 1, we know that B can­not be false. By 3, we know that B can­not be true. B has no pos­si­ble val­ues, which is a con­tra­dic­tion.

This isn’t a log­i­cal con­tra­dic­tion: per­haps what you mean is that we can de­duce from this sys­tem that EARTHQUAKE is false. This would give us a con­tra­dic­tion in a modal sys­tem, if we also had the the­o­rem ⊢ pos­si­bly(EARTHQUAKE), but as it stands it isn’t yet con­tra­dic­tory.

E is a sen­sor read­ing about re­al­ity, and so ⊢ pos­si­bly(E) is meant to be im­plied. (Writ­ing down those three state­ments on a piece of pa­per can’t force the earth to stop shak­ing!)

One of the im­prove­ments made to solve this prob­lem was to in­tro­duce prob­a­bil­ity- the idea that in­stead of treat­ing the links be­tween A, E, and B as de­ter­minis­tic, let’s treat them as stochas­tic. That’s the Bayesian net­work idea, and with those it’s harder to get con­tra­dic­tions (you can by mis­form­ing your dis­tri­bu­tions).

The causal model is an im­prove­ment even be­yond that, be­cause it al­lows you to deal with in­ter­ven­tions in the sys­tem. Sup­pose we know that alarms and bur­glars are perfectly cor­re­lated. This could be ei­ther be­cause bur­glars always set off alarms, or be­cause alarms always at­tract bur­glars. If you’re a bur­glar who would like to steal from a house when there isn’t an earth­quake, the differ­ence is im­por­tant! If you knew which causal sys­tem were the case, you could pre­dict what would hap­pen when you steal from the house.

• Very nice and in­tu­itive, thanks! This ex­pla­na­tion is great.

(Though I’ve already spent a lit­tle while play­ing around with Bayes nets, and I don’t know how large of a role that had in mak­ing this feel more in­tu­itive to me.)

• The con­cern of the philoso­phers is the idea of ‘true cau­sa­tion’ as in­de­pen­dent from merely ap­par­ent cau­sa­tion. In par­tic­u­lar, they have in mind the idea that even if the laws of the uni­verse were de­ter­minis­tic there would be a sense in which cer­tain events could be said to be causes of oth­ers even though math­e­mat­i­cally, the con­figu­ra­tion of the uni­verse at any time com­pletely en­tails it at all oth­ers. Frankly, this ques­tion arises out of half-baked ar­gu­ments about whether events cause lat­ter events or if god has pre­de­ter­mined and causes all events in­di­vi­d­u­ally and I don’t take it se­ri­ously.

My take is that there is no such thing as cau­sa­tion. Cor­re­la­tion is all there is and the fact that many cor­re­la­tions are use­fully and com­pactlly de­scribed by Bayesian causal mod­els is ac­tu­ally sup­port for the idea that the as­crip­tion of cau­sa­tion re­flects noth­ing more than how the ar­rows hap­pen to point in those causal mod­els we find most com­pel­ling. In other words I don’t think it makes sense to look un­der your model to ask about what is truly cau­sa­tion but we should be clear that is what the philoso­phers mean.

De­spite my great re­spect for Bayesian causal mod­els it doesn’t let us de­duce causal­ity from cor­re­la­tion and I can prove it.

Given re­sults about k events (as­sume for sim­plic­ity they are bi­nary True/​False events) E_1...E_k (so E_1 might be bur­glary, E_2 earth­quake, E_3 re­ces­sion and a trial is each year) and any or­der­ing < on 1..k there is a causal model such that E_i is a causal an­te­cedant of E_j iff i < j that perfectly agrees with the given prob­a­bil­ities. In other words at the ex­pense of po­ten­tially hav­ing ev­ery E_i with i <* j af­fect the prob­a­bil­ity of E_i I can have any causal or­der I want on the events and get the same re­sults.

To see this is true start with what­ever event we want to oc­cur first, say E_{i1}. Now we com­pute the prob­a­bil­ities that the next event E{i2} oc­curs con­di­tional on E{i1} and it’s nega­tion. For E{i3} we com­pute the prob­a­bil­ities that this event oc­curs con­di­tional on all 4 out­comes for the pair E{i1}, E{i_2} and so on. This gives the cor­rect prob­a­bil­ity to each set of out­comes and thus matches all ob­ser­va­tions. Alter­na­tively, we can always make the E_i all de­pen­dent on some in­visi­ble com­mon causes that match the ap­pro­pri­ate pri­ors.

True, these di­a­grams might be less sim­ple in some sense than other di­a­grams we might draw but that doesn’t mean they are false. In­deed, we might have very good gen­eral rea­sons for prefer­ring some more com­pli­cated the­ory, e.g., even if a sim­pler causal model could ex­plain the data but re­quires causal de­pen­dence on effects later in time re­ject it in fa­vor of some more com­pli­cated model. This is a use­ful gen­er­al­iza­tion we have about the world and fol­low­ing it helps us reach bet­ter pre­dic­tions when we have limited data. Thus the mere num­ber of ar­rows can’t sim­ply be min­i­mized.

In other words all you’ve got is the same old crap about prefer­ring the sim­pler the­ory where that has no prin­ci­pled math­e­mat­i­cal defi­ni­tion and more or less means ‘pre­fer what­ever your pri­ors say the causal model re­ally looks like.’ In other words we haven’t got­ten any closer to in­fer­ing cau­sa­tion.

Just the op­po­site. The use of Bayesian causal mod­els ex­plains ex­tremely well why, even if events are truly all effects caused by the choices of some un­seen mover the no­tion of cau­sa­tion would be likely to evolve.

• . In par­tic­u­lar, they have in mind the idea that even if the laws of the uni­verse were de­ter­minis­tic there would be a sense in which cer­tain events could be said to be causes of oth­ers even though math­e­mat­i­cally, the con­figu­ra­tion of the uni­verse at any time com­pletely en­tails it at all oth­ers.

What;s the prob­lem with that? If the uni­verse is causally de­ter­minis­tic, it is causal. True,it is nece­sary to dis­t­in­guish causal de­tern­inism form acausal de­ter­minism (eg fatal­ism) and philos­o­phy can do that. Or is your con­cern with fu­ture events en­tailing past ones? Then adopt two-way causal­ity.

Cor­re­la­tion is all there is and the fact that many cor­re­la­tions are use­fully and com­pactlly de­scribed by Bayesian causal mod­els is ac­tu­ally sup­port for the idea that the as­crip­tion of cau­sa­tion re­flects noth­ing more than how the ar­rows hap­pen to point in those causal mod­els we find most compelling

I don’t fol­low that. The ex­is­tence of a map doesn’t usu­ally prove the non-ex­is­tence of a ter­ri­tory.

In other words all you’ve got is the same old crap about prefer­ring the sim­pler the­ory where that has no prin­ci­pled math­e­mat­i­cal defi­ni­tion

The con­se­quence­sof aban­don­ing the ra­zor are much worse than those of hav­ing a sub­jec­tive ra­zor.

• My take is that there is no such thing as cau­sa­tion. Cor­re­la­tion is all there is …

I keep hav­ing to link this:

http://​​www.smbc-comics.com/​​in­dex.php?db=comics&id=1994

In other words at the ex­pense of po­ten­tially hav­ing ev­ery Ei with i <* j af­fect the prob­a­bil­ity of E_i I can have any causal or­der I want on the events and get the same re­sults.

Causal mod­els have to do with in­ter­ven­tions not with node or­ders in a Bayesian net­work. A causal model is not the same thing as a Bayesian net­work (which Eliezer got wrong in his post, and has yet to fix, by the way). Causal mod­els are not about mak­ing bet­ter pre­dic­tions, they are about cause effect re­la­tion­ships (causal effects, me­di­a­tion anal­y­sis, con­founders, things like that). I think read­ing stan­dard stuff on in­ter­ven­tion­ist causal­ity might be a good idea: Pearl’s Causal­ity book or the CMU book (Cau­sa­tion, Pre­dic­tion and Search).

• I’m afraid I haven’t fol­lowed the maths at all, but when you say that there is no cau­sa­tion, only cor­ra­tion, do you mean that you can­not prove cau­sa­tion, or that it ac­tu­ally never ex­ists? Be­cause that last op­tion surely isn’t true? Back in ‘The Use­ful Idea of Truth’ we dis­cussed how pho­tons from shoelaces cause you to be­come en­tan­gled with their un­tan­geled­ness. If there is no cau­sa­tion, you couldn’t ob­serve or know any­thing. If you mean you just can’t prove cau­sa­tion, could you please say it more sim­ply (for me please)?