I think it would be valuable if someone pointed out that a third party watching, without controlling, a scientist’s controlled study is in pretty much the same situation as the three-column exercise/weight/internet use situation—they have instead exercise/weight/control group.
This “observe the results of a scientist’s controlled study” thought experiment motivates and provides hope that one can sometimes derive causation from observation, where the current story arc makes a sortof magical leap.
This “observe the results of a scientist’s controlled study” thought experiment motivates and provides hope that one can sometimes derive causation from observation,
Indeed; one way to think about this is to consider nature as a scientist whose shoulder we can look over.
where the current story arc makes a sortof magical leap.
The leap only seems magical until you understand what the moving parts inside are. So let’s try going in the reverse direction, and see if that helps make it clearer.
Suppose there are three binary variables, A, B, and C, and they are pairwise dependent on each other: that is, P(A) isn’t P(A|B), but we haven’t looked at P(A|BC).
Alice says that A causes both B and C. Bob says that A causes B, which causes C. Charlie says that A and B both cause C. (Each of these is a minimal description of the model- any arcs not mentioned don’t exist, which means there’s no direct causal link between those two.)
Unfortunately, A, B, and C are easy to measure but hard to influence, so running experiments is out of the question, but fortunately we have lots of observational data to do statistics on.
We take a look at the models and realize that they make falsifiable predictions:
If Alice is right, then B and C should be conditionally independent given A: that is, P(B|AC)=P(B|A) and P(C|AB)=P(C|A).
If Bob is right, then A and C should be conditionally independent given B: that is, P(A|BC)=P(A|B) and P(C|AB)=P(C|B).
If Charlie is right, then A and B should be independent, and only become dependent given C.
We know Charlie’s wrong immediately, since the variables are unconditionally pairwise dependent. To test if Alice or Bob are right, we look at the joint probability distribution and marginalize, like described in the post. Suppose we find that both Alice and Bob are wrong, and so we can conclude that their models are incorrect, just like we could with Charlie’s.
In general, we don’t look at three proposed models. What we do instead is a procedure that will implicitly consider each of the 25 acyclic causal models that could describe a set of three binary variables, ruling them out until only a small set are left.
Note that an observation that, say, A and C are uncorrelated given B ensured that there is no arc between A and C- ruling out around two thirds of the models at once; that’s what we mean by implicitly considering all models. As well, we’re left with a set of models that agree with the data- sometimes, we’ll be able to reduce it to a single model, but sometimes the data is insufficient to identify the model exactly, and so we’ll have several models which are all possible- but many more models which we know can’t be the case.
That’s the big insight, I think: causal models make testable predictions, and most imaginable models will be wrong. My suspicion as to why this took so long to develop is that it’s worthless when looking at graphs with only two nodes (apparently not; see this comment below): there, we can only tell the difference between independence and correlation, and there’s no way to tell which way the causation goes. It’s only when we have systems with at least three nodes that we start being able to rule out causal models, and the third node may let us conclude things about the first two nodes that we couldn’t conclude without that node.
My suspicion as to why this took so long to develop is that it’s worthless when looking at graphs with only two nodes:
there, we can only tell the difference between independence and correlation, and there’s no way to tell which way the
causation goes.
Yes, but they contain less information. Check out figure 2 of the Peters paper (which describes discrete distributions). If you have an additive noise model, so Y is X plus noise, then by looking at the joint pdf you can distinguish between X causing Y and Y causing X by the corners. This doesn’t seem possible if X and Y can only have 2 values (since you get a square, not a trapezoid).
I think it would be valuable if someone pointed out that a third party watching, without controlling, a scientist’s controlled study is in pretty much the same situation as the three-column exercise/weight/internet use situation—they have instead exercise/weight/control group.
This “observe the results of a scientist’s controlled study” thought experiment motivates and provides hope that one can sometimes derive causation from observation, where the current story arc makes a sortof magical leap.
Indeed; one way to think about this is to consider nature as a scientist whose shoulder we can look over.
The leap only seems magical until you understand what the moving parts inside are. So let’s try going in the reverse direction, and see if that helps make it clearer.
Suppose there are three binary variables, A, B, and C, and they are pairwise dependent on each other: that is, P(A) isn’t P(A|B), but we haven’t looked at P(A|BC).
Alice says that A causes both B and C. Bob says that A causes B, which causes C. Charlie says that A and B both cause C. (Each of these is a minimal description of the model- any arcs not mentioned don’t exist, which means there’s no direct causal link between those two.)
Unfortunately, A, B, and C are easy to measure but hard to influence, so running experiments is out of the question, but fortunately we have lots of observational data to do statistics on.
We take a look at the models and realize that they make falsifiable predictions:
If Alice is right, then B and C should be conditionally independent given A: that is, P(B|AC)=P(B|A) and P(C|AB)=P(C|A).
If Bob is right, then A and C should be conditionally independent given B: that is, P(A|BC)=P(A|B) and P(C|AB)=P(C|B).
If Charlie is right, then A and B should be independent, and only become dependent given C.
We know Charlie’s wrong immediately, since the variables are unconditionally pairwise dependent. To test if Alice or Bob are right, we look at the joint probability distribution and marginalize, like described in the post. Suppose we find that both Alice and Bob are wrong, and so we can conclude that their models are incorrect, just like we could with Charlie’s.
In general, we don’t look at three proposed models. What we do instead is a procedure that will implicitly consider each of the 25 acyclic causal models that could describe a set of three binary variables, ruling them out until only a small set are left.
Note that an observation that, say, A and C are uncorrelated given B ensured that there is no arc between A and C- ruling out around two thirds of the models at once; that’s what we mean by implicitly considering all models. As well, we’re left with a set of models that agree with the data- sometimes, we’ll be able to reduce it to a single model, but sometimes the data is insufficient to identify the model exactly, and so we’ll have several models which are all possible- but many more models which we know can’t be the case.
That’s the big insight, I think: causal models make testable predictions, and most imaginable models will be wrong. My suspicion as to why this took so long to develop is that it’s worthless when looking at graphs with only two nodes (apparently not; see this comment below): there, we can only tell the difference between independence and correlation, and there’s no way to tell which way the causation goes. It’s only when we have systems with at least three nodes that we start being able to rule out causal models, and the third node may let us conclude things about the first two nodes that we couldn’t conclude without that node.
Well, actually...
http://jmlr.csail.mit.edu/papers/volume7/shimizu06a/shimizu06a.pdf http://jmlr.csail.mit.edu/proceedings/papers/v9/peters10a/peters10a.pdf
Fascinating; thanks for the papers! Those look like they describe continuous and discrete distributions; does my statement hold for binary variables?
Aren’t binary variables a discrete distribution?
Yes, but they contain less information. Check out figure 2 of the Peters paper (which describes discrete distributions). If you have an additive noise model, so Y is X plus noise, then by looking at the joint pdf you can distinguish between X causing Y and Y causing X by the corners. This doesn’t seem possible if X and Y can only have 2 values (since you get a square, not a trapezoid).