edit to express what I meant better: “Do you have any examples where lack of statistical dependence coexists with causality, and this happens without path cancellations?”
The capacitor example is one: there is one causal arrow, so no multiple paths that could cancel, and no loops. The arrow could run in either direction, depending on whether the power supply is set up to generate a voltage or a current.
Of course, I is by definition proportional to dV/dt, and this is discoverable by looking at the short-term transient behaviour. But sampled on a long timescale you just get a sequence of i.i.d. independent pairs.
For cyclic graphs, I’m not sure how “path cancellation” is defined, if it is at all. The generic causal graph of the archetypal control system has arrows D --> P --> O and R --> O --> P, there being a cycle between P and O. The four variables are the Disturbance, the Perception, the Output, and the Reference.
If P = O+D, O is proportional to the integral of R-P, R = zero, and D is a signal varying generally on a time scale slower than the settling time of the loop, then O has a correlation with D close to −1, and O and D have correlations with P close to zero.
There are only two parameters, the settling time of the loop and the timescale of variations in D. So long as the former is substantially less than the latter, these correlations are unchanged.
Would you consider this an example of path cancellation? If so, what are the paths, and what excludes this system from the scope of theorems about faithfulness violations having measure zero? Not being a DAG is one reason, of course, but have any such theorems been extended to at least some class of cyclic graphs?
Addendum:
When D is a source with a long-term Gaussian distribution, the statistics of the system are multivariate Gaussian, so correlation coefficients capture the entire statistical dependence. Following your suggestion about non-parametric dependence tests I’ve run simulations in which D instead makes random transitions between +/- 1, and calculated statistics such as Kendall’s tau, but the general pattern is much the same. The controller takes time to respond to the sudden transitions, which allows the zero correlations to turn into weak ones, but that only happens because the controller is failing to control at those moments. The better the controller works, the smaller the correlation of P with O or D.
I’ve also realised that “non-parametric statistics” is a subject like the biology of non-elephants, or the physics of non-linear systems. Shannon mutual information sounds in theory like the best possible measure, but for continuous quantities I can get anything from zero to perfect prediction of one variable from the other just by choosing a suitable bin size for the data. No statistical conclusions without statistical assumptions.
I have not forgotten about your paper, I am just extremely busy until early March. Three quick comments though:
(a) People have viewed cyclic models as defining a stable distribution in an appropriate Markov chain. There are some complications, and it seems with cyclic models (unlike the DAG case) the graph which predicts what happens after an intervention, and the graph which represents the independence structure of the equilibrium distribution are not the same graph (this is another reason to treat the statistical and causal graphical models separately). See Richardson and Lauritzen’s chain graph paper for a simple 4 node example of this.
So when we say there is a faithfulness violation, we have to make sure we are talking about the right graph representing the right distribution.
(b) In general I view a derivative not as a node, but as an effect. So e.g. in a linear model:
y = f(x) = ax + e
dy/dx = a = E[y|do(x=1)] - E[y|do(x=0)], which is just the causal effect of x on y on the mean difference scale.
In general, the partial derivative of the outcome wrt some treatment holding the other treatments constant is a kind of direct causal effect. So viewed through that lens it is not perhaps so surprising that x and dy/dx are independent. After all, the direct effect/derivative is a function of p(y|do(x),do(other parents of y)), and we know do(.) cuts incoming arcs to y, so the distribution p(y|do(x),do(other parents of y)) is independent of p(x) by construction.
But this is more an explanation of why derivatives sensibly represent interventional effects, not whether there is something more to this observation (I think there might be). I do feel that Newton’s intuition for doing derivatives was trying to formalize a limit of “wiggle the independent variable and see what happens to the dependent variable”, which is precisely the causal effect. He was worried about physical systems, also, where causality is fairly clear.
In general, p(y) and any function of p(y | do(x)) are not independent of course.
(c) I think you define a causal model in terms of the Markov factorization, which I disagree with. The Markov factorization
p[x1,…,xn]=∏ip[xi|pa[xi]]
defines a statistical model. To define a causal model you essentially need to formally state that parents of every node are that node’s direct causes. Usually people use the truncated factorization (g-formula) to do this. See, e.g. chapter 1 in Pearl’s book.
I just noticed your edit:
The capacitor example is one: there is one causal arrow, so no multiple paths that could cancel, and no loops. The arrow could run in either direction, depending on whether the power supply is set up to generate a voltage or a current.
Of course, I is by definition proportional to dV/dt, and this is discoverable by looking at the short-term transient behaviour. But sampled on a long timescale you just get a sequence of i.i.d. independent pairs.
For cyclic graphs, I’m not sure how “path cancellation” is defined, if it is at all. The generic causal graph of the archetypal control system has arrows D --> P --> O and R --> O --> P, there being a cycle between P and O. The four variables are the Disturbance, the Perception, the Output, and the Reference.
If P = O+D, O is proportional to the integral of R-P, R = zero, and D is a signal varying generally on a time scale slower than the settling time of the loop, then O has a correlation with D close to −1, and O and D have correlations with P close to zero.
There are only two parameters, the settling time of the loop and the timescale of variations in D. So long as the former is substantially less than the latter, these correlations are unchanged.
Would you consider this an example of path cancellation? If so, what are the paths, and what excludes this system from the scope of theorems about faithfulness violations having measure zero? Not being a DAG is one reason, of course, but have any such theorems been extended to at least some class of cyclic graphs?
Addendum:
When D is a source with a long-term Gaussian distribution, the statistics of the system are multivariate Gaussian, so correlation coefficients capture the entire statistical dependence. Following your suggestion about non-parametric dependence tests I’ve run simulations in which D instead makes random transitions between +/- 1, and calculated statistics such as Kendall’s tau, but the general pattern is much the same. The controller takes time to respond to the sudden transitions, which allows the zero correlations to turn into weak ones, but that only happens because the controller is failing to control at those moments. The better the controller works, the smaller the correlation of P with O or D.
I’ve also realised that “non-parametric statistics” is a subject like the biology of non-elephants, or the physics of non-linear systems. Shannon mutual information sounds in theory like the best possible measure, but for continuous quantities I can get anything from zero to perfect prediction of one variable from the other just by choosing a suitable bin size for the data. No statistical conclusions without statistical assumptions.
Dear Richard,
I have not forgotten about your paper, I am just extremely busy until early March. Three quick comments though:
(a) People have viewed cyclic models as defining a stable distribution in an appropriate Markov chain. There are some complications, and it seems with cyclic models (unlike the DAG case) the graph which predicts what happens after an intervention, and the graph which represents the independence structure of the equilibrium distribution are not the same graph (this is another reason to treat the statistical and causal graphical models separately). See Richardson and Lauritzen’s chain graph paper for a simple 4 node example of this.
So when we say there is a faithfulness violation, we have to make sure we are talking about the right graph representing the right distribution.
(b) In general I view a derivative not as a node, but as an effect. So e.g. in a linear model:
y = f(x) = ax + e
dy/dx = a = E[y|do(x=1)] - E[y|do(x=0)], which is just the causal effect of x on y on the mean difference scale.
In general, the partial derivative of the outcome wrt some treatment holding the other treatments constant is a kind of direct causal effect. So viewed through that lens it is not perhaps so surprising that x and dy/dx are independent. After all, the direct effect/derivative is a function of p(y|do(x),do(other parents of y)), and we know do(.) cuts incoming arcs to y, so the distribution p(y|do(x),do(other parents of y)) is independent of p(x) by construction.
But this is more an explanation of why derivatives sensibly represent interventional effects, not whether there is something more to this observation (I think there might be). I do feel that Newton’s intuition for doing derivatives was trying to formalize a limit of “wiggle the independent variable and see what happens to the dependent variable”, which is precisely the causal effect. He was worried about physical systems, also, where causality is fairly clear.
In general, p(y) and any function of p(y | do(x)) are not independent of course.
(c) I think you define a causal model in terms of the Markov factorization, which I disagree with. The Markov factorization
p[x1,…,xn]=∏ip[xi|pa[xi]]
defines a statistical model. To define a causal model you essentially need to formally state that parents of every node are that node’s direct causes. Usually people use the truncated factorization (g-formula) to do this. See, e.g. chapter 1 in Pearl’s book.