I guess this is what I get for my statistics-knowledge being an odd mongrel mix of Bayesian vibes gathered from LessWrong and frequentist stats from biology grad school. You seem much more knowledgeable on the formalizations so I trust you’re right that they can’t formally mix, but informally to me it seems like there must be some way to mix them—fundamentally, this is just “extraordinary claims require extraordinary evidence”. To use an example from my day job: if I’m testing whether knocking out a candidate gene that I think will increase tryptophan in a plant tissue does indeed increase the tryptophan, p < 0.05 feels like a fine cutoff. If I’m testing whether that same KO causes my plants to communicate with me telepathically, I’d be crazy to tell anyone about my results unless I was seeing p < 0.001 in at least two independent experiments.
The difference in required p-value threshold to me seems to come down to the prior. Perhaps there’s no formal framework that combines them, but empirically I think that’s what I’m doing.
The mathematics of causality is still not part of a standard stats curriculum .
You seem to be correct here, but it strikes me as strange because quantifying the evidence for causality was one of the central themes of most of my stats classes (with names like “intro statistics” and “statistical design of experiments”).
Possibly the synthesis here is that most of what non-math people learn doesn’t qualify as real math—I only took the super basic stuff that doesn’t get into the actual math of causality (despite minoring in math in undergrad and taking more math than most in a bio PhD).
I would like there to be a formal way to render both DNA and dominoes as a short causal chain. But I don’t have one.
Thinking about it a bit more, I think I’m happy to bite the bullet here and say the causal chain is long (though well-approximated by both of our very short descriptions), and that this is one of the exceptions to the point in footnote 4 that real-world correlations are generally well below 1. The causal chain is quite long, but DNA replication is extremely good—something like 10^-7 errors per base of DNA per generation—pretty dang close to 1 (my guess is domino setups by hobbyists also have failure rates under 10^-3). I’m sure we could find a handful more examples of things with long chains of very high correlation, but not that many—pretty few in biology are over 0.95.
A classic example from Pearl’s 2009 book
Thanks, that was fun to work out, and appropriately simple for a humble biologist. In the 2x2 matrix of “corr or not” and “causal linkage or not”, this is the opposite square to the one I’m looking for—causal but not correlated. I agree that such things happen (they happen a lot in biology due to regulatory feedback loops), and I now see that this is indeed a non-faithful network.
Is there a similar toy example for “corr but no causal linkage”, e.g. “two variables on opposite sides of the program whose value is correlated for no good reason”? I spent a while asking Claude Opus 4.6 for one and it couldn’t come up with any (it came up with a toy where two variables would always be the same constant, but correlation there is undefined so I don’t count it).
To use an example from my day job: if I’m testing whether knocking out a candidate gene that I think will increase tryptophan in a plant tissue does indeed increase the tryptophan, p < 0.05 feels like a fine cutoff. If I’m testing whether that same KO causes my plants to communicate with me telepathically, I’d be crazy to tell anyone about my results unless I was seeing p < 0.001 in at least two independent experiments.
The difference in required p-value threshold to me seems to come down to the prior. Perhaps there’s no formal framework that combines them, but empirically I think that’s what I’m doing.
As a Pearl-style half-Bayesian (see here) who chose the path of logic over stats more than a decade ago, I am not at all a good representative of the frequentist school. My impression though is that they would say these kinds of biases and assumptions live outside the domain of stats.
A Bayesian would just tell you to use Bayesian techniques such as credible intervals and maximum a posteriori estimators in lieu of p-values, which I have mostly not studied. I believe Bayesian techniques as a whole are conceptually simpler but computationally more difficult than frequentist ones.
I think I’m happy to bite the bullet here and say the causal chain is long ….and that this is one of the exceptions
Well spoken!
Is there a similar toy example for “corr but no causal linkage”, e.g. “two variables on opposite sides of the program whose value is correlated for no good reason”
So first, if you’re talking about literal correlation coefficients, you can come up with examples where Y is a deterministic function of X, but their correlation coefficient is 0. Correlation coefficients measure linear relationships. Pretty sure you can do it using a piecewise-linear function with just two pieces, and probably also with a simple parabola.
(Just searched: Yes you can. See page 4 of here .)
When we speak of correlation, I think it’s more helpful to consider mutual information and conditional mutual information, which capture all dependencies, not merely linear ones. (At least theoretically—convergence properties of conditional mutual information estimation are terrible, and a reason I gave up on my first causality project in grad school.)
Mutual information between two constants is defined. But it’s always 0. So that’s still a non-example.
I confess that I actually had to look up how to get spurious correlations in a causal network. Here’s an example showing you can do it even without determinism.
....and then I realized this is still the opposite of what you’re looking for. Oops. I should have just gone to sleep.
Okay, you might be right about the spurious correlations not being possible in Pearl-style causal graphs.
What I recall from my investigations into this in 2015: (1) I was mostly concerned with trying to apply the PC algorithm for causal structure discovery to programs. PC essentially operates by looking for colliders, i.e.: independent variables that become dependent when conditioned on a mutual child. So spurious non-correlations were a bigger problem for me. (2) I found an old comment of mine where I mentioned having to condition on events of probabiilty 0 to apply the do-calculus.
Thanks for a detailed reply!
I guess this is what I get for my statistics-knowledge being an odd mongrel mix of Bayesian vibes gathered from LessWrong and frequentist stats from biology grad school. You seem much more knowledgeable on the formalizations so I trust you’re right that they can’t formally mix, but informally to me it seems like there must be some way to mix them—fundamentally, this is just “extraordinary claims require extraordinary evidence”. To use an example from my day job: if I’m testing whether knocking out a candidate gene that I think will increase tryptophan in a plant tissue does indeed increase the tryptophan, p < 0.05 feels like a fine cutoff. If I’m testing whether that same KO causes my plants to communicate with me telepathically, I’d be crazy to tell anyone about my results unless I was seeing p < 0.001 in at least two independent experiments.
The difference in required p-value threshold to me seems to come down to the prior. Perhaps there’s no formal framework that combines them, but empirically I think that’s what I’m doing.
You seem to be correct here, but it strikes me as strange because quantifying the evidence for causality was one of the central themes of most of my stats classes (with names like “intro statistics” and “statistical design of experiments”).
Possibly the synthesis here is that most of what non-math people learn doesn’t qualify as real math—I only took the super basic stuff that doesn’t get into the actual math of causality (despite minoring in math in undergrad and taking more math than most in a bio PhD).
Thinking about it a bit more, I think I’m happy to bite the bullet here and say the causal chain is long (though well-approximated by both of our very short descriptions), and that this is one of the exceptions to the point in footnote 4 that real-world correlations are generally well below 1. The causal chain is quite long, but DNA replication is extremely good—something like 10^-7 errors per base of DNA per generation—pretty dang close to 1 (my guess is domino setups by hobbyists also have failure rates under 10^-3). I’m sure we could find a handful more examples of things with long chains of very high correlation, but not that many—pretty few in biology are over 0.95.
Thanks, that was fun to work out, and appropriately simple for a humble biologist. In the 2x2 matrix of “corr or not” and “causal linkage or not”, this is the opposite square to the one I’m looking for—causal but not correlated. I agree that such things happen (they happen a lot in biology due to regulatory feedback loops), and I now see that this is indeed a non-faithful network.
Is there a similar toy example for “corr but no causal linkage”, e.g. “two variables on opposite sides of the program whose value is correlated for no good reason”? I spent a while asking Claude Opus 4.6 for one and it couldn’t come up with any (it came up with a toy where two variables would always be the same constant, but correlation there is undefined so I don’t count it).
As a Pearl-style half-Bayesian (see here) who chose the path of logic over stats more than a decade ago, I am not at all a good representative of the frequentist school. My impression though is that they would say these kinds of biases and assumptions live outside the domain of stats.
A Bayesian would just tell you to use Bayesian techniques such as credible intervals and maximum a posteriori estimators in lieu of p-values, which I have mostly not studied. I believe Bayesian techniques as a whole are conceptually simpler but computationally more difficult than frequentist ones.
Well spoken!
So first, if you’re talking about literal correlation coefficients, you can come up with examples where Y is a deterministic function of X, but their correlation coefficient is 0. Correlation coefficients measure linear relationships. Pretty sure you can do it using a piecewise-linear function with just two pieces, and probably also with a simple parabola.
(Just searched: Yes you can. See page 4 of here .)
When we speak of correlation, I think it’s more helpful to consider mutual information and conditional mutual information, which capture all dependencies, not merely linear ones. (At least theoretically—convergence properties of conditional mutual information estimation are terrible, and a reason I gave up on my first causality project in grad school.)
Mutual information between two constants is defined. But it’s always 0. So that’s still a non-example.
I confess that I actually had to look up how to get spurious correlations in a causal network. Here’s an example showing you can do it even without determinism.
(Source: https://www.researchgate.net/figure/An-example-of-unfaithful-network-applying-d-separation-rules-to-the-DAG-would-indicate_fig15_392864356 )
....and then I realized this is still the opposite of what you’re looking for. Oops. I should have just gone to sleep.
Okay, you might be right about the spurious correlations not being possible in Pearl-style causal graphs.
What I recall from my investigations into this in 2015: (1) I was mostly concerned with trying to apply the PC algorithm for causal structure discovery to programs. PC essentially operates by looking for colliders, i.e.: independent variables that become dependent when conditioned on a mutual child. So spurious non-correlations were a bigger problem for me. (2) I found an old comment of mine where I mentioned having to condition on events of probabiilty 0 to apply the do-calculus.