Ilya, I don’t think it is very fair for you to bludgeon people with terminology / appeals to authority (as you do later in a couple of the sub-threads to this comment) especially given that causality is a somewhat niche subfield of machine learning. I.e. I think many people in machine learning would disagree with the implicit assumptions in the claim “probabilistic models cannot capture causal information”. I realize that this is true by definition under the definitions preferred by causality researchers, but the assumption here seems to be that it’s more natural to make causality an ontologically fundamental aspect of the model, whereas it’s far from clear to me that this is the most natural thing to do (i.e. you can imagine learning about causality as a feature of the environment). In essence, you are asserting that “do” is an ontologically fundamental notion, but I personally think of it as a notion that just happens to be important enough to many of the prediction tasks we care about that we hard-code it as a feature of the model, and supply the causal information by hand. I suspect the people you argue with below have similar intuitions but lack the terminology to express them to your satisfaction.
I’ll freely admit that I’m not an expert on causality in particular, so perhaps some of what I say above is off-base. But if I’m also below the bar for respectful discourse then your target audience is small indeed.
If anyone felt I was uncivil to them in any subthread, I hereby apologize here.
I am not sure causality is a subfield of ML in the sense that I don’t think many ML people care about causality. I think causal inference is a subfield of stats (lots of talks with the word “causal” at this year’s JSM). I think it’s weird that stats and ML are different fields, but that’s a separate discussion.
I think it is possible to formalize causality without talking about interventions as Pearl et al. thinks of them, for example people in reinforcement learning do this. But if you start to worry about e.g. time-varying confounders, and you are not using interventions, you will either get stuff wrong, or have to reinvent interventions again. Which would be silly—so just learn about the Neyman/Rubin model and graphs. It’s the formalism that handles all the “gotchas” correctly.
(In fact, until interventionists came along, people didn’t even have the math to realize that time-varying confounders are a “gotcha” that needs special handling!)
By the way, the only reason I am harping on time-varying confounders is because it is a historically important case that I can explain with a 4 node example. There are lots of other, more complicated “gotchas,” of course.
Interventions seem to pop up/get reinvented in seemingly weird places, like the pi constant:
In handling missing data (can view “missingness” as a causal property). Note the phrasing in the second link: “given the observed data, the missingness mechanism does not depend on the unobserved data.” This is precisely the “no unobserved confounders” assumption in causal inference. Not surprisingly the correction is the same as in causal inference.
Also in figuring out what the dimension of a statistical hidden variable DAG model is. For example if A,B,C,D are binary, and U, W are unrestricted, then the dimension of the model
{ p(a,b,c,d) = \sum_{u,w} p(a,b,c,d,u,w) | p(a,b,c,d,u,w) factorizes wrt A → B → C → D, A ← U → C, B ← W → D } is 13, not 15, which is weird, but there is an intervention-inspired explanation for why.
you can imagine learning about causality as a feature of the environment
I don’t think you can get something for nothing. You will need causal assumptions somewhere.
Thanks Ilya, that was a lot of useful context and I wasn’t aware that causality was more in stats than ML. For the record, I think that causality is super-interesting and cool, I hope that I didn’t sound too negative by calling it “niche” (I would have described e.g. Bayesian nonparametrics, which I used to do research in, the same way, although perhaps it’s unfair to lump in causality with nonparametric Bayes, since the former has a much more distinguished history).
I agree with pretty much everything you say above, although I’m still confused about “you will need causal assumptions somewhere”. If I could somehow actually do inference under the Solomonoff prior, do you think that some notion of causality would not pop out? I’d understand if you didn’t want to take the time to explain it to me; I’ve had this conversation with 2 other causality people already and am still not quite sure I understand what is meant by “you need causal assumptions to get causal inferences”. (Note I already agree that this is true in the context of graphical models, i.e. you can’t distincuish between X->Y and X<-Y without do(X) or some similar information.)
Graphical models are only a “thing” because our brain dedicates lots of processing to vision, so, for instance, we immediately understand complicated conditional independence statements if expressed in the visual form of d-separation. In some sense, graphs in the context of graphical models do not really add any extra information mathematically that wasn’t already encoded even without graphs.
Given this, I am not sure there really is a context for graphical models separate from the context of “variables and their relationships”. What you are saying above is that we seem to need “something extra” to be able to tell the direction of causality in a two variable system. (For example, in an additive noise model you can do this:
I think the “no causes in—no causes out” principle is more general than that though. For example if we had a three variable case, with variables A, B, C where:
A is marginally independent of B, but no other independences hold, than the only faithful graphical explanation for this model is:
A → C ← B
It seems that, unlike the previous case, here there is no causal ambiguity—A points to C, and B points to C. However, since the only information you inserted into the procedure which gave you this graph is the information about conditional independences, all you are getting out is a graphical description of a conditional independence model (that is a Bayesian network, or a statistical DAG model). In particular, the absence of arrows aren’t telling you about absent causal relationships (that is whether A would change if I intervene on C), but absent statistical relationships (that is, whether A is independent of B). The statistical interpretation of the above graph is that it corresponds to a set of densities:
{ p(A,B,C) | A is independent of B }
The same graph can also correspond to a causal model, where we are explicitly talking about interventions, that is:
{ p(A,B,C,C(a,b),B(a)) | C(a,b) is independent of B(a) is independent of A, p(B(a)) = p(B) }
where C(a,b) is just stats notation for do(.), that is p(C(a,b)) = p(C | do(a,b)).
This is a different object from before, and the interpretation of arrows is different. That is, the absence of an arrow from A to B means that intervening on A does not affect B, etc. This causal model also induces an independence model on the same graph, where the interpretation of arrows changes back to the statistical interpretation. However, we could imagine a very different causal model on three variables, that will also induce the same independence model where A is marginally independent of B. For example, maybe the set of all densities where the real direction of causality is A → C → B, but somehow the probabilities involved happened to line up in such a way that A is marginally independent of B. In other words, the mapping from causal to statistical models is many to one.
Given this view, it seems pretty clear that going from independences to causal models (even via a very complicated procedure) involves making some sort of assumption that makes the mapping one to one. Maybe the prior in Solomonoff induction gives this to you, but my intuitions about what non-computable procedures will do are fairly poor.
It sort of seems like Solomonoff induction operates at a (very low) level of abstraction where interventionist causality isn’t really necessary (because we just figure out what the observable environment as a whole—including action-capable agents, etc. -- will do), and thus isn’t explicitly represented. This is similar to how Blockhead (http://en.wikipedia.org/wiki/Blockhead_(computer_system%29) does not need an explicit internal model of the other participant in the conversation.
I think Solomonoff induction is sort of a boring subject, if one is interested in induction, in the same sense that Blockhead is boring if one is interested in passing the Turing test, and particle physics is boring if one is interested in biology.
Ilya, I don’t think it is very fair for you to bludgeon people with terminology / appeals to authority (as you do later in a couple of the sub-threads to this comment) especially given that causality is a somewhat niche subfield of machine learning. I.e. I think many people in machine learning would disagree with the implicit assumptions in the claim “probabilistic models cannot capture causal information”. I realize that this is true by definition under the definitions preferred by causality researchers, but the assumption here seems to be that it’s more natural to make causality an ontologically fundamental aspect of the model, whereas it’s far from clear to me that this is the most natural thing to do (i.e. you can imagine learning about causality as a feature of the environment). In essence, you are asserting that “do” is an ontologically fundamental notion, but I personally think of it as a notion that just happens to be important enough to many of the prediction tasks we care about that we hard-code it as a feature of the model, and supply the causal information by hand. I suspect the people you argue with below have similar intuitions but lack the terminology to express them to your satisfaction.
I’ll freely admit that I’m not an expert on causality in particular, so perhaps some of what I say above is off-base. But if I’m also below the bar for respectful discourse then your target audience is small indeed.
[ Upvoted. ]
If anyone felt I was uncivil to them in any subthread, I hereby apologize here.
I am not sure causality is a subfield of ML in the sense that I don’t think many ML people care about causality. I think causal inference is a subfield of stats (lots of talks with the word “causal” at this year’s JSM). I think it’s weird that stats and ML are different fields, but that’s a separate discussion.
I think it is possible to formalize causality without talking about interventions as Pearl et al. thinks of them, for example people in reinforcement learning do this. But if you start to worry about e.g. time-varying confounders, and you are not using interventions, you will either get stuff wrong, or have to reinvent interventions again. Which would be silly—so just learn about the Neyman/Rubin model and graphs. It’s the formalism that handles all the “gotchas” correctly. (In fact, until interventionists came along, people didn’t even have the math to realize that time-varying confounders are a “gotcha” that needs special handling!)
By the way, the only reason I am harping on time-varying confounders is because it is a historically important case that I can explain with a 4 node example. There are lots of other, more complicated “gotchas,” of course.
Interventions seem to pop up/get reinvented in seemingly weird places, like the pi constant:
http://infostructuralist.wordpress.com/2010/09/23/directed-stochastic-kernels-and-causal-interventions/
In channels with feedback (thus causality arises!)
http://www.adaptiveagents.org/bayesian_control_rule
http://en.wikipedia.org/wiki/Thompson_sampling
In multi-armed bandit problems (which are related to longitudinal studies in causal inference).
http://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator
http://missingdata.lshtm.ac.uk/index.php?option=com_content&view=article&id=76:missing-at-random-mar&catid=40:missingness-mechanisms&Itemid=96
In handling missing data (can view “missingness” as a causal property). Note the phrasing in the second link: “given the observed data, the missingness mechanism does not depend on the unobserved data.” This is precisely the “no unobserved confounders” assumption in causal inference. Not surprisingly the correction is the same as in causal inference.
Also in figuring out what the dimension of a statistical hidden variable DAG model is. For example if A,B,C,D are binary, and U, W are unrestricted, then the dimension of the model
{ p(a,b,c,d) = \sum_{u,w} p(a,b,c,d,u,w) | p(a,b,c,d,u,w) factorizes wrt A → B → C → D, A ← U → C, B ← W → D } is 13, not 15, which is weird, but there is an intervention-inspired explanation for why.
I don’t think you can get something for nothing. You will need causal assumptions somewhere.
Thanks Ilya, that was a lot of useful context and I wasn’t aware that causality was more in stats than ML. For the record, I think that causality is super-interesting and cool, I hope that I didn’t sound too negative by calling it “niche” (I would have described e.g. Bayesian nonparametrics, which I used to do research in, the same way, although perhaps it’s unfair to lump in causality with nonparametric Bayes, since the former has a much more distinguished history).
I agree with pretty much everything you say above, although I’m still confused about “you will need causal assumptions somewhere”. If I could somehow actually do inference under the Solomonoff prior, do you think that some notion of causality would not pop out? I’d understand if you didn’t want to take the time to explain it to me; I’ve had this conversation with 2 other causality people already and am still not quite sure I understand what is meant by “you need causal assumptions to get causal inferences”. (Note I already agree that this is true in the context of graphical models, i.e. you can’t distincuish between X->Y and X<-Y without do(X) or some similar information.)
Graphical models are only a “thing” because our brain dedicates lots of processing to vision, so, for instance, we immediately understand complicated conditional independence statements if expressed in the visual form of d-separation. In some sense, graphs in the context of graphical models do not really add any extra information mathematically that wasn’t already encoded even without graphs.
Given this, I am not sure there really is a context for graphical models separate from the context of “variables and their relationships”. What you are saying above is that we seem to need “something extra” to be able to tell the direction of causality in a two variable system. (For example, in an additive noise model you can do this:
http://machinelearning.wustl.edu/mlpapers/paper_files/ShimizuHHK06.pdf)
I think the “no causes in—no causes out” principle is more general than that though. For example if we had a three variable case, with variables A, B, C where:
A is marginally independent of B, but no other independences hold, than the only faithful graphical explanation for this model is:
A → C ← B
It seems that, unlike the previous case, here there is no causal ambiguity—A points to C, and B points to C. However, since the only information you inserted into the procedure which gave you this graph is the information about conditional independences, all you are getting out is a graphical description of a conditional independence model (that is a Bayesian network, or a statistical DAG model). In particular, the absence of arrows aren’t telling you about absent causal relationships (that is whether A would change if I intervene on C), but absent statistical relationships (that is, whether A is independent of B). The statistical interpretation of the above graph is that it corresponds to a set of densities:
{ p(A,B,C) | A is independent of B }
The same graph can also correspond to a causal model, where we are explicitly talking about interventions, that is:
{ p(A,B,C,C(a,b),B(a)) | C(a,b) is independent of B(a) is independent of A, p(B(a)) = p(B) }
where C(a,b) is just stats notation for do(.), that is p(C(a,b)) = p(C | do(a,b)).
This is a different object from before, and the interpretation of arrows is different. That is, the absence of an arrow from A to B means that intervening on A does not affect B, etc. This causal model also induces an independence model on the same graph, where the interpretation of arrows changes back to the statistical interpretation. However, we could imagine a very different causal model on three variables, that will also induce the same independence model where A is marginally independent of B. For example, maybe the set of all densities where the real direction of causality is A → C → B, but somehow the probabilities involved happened to line up in such a way that A is marginally independent of B. In other words, the mapping from causal to statistical models is many to one.
Given this view, it seems pretty clear that going from independences to causal models (even via a very complicated procedure) involves making some sort of assumption that makes the mapping one to one. Maybe the prior in Solomonoff induction gives this to you, but my intuitions about what non-computable procedures will do are fairly poor.
It sort of seems like Solomonoff induction operates at a (very low) level of abstraction where interventionist causality isn’t really necessary (because we just figure out what the observable environment as a whole—including action-capable agents, etc. -- will do), and thus isn’t explicitly represented. This is similar to how Blockhead (http://en.wikipedia.org/wiki/Blockhead_(computer_system%29) does not need an explicit internal model of the other participant in the conversation.
I think Solomonoff induction is sort of a boring subject, if one is interested in induction, in the same sense that Blockhead is boring if one is interested in passing the Turing test, and particle physics is boring if one is interested in biology.