I think this is a solved problem. Are you familiar with the formalization of causality in terms of Bayesian networks? (You have enough history on this website that you’ve probably heard of it!)
Make observations using sensors. Abstract your sensory data into variables: maybe you have a weather variable with possible values RAINY and SUNNY, a sprinkler variable with possible values ON and OFF, and a sidewalk variable with possible values WET and DRY. As you make more observations, you can begin to learn statistical relationships between your variables: maybe weather and sprinkler are independent, but conditionally dependent given the value of sprinkler. It turns out that you can summarize this kind of knowledge in the form of a directed graph: weather → sidewalk ← sprinkler. (I’m glossing over a lot of details: a graph represents conditional-independence relationships in the joint distribution over your variables, but the distribution doesn’t uniquely specify a graph.)
But once you’ve learned this graphical model representing the probabilistic relationships between variables which represent abstractions over your sensory observations, then you can construct a similar model that fixes a particular variable to have a particular value, but keeps everything else the same.
Why would you do that? Because such an altered model is useful for decisionmaking if the system-that-you-are is one of the variables in the graph. The way you compute which decision to output is based on a model of how the things in your environment depend on your decision, and it’s possible to learn such a model from previous observations, even though you can’t observe the effects of your current decision in advance of making it.
And that’s what counterfactuals are! I don’t think this is meaningfully circular: we’ve described how the system works in terms of lower-level components. (I’ve omitted a lot of details, but we can totally write computer programs that do this stuff.)
I don’t really agree. The idea of using conditional independencies as measuring causality is cute in theory, but it doesn’t IME work in practice for many reasons. Both because things are rarely truly independent, because you don’t get enough data to test for independencies in practice, and because conditional independence relations are not enough to uniquely identify the causal structure. There’s much more to causality than just conditional independence relations.
Maybe I’m explaining it badly? I’m trying to point to the Judea Pearl thing in my own words. The claim is not that causality “just is” conditional independence relationships. (Pearl repeatedly explicitly disclaims that causal concepts are different from statistical concepts and require stronger assumptions.) Do you have an issue with the graph formalism itself (as an explanation of the underlying reality of how causality and counterfactuals work), separate from practical concerns about how one would learn a particular graph?
Maybe I’m explaining it badly? I’m trying to point to the Judea Pearl thing in my own words. The claim is not that causality “just is” conditional independence relationships. (Pearl repeatedly explicitly disclaims that causal concepts are different from statistical concepts and require stronger assumptions.)
Partly it’s explaining it badly. In addition to the points listed above, there’s also issues like focusing entirely on rung 2 causality and disregarding rung 3 causality, which is arguably the truer kind of causality.
Do you have an issue with the graph formalism itself (as an explanation of the underlying reality of how causality and counterfactuals work), separate from practical concerns about how one would learn a particular graph?
I assume that here we are understanding the graph formalism sufficiently broadly as to include e.g. differential equations, as otherwise there’s definitely a problem already there. And in the same vein, for most problems both DAGs and differential equations are too rigid/vector-spacey to work, and we probably need new formalisms that can better handle systems with varying structure of variables.
Regardless, I don’t think the question of how one would learn a particular graph is merely a practical concern; it’s the core part. Not just learning the edges between the vertices, but also in selecting the variables that are supposed to feature in the graphs. In fact I suspect once we have a good understanding of representation learning, we will see that causal structure learning follows mostly from the representations we choose, because the things that make certain function interesting as features tend to be the causal effects they have.
As far as I know, most of the focus of the causal inference literature is on effect size estimation. Which is probably important too, but it’s not really the hard part that OP is asking about. As far as I know, it only has slight focus on causal structure learning, and the typical advice seems to be to have human experts do the causal structure specification. And as far as I know, they don’t have an answer at all to representation learning. (Instead, John Wentworth seems to be the hero who is working on a solid theory for this.)
Bayesian Networks don’t solve Newcomb’s problem, but I assume you’re aware of it. So I’m guessing your point is that if standard counterfactuals can be constructed outside of the counterfactual perspective that more general counterfactuals would most likely be the same?
Does the concept of a variable even make sense without counterfactuals? It’s not immediately obvious that it does, although I haven’t thought through this enough to assert that it doesn’t.
Update: Having spent a few minutes thinking this through, I’ve concluded that the concept of a variable over time makes sense or a variable over space, ect. makes sense without counterfactuals. However, this is a more limited notion of variable than that which we normally deal with as, if for example, the variable L representing the state of a lightswitch is “ON” at t=0, then we wouldn’t have the notion that it could have been “OFF” instead.
Update 2: Upon further thought, this seems more limited than I first thought. For example, we can’t say let a be how many apples there would be at time t if we counted them, because “if we counted them” is invoking counterfactual reasoning, unless we really did count the apples at each time period. In any case, the issue of whether or not Bayesian Networks are circular seems to be complex enough that it is deserving of further investigation.
I think this is a solved problem. Are you familiar with the formalization of causality in terms of Bayesian networks? (You have enough history on this website that you’ve probably heard of it!)
Make observations using sensors. Abstract your sensory data into variables: maybe you have a
weather
variable with possible valuesRAINY
andSUNNY
, asprinkler
variable with possible valuesON
andOFF
, and asidewalk
variable with possible valuesWET
andDRY
. As you make more observations, you can begin to learn statistical relationships between your variables: maybeweather
andsprinkler
are independent, but conditionally dependent given the value ofsprinkler
. It turns out that you can summarize this kind of knowledge in the form of a directed graph: weather → sidewalk ← sprinkler. (I’m glossing over a lot of details: a graph represents conditional-independence relationships in the joint distribution over your variables, but the distribution doesn’t uniquely specify a graph.)But once you’ve learned this graphical model representing the probabilistic relationships between variables which represent abstractions over your sensory observations, then you can construct a similar model that fixes a particular variable to have a particular value, but keeps everything else the same.
Why would you do that? Because such an altered model is useful for decisionmaking if the system-that-you-are is one of the variables in the graph. The way you compute which decision to output is based on a model of how the things in your environment depend on your decision, and it’s possible to learn such a model from previous observations, even though you can’t observe the effects of your current decision in advance of making it.
And that’s what counterfactuals are! I don’t think this is meaningfully circular: we’ve described how the system works in terms of lower-level components. (I’ve omitted a lot of details, but we can totally write computer programs that do this stuff.)
I don’t really agree. The idea of using conditional independencies as measuring causality is cute in theory, but it doesn’t IME work in practice for many reasons. Both because things are rarely truly independent, because you don’t get enough data to test for independencies in practice, and because conditional independence relations are not enough to uniquely identify the causal structure. There’s much more to causality than just conditional independence relations.
Maybe I’m explaining it badly? I’m trying to point to the Judea Pearl thing in my own words. The claim is not that causality “just is” conditional independence relationships. (Pearl repeatedly explicitly disclaims that causal concepts are different from statistical concepts and require stronger assumptions.) Do you have an issue with the graph formalism itself (as an explanation of the underlying reality of how causality and counterfactuals work), separate from practical concerns about how one would learn a particular graph?
Partly it’s explaining it badly. In addition to the points listed above, there’s also issues like focusing entirely on rung 2 causality and disregarding rung 3 causality, which is arguably the truer kind of causality.
I assume that here we are understanding the graph formalism sufficiently broadly as to include e.g. differential equations, as otherwise there’s definitely a problem already there. And in the same vein, for most problems both DAGs and differential equations are too rigid/vector-spacey to work, and we probably need new formalisms that can better handle systems with varying structure of variables.
Regardless, I don’t think the question of how one would learn a particular graph is merely a practical concern; it’s the core part. Not just learning the edges between the vertices, but also in selecting the variables that are supposed to feature in the graphs. In fact I suspect once we have a good understanding of representation learning, we will see that causal structure learning follows mostly from the representations we choose, because the things that make certain function interesting as features tend to be the causal effects they have.
As far as I know, most of the focus of the causal inference literature is on effect size estimation. Which is probably important too, but it’s not really the hard part that OP is asking about. As far as I know, it only has slight focus on causal structure learning, and the typical advice seems to be to have human experts do the causal structure specification. And as far as I know, they don’t have an answer at all to representation learning. (Instead, John Wentworth seems to be the hero who is working on a solid theory for this.)
Yeah, I’m aware of Bayesian Networks.
Two points:
Bayesian Networks don’t solve Newcomb’s problem, but I assume you’re aware of it. So I’m guessing your point is that if standard counterfactuals can be constructed outside of the counterfactual perspective that more general counterfactuals would most likely be the same?
Does the concept of a variable even make sense without counterfactuals? It’s not immediately obvious that it does, although I haven’t thought through this enough to assert that it doesn’t.
Update: Having spent a few minutes thinking this through, I’ve concluded that the concept of a variable over time makes sense or a variable over space, ect. makes sense without counterfactuals. However, this is a more limited notion of variable than that which we normally deal with as, if for example, the variable L representing the state of a lightswitch is “ON” at t=0, then we wouldn’t have the notion that it could have been “OFF” instead.
Update 2: Upon further thought, this seems more limited than I first thought. For example, we can’t say let a be how many apples there would be at time t if we counted them, because “if we counted them” is invoking counterfactual reasoning, unless we really did count the apples at each time period. In any case, the issue of whether or not Bayesian Networks are circular seems to be complex enough that it is deserving of further investigation.