A small comment… Pearl’s treatment works fine for “forward-tracking” counterfactuals, where the only allowed changes in the counterfactual world are in the future of the change (i.e. after the point of surgery). However, regular counterfactuals require a bit of “back-tracking” to make the counterfactual scenario plausible in the first place.

Consider these statements:

“If Gore had been president during 9/11 he’d have reacted differently, and the US wouldn’t have invaded Iraq”

“If Gore had been president during 9/11 then he’d have been sworn in as president back in January 2001”

“If Gore had been president during 9/11 he’d have been blinking in shock at having suddenly teleported to Air Force One, and wondering why everyone was calling him Mr President”

“If Gore had been president during 9/11 he’d have been recently sworn in as president, owing to changing party, Dick Cheney dying of a heart attack, Bush appointing him as vice-president, and then himself dying of shock”.

Statement 1 is a forward-tracking counterfactual and arguably true. But the usual way of understanding it to be true assumes that 2 is also true, which is a back-tracking counterfactual (it involves re-writing approximately a year of history before 9/11).

Statement 3 is the only statement consistent with no back-tracking at all, and corresponds to Pearl’s approach of performing surgery on the causal graph at 9/11. (This is generally physically impossible, or at least physically absurd, since it involves tearing the graph apart and inserting a new state with no causal relation to the previous states.)

Statement 4 is an odd sort of compromise; it’s at least not physically impossible or absurd, and involves the bare minimum of back-tracking (to a political crisis a few days before 9/11). But it is clearly not the best way to understand the counterfactual.

My feeling is this is only a problem with expressing counterfactuals in English. If one did have a causal model of American history, 2000-2001, and one wanted to implement counterfactual 1, performing surgery at 9/11 would be unsound, for the reasons you state. The joint probability of the ancestors of 9/11 after such a transformation would all be very small indeed, relative to whatever vastly improbable events were necessary to transition Al Gore circa ^{9}⁄_{10} to president during 9/11.

Is this an actual limitation of the calculus, though? Are there counterfactuals that are well-posed, but require an indefinite amount of “back-tracking”?

The issue arises whenever we have a causal model with a large number of micro-states, and the antecedent of a counterfactual can only be realised in worlds which change lots of different micro-states. The most “natural” way of thinking about the counterfactual in that case is still to make a minimal change (to one single micro state e.g. a particle decaying somewhere, or an atom shifting an angstrom somewhere) and to make it sufficiently far back in time to make a difference. (In the Gore case, in the brain of whoever thought up the butterfly ballot, or perhaps in the brain of a justice of the Supreme Court.) The problem with Pearl’s calculus though is that it doesn’t do that.

Here’s a toy model to demonstrate (no English). Consider the following set of structural equations (among Boolean micro state variables):

X = 0

Y_1 = X, Y_2 = X, …, Y_10^30 = X

The model is deterministic so P[X = 0] = 1.

Next we define a “macro-state” variable Z := (Y_1 + Y2 + … + Y 10^30) / 10^30.
Plainly in the actual outcome Z = 0 and indeed P[Z = 0] = 1.

But what if Z were equal to 1?

My understanding of Pearl’s semantics is that to evaluate this we have to intervene i.e. do(Z = 1) and this is equivalent to the multi-point intervention do(Y_1 = 1 & Y_2 = 1 & … & Y_10^30 = 1). This is achieved by replacing every structural equation between X and Y_i by the static equation Y_i = 1.

Importantly, it is NOT achieved by the single-point intervention X = 1, even though that is probably the most “natural” way to realise the counterfactual. So in Pearl’s notation, we must have ~X _ (Z = 1) or in probabilistic terms P[X = 0 | do(Z = 1)] = 1. Which, to be frank, seems wrong.

And we can’t “fix” this in Pearl’s semantics by choosing the alternative surgery (X = 1) because if P[X = 1 | do(Z = 1)] = 1 that would imply in Pearl’s semantics that X is caused by the Yi, rather than the other way round, which is clearly wrong since it contradicts the original causal graph. Worse, even if we introduce some ambiguity, saying that X might change under the intervention do(Z = 1), then we will still have P[X = 1 | do(Z = 1)] > 0 = P[X = 1] and this is enough to imply a probabilistic causal link from the Y_i to X which is still contrary to the causal graph.

So I think this is a case where Pearl’s analysis gets it wrong.

Before I analyze this apparent paradox in any depth, I want to be sure I understand your criticism. There are three things about this comment on which I am unclear.

1.) The number of states cannot be relevant to the paradox from a theoretical standpoint, because nothing in Pearl’s calculus depends on the number of states. If this does pose a problem, it only poses a problem in so far as it creates an apparent paradox, that is, whatever algorithm humans use to parse the counterfactual “What if Z were 1?” is different from the Pearl’s calculus. A priori, this is not a dealbreaker, unless it can also be shown the human algorithm does better.

2.) If Yi = X, then there is a causal link between Yi and X. Indeed, there is a causal link between every X and every Yi. Conditioning on any of the Yi immediately fixes the value of every other variable.

3.) You say the problem isn’t with English, but then talk about “the most natural way to realize a counterfactual.” I don’t know what that means, other than as an artifact of the human causal learning algorithm.

Thanks for taking the time to think/comment. It may help us to fix a reference which describes Pearl’s thinking and his calculus. There are several of his papers available online, but this one is pretty comprehensive: ftp://ftp.cs.ucla.edu/pub/stat_ser/r284-reprint.pdf “Bayesianism and Causality, Or, Why I am only a Half-Bayesian”.

Now onto your points:

1) You are correct that nothing in Pearl’s calculus varies depending on the number of variables Yi which causally depend on X. For any number of Yi, the intervention do(Z = 1) breaks all the links between X and the Yi and doesn’t change the vale of X at all. Also, there is no “paradox” within Pearl’s calculus here: it is internally consistent.

The real difficulty is that the calculus just doesn’t work as a full conceptual analysis of counterfactuals, and this becomes increasingly clear the more Yi variables we add. It is a bit unfortunate, because while the calculus is elegant in its own terms, it does appears that conceptual analysis is what Pearl was attempting. He really did intend his “do” calculus to reflect how we usually understand counterfactuals, only stated more precisely. Pearl was not consciously proposing a “revisionist” account to the effect: “This is how I’m going to define counterfactuals for the sake of getting some math to work. If your existing definition or intuition about counterfactuals doesn’t match that definition, then sorry, but it still won’t affect my definition.” Accordingly, it doesn’t help to say “Regular intuitions say one thing, Pearl’s calculus says another, but the calculus is better, therefore the calculus is right and intuitions are wrong”. You can get away with that in revisionist accounts/definitions but not in regular conceptual analysis.

2) The structural equations do indeed imply there is a causal link from the X to the Yi. But there is NO causal link in the opposite direction from the Yi to the X, or from any Yi to any Yj. The causal graph is directed, and the structural equations are asymmetric. Note that in Pearl’s models, the structural equation Yi = X is different from the reverse structural equation X = Yi, even though in regular logic and probability theory these are equivalent. This point is really quite essential to Pearl’s treatment, and is made clear by the referenced document.

3) See point 1. Pearl’s calculus is trying to analyse counterfactuals (and causal relations) as we usually understand them, not to propose a revisionist account. So evidence about how we (naturally) interpret counterfactuals (in both the Gore case and the X, Y case) is entirely relevant here.

Incidentally, if you want my one sentence view, I’d say that Pearl is correctly analysing a certain sort of counterfactual but not the general sort he thinks he is analysing. Consider these two counterfactuals:

If A were to happen, then B would happen.

If A were to be made to happen (by outside intervention) then B would happen.

I believe that these are different counterfactuals, with different antecedents, and so they can have different truth values. It looks to me like Pearl’s “do” calculus correctly analyses the second sort of counterfactual, but not the first.

(Edited this comment to fix typos and a broken reference.)

Okay. So according to Causality (first edition, cause I’m poor), Theorem 7.1.7, the algorithm for calculating the counterfactual P( (Y= y)_(X = x) | e) -- which represents the statement “If X were x, then Y would be y, given evidence e”—has three stages:

Abduction; use the probability distribution P(x, y| E = e).

Action; perform do(X = x).

Calculate p(Y = y) relative to the new graph model and its updated joint probability distribution.

In our specific case, we want to calculate P (X = 0_(Z = 1)). There’s no evidence to condition on, so abduction does nothing.

To perform do(Z = 1), we delete every arrow pointing from the Yi’s to Z. The new probability distribution, p(x, yi | do(Z = 1)) is given by p(x, yi, 1) when z = 1 and zero otherwise. Since the original probability distribution assigned probability one only to the state (x = 0, yi = 0, z = 0), the new probability distribution is uniformly zero.

I now no longer follow your calculation of P(X=0_(Z=1)). In particular:

My understanding of Pearl’s semantics is that to evaluate this we have to intervene i.e. do(Z = 1) and this is equivalent to the multi-point intervention do(Y1 = 1 & Y2 = 1 & … & Y10^30 = 1). This is achieved by replacing every structural equation between X and Yi by the static equation Y_i = 1.

The intervention do(Z = 1) does not manipulate the Yi. The formula I used to calculate p(X = 0 | do(Z = 1)) is the truncated factorization formula given in section 3.2.3.

I suddenly wish I had sat down and calculated this out first, rather than argue from principles. I hear my mother’s voice in the background telling me to “do the math,” as is her habit.

You missed the point here that Z is a “macro-state” variable, which is defined to be the average of the Yi variables.

It is not actually a separate variable on the causal graph, and it is not caused by the Yi variables. This means that the intervention do(Z = 1) can only be realised on the causal graph by do(Y1 = 1, Y2 = 1, …, Y_10^30 = 1) which was what I stated a few posts ago. You are correct that the abduction step is not needed as this is a deterministic example.

Then why is P( X = 1 | do(Yi = 1) ) = 1? If I delete from the graph every arrow entering each Yi, I’m left with a graph empty of edges; the new joint pdf is still uniformly zero.

If you look back at my above posts, I deduce that in Pearl’s calculus we will get P[X = 0 | do (Z = 1)] = P[X = 0 | do(Yi = 1 for all i)] = 1. We agree here with what Pearl’s calculus says.

The problem is that the counterfactual interpretation of this is “If the average value of the Yi were 1, then X would have been 0”. And that seems plain implausible as a counterfactual. The much more plausible counterfactual backtracks to change X, allowing all the Yi to change to 1 through a single change in the causal graph, namely “If the average value of the Yi were 1, then X would have been 1″.

Notice the analogy to the Gore counterfactual. If Gore were president on 9/11, he wouldn’t suddenly have become president (the equivalent of a mass deletion of all the causal links to the Yi). No, he would have been president since January, because of a micro-change the previous Fall (equivalent to a backtracked change to the X). I believe you agreed that the Gore counterfactual needs to backtrack to make sense, so you agree with backtracking in principle? In that case, you should disagree with the Pearl treatment of counterfactuals, since they never backtrack (they can’t).

If you look back at my above posts, I deduce that in Pearl’s calculus we will get P[X = 0 | do (Z = 1)] = P[X = 0 | do(Yi = 1 for all i)] = 1. We agree here with what Pearl’s calculus says.

No, we disagree. My calculations suggest that P[X = 0 | do(Yi = 1 for all i)] = P[X = 1 | do(Yi = 1 for all i)] = 0. The intervention falls outside the region where the original joint pdf has positive mass. The intervention do(X = 1) also annihilates the original joint pdf, because there is no region of positive mass in which X = 1.

I still don’t understand why you don’t think the problem is a language problem. Pearl’s counterfactuals have a specific meaning, so of course they don’t mean something else from what they mean, even if the other meaning is a more plausible interpretation of the counterfactual (again, whatever that means—I’m still not sure what “more plausible” is supposed to mean theoretically).

The problem is that the counterfactual interpretation of this is “If the average value of the Yi were 1, then X would have been 0”. And that seems plain implausible as a counterfactual. The much more plausible counterfactual backtracks to change X, allowing all the Yi to change to 1 through a single change in the causal graph, namely “If the average value of the Yi were 1, then X would have been 1″.

I think the problem is that when you intervene to make something impossible happen, the resulting system no longer makes sense.

I believe you agreed that the Gore counterfactual needs to backtrack to make sense, so you agree with backtracking in principle?

Yes. (I assume you mean “If Gore was president during 9/11, he wouldn’t have invaded Iraq.”)

In that case, you should disagree with the Pearl treatment of counterfactuals, since they never backtrack (they can’t).

Why should I disagree with Pearl’s treatment of counterfactuals that don’t backtrack?

Isn’t the decision of whether or not a given counterfactual backtracks in its most “natural” interpretation largely a linguistic problem?

No, we disagree. My calculations suggest that P[X = 0 | do(Yi = 1 for all i)] = P[X = 1 | do(Yi = 1 for all i)] = 0. The >intervention falls outside the region where the original joint pdf has positive mass. The intervention do(X = 1) also >annihilates the original joint pdf, because there is no region of positive mass in which X = 1.

I don’t think that’s correct. My understanding of the intervention do(Yi = 1 for all i)] is that it creates a disconnected graph, in which the Yi all have the value 1 (as stipulated by the intervention) but X retains its original mass function P[X = 0] = 1. The causal links from X to the Yi are severed by the intervention, so it doesn’t matter that the intervention has zero probability in the original graph, since the intervention creates a new graph. (Interventions into deterministic systems often will have zero probability in the original system, though not in the intervened one.) On the other hand, you claim to be following Pearl_2012 whereas I’ve been reading Pearl_2001 and there might have been some differences in his treatment of impossible interventions… I’ll check this out.

For now, just suppose the original distribution over X was P[X = 0] = 1 - epsilon and P[X = 1] = epsilon for a very small epsilon. Would you agree that the intervention do(Yi = 1 for all i) now is in the area of positive mass function, but still doesn’t change the distribution over X so we still have P[X = 0 | do(Yi = 1 for all i)] = 1 - epsilon and P[X = 1 | do(Yi = 1 for all i)] = epsilon?

Isn’t the decision of whether or not a given counterfactual backtracks in its most “natural” interpretation largely a >linguistic problem?

I still think it’s a conceptual analysis problem rather than a linguistic problem. However perhaps we should play the taboo game on “linguistic” and “conceptual” since it seems we mean different things by them (and possibly what you mean by “linguistic” is close to what I mean by “conceptual” at least where we are talking about concepts expressed in English).

You seem to be done, so I won’t belabor things further; I just want to point out that I didn’t claim to have a more updated copy of Pearl (in fact, I said the opposite). I doubt there’s been any change to his algorithm.

All this ASCII math is confusing the heck out of me, anyway.

EDIT: Oh, dear. I see how horribly wrong I was now. The version of the formula I was looking at said “(formula) for (un-intervened variables) consistent with (intervention), and zero otherwise” and because it was a deterministic system my mind conflated the two kinds of consistency. I’m really sorry to have blown a lot of your free time on my own incompetence.

Thanks for that.… You just saved me a few hours additional research on Pearl to find out whether I’d got it wrong (and misapplied the calculus for interventions that are impossible in the original system)!

Incidentally, I’m quite a fan of Pearl’s work, and think there should be ways to adjust the calculus to allow reasonable backtracking counterfactuals as well as forward-tracking ones (i.e. ways to find a minimal intervention further back in the graph, one which then makes the antecedent come out true..) But that’s probably worth a separate post, and I’m not ready for it yet.

Thanks for that.… You just saved me a few hours additional research on Pearl to find out whether I’d got it wrong (and misapplied the calculus for interventions that are impossible in the original system)!

Incidentally, I’m quite a fan of Pearl’s work, and think there should be ways to adjust the calculus to allow reasonable backtracking counterfactuals as well as forward-tracking ones (i.e. ways to find a minimal intervention further back in the graph, one which then makes the antecedent come out true..) But that’s probably worth a separate post, and I’m not ready for it yet.

“Bayesianism and Causality, Or, Why I am only a Half-Bayesian”.

As a (mostly irrelevant) side note, this is Pearl_2001, who is a very different person from Pearl_2012.

Also, there is no “paradox” within Pearl’s calculus here: it is internally consistent.

I’m using the word paradox in the sense of “puzzling conclusion”, not “logical inconsistency.” Hence “apparent paradox”, which can’t make sense in the context of the latter definition.

It is a bit unfortunate, because while the calculus is elegant in its own terms, it does appears that conceptual analysis is what Pearl was attempting. He really did intend his “do” calculus to reflect how we usually understand counterfactuals, only stated more precisely. Pearl was not consciously proposing a “revisionist” account to the effect: “This is how I’m going to define counterfactuals for the sake of getting some math to work. If your existing definition or intuition about counterfactuals doesn’t match that definition, then sorry, but it still won’t affect my definition.”

The human causal algorithm is frequently, horrifically, wrong. A theory that attempts to model it is, I heavily suspect, less accurate than Pearl’s theory as it stands, at least because it will frequently prefer to use the post hoc inference when it is more appropriate to infer a mutual cause.

Accordingly, it doesn’t help to say “Regular intuitions say one thing, Pearl’s calculus says another, but the calculus is better, therefore the calculus is right and intuitions are wrong”. You can get away with that in revisionist accounts/definitions but not in regular conceptual analysis.

No, I didn’t say that. In my earlier comments I wondered under what conditions the “natural” interpretation of counterfactuals was preferable. If regular intuition disagrees with Pearl, there are at least two possibilities: intuition is wrong (i.e., a bias exists) or Pearl’s calculus does worse than intuition, which means the calculus needs to be updated. In a sense, the calculus is already a “revisionist” account of the human causal learning algorithm, though I disapprove of the connotations of “revisionist” and believe they don’t apply here.

But there is NO causal link in the opposite direction from the Yi to the X, or from any Yi to any Yj. The causal graph is directed, and the structural equations are asymmetric.

Yes, but my question here was whether or not the graph model was accurate. Purely deterministic graph models are weird in that they are observationally equivalent not just with other graphs with the same v-structure, but with any graph with the same skeleton, and even worse, one can always add an arrow connecting the ends of any path. I understand better now that the only purpose behind a deterministic graph model is to fix one out of this vast set of observationally equivalent models. I was confused by the plethora of observationally equivalent deterministic graph models.

Incidentally, if you want my one sentence view, I’d say that Pearl is correctly analysing a certain sort of counterfactual but not the general sort he thinks he is analysing. Consider these two counterfactuals:

If A were to happen, then B would happen.

If A were to be made to happen (by outside intervention) then B would happen.

As far as I can tell, the first is given by P(B | A), and the second is P(B_A). Am I missing something really fundamental here?

I’ve done the calculations for your model, but I’m going to put them in a different comment to separate out mathematical issues from philosophical ones. This comment is already too long.

Couple of points. You say that “the human causal algorithm is frequently, horrifically, wrong”.

But remember here that we are discussing the human counterfactual algorithm, and my understanding of the experimental evidence re counterfactual reasoning (e.g. on cases like Kennedy or Gore) is that it is pretty consistent across human subjects, and between “naive” subjects (taken straight off the street) vs “expert” subjects (who have been thinking seriously about the matter). There is also quite a lot of consistency on what constitues a “plausible” versus a “far out” counterfactual, and much stronger sense about what happens in the cases with plausible antecedents than in cases with weird antecedents (such as what Caesar would have done if fighting in Korea). It’s also interesting that there are rather a lot of formal analyses which almost match the human algorithm, but not quite, and that there is quite a lot of consensus on the counter examples (that they genuinely are counter examples, and that the formal analysis gets it wrong).

What pretty much everyone agrees is that counterfactuals involving macro-variable antecedents assume some back-tracking before the time of the antecedent, and that a small micro-state change to set up the antecedent is more plausible than a sudden macro-change which involves breaks across multiple micro-states.

And on your other point, simple conditioning P(B | A) gives results more like the indicative conditional (“If Oswald did not shoot Kennedy, then someone else did”) rather than the counterfactual conditional (“If Oswald had not shot Kennedy, then no-one else would have”) .

A small comment… Pearl’s treatment works fine for “forward-tracking” counterfactuals, where the only allowed changes in the counterfactual world are in the future of the change (i.e. after the point of surgery). However, regular counterfactuals require a bit of “back-tracking” to make the counterfactual scenario plausible in the first place.

Consider these statements:

“If Gore had been president during 9/11 he’d have reacted differently, and the US wouldn’t have invaded Iraq”

“If Gore had been president during 9/11 then he’d have been sworn in as president back in January 2001”

“If Gore had been president during 9/11 he’d have been blinking in shock at having suddenly teleported to Air Force One, and wondering why everyone was calling him Mr President”

“If Gore had been president during 9/11 he’d have been recently sworn in as president, owing to changing party, Dick Cheney dying of a heart attack, Bush appointing him as vice-president, and then himself dying of shock”.

Statement 1 is a forward-tracking counterfactual and arguably true. But the usual way of understanding it to be true assumes that 2 is also true, which is a back-tracking counterfactual (it involves re-writing approximately a year of history before 9/11).

Statement 3 is the only statement consistent with no back-tracking at all, and corresponds to Pearl’s approach of performing surgery on the causal graph at 9/11. (This is generally physically impossible, or at least physically absurd, since it involves tearing the graph apart and inserting a new state with no causal relation to the previous states.)

Statement 4 is an odd sort of compromise; it’s at least not physically impossible or absurd, and involves the bare minimum of back-tracking (to a political crisis a few days before 9/11). But it is clearly not the best way to understand the counterfactual.

My feeling is this is only a problem with expressing counterfactuals in English. If one did have a causal model of American history, 2000-2001, and one wanted to implement counterfactual 1, performing surgery at 9/11 would be unsound, for the reasons you state. The joint probability of the ancestors of 9/11 after such a transformation would all be very small indeed, relative to whatever vastly improbable events were necessary to transition Al Gore circa

^{9}⁄_{10}to president during 9/11.Is this an actual limitation of the calculus, though? Are there counterfactuals that are well-posed, but require an indefinite amount of “back-tracking”?

I’m not sure the problem is with English…

The issue arises whenever we have a causal model with a large number of micro-states, and the antecedent of a counterfactual can only be realised in worlds which change lots of different micro-states. The most “natural” way of thinking about the counterfactual in that case is still to make a minimal change (to one single micro state e.g. a particle decaying somewhere, or an atom shifting an angstrom somewhere) and to make it sufficiently far back in time to make a difference. (In the Gore case, in the brain of whoever thought up the butterfly ballot, or perhaps in the brain of a justice of the Supreme Court.) The problem with Pearl’s calculus though is that it doesn’t do that.

Here’s a toy model to demonstrate (no English). Consider the following set of structural equations (among Boolean micro state variables):

X = 0

Y_1 = X, Y_2 = X, …, Y_10^30 = X

The model is deterministic so P[X = 0] = 1.

Next we define a “macro-state” variable Z := (Y_1 + Y

2 + … + Y10^30) / 10^30. Plainly in the actual outcome Z = 0 and indeed P[Z = 0] = 1.But what if Z

wereequal to 1?My understanding of Pearl’s semantics is that to evaluate this we have to intervene i.e. do(Z = 1) and this is equivalent to the multi-point intervention do(Y_1 = 1 & Y_2 = 1 & … & Y_10^30 = 1). This is achieved by replacing every structural equation between X and Y_i by the static equation Y_i = 1.

Importantly, it is

NOTachieved by the single-point intervention X = 1, even though that is probably the most “natural” way to realise the counterfactual. So in Pearl’s notation, we must have ~X _ (Z = 1) or in probabilistic terms P[X = 0 | do(Z = 1)] = 1. Which, to be frank, seems wrong.And we can’t “fix” this in Pearl’s semantics by choosing the alternative surgery (X = 1) because if P[X = 1 | do(Z = 1)] = 1 that would imply in Pearl’s semantics that X is caused by the Yi, rather than the other way round, which is clearly wrong since it contradicts the original causal graph. Worse, even if we introduce some ambiguity, saying that X

mightchange under the intervention do(Z = 1), then we will still have P[X = 1 | do(Z = 1)] > 0 = P[X = 1] and this is enough to imply a probabilistic causal link from the Y_i to X which is still contrary to the causal graph.So I think this is a case where Pearl’s analysis gets it wrong.

Before I analyze this apparent paradox in any depth, I want to be sure I understand your criticism. There are three things about this comment on which I am unclear.

1.) The number of states cannot be relevant to the paradox from a theoretical standpoint, because nothing in Pearl’s calculus depends on the number of states. If this does pose a problem, it only poses a problem in so far as it creates an apparent paradox, that is, whatever algorithm humans use to parse the counterfactual “What if Z were 1?” is different from the Pearl’s calculus. A priori, this is not a dealbreaker, unless it can also be shown the human algorithm does better.

2.) If Yi = X, then there is a causal link between Yi and X. Indeed, there is a causal link between every X and every Yi. Conditioning on any of the Yi immediately fixes the value of every other variable.

3.) You say the problem isn’t with English, but then talk about “the most natural way to realize a counterfactual.” I don’t know what that means, other than as an artifact of the human causal learning algorithm.

Or am I misunderstanding you completely?

Thanks for taking the time to think/comment. It may help us to fix a reference which describes Pearl’s thinking and his calculus. There are several of his papers available online, but this one is pretty comprehensive: ftp://ftp.cs.ucla.edu/pub/stat_ser/r284-reprint.pdf “Bayesianism and Causality, Or, Why I am only a Half-Bayesian”.

Now onto your points:

1) You are correct that nothing in Pearl’s calculus varies depending on the number of variables Yi which causally depend on X. For any number of Yi, the intervention do(Z = 1) breaks

allthe links between X and the Yi and doesn’t change the vale of X at all. Also, there is no “paradox” within Pearl’s calculus here: it is internally consistent.The real difficulty is that the calculus just doesn’t work as a full conceptual analysis of counterfactuals, and this becomes increasingly clear the more Yi variables we add. It is a bit unfortunate, because while the calculus is elegant in its own terms, it does appears that conceptual analysis is what Pearl was attempting. He really did intend his “do” calculus to reflect how we usually understand counterfactuals, only stated more precisely. Pearl was not consciously proposing a “revisionist” account to the effect: “This is how I’m going to define counterfactuals for the sake of getting some math to work. If your existing definition or intuition about counterfactuals doesn’t match that definition, then sorry, but it still won’t affect my definition.” Accordingly, it doesn’t help to say “Regular intuitions say one thing, Pearl’s calculus says another, but the calculus is better, therefore the calculus is right and intuitions are wrong”. You can get away with that in revisionist accounts/definitions but not in regular conceptual analysis.

2) The structural equations do indeed imply there is a causal link from the X to the Yi. But there is NO causal link in the opposite direction from the Yi to the X, or from any Yi to any Yj. The causal graph is directed, and the structural equations are asymmetric. Note that in Pearl’s models, the structural equation Yi = X is

differentfrom the reverse structural equation X = Yi, even though in regular logic and probability theory these are equivalent. This point is really quite essential to Pearl’s treatment, and is made clear by the referenced document.3) See point 1. Pearl’s calculus is trying to analyse counterfactuals (and causal relations) as we usually understand them, not to propose a revisionist account. So evidence about how we (naturally) interpret counterfactuals (in both the Gore case and the X, Y case) is entirely relevant here.

Incidentally, if you want my one sentence view, I’d say that Pearl is correctly analysing a certain sort of counterfactual but not the general sort he thinks he is analysing. Consider these two counterfactuals:

If A were to happen, then B would happen.

If A were to be made to happen (by outside intervention) then B would happen.

I believe that these are different counterfactuals, with different antecedents, and so they can have different truth values. It looks to me like Pearl’s “do” calculus correctly analyses the second sort of counterfactual, but not the first.

(Edited this comment to fix typos and a broken reference.)

Okay. So according to Causality (first edition, cause I’m poor), Theorem 7.1.7, the algorithm for calculating the counterfactual P( (Y= y)_(X = x) | e) -- which represents the statement “If X were x, then Y would be y, given evidence e”—has three stages:

Abduction; use the probability distribution P(x, y| E = e).

Action; perform do(X = x).

Calculate p(Y = y) relative to the new graph model and its updated joint probability distribution.

In our specific case, we want to calculate P (X = 0_(Z = 1)). There’s no evidence to condition on, so abduction does nothing.

To perform do(Z = 1), we delete every arrow pointing from the Yi’s to Z. The new probability distribution, p(x, yi | do(Z = 1)) is given by p(x, yi, 1) when z = 1 and zero otherwise. Since the original probability distribution assigned probability one only to the state (x = 0, yi = 0, z = 0), the new probability distribution is uniformly zero.

I now no longer follow your calculation of P(X=0_(Z=1)). In particular:

The intervention do(Z = 1) does not manipulate the Yi. The formula I used to calculate p(X = 0 | do(Z = 1)) is the

truncated factorization formulagiven in section 3.2.3.I suddenly wish I had sat down and calculated this out first, rather than argue from principles. I hear my mother’s voice in the background telling me to “do the math,” as is her habit.

You missed the point here that Z is a “macro-state” variable, which is

definedto be the average of the Yi variables.It is not actually a separate variable on the causal graph, and it is not caused by the Yi variables. This means that the intervention do(Z = 1) can only be realised on the causal graph by do(Y1 = 1, Y2 = 1, …, Y_10^30 = 1) which was what I stated a few posts ago. You are correct that the abduction step is not needed as this is a deterministic example.

Then why is P( X = 1 | do(Yi = 1) ) = 1? If I delete from the graph every arrow entering each Yi, I’m left with a graph empty of edges; the new joint pdf is still uniformly zero.

In Pearl’s calculus, it isn’t!

If you look back at my above posts, I deduce that in Pearl’s calculus we will get P[X = 0 | do (Z = 1)] = P[X = 0 | do(Yi = 1 for all i)] = 1. We agree here with what Pearl’s calculus says.

The problem is that the counterfactual interpretation of this is “If the average value of the Yi were 1, then X would have been 0”. And that seems plain implausible as a counterfactual. The much more plausible counterfactual backtracks to change X, allowing all the Yi to change to 1 through a single change in the causal graph, namely “If the average value of the Yi were 1, then X would have been 1″.

Notice the analogy to the Gore counterfactual. If Gore were president on 9/11, he wouldn’t suddenly have become president (the equivalent of a mass deletion of all the causal links to the Yi). No, he would have been president since January, because of a micro-change the previous Fall (equivalent to a backtracked change to the X). I believe you agreed that the Gore counterfactual needs to backtrack to make sense, so you agree with backtracking in principle? In that case, you should disagree with the Pearl treatment of counterfactuals, since they never backtrack (they can’t).

No, we disagree. My calculations suggest that P[X = 0 | do(Yi = 1 for all i)] = P[X = 1 | do(Yi = 1 for all i)] = 0. The intervention falls outside the region where the original joint pdf has positive mass. The intervention do(X = 1) also annihilates the original joint pdf, because there is no region of positive mass in which X = 1.

I still don’t understand why you don’t think the problem is a language problem. Pearl’s counterfactuals have a specific meaning, so of course they don’t mean something else from what they mean, even if the other meaning is a more plausible interpretation of the counterfactual (again, whatever that means—I’m still not sure what “more plausible” is supposed to mean theoretically).

I think the problem is that when you intervene to make something impossible happen, the resulting system no longer makes sense.

Yes. (I assume you mean “If Gore was president during 9/11, he wouldn’t have invaded Iraq.”)

Why should I disagree with Pearl’s treatment of counterfactuals that don’t backtrack?

Isn’t the decision of whether or not a given counterfactual backtracks in its most “natural” interpretation largely a linguistic problem?

I don’t think that’s correct. My understanding of the intervention do(Yi = 1 for all i)] is that it creates a disconnected graph, in which the Yi all have the value 1 (as stipulated by the intervention) but X retains its original mass function P[X = 0] = 1. The causal links from X to the Yi are severed by the intervention, so it doesn’t matter that the intervention has zero probability in the original graph, since the intervention creates a new graph. (Interventions into deterministic systems often will have zero probability in the original system, though not in the intervened one.) On the other hand, you claim to be following Pearl_2012 whereas I’ve been reading Pearl_2001 and there might have been some differences in his treatment of impossible interventions… I’ll check this out.

For now, just suppose the original distribution over X was P[X = 0] = 1 - epsilon and P[X = 1] = epsilon for a very small epsilon. Would you agree that the intervention do(Yi = 1 for all i) now

isin the area of positive mass function, but still doesn’t change the distribution over X so we still have P[X = 0 | do(Yi = 1 for all i)] = 1 - epsilon and P[X = 1 | do(Yi = 1 for all i)] = epsilon?I still think it’s a conceptual analysis problem rather than a linguistic problem. However perhaps we should play the taboo game on “linguistic” and “conceptual” since it seems we mean different things by them (and possibly what you mean by “linguistic” is close to what I mean by “conceptual” at least where we are talking about concepts expressed in English).

Thanks anyway.

You seem to be done, so I won’t belabor things further; I just want to point out that I didn’t claim to have a more updated copy of Pearl (in fact, I said the opposite). I doubt there’s been any change to his algorithm.

All this ASCII math is confusing the heck out of me, anyway.

EDIT: Oh, dear. I see how horribly wrong I was now. The version of the formula I was looking at said “(formula) for (un-intervened variables)

consistentwith (intervention), and zero otherwise” and because it was a deterministic system my mind conflated the two kinds of consistency. I’m really sorry to have blown a lot of your free time on my own incompetence.Thanks for that.… You just saved me a few hours additional research on Pearl to find out whether

I’dgot it wrong (and misapplied the calculus for interventions that are impossible in the original system)!Incidentally, I’m quite a fan of Pearl’s work, and think there should be ways to adjust the calculus to allow reasonable backtracking counterfactuals as well as forward-tracking ones (i.e. ways to find a minimal intervention further back in the graph, one which then makes the antecedent come out true..) But that’s probably worth a separate post, and I’m not ready for it yet.

Thanks for that.… You just saved me a few hours additional research on Pearl to find out whether

I’dgot it wrong (and misapplied the calculus for interventions that are impossible in the original system)!Incidentally, I’m quite a fan of Pearl’s work, and think there should be ways to adjust the calculus to allow reasonable backtracking counterfactuals as well as forward-tracking ones (i.e. ways to find a minimal intervention further back in the graph, one which then makes the antecedent come out true..) But that’s probably worth a separate post, and I’m not ready for it yet.

As a (mostly irrelevant) side note, this is Pearl_2001, who is a very different person from Pearl_2012.

I’m using the word paradox in the sense of “puzzling conclusion”, not “logical inconsistency.” Hence “apparent paradox”, which can’t make sense in the context of the latter definition.

The human causal algorithm is frequently, horrifically, wrong. A theory that attempts to model it is, I heavily suspect, less accurate than Pearl’s theory as it stands, at least because it will frequently prefer to use the post hoc inference when it is more appropriate to infer a mutual cause.

No, I didn’t say that. In my earlier comments I

wonderedunder what conditions the “natural” interpretation of counterfactuals was preferable. If regular intuition disagrees with Pearl, there are at least two possibilities: intuition is wrong (i.e., a bias exists) or Pearl’s calculus does worse than intuition, which means the calculus needs to be updated. In a sense, the calculus is already a “revisionist” account of the human causal learning algorithm, though I disapprove of the connotations of “revisionist” and believe they don’t apply here.Yes, but my question here was whether or not the graph model was accurate. Purely deterministic graph models are weird in that they are observationally equivalent not just with other graphs with the same v-structure, but with any graph with the same skeleton, and even worse, one can always add an arrow connecting the ends of any path. I understand better now that the only purpose behind a deterministic graph model is to fix one out of this vast set of observationally equivalent models. I was confused by the plethora of observationally equivalent deterministic graph models.

As far as I can tell, the first is given by P(B | A), and the second is P(B_A). Am I missing something really fundamental here?

I’ve done the calculations for your model, but I’m going to put them in a different comment to separate out mathematical issues from philosophical ones. This comment is already too long.

Couple of points. You say that “the human causal algorithm is frequently, horrifically, wrong”.

But remember here that we are discussing the human

counterfactualalgorithm, and my understanding of the experimental evidence re counterfactual reasoning (e.g. on cases like Kennedy or Gore) is that it is pretty consistent across human subjects, and between “naive” subjects (taken straight off the street) vs “expert” subjects (who have been thinking seriously about the matter). There is also quite a lot of consistency on what constitues a “plausible” versus a “far out” counterfactual, and much stronger sense about what happens in the cases with plausible antecedents than in cases with weird antecedents (such as what Caesar would have done if fighting in Korea). It’s also interesting that there are rather a lot of formal analyses which almost match the human algorithm, but not quite, and that there is quite a lot of consensus on the counter examples (that they genuinely are counter examples, and that the formal analysis gets it wrong).What pretty much everyone agrees is that counterfactuals involving macro-variable antecedents assume some back-tracking before the time of the antecedent, and that a small micro-state change to set up the antecedent is more plausible than a sudden macro-change which involves breaks across multiple micro-states.

And on your other point, simple conditioning P(B | A) gives results more like the indicative conditional (“If Oswald did not shoot Kennedy, then someone else did”) rather than the counterfactual conditional (“If Oswald had not shot Kennedy, then no-one else would have”) .

Granted. I’m a mathematician, not a cognitive scientist.