That’s a clear outline of the theory. I just want to note that the theory itself makes some assumptions about possible patterns of causation, even before you begin to select which causal graphs are plausible candidates for testing. Pearl himself stresses that without putting causal information in, you can’t get causal information out from purely observational data.
For example, if overweight causes lack of exercise and lack of exercise causes overweight, you don’t have an acyclic graph. Acyclicity of causation is one of the background assumptions here. Acyclicity of causation is reasonable when talking about point events in a universe without time-like loops. However, “weight” and “exercise level” are temporally extended processes, which makes acyclicity a strong assumption.
First, notice a situation that occurs many times. Then pay attention to the ways in which things are different from one iteration to the next. At this point, and here is where causal information begins, if some of the variables represent your own behavior, you can systematically intervene in the situation by changing those behaviors. For cleanest results, contrive a controlled experiment that is analogous the the original situation.
In short, you insert causal information by intervening.
This of course requires you to construct a reference class of situations that are substantially similar to one another, but humans seem to be pretty good at that within our domains of familiarity.
By the way, thank you for explaining the underlying assumption of acyclicity. I’ve been trying to internalize the math of causal calculus and it bugged me that cyclic causes weren’t allowed. Now I understand that it is a simplification and that the calculus isn’t quite as powerful as I thought.
I don’t have an answer to my own koan, but this was one of the possibilities that I thought of:
In short, you insert causal information by intervening.
But how does one intervene? By causing some variable to take some value, while obstructing the other causal influences on it. So causal knowledge is already required before one can intervene. This is not a trivial point—if the knowledge is mistaken, the intervention may not be successful, as I pointed out with the example of trying to warm a room thermostat by placing a candle near it.
Causal knowledge is required to ensure success, but not to stumble across it. Over time, noticing (or stumbling across if you prefer) relationships between the successes stumbled upon can quickly coalesce into a model of how to intervene. Isn’t this essentially how we believe causal reasoning originated? In a sense, all DNA is information about how to intervene that, once stumbled across, persisted due to its efficacy.
I think that one bootstraps the process with contrived situations designed to appeal to ones intuitions. For example, one attempts to obtain causal information through a randomised controlled trial. You mark the obverse face of a coin “treatment” and reverse face “control” and toss the coin to “randomly” assign your patients.
Let us briefly consider the absolute zero of no a priori knowledge at all. Perhaps the coin knows the prognosis of the patient and comes down “treatment” for patients with a good prognosis, intending to mislead you into believing that the treatment is the cause of good outcomes. Maybe, maybe not. Let’s stop considering this because insanity is stalking us.
We are willing to take a stand. We know enough, a prior, to choose and operate a randomisation device and thus obtain a variable which is independent of all the others and causally connected to none of them. We don’t prove this, we assume it. When we encounter a compulsive gambler, who believes in Lady Luck who is fickle and very likely is actually messing with us via the coin, we just dismiss his hypothesis. Life is short, one has to assume that certain obvious things are actually true in order to get started, and work up from there.
My answer: Attributing causation is part of our human instincts. We are born with some desire to do it. We may develop that skill by reflecting on it during our lifetime.
(How did we humans develop that instinct? Evolution, probably. Humans who had mutated to reason about causality died less – for instance, they might have avoided drinking from a body of water after seeing something poisonous put in, because they reasoned that the poison addition would cause the water to be poisonous.)
There isn’t any better explanation. If you don’t accept the idea of causality as given, you can never explain anything. Roryokane is using causality to explain how causality originated, and that’s not a good way to go about proving the way causality works or anything but it is a good way of understanding why causality exists, or rather just accepting that we can never prove causality exists.
Our instincts are just wired to interpret causality that way, and that makes it a brute fact. You might as well claim that calling a certain color yellow and then saying it looks yellow as a result of human nature is a non-explanation, you might be technically right to do so but in that case then you’re asking for answers you’re never actually going to get.
You might as well claim that calling a certain color yellow and then saying it looks yellow as a result of human nature is a non-explanation, you might be technically right to do so but in that case then you’re asking for answers you’re never actually going to get.
That would be a non-explanation, but a better explanation is in fact possible. You can look at the way that light is turned into neural signals by the eye, and discover the existence of red-green, blue-yellow, and light-dark axes, and there you have physiological justification for six of our basic colour words. (I don’t know just how settled that story is, but it’s settled enough to be literally textbook stuff.)
So, that is what a real explanation looks like. Attributing anything to “human nature” is even more wrong than attributing it to “God”. At least we have some idea of what “God” would be if he existed, but “human nature” is a blank, a label papering over a void. How do Sebastian Thrun’s cars drive themselves? Because he has integrated self-driving into their nature. How does opium produce sleep? By its dormitive nature. How do humans distinguish colours? By their human nature.
But causality is uniquely impervious to those kind of explanations. You can explain why humans believe in causality in a physiological sense, but I didn’t think that is what you were asking for. I thought you were asking for some overall metaphysical justification for causality, and there really isn’t any. Causal reasoning works because it works, there’s no other justification to be had for it.
Pearl himself stresses that without putting causal information in, you can’t get causal information out from purely observational data.
Where do you get this? My recall of Causality is that he specifically rejected the “no causes in, no causes out” view in favor of the “Occam’s Razor in, some causes out” view.
Yes, the Occamian view is in his book in section 2.3 (and still in the 2009 2nd edition). But that definition of “inferred causation”—those arrows common to all causal models consistent with the statistical data—depends on general causal assumptions, the usual ones being the DAG, Markov, and Faithfulness properties.
“one cannot substantiate causal claims from associations alone, even at the population level—behind every causal conclusion there must lie some causal assumption that is not testable in observational studies.”
Here is a similar survey article from 2003, in which he writes that exact sentence, followed by:
“Nancy Cartwright (1989) expressed this principle as “no causes in, no causes out”, meaning that we cannot convert statistical knowledge into causal knowledge.”
Everywhere, he defines causation in terms of counterfactuals: claims about what would have happened had something been different, which, he says, cannot be expressed in terms of statistical distributions over observational data. Here is a long interview (both audio and text transcript) in which he recounts the whole course of his work.
In other places, for example: “Causal inference in statistics: An overview”, which is in effect the Cliff Notes version of his book, he writes:
“one cannot substantiate causal claims from associations alone, even at the population level—behind every causal conclusion there must lie some causal assumption that is not testable in observational studies.”
Here is a similar survey article from 2003, in which he writes that exact sentence, followed by:
“Nancy Cartwright (1989) expressed this principle as “no causes in, no causes out”, meaning that we cannot convert statistical knowledge into causal knowledge.”
Interesting, but how do those files evade word searches for the parts you’ve quoted?
Interesting, but how do those files evade word searches for the parts you’ve quoted?
Dunno, not all PDFs are searchable and not all PDF viewers fail to make a pig’s ear of searching. The quotes can be found on p.99 (the third page of the file) and pp.284-285 (6th-7th pages of the file) respectively.
Btw, Scott Aaronson just recently posted the question of whether you would care about causality if you could only work with observational data (someone already linked this article in the comments) and I put up a comment with my summary of the LW position (plus some complexity-theoretic considerations).
I don’t think that Bayesian networks implicitly contain the concept of causality.
Formally, a probability distribution is represented by a Bayesian network if it can be factored as a product of P(node | node’s parents). But this is not unique, given one network you can create lots of other networks which also represent the same distribution by e.g. changing the direction of arrows as long as the independence properties from the graph stay the same (e.g. the graphs A → B → C and A ← B ← C can represent exactly the same class of probability distributions). Pearl distinguishes Baysian networks from causal networks, which are Bayesian networks in which the arrows point in the direction of causality.
And of course, there are other sparse representations like Markov networks, which also incorporates independence assumptions but are undirected.
The non-uniqueness doesn’t make causality absent or irrelevant; it must means there are multiple minimal representations that use causality. The causality arises in how your node connections are asymmetric. If the relativity of simultaneity (observers seeing the same events in a different time order) doesn’t obviate causality, neither should he existence of multiple causal networks.
There are indeed equivalent models that use purely symmetric node connections (or none at all in the case of the superexponential pair wise conditional independence table across all variables), but (correct me if I’m wrong) by throwing away the information graphically represented by the arrows, you no longer have a maximally efficient encoding of the joint probability distribution (even though it’s certainly not as bad as the superexponential table).
First, authors like Pearl do not use “causality” to mean just that there is a directed edge in a Bayesian network (i.e. that certain conditional independence properties hold). Rather, he uses it to mean that the model describes what happens under interventions. One can see the difference by comparing Rain → WetGrass with WetGrass → Rain (which are equivalent as Bayesian networks). Of course, maybe he is confused and the difference will dissolve under more careful consideration, but I think this shows one should be careful in claiming that Bayes networks encode our best understanding of causality.
Second, do we need Bayesian networks to economically represent distributions? This is slightly subtle.
We do not need the directed arrows when representing a particular distribution. For example, suppose a distribution P(A,B,C) is represented by the Bayesian network A → B ← C. Expanding the definition, this means that the joint distribution can be factored as
P(A=a,B=b,C=c) = P1(A=a) P2(B=b|A=a,C=c) P3(C=c)
where P1 and P3 are the marginal distributions of A and B, and P2 is the conditional distribution of B. So the data we needed to specify P were two one-column tables specifying P1 and P3, and a three-column table specifying P2(a|b,c) for all values of a,b,c. But now note that we do not gain very much by knowing that these are probability distributions. To save space it is enough to note that P factors as
P(A=a,B=b,C=c) = F1(a) F2(b,a,c) F3(c)
for some real-valued functions F1, F2, and F3. In other words, that P is represented by a Markov network A—B - C. The directions on the edges were not essential.
And indeed, typical algorithms for inference given a probability distribution, such as belief propagation, do not make use of the Bayesian structure. They work equally well for directed and undirected graphs.
Rather, the point of Bayesian versus Markov networks is that the class of probability distributions that can be represented by them are different. So they are useful when we try to learn a probability distribution, and want to cut down the search space by constraining the distribution by some independence relations that we know a priori.
Bayesian networks are popular because they let us write down many independence assumptions that we know hold for practical problems. However, we then have to ask how we know those particular independence relations hold. And that’s because they correspond to causual relations! The reason Bayesian networks are popular with human researchers is that they correspond well with the notion of causality that humans use. We don’t know that the Armchairians would also find them useful.
To save space it is enough to note that P factors as
P(A=a,B=b,C=c) = F1(a) F2(b,a,c) F3(c)
for some real-valued functions F1, F2, and F3. In other words, that P is represented by a Markov network A—B - C. The directions on the edges were not essential.
Can’t the directions be recovered automatically from that expression, though? That is, discarding the directions from the notation of conditional probabilities doesn’t actually discard them.
The reconstruction algorithm would label every function argument as “primary” or “secondary”, begin with no arguments labelled, and repeatedly do this:
For every function with no primary variable and exactly one unlabelled variable, label that variable as primary and all of its occurrences as arguments to other functions as secondary.
When all arguments are labelled, make a graph of the variables with an arrow from X to Y whenever X and Y occur as arguments to the same function, X as secondary and Y as primary. If the functions F1 F2 etc. originally came from a Bayesian network, won’t this recover that precise network?
If the original graph was A ← B → C, the expression would have been F1(a,b) F2(b) F3(c,b).
If the functions F1 F2 etc. originally came from a Bayesian network, won’t this recover that precise network?
I think this is right, if you know that the factors were learned by fitting them to a Bayesian network, you can recover what that network must have been. And you can go even further, if you only have a joint distribution you can use the techniques of the original article to see which Bayesian networks could be consistent with it.
But there is a separate question about why we are interested in Bayesian networks in the first place. SilasBarta seemed to claim that you are naturally led to them if you are interested in representing probability distributions efficiently. But for that purpose (I claim), you only need the idea of factors, not the directed graph structure. E.g. a probability distribution which fits the (equivalent) Bayesian networks A → B → C or A ← B ← C or A ← B → C can be efficiently represented as F1(a,b) F2(b,c). You would not think of representing it as F1(a) F2(a,b) F3(b,c) unless you were already interested in causality.
In other words, that P is represented by a Markov network A—B - C. The directions on the edges were not essential.
On the contrary, they are important and store information about the relationships that saves you time and space. Like I said in my linked comment, the direction of the arrows between A,C and B tell you whether conditioning on B (perhaps by separating it out into buckets of various values) creates or destroys mutual information between A and C. That saves you from having to explicitly write out all the combinations of conditional (in)dependence.
In other words, that P is represented by a Markov network A—B - C. The directions on the edges were not essential.
Oops, on second thought the factorization is equivalent to the complete triangle, not a line. But this doesn’t change the point that the space requirements are determined by the factors, not the graph structure, so the two representations will use the same amount of space.
On the contrary, they are important and store information about the relationships that saves you time and space.
All independence relations are implicit in the distribution itself, so the graph can only save you time, not space.
It is true that knowing a minimal Bayes network or a minimal Markov network for a distribution lets you read of certain independence assumptions quickly. But it doesn’t save you from having to write out all the combinations. There are exponentially many possible conditional independences, each of which may hold or not, so no sub-exponential representation can get encode all of them. And indeed, there are some kinds of independence assumptions that can be expressed as Bayesian networks but not Markov networks, and vice versa. Even in everyday machine learning, it is not the case that Bayesian networks is always the best representation.
You also do not motivate why someone would be interested in a big list of conditional independencies for its own sake. Surely, what we ultimately want to know is e.g. the probability that it will rain tomorrow, not whether or not rain is correlated with sprinklers.
But it doesn’t save you from having to write out all the combinations.
It saves you from having to write them until needed, in which case they can be extracted by walking through the graph rather than doing a lookup on a superexponential table.
You also do not motivate why someone would be interested in a big list of conditional independencies for its own sake. Surely, what we ultimately want to know is e.g. the probability that it will rain tomorrow, not whether or not rain is correlated with sprinklers.
Yes, the question was what they would care about if they were only interested in predictions. And so I think I’ve motivated why they would care about conditional (in)dependencies: it determines the (minimal) set of variables they need to look at! Whatever minimal method of representing their knowledge will then have these arrows (from one of the networks that fits the data).
If you require that causality definitions be restricted to (uncorrelated) counterfactual operations (like Pearl’s “do” operation), then sure, the Armcharians won’t do that specific computation. But if you use the definition of causality from this article, then I think it’s clear that efficiency considerations will lead them to use something isomorphic to it.
It saves you from having to write them until needed
I was saying that not every independence property is representable as a Bayesian network.
Whatever minimal method of representing their knowledge will then have these arrows (from one of the networks that fits the data).
No! Once you have learned a distribution using Bayesian network-based methods, the minimal representation of it is the set of factors. You don’t need the direction of the arrows any more.
I was saying that not every independence property is representable as a Bayesian network.
You mean when all variables are independent, or some other class of cases?
No! Once you have learned a distribution using Bayesian network-based methods, the minimal representation of it is the set of factors. You don’t need the direction of the arrows any more.
Read the rest: you need the arrows if you want to efficiently look up the conditional (in)dependencies.
You mean when all variables are independent, or some other class of cases?
Well, there are doubly-exponentially many possibilities…
The usual example for Markov networks is four variables connected in a square. The corresponding independence assumption is that any two opposite corners are independent given the other two corners. There is no Bayesian network encoding exactly that.
you need the arrows if you want to efficiently look up the conditional (in)dependencies.
But again, why would you want that? As I said in the grand^(n)parent, you don’t need to when doing inference.
The usual example for Markov networks is four variables connected in a square. The corresponding independence assumption is that any two opposite corners are independent given the other two corners. There is no Bayesian network encoding exactly that.
Okay, I’m recalling the “troublesome” cases that Pearl brings up, which gives me a better idea of what you mean. But this is not a counterexample. It just means that you can’t do it on a Bayes net with binary nodes. You can still represent that situation by merging (either pair of) the screening nodes into one node that covers all combinations of possibilities between them.
Do you have another example?
But again, why would you want that? As I said in the grand^(n)parent, you don’t need to when doing inference.
Sure you do: you want to know which and how many variables you have to look up to make your prediction.
merging (either pair of) the screening nodes into one node
Then the network does not encode the conditional independence between the two variables that you merged.
The task you have to do when making predictions is marginalization: in order to computer P(Rain|WetGrass), you need to compute the sum of P(Rain|WetGrass, X,Y,Z) for all possible values of the variables X, Y, Z that you didn’t observe. Here it is very helpful to have the distribution factored into a tree, since that can make it feasible to do variable elimination (or related algorithms like belief propagation). But the directions on the edges in the tree don’t matter, you can start at any leaf node and work across.
That’s a clear outline of the theory. I just want to note that the theory itself makes some assumptions about possible patterns of causation, even before you begin to select which causal graphs are plausible candidates for testing. Pearl himself stresses that without putting causal information in, you can’t get causal information out from purely observational data.
For example, if overweight causes lack of exercise and lack of exercise causes overweight, you don’t have an acyclic graph. Acyclicity of causation is one of the background assumptions here. Acyclicity of causation is reasonable when talking about point events in a universe without time-like loops. However, “weight” and “exercise level” are temporally extended processes, which makes acyclicity a strong assumption.
This is a hugely important point in practical, everyday reasoning about causality. Feedback loops abound.
Koan: How, then, does the process of attributing causation get started?
My answer:
First, notice a situation that occurs many times. Then pay attention to the ways in which things are different from one iteration to the next. At this point, and here is where causal information begins, if some of the variables represent your own behavior, you can systematically intervene in the situation by changing those behaviors. For cleanest results, contrive a controlled experiment that is analogous the the original situation.
In short, you insert causal information by intervening.
This of course requires you to construct a reference class of situations that are substantially similar to one another, but humans seem to be pretty good at that within our domains of familiarity.
By the way, thank you for explaining the underlying assumption of acyclicity. I’ve been trying to internalize the math of causal calculus and it bugged me that cyclic causes weren’t allowed. Now I understand that it is a simplification and that the calculus isn’t quite as powerful as I thought.
I don’t have an answer to my own koan, but this was one of the possibilities that I thought of:
But how does one intervene? By causing some variable to take some value, while obstructing the other causal influences on it. So causal knowledge is already required before one can intervene. This is not a trivial point—if the knowledge is mistaken, the intervention may not be successful, as I pointed out with the example of trying to warm a room thermostat by placing a candle near it.
Causal knowledge is required to ensure success, but not to stumble across it. Over time, noticing (or stumbling across if you prefer) relationships between the successes stumbled upon can quickly coalesce into a model of how to intervene. Isn’t this essentially how we believe causal reasoning originated? In a sense, all DNA is information about how to intervene that, once stumbled across, persisted due to its efficacy.
I think that one bootstraps the process with contrived situations designed to appeal to ones intuitions. For example, one attempts to obtain causal information through a randomised controlled trial. You mark the obverse face of a coin “treatment” and reverse face “control” and toss the coin to “randomly” assign your patients.
Let us briefly consider the absolute zero of no a priori knowledge at all. Perhaps the coin knows the prognosis of the patient and comes down “treatment” for patients with a good prognosis, intending to mislead you into believing that the treatment is the cause of good outcomes. Maybe, maybe not. Let’s stop considering this because insanity is stalking us.
We are willing to take a stand. We know enough, a prior, to choose and operate a randomisation device and thus obtain a variable which is independent of all the others and causally connected to none of them. We don’t prove this, we assume it. When we encounter a compulsive gambler, who believes in Lady Luck who is fickle and very likely is actually messing with us via the coin, we just dismiss his hypothesis. Life is short, one has to assume that certain obvious things are actually true in order to get started, and work up from there.
My answer: Attributing causation is part of our human instincts. We are born with some desire to do it. We may develop that skill by reflecting on it during our lifetime.
(How did we humans develop that instinct? Evolution, probably. Humans who had mutated to reason about causality died less – for instance, they might have avoided drinking from a body of water after seeing something poisonous put in, because they reasoned that the poison addition would cause the water to be poisonous.)
This is a non-explanation, or rather, three non-explanations.
“Human nature does it” explains no more than “God does it”.
“It’s part of human nature because it must have been adaptive in the past” likewise. Causal reasoning works, but why does it work?
And “mutated to reason about causality” is just saying “genes did it”, which is still not an advance on “God did it”.
There isn’t any better explanation. If you don’t accept the idea of causality as given, you can never explain anything. Roryokane is using causality to explain how causality originated, and that’s not a good way to go about proving the way causality works or anything but it is a good way of understanding why causality exists, or rather just accepting that we can never prove causality exists.
Our instincts are just wired to interpret causality that way, and that makes it a brute fact. You might as well claim that calling a certain color yellow and then saying it looks yellow as a result of human nature is a non-explanation, you might be technically right to do so but in that case then you’re asking for answers you’re never actually going to get.
That would be a non-explanation, but a better explanation is in fact possible. You can look at the way that light is turned into neural signals by the eye, and discover the existence of red-green, blue-yellow, and light-dark axes, and there you have physiological justification for six of our basic colour words. (I don’t know just how settled that story is, but it’s settled enough to be literally textbook stuff.)
So, that is what a real explanation looks like. Attributing anything to “human nature” is even more wrong than attributing it to “God”. At least we have some idea of what “God” would be if he existed, but “human nature” is a blank, a label papering over a void. How do Sebastian Thrun’s cars drive themselves? Because he has integrated self-driving into their nature. How does opium produce sleep? By its dormitive nature. How do humans distinguish colours? By their human nature.
But causality is uniquely impervious to those kind of explanations. You can explain why humans believe in causality in a physiological sense, but I didn’t think that is what you were asking for. I thought you were asking for some overall metaphysical justification for causality, and there really isn’t any. Causal reasoning works because it works, there’s no other justification to be had for it.
Where do you get this? My recall of Causality is that he specifically rejected the “no causes in, no causes out” view in favor of the “Occam’s Razor in, some causes out” view.
Yes, the Occamian view is in his book in section 2.3 (and still in the 2009 2nd edition). But that definition of “inferred causation”—those arrows common to all causal models consistent with the statistical data—depends on general causal assumptions, the usual ones being the DAG, Markov, and Faithfulness properties.
In other places, for example: “Causal inference in statistics: An overview”, which is in effect the Cliff Notes version of his book, he writes:
“one cannot substantiate causal claims from associations alone, even at the population level—behind every causal conclusion there must lie some causal assumption that is not testable in observational studies.”
Here is a similar survey article from 2003, in which he writes that exact sentence, followed by:
“Nancy Cartwright (1989) expressed this principle as “no causes in, no causes out”, meaning that we cannot convert statistical knowledge into causal knowledge.”
Everywhere, he defines causation in terms of counterfactuals: claims about what would have happened had something been different, which, he says, cannot be expressed in terms of statistical distributions over observational data. Here is a long interview (both audio and text transcript) in which he recounts the whole course of his work.
Interesting, but how do those files evade word searches for the parts you’ve quoted?
Dunno, not all PDFs are searchable and not all PDF viewers fail to make a pig’s ear of searching. The quotes can be found on p.99 (the third page of the file) and pp.284-285 (6th-7th pages of the file) respectively.
OTOH, try Google.
Btw, Scott Aaronson just recently posted the question of whether you would care about causality if you could only work with observational data (someone already linked this article in the comments) and I put up a comment with my summary of the LW position (plus some complexity-theoretic considerations).
I don’t think that Bayesian networks implicitly contain the concept of causality.
Formally, a probability distribution is represented by a Bayesian network if it can be factored as a product of P(node | node’s parents). But this is not unique, given one network you can create lots of other networks which also represent the same distribution by e.g. changing the direction of arrows as long as the independence properties from the graph stay the same (e.g. the graphs A → B → C and A ← B ← C can represent exactly the same class of probability distributions). Pearl distinguishes Baysian networks from causal networks, which are Bayesian networks in which the arrows point in the direction of causality.
And of course, there are other sparse representations like Markov networks, which also incorporates independence assumptions but are undirected.
The non-uniqueness doesn’t make causality absent or irrelevant; it must means there are multiple minimal representations that use causality. The causality arises in how your node connections are asymmetric. If the relativity of simultaneity (observers seeing the same events in a different time order) doesn’t obviate causality, neither should he existence of multiple causal networks.
There are indeed equivalent models that use purely symmetric node connections (or none at all in the case of the superexponential pair wise conditional independence table across all variables), but (correct me if I’m wrong) by throwing away the information graphically represented by the arrows, you no longer have a maximally efficient encoding of the joint probability distribution (even though it’s certainly not as bad as the superexponential table).
I guess there are two points here.
First, authors like Pearl do not use “causality” to mean just that there is a directed edge in a Bayesian network (i.e. that certain conditional independence properties hold). Rather, he uses it to mean that the model describes what happens under interventions. One can see the difference by comparing Rain → WetGrass with WetGrass → Rain (which are equivalent as Bayesian networks). Of course, maybe he is confused and the difference will dissolve under more careful consideration, but I think this shows one should be careful in claiming that Bayes networks encode our best understanding of causality.
Second, do we need Bayesian networks to economically represent distributions? This is slightly subtle.
We do not need the directed arrows when representing a particular distribution. For example, suppose a distribution P(A,B,C) is represented by the Bayesian network A → B ← C. Expanding the definition, this means that the joint distribution can be factored as
P(A=a,B=b,C=c) = P1(A=a) P2(B=b|A=a,C=c) P3(C=c)
where P1 and P3 are the marginal distributions of A and B, and P2 is the conditional distribution of B. So the data we needed to specify P were two one-column tables specifying P1 and P3, and a three-column table specifying P2(a|b,c) for all values of a,b,c. But now note that we do not gain very much by knowing that these are probability distributions. To save space it is enough to note that P factors as
P(A=a,B=b,C=c) = F1(a) F2(b,a,c) F3(c)
for some real-valued functions F1, F2, and F3. In other words, that P is represented by a Markov network A—B - C. The directions on the edges were not essential.
And indeed, typical algorithms for inference given a probability distribution, such as belief propagation, do not make use of the Bayesian structure. They work equally well for directed and undirected graphs.
Rather, the point of Bayesian versus Markov networks is that the class of probability distributions that can be represented by them are different. So they are useful when we try to learn a probability distribution, and want to cut down the search space by constraining the distribution by some independence relations that we know a priori.
Bayesian networks are popular because they let us write down many independence assumptions that we know hold for practical problems. However, we then have to ask how we know those particular independence relations hold. And that’s because they correspond to causual relations! The reason Bayesian networks are popular with human researchers is that they correspond well with the notion of causality that humans use. We don’t know that the Armchairians would also find them useful.
Can’t the directions be recovered automatically from that expression, though? That is, discarding the directions from the notation of conditional probabilities doesn’t actually discard them.
The reconstruction algorithm would label every function argument as “primary” or “secondary”, begin with no arguments labelled, and repeatedly do this:
For every function with no primary variable and exactly one unlabelled variable, label that variable as primary and all of its occurrences as arguments to other functions as secondary.
When all arguments are labelled, make a graph of the variables with an arrow from X to Y whenever X and Y occur as arguments to the same function, X as secondary and Y as primary. If the functions F1 F2 etc. originally came from a Bayesian network, won’t this recover that precise network?
If the original graph was A ← B → C, the expression would have been F1(a,b) F2(b) F3(c,b).
I think this is right, if you know that the factors were learned by fitting them to a Bayesian network, you can recover what that network must have been. And you can go even further, if you only have a joint distribution you can use the techniques of the original article to see which Bayesian networks could be consistent with it.
But there is a separate question about why we are interested in Bayesian networks in the first place. SilasBarta seemed to claim that you are naturally led to them if you are interested in representing probability distributions efficiently. But for that purpose (I claim), you only need the idea of factors, not the directed graph structure. E.g. a probability distribution which fits the (equivalent) Bayesian networks A → B → C or A ← B ← C or A ← B → C can be efficiently represented as F1(a,b) F2(b,c). You would not think of representing it as F1(a) F2(a,b) F3(b,c) unless you were already interested in causality.
On the contrary, they are important and store information about the relationships that saves you time and space. Like I said in my linked comment, the direction of the arrows between A,C and B tell you whether conditioning on B (perhaps by separating it out into buckets of various values) creates or destroys mutual information between A and C. That saves you from having to explicitly write out all the combinations of conditional (in)dependence.
Oops, on second thought the factorization is equivalent to the complete triangle, not a line. But this doesn’t change the point that the space requirements are determined by the factors, not the graph structure, so the two representations will use the same amount of space.
All independence relations are implicit in the distribution itself, so the graph can only save you time, not space.
It is true that knowing a minimal Bayes network or a minimal Markov network for a distribution lets you read of certain independence assumptions quickly. But it doesn’t save you from having to write out all the combinations. There are exponentially many possible conditional independences, each of which may hold or not, so no sub-exponential representation can get encode all of them. And indeed, there are some kinds of independence assumptions that can be expressed as Bayesian networks but not Markov networks, and vice versa. Even in everyday machine learning, it is not the case that Bayesian networks is always the best representation.
You also do not motivate why someone would be interested in a big list of conditional independencies for its own sake. Surely, what we ultimately want to know is e.g. the probability that it will rain tomorrow, not whether or not rain is correlated with sprinklers.
It saves you from having to write them until needed, in which case they can be extracted by walking through the graph rather than doing a lookup on a superexponential table.
Yes, the question was what they would care about if they were only interested in predictions. And so I think I’ve motivated why they would care about conditional (in)dependencies: it determines the (minimal) set of variables they need to look at! Whatever minimal method of representing their knowledge will then have these arrows (from one of the networks that fits the data).
If you require that causality definitions be restricted to (uncorrelated) counterfactual operations (like Pearl’s “do” operation), then sure, the Armcharians won’t do that specific computation. But if you use the definition of causality from this article, then I think it’s clear that efficiency considerations will lead them to use something isomorphic to it.
I was saying that not every independence property is representable as a Bayesian network.
No! Once you have learned a distribution using Bayesian network-based methods, the minimal representation of it is the set of factors. You don’t need the direction of the arrows any more.
You mean when all variables are independent, or some other class of cases?
Read the rest: you need the arrows if you want to efficiently look up the conditional (in)dependencies.
Well, there are doubly-exponentially many possibilities…
The usual example for Markov networks is four variables connected in a square. The corresponding independence assumption is that any two opposite corners are independent given the other two corners. There is no Bayesian network encoding exactly that.
But again, why would you want that? As I said in the grand^(n)parent, you don’t need to when doing inference.
Okay, I’m recalling the “troublesome” cases that Pearl brings up, which gives me a better idea of what you mean. But this is not a counterexample. It just means that you can’t do it on a Bayes net with binary nodes. You can still represent that situation by merging (either pair of) the screening nodes into one node that covers all combinations of possibilities between them.
Do you have another example?
Sure you do: you want to know which and how many variables you have to look up to make your prediction.
Then the network does not encode the conditional independence between the two variables that you merged.
The task you have to do when making predictions is marginalization: in order to computer P(Rain|WetGrass), you need to compute the sum of P(Rain|WetGrass, X,Y,Z) for all possible values of the variables X, Y, Z that you didn’t observe. Here it is very helpful to have the distribution factored into a tree, since that can make it feasible to do variable elimination (or related algorithms like belief propagation). But the directions on the edges in the tree don’t matter, you can start at any leaf node and work across.