A Bayesian Explanation of Causal Models

Judea Pearl’s theory of causal models is a technical explanation of causality in terms of probability theory. Causality is a very fundamental part of our reasoning—if we want to understand something, the first question we ask is “why?”. Then it is no surprise that this theory is very useful for rationality, and is invoked at several points in the Sequences. This suggests a post explaining how causal models work and what they imply would be useful.

Well1, there is such a post: Causal Diagrams and Causal Models (henceforth CDCM), by Eliezer Yudkowsky, is meant as a standard introduction to the topic. But I don’t think it is a very good explanation.

The bulk of the post is a description of how to derive a causal model from statistical data, through the following procedure:

Extract from the data the relative frequency of each possible combination of values for your variables.
Assume that when you have enough data your probability for each combination should be equal to its frequency, and that we do have enough data.
Try to fit the resulting probability distribution into a sort of DAG data structure called a “causal model”, which requires some of the variables to be conditionally independent.

According to the post, this was first done by AI researchers because the causal models’ independence constraints mean they can only represent a tiny fraction of all possible distributions, so they take much less space to store and time to learn than a full distribution.

However, the post also claims that a causal model of some variables represents the causal relationships between them, which presumably are a feature of the real world.

So, what is a causal model? Is it a different form of probability distribution? A frequency distribution? A data structure? A representation of Reality? CDCM switches between interpretations frequently, as if they were equivalent, without clarifying where they come from or how they relate to each other.

When I first learned about causal models, through CDCM and an AI textbook, I thought of causal models as just a data structure that stored a subjective distribution and was often useful. I first became confused about this while reading Pearl’s book Causality. That book states clearly, as one of its main points, that a causal model is not a probability distribution, they don’t represent the same information and should be kept distinct. As the inventor of causal models, Pearl would know, but if a causal model isn’t a distribution, what is it?

I thought about this for several weeks, looking at the problem from various distinct angles, and searching different sources for anything that might be an explanation. After some failed attempts, I was able to put together a satisfactory framework, an interpretation that matched my intuitions and justified the math. Then, after I had the whole solution in my head and could look at the entire thing all at once, it was only then that I realized—it was literally just Bayes all along.

What is a causal model? It’s just a regular hypothesis: it’s a logical statement that may correspond to a part of the real world, providing some incomplete information about it; this information can be represented as a distribution P(Data|Model); and using Bayes’ Theorem to score the model by how well it fits the observed data, we can find its probability P(Model|Data).

Now I believe I could give a better explanation of causal models, that makes explicit what its assumptions are and what its mathematical objects mean. This post is an attempt at writing up that explanation.

What is a Causal Model?

We learn by finding various patterns we can use to predict future observations. When we manage to make these patterns precise, they become mathematical laws, equations relating variables that represent physical quantities. The majority of those patterns we find fit the following structure: there is an initial state, which we characterize by some quantities, then it changes until reaching a final state, characterized by other quantities. The rule is an equation that relates the initial to the final quantities.

For example, in physics we might start with initial positions and momenta of two billiard balls, and calculate their positions and momenta after they collide. In chemistry we might start with the concentration of two solutes in a solution, and calculate the equilibrium concentration of a product after the two react. In economics we might start with some agents’ basket of goods and utility functions, and calculate the equilibrium distribution of goods after they trade.

We’ll call those kinds of rules causal laws, and represent one of them by a mathematical function $y = f (x_{1}, \dots, x_{n})$ . The function takes the initial state’s quantities (the causes) and returns one of the final state’s quantities (the effect). So if we write $y = f (x_{1}, \dots, x_{n})$ , we mean there are some parts of the world described by the quantities $x_{i}$ , they interact and change, and then some section of the resulting state can be described by the quantity $y$ . When stating a particular causal law, we are making two claims: first that the $x_{i}$ are all the causes of $y$ , they fully characterize which systems evolve into which values of y; and then that $f$ describes the way in which the $x_{i}$ evolve into $y$ .

To be possible to learn as a generalization, a causal law has to hold over some range of “instances” of the system, it needs some kind of persistent existence with which it takes many different values of initial states each to the corresponding final state. Even if we could learn from one example, if there wasn’t a second we couldn’t make any predictions. For example, if we state the informal causal rule that pavement being wet causes it to be slippery (or just “wet pavement is slippery”), we are applying the rule to a range of pieces of pavement and at a range of points in time—if we pour water on our sidewalk, then dry it with a mop, then pour water on it again, it will be slippery then not slippery then slippery again.

If we have a causal law for $y$ as a function of some $x_{i}$ , we can try to find causes for the $x_{i}$ themselves, then causes for their causes, and so on. The result is a causal model—a conjunction of causal laws for each of a set of variables. Each of the model’s variables is written as a function of other model variables or of outside, “exogenous” variables.

A model can be thought of as like a computer program—you start by giving it the values of the exogenous variables, then it calculates the value of the variables they cause, then the variables’ those cause, and so on in an expanding web of causation that spreads out from the initial assignments. But a model is not any program, or even any program that calculates those correct values—it is isomorphic to the real-world situation, in the sense that each of the causal laws, each line of the program, logically corresponds to a real-world mechanism, describing how it evolves from an initial state to an end state. A causal model is like a small, high-level simulation of a section of reality.

This means that for a causal model to correspond to the real world, it’s not enough that all its equations are true. For example, if $x$ changes into $y$ in a way such that $y = f (x)$ , that equation is a causal model. But if we instead write this as $x = f^{- 1} (y)$ , though this will be true numerically, it won’t be a causal model of the situation, since $y$ isn’t evolving into $x$ . The equals sign in a causal equation is an assignment statement in a program, not a mathematical equation, and those are not symmetric.

We usually separate a causal model into a causal graph $D$ , and some parameters $Θ_{D}$ . The graph specifies which variables cause which others, with arrows pointing into a variable from each of its causes. The parameters specify the form of the causal functions. This is done because knowing only the causal graph is already pretty useful information, and it’s easier for people to translate their internal knowledge into a graph than into exact functions.

The causal graph needs to have no cycles—it’s a directed acyclic graph, a DAG. Otherwise, the model couldn’t be a program where each line corresponds to a real mechanism, since it would have lines like $a = f (b)$ and $b = g (a)$ and would try to use a value before computing it. There are other ways to compute values for such a cyclical model, but they aren’t the same as actually running the program, and so their computations won’t correspond to Reality. The direction of the arrows corresponds to time—the constraint prohibiting cycles is why the future can’t affect the past.

The standard example of a causal graph is Pearl’s slippery pavement model:

This image is from Pearl’s book *Causality*.

The Season affects the Sprinkler’s activation and the Rain, the Sprinkler and the Rain affect whether the pavement is Wet, and whether it’s Wet determines whether it’s Slippery. A graph like this doesn’t exclude the possibility that there are other unrepresented causes that also affect the variables (for example, something else might make the pavement Wet), but it excludes a single hidden cause that affects two of the variables—we represent those in graphs by a bidirected arrow (A<->B).

As you can see, each of the graph’s connections is intuitively appealing on its own, and doesn’t depend on the others. For example, the Season influences whether it Rains, and Wet things are Slippery, and these are separate facts that we know independently. In a causal model, physically independent parts of a system are represented by logically independent rules.

This modularity is one of the most useful properties of causal models, as it allows us to do things like building up a model out of smaller parts, verifying each part independently, and adapting to partial changes in the environment by changing parts of the model. In short, causal models have gears.

Building up complex models out of parts, though a thing humans need to do basically every time they think about causality, isn’t something I know how to represent as math. It seems like a difficult problem—for one thing, you have to decide what variables you’ll use in the first place, and how to draw their boundaries, which runs into abstraction problems. However, we do know how mathematically adapt a model to an environmental change, through the machinery of intervention.

Intervention is very simple: to represent a mechanism that changed into something else, you just take the equation that represents the mechanism and replace it with another law representing the new mechanism, keeping the other laws unchanged. For example, if we decided to turn on the Sprinkler regardless of its previous activation schedule, we would wipe out the equation for it as a function of the Season, replacing it with “Sprinkler = On”:

Modeling decisions is one of the main uses of intervention. Typically, you observe a system in some situations, building a causal model for it, and then you get to control one of the variables. This control changes that variable’s causal mechanism, since we can pick its value regardless of what it would have been set to, which means it’s now decoupled from its causes. Intervention allows us to match this change in physical mechanism with a corresponding alteration in our model, which works because of the modularity of causal models.

Thinking of decisions as interventions in a causal model is what is called Causal Decision Theory, a term you’ve probably heard of around LessWrong. By contrast, Evidential Decision Theory would model turning the Sprinkler on the same way it would think about learning its value, and so an EDT reasoner would think they could influence the Season by turning the Sprinkler on. CDT might not be the best possible decision theory, but it sure is an improvement over EDT.

Interventions are also very useful for inferring causal models. For example, in a drug trial, we want to find out the causal effect of the treatment in the chances of the patient’s recovery. However, the potential of a patient to recover depends on other things, such as socioeconomic status, that may also influence their choice of treatment. This confounding factor obscures the variables’ causal relation, since if people who get the treatment recover more often, this might be because the treatment helps them or because they are wealthier. (We’ll make this more precise later when we learn to do causal model inference.)

The solution is a randomized controlled trial (RCT): we assign each patient to an experimental group or a control group randomly, based on a coin flip. This breaks the causal relationship between wealth and choice of treatment, replacing the causal law with the coin’s, which isn’t causally related to the other variables. Since the mechanism that determines the patient’s recovery is independent of the one that determines the treatment, we can then figure a causal law for Treatment in the lab setting and generalize it to the real world, where socioeconomic status does affect choice of treatment.

In the two situations, the lab and the real world, the Recovery variable has different correlations, because correlation is a feature of our state of knowledge and may change when we learn new things, but it has the same causal law on both, because the causal law represents a physical mechanism that doesn’t change unless it is physically modified.

Please forgive my terrible PowerPoint graphics.

Inferring Causal Mechanisms

You might notice that so far we haven’t mentioned any probabilities. That is because we were talking about what causal models mean, what they correspond to in Reality, and probability is in the mind. Now, we’ll start trying to infer causality from limited information, and so we will apply probability theory.

We will first discuss the simplest form of causal inference: a repeated experiment with a single binary variable, that might each time be LEFT or RIGHT. We will hold fixed all known causes of the variable, so that all variation in the results comes from variation in unobserved causes. We observe the result of some past instances of the experiment, and have to predict future ones.

Our model of this situation is a single causal law $R e s u l t = f (U)$ , where we have no knowledge about the law $f$ or the causes $U$ . That’s only trivially a causal model—the causal graph has only a single node! However, since causal models are made of causal laws, knowing how to infer those is a prerequisite for inferring more complex models, so we’ll cover them first.

If the experiment is repeated $N$ times, it has $2^{N}$ possible sequences of results. If we observe some of the results, what we can infer about the others depends on our probability distribution over these sequences, which is determined by our prior information.

With no information other than the fact that there are $2^{N}$ possible outcomes (call this state of knowledge $H_{0}$ ), we would assign the maximum entropy distribution, the one that makes no other assumptions: an uniform probability of $P (S | H_{0}) = 1 / 2^{N}$ to each possible sequence $S$ . This hypothesis’ probabilities over the individual results are independent—learning one tells us nothing about the others.

Even after seeing a hundred RIGHT results in a row, the hypothesis of maximum entropy still assigns a probability of $1 / 2$ to LEFT on the 101st. This is unintuitive—if the hypothesis assumes no information, why does it refuse to learn anything and stubbornly keeps to its previous estimate?

The problem is that to see RIGHT on the 101st trial as corresponding to the first 100 being RIGHT, as being “the same result as before”, would itself be prior knowledge relating different trials—knowledge that our brain assumes automatically upon seeing that RIGHT in one trial has the same label as RIGHT in the next.

To truly adopt the mindset of the hypothesis $H_{0}$ , you should imagine instead that the first trial’s possible results are 685 and 466, the second’s are 657 and 589, the third’s are 909 and 596, and so on, all random numbers with no visible pattern. Then it really is obvious that, no matter what results you get, you will still have no information about the next trial, and keep assigning probabilities of 0.5 to each possibility. A more informed reasoner might know some results are RIGHT, RIGHT, RIGHT… but you only see 466, 589, 596, and so on, which is completely useless for prediction.

Of course, in this case we do know it’s the same experiment each time and a RIGHT result is the same kind of result each time. The question, then, is what kind of probability distribution represents this knowledge.

To find out, let’s imagine a generic binary experiment. By hypothesis, the experiment has only two possible endpoints, or at least two distinct clusters of endpoints in state-space. If it’s a physical system that evolves until reaching equilibrium, this means the system has two stable equilibria. The prototypical example of a system with two stable equilibra is the marble in a double bowl:

This image is from John Wentworth’s post on bistability, where I got this example from.

We’ll take that as our example binary experiment. The initial state is the marble’s starting horizontal position, and the final state is which bowl it settled down in—the LEFT bowl or the RIGHT bowl. We’ll suppose there is some range of initial positions the marble could be dropped at, and we know nothing else about its initial position, so we assign uniform probability in that range. We don’t know the exact shape of the bowl (the function from initial to final states) or where in the range the marble is dropped each time (the unseen initial state).

The final state is completely determined by whether the marble starts off to the left or to the right of the bowl’s peak. If we knew where the peak was, our probability of the marble going RIGHT would be the fraction of the range of possible starting positions that is to the right of the peak. We’ll call that fraction $θ$ .

We will assume the peak is in fact somewhere in the starting range—otherwise the experiment wouldn’t really have two possible outcomes. Since we have no other knowledge about the shape of the bowl, it seems reasonable to assign the same probability to the peak being at each possible position, meaning $θ$ has uniform prior probability density between 0 and 1.^[1]

So, if $H_{1}$ is the hypothesis outlined above, and $R_{k}$ means “RIGHT on the kth trial”, we have $P (θ | H_{1}) = 1; (0 \leq θ \leq 1)$ and $P (R_{k} | θ, H_{1}) = θ$ . That’s a complete model!^[2] If we know a sequence of past results, we can update our distribution for $θ$ using Bayes’ Theorem, then integrate over it to find the expected value of $θ$ , which is also the probability of the next result being RIGHT.

If we see $N$ results, of which $R$ are RIGHTs, the probability of RIGHT in the next trial is $(R + 1) / (N + 2)$ . This result is known as Laplace’s Rule of Succession. Notice that as $N$ grows this value approaches $R / N$ , the past frequency of RIGHT.

What about a more general binary experiment? Well, in that case we might have a higher-dimensional starting space, a complicated prior probability distribution for initial states, and an intricate pattern for which starting states go to which end states. Fortunately, we can ignore most of that if two properties are true: our distributions for the starting state of each trial are independent, and we are ignorant of the fixed start state → end state function. We need to factor our representation of the system, into a starting state that varies with no consistent patterns and an outcome function that is fixed but initially unknown.

If we can do this, we can define our $θ$ as the fraction of starting probability mass assigned to states that go to the RIGHT end state, and use the same distribution as in the double bowl example. We can conclude that the Rule of Succession applies to many possible experiments, or at least is a reasonable distribution to assign if your prior information consists of “it’s a binary experiment” and nothing else.

Assigning probability to a result equal to its frequency in past trials of the experiment is very intuitive—so much so there is an entire school of probability based on it, we talk all the time about biased coins as if they had some fixed non-1/2 probability of landing heads (they don’t), and CDCM uses the assumption without bothering to justify it.

However, I decided to take time to show a derivation because the Rule of Succession being the consequence of causal assumptions is a fundamental point about causality, knowing how “frequency → probability” is derived will make it easier to deny it later when it’s no longer true, and I just think this derivation is neat.

All this was model fitting—we made a model with one parameter and figured out how to find the value of that parameter from the data. The next step is model comparison—we have several possible models, and want to find out which is true from the data. If we can do model fitting, Bayes’ Theorem tells us exactly how to compare models—just calculate the likelihood ratio.

For example, suppose we have a sequence of $N$ LEFTs and RIGHTs and we’re comparing the two hypotheses we talked about, the uniform distribution over sequences of outcomes $H_{0}$ and Laplace’s Rule $H_{1}$ , to see which one governs the data.

$H_{0}$ assigns probability $1 / 2^{N}$ to each possible sequence, of course. If $H_{1}$ holds and we know $θ$ , the probability of a sequence with $R$ RIGHTs is $θ^{R} (1 - θ)^{N - R}$ . We can integrate that over $θ$ (multiplying by the constant prior probability) to get the probability $H_{1}$ assigns to the data, which turns out to be $\frac{R! (N - R)!}{(N + 1)!}$ .

This probability is larger if the sequence has a lot of RIGHTs or a lot of LEFTs, and smaller if the amounts are close to even, which is exactly what we’d expect— $H_{0}$ always assigns probability ¹⁄₂ to RIGHT, so it wins when around half the results are RIGHTs.

For another way to look at it, consider that the amount of possible sequences of $N$ trials containing $R$ RIGHTs is $\frac{N!}{R! (N - R)!}$ , so the probability $H_{1}$ assigns to getting any sequence of length $N$ with $R$ RIGHTs is… $\frac{1}{N + 1}$ . The same for each of the $N + 1$ possible frequencies of RIGHT, from $0 / N$ to $N / N$ .

Therefore, Laplace’s Rule is equivalent to assigning an equal probability to each possible value of $R$ and spreading out that probability equally among the possible orderings of that amount of RIGHTs and LEFTs. This assigns less probability to the sequences with similar amounts of RIGHTs and LEFTs because there are more of them—there are 252 ways to arrange 5 RIGHTs and 5 LEFTs, but only 10 ways to arrange 9 RIGHTs and 1 LEFT.

Another good example of comparing causal models of a repeated experiment with one variable is the case of Wolf’s dice, which is analyzed in two posts by John Wentworth (and the Jaynes paper he got the example from). The dataset in question consists of a large number of results of throwing a pair of dice.

We usually assign ¹⁄₆ probability to each of a die’s faces coming up, independently between throws. You might think that’s just the maximum entropy, zero knowledge hypothesis, but in this case we do know it’s a repeated experiment and we can’t just unlearn this fact. The probabilities being ¹⁄₆ come from additional knowledge about the die’s symmetry, on top of what we knew about a generic experiment.

Specifically, we know that a very small change in the initial state (height, angular momentum, etc.), smaller than what we can control, will change the result. This means the initial state-space is divided into thin “stripes” that go to each of the possible results. We also know the dice is symmetric with respect to its faces, so the adjacent “stripes” that go to each result have nearly the same volume. This means for almost no matter what initial region we choose, for any way we might throw the die, the fraction of possible states that end up at each number will be approximately the same.

That is true if the die is a nearly perfect cube. In the results of Wolf’s dice, the frequencies of each number are very different from ¹⁄₆, falsifying the hypothesis that they are fair dice—that is, the hypothesis we just described assigns a much smaller likelihood to the data than the hypothesis that doesn’t make those assumptions, the six-result analogue of Laplace’s Rule that assigns to each possibility a probability approximately equal to its frequency.

But we can do better. Jaynes hypothesized the dice’s asymmetry came from two specific sources: their length in one dimension being different from the other two due to imperfect cutting, and the pips cut into each face changing their masses. The assumption that the die is symmetric except for those things places constraints on the possible values of the faces’ “objective probabilities” (the $θ$ -analogues), not enough to completely determine them like in a perfect die, but enough to cut down on the space of possibilities, gaining a large likelihood advantage.

For example, if we take the different length of one dimension to be the only source of asymmetry, we know the two faces that are perpendicular to that dimension have the same frequency $f / 2$ and the others have a different frequency $(1 - f) / 4$ . That’s only one parameter, in contrast to five in the general case where the only constraint is that the frequencies have to add up to 1.

Adding the pips’ asymmetry is a similar process. It means there are more degrees of freedom than we’d have with just the length asymmetry, but that’s necessary—not considering the pip asymmetry, we can’t beat the general, unconstrained hypothesis, but if we do, we can, and by a large likelihood ratio! (For details on the math, see John Wentworth’s posts.)

We’re now basically ready to move on to inference of causal models with multiple variables (the interesting part), but before we do that I need to make a couple of points that will become relevant later.

Let’s go back to the example where we have a binary experiment, and we’re comparing the uniform distribution over sequences $H_{0}$ to Laplace’s Rule $H_{1}$ . The Technical Explanation of Technical Explanation says this about hypotheses being falsifiable:

Why did the vague theory lose when both theories fit the evidence? The vague theory is timid; it makes a broad prediction, hedges its bets, allows many possibilities that would falsify the precise theory. This is not the virtue of a scientific theory. Philosophers of science tell us that theories should be bold, and subject themselves willingly to falsification if their prediction fails. Now we see why. The precise theory concentrates its probability mass into a sharper point and thereby leaves itself vulnerable to falsification if the real outcome hits elsewhere; but if the predicted outcome is correct, precision has a tremendous likelihood advantage over vagueness.

I ask: which of $H_{0}$ and $H_{1}$ is more falsifiable? From what I’ve said so far, you might think it’s $H_{1}$ , because it concentrates more probability in some outcome sequences than in others while $H_{0}$ gives them all the same probability. But consider this: from the perspective of a believer in $H_{1}$ , $H_{0}$ predicts as if it were certain $θ = 1 / 2$ , and this narrow prediction gives them a big likelihood advantage if they’re right and dooms them if they’re wrong. Then isn’t $H_{0}$ more narrow and falsifiable? Or is it subjective, something that depends on how you look at the problem?

It is not. (Or at least, I don’t think so!) The space over which you assign probabilities over is defined in the statement of the problem, and in this case it is clearly the sequences of outcomes—the data, what you observe. The latent variable $θ$ is a part of the hypothesis $H_{1}$ , not of the problem statement—from $H_{0}$ ’s perspective it is completely meaningless. The space over which the distributions can be more or less spread out is the space over which they predict things, the space of outcome sequences.

Technically, this “spreading out” is measured by a distribution’s entropy. Entropy is the amount of bits a hypothesis expects itself to lose by the Technical Explanation’s scoring rule, negated to turn into a positive number. The bigger the entropy, the more uncertain the hypothesis. In this case, $H_{0}$ is the maximum entropy hypothesis, meaning it’s the “most humble” possible set of assumptions, the one that assumes the least amount of information and so makes the least precise predictions even if it is correct.

This means $H_{0}$ is more uncertain and $H_{1}$ is more falsifiable. And that is what we would expect, since $H_{1}$ is the stronger hypothesis, incorporating more prior knowledge: $H_{0}$ only knows what results are possible, while $H_{1}$ knows they’re results of a repeated experiment, and what outcomes are the same kind of result (RIGHT or LEFT).

Of course, if we did know $θ$ was ¹⁄₂, like if we were throwing a coin (which is just like the case of the fair die described above), we would have even more information than $H_{1}$ , but we’d go back to making the same predictions as $H_{0}$ . In this case, getting more information increased our uncertainty; that is allowed to happen, you just can’t expect that it will on the average case.

Another point which I was confused about, and so might be helpful to clarify: if $H_{0}$ is true, $H_{1}$ quickly determines that $θ = 1 / 2$ and then makes approximately the same predictions as it, only losing more probability when its distribution for $θ$ narrows down more and more. If $H_{1}$ is true, $θ$ is probably significantly different from $1 / 2$ , meaning $H_{0}$ keeps losing more and more probability at each trial.

This means if $H_{1}$ is true, it will gather a large likelihood ratio over $H_{0}$ (thus falsifying it) very quickly. If $H_{0}$ is true instead, it will take much longer to falsify $H_{1}$ . This is, however, not evidence that $H_{0}$ is more falsifiable (in the sense “concentrates more probability”) than $H_{1}$ .

To see that, suppose there is an unknown number between 0 and 999,999, and two hypotheses: a broad hypothesis that assigns an equal probability of ¹⁄_1,000,000 to each possibility, and a narrow hypothesis that assigns ¹⁄₁₀ probability to the number being 0 and spreads the remaining ⁹⁄₁₀ among the rest, assigning approximately 0.9/1,000,000 probability to each.

In this case, if the broad hypothesis is true, it will get a 10:9 ratio the vast majority of the time, slowly gaining probability. If the narrow hypothesis is true, it will lose 10:9 90% of the time, but gain a ratio of 100,000:1 the 10% of the time where the number ends up being 0, blasting the broad hypothesis down into oblivion.

The reason this happens is that, though we transferred only a small fraction of the probability mass from the other possibilities to 0, there is a huge amount of them, so the probability of 0 ended up much larger than before. A similar thing happens for $H_{0}$ versus $H_{1}$ : $H_{1}$ transfers some probability from the huge amount of around-half-and-half possibilities to the tiny amount of skewed possibilities, so the advantage it gets when true is much bigger than the losses it takes when false.

So it is not the case that if a hypothesis is narrow (loses probability in a large part of the possibility space) then it is easily falsified (loses much probability upon seeing a small amount of dissonant observations).^[3] When I first read the Technical Explanation, I was fooled into thinking this was true because of the use of the word “falsifiable”, though upon rereading the post I can see it is always used to mean “narrow” and never “easily falsified”.

Inferring Causal Models

In this section we’ll talk about larger, multivariable causal models. As we’ll see, they’re basically several models for one variable glued together. Like before, we’ll observe several repetitions of one experiment, and first do model fitting then model comparison based on them. We’ll again start by discussing the simplest possible case: an experiment with two binary variables, $x$ and $y$ , of which $x$ may or may not causally affect $y$ . We’ll say $x$ can be UP or DOWN and $y$ can be LEFT or RIGHT.

First suppose $x$ doesn’t affect $y$ , and they’re completely unrelated. (Call that hypothesis $D_{1}$ .) In this case, $x$ and $y$ are both results of repeated experiments, with some unknown causes and causal rules. We know what to do in this case! We’ll use Laplace’s Rule for both of them independently, with completely uncorrelated parameters $θ_{0}$ and $θ_{1}$ , the frequencies of $x$ being UP and $y$ being RIGHT. It’s just as if $x$ and $y$ were the results of two completely separate experiments.

Now suppose $x$ affects $y$ . (Call that hypothesis $D_{2}$ .) That means $x$ and the other causes of $y$ form the initial state that evolves into $y$ ’s possible values. This process can be factored into two functions taking the rest of the initial state to values of $y$ —one for when $x$ is UP, and one for when it’s DOWN. The two values of $x$ push the rest of the causes into different regions of initial-state-space, which evolve to values of $y$ in different ways. So we’ll use two parameters for $y$ , $θ_{1}$ and $θ_{2}$ , representing the fraction of values of the other parameters that causes $y$ to be RIGHT when $x$ is DOWN and when $x$ is UP respectively. These parameters, when known, will be the conditional probabilities we assign to $y$ given each value of $x$ .

We could in principle know about some relationship between $θ_{1}$ and $θ_{2}$ , for example if $x$ had only a small effect on the initial state, but to simplify we’ll assume no such knowledge and have independent distributions for them. Then we have two Laplace’s Rules for $y$ running at once, each of which updates for one of the values for $x$ . Also, since we don’t know anything about $x$ ’s own causes, we’ll use yet another Laplace’s Rule for it, with an independent parameter $θ_{0}$ for the frequency of $x$ being UP.

To sum it up, the first hypothesis converges to independent probabilities for $x$ and $y$ equal to their frequencies, and the second converges to probabilities for $x$ and $y | x$ equal to their frequencies.

Now, model comparison! Suppose we don’t know whether $x$ is causally unrelated to $y$ ( $D_{1}$ ) or it affects $y$ ( $D_{2}$ ), and want to find out. Suppose first that $D_{2}$ is true. Then, after both models narrow down values for their parameters, $D_{1}$ will make the same prediction for $y$ every time, while $D_{2}$ will use its knowledge of the causal relationship to assign different probabilities to $y$ depending on the value of $x$ , making more accurate predictions and accumulating a greater likelihood.

Now suppose $D_{1}$ is true. Then after the models have narrowed down the values of their parameters, they will make basically the same predictions, because $D_{2}$ will have narrowed down its parameters $θ_{1}$ and $θ_{2}$ , for the conditional probabilities of $y$ , to the same value—it’s predicting $x$ does have an effect in $y$ , this effect just happens to be the same for either value of $x$ ! However, $D_{2}$ didn’t start with this assumption, so it will lose a lot more of its probability mass in finding the (equal) values of its two parameters than $D_{1}$ does in finding the value of its one parameter for $y$ . This means we do get some update in favor of $D_{1}$ before these values have been narrowed down.

Here is how CDCM describes an equivalent situation:

By the pigeonhole principle (you can’t fit 3 pigeons into 2 pigeonholes) there must be some joint probability distributions which cannot be represented in the first causal structure [in our naming scheme, $D_{1}$ ]. This means the first causal structure is falsifiable; there’s survey data we can get which would lead us to reject it as a hypothesis. In particular, the first causal model requires: [in our naming scheme, $p (x y) = p (x) p (y)$ ]

While $D_{1}$ certainly can be falsified, this seems to be implying that it is in some sense more falsifiable than $D_{2}$ , which is incorrect. If we consider the space of possible joint distributions over $x$ and $y$ , it is true that $D_{2}$ can converge to any distribution in it while $D_{1}$ can only converge to the ones where $x$ and $y$ are independent. However, as we’ve seen, we have to look at the space of possible outcome sequences, not at the space of possible parameter values, to find the hypotheses’ entropy.

Since $D_{1}$ knows $x$ and $y$ are results of repeated experiments, it uses Laplace’s Rule to converge to a probability to each of them equal to their frequency. Since it sees them as completely unrelated, those probabilities are independent. Why do we assign independent distributions to things we know of no relation between? Because the distribution that assigns independent probabilities to them has maximum entropy.

If we fix the probabilities of $x$ being UP and $y$ being RIGHT at some values $θ_{0}$ and $θ_{1}$ , the maximum entropy joint distribution for $x$ and $y$ is the one where they’re independent. By the entropy concentration theorem, this implies that, considering only those sequences of results where $x$ being UP has frequency $θ_{0}$ and $y$ being RIGHT has frequency $θ_{1}$ , in the vast majority of possibilities, their frequency distributions are approximately independent.

This means saying two variables are independent is not an assumption—the assumption is saying they’re correlated. Since $D_{1}$ and $D_{2}$ are really distributions over sequences of values of $x$ and $y$ , $D_{1}$ has higher entropy than $D_{2}$ ! If you think carefully, this makes sense, as the assumptions that $D_{1}$ makes are ” $x$ is a repeated experiment with a common causal law” and ” $y$ is a repeated experiment with a common causal law”, while $D_{2}$ makes these assumptions plus the assumption that ” $x$ is one of the causes of $y$ ”, so $D_{2}$ is a strictly stronger hypothesis. Note also that even if $D_{2}$ is true, $D_{1}$ remains calibrated, assigning the correct marginal probabilities for $x$ and $y$ — $D_{2}$ is just more accurate.

The reason we spent so long comparing the uniform distribution over sequences $H_{0}$ to Laplace’s Rule $H_{1}$ , in the first place, is that the current case is very similar: $D_{2}$ is more spread out in parameter space, but really is narrower over the space of observations (as a consequence of its stronger assumptions), meaning it’s the stronger hypothesis. Also like $H_{0}$ and $H_{1}$ , $D_{1}$ is falsified more quickly if $D_{2}$ is true than $D_{2}$ is falsified if $D_{1}$ is true, and just like in that case it doesn’t mean $D_{1}$ is more precise.

For one unexpected application of the comparison between $D_{1}$ and $D_{2}$ , suppose we have two separate experiments of one binary variable, and we want to know whether they are two instances of the same phenomenon. It turns out the correct probabilities in this situation are the same as the ones in the case we just discussed, with $y$ as both the experiments’ results and $x$ as the variable “which experiment produced $y$ ”!^[4]

This seemed surprising to me when I found out about it, because “which of two different experiments is it” doesn’t seem like the kind of thing we usually treat as a variable. However, it certainly is a feature of the world that either doesn’t make any difference to $y$ at all (the assumption for $x$ in $D_{1}$ ) or throws $y$ into two completely unrelated causal laws depending on its value (which is the assumption for $x$ in $D_{2}$ )!

Larger models (of binary variables) work basically the same way as $D_{1}$ and $D_{2}$ —for each variable, you can use a separate instance of Laplace’s Rule for every possible combination of its causes. You get the patterns of correlation described in CDCM, governed by the rule of d-separation. For example, if A causes B and B causes C, C is correlated with A unless you condition on B; if P and Q both cause R, P and Q are correlated only when you condition on R.

Comparing these models is also similar—if the true model can have correlations that a false model can’t explain, it will keep gaining bits and falsify the false model; if the true model instead draws strictly less correlations than a false model, its parameters’ true values will be in a region of a lower-dimensional space than in the false model, so it will keep a larger fraction of its probability mass and gain likelihood over the false model.

This allows us to distinguish some models, but not all. Some models can converge to the exact same set of possible distributions, with the exact same correlations. We were able to distinguish ” $x$ causes $y$ ” from ” $x$ and $y$ are unrelated”, but alternative hypotheses, such as ” $y$ causes $x$ ” or ” $x$ and $y$ have a common cause”, are indistinguishable from ” $x$ causes $y$ ” no matter how many instances of them we observe.

This might make you think these models are equivalent, if they make all the same predictions, that there is no reason to say they point to different underlying states of reality. And this would be right if a causal model was only applicable in the situation we’ve been considering, as making predictions about a series of cases in each of which the entire model holds.

But as we’ve seen before, models are not so monolithic—they are conjunctions of causal laws, each of which represents its own physical mechanism that may be present in situations the others aren’t. This means even if two models are equivalent, there may be modifications of them that are different. For example, in the case of RCTs, intervention can be used to learn about the causal law of $y$ in a context where we control the causation of $x$ , and then this can be generalized to a wider context where $x$ has its own causal law. I believe this is also how humans learn most causal relations in real life.

For example, take the slippery pavement model above. If we had a model that was almost the same, except that the Rain caused the Season instead of vice versa, then barring intervention this model would be statistically equivalent to the true model. But it’s not like we learned that Rain was affected by the Season specifically by observing thousands of (Season, Sprinkler, Rain, Wet, Slippery) value tuples, drawing correlations and fitting them to models—that would hardly be practical for a human being. We learned it by observing Rain and the Season in a variety of contexts, where maybe different laws affected other parts of the system but the Rain kept the same mechanism, and generalizing.

Before Pearl came along, some people believed that all causality came from interventions—”no causation without manipulation”, they concluded. We now know we can in fact distinguish some possible causal relations without intervention, which contradicts this assertion. Indeed, if this weren’t the case it would be impossible to learn any causality at all, since manipulation just changes one model into another, so if you can’t distinguish any pair of causal structures it wouldn’t help.

It remains the case that a large part of why models are useful in human thought is that we can put them together out of separate parts, change them to match changes in situation, and generalize them from one context to the other. The inability of causal models to determine on their own when those kinds of adaptations are appropriate is, I think, a large part of why you can’t get an AGI just by putting together a large enough causal model. But, after doing the mental work of figuring out which variables and relationships may be at play, we can now use Pearl’s machinery of causal models to find their exact relationships and distinguish between some possibilities we may consider plausible, which means it does have its use.

^
There are other reasonable choices of prior, but as long as your prior is relatively flat you will get approximately the same results.
^
Well, those equations plus the assumption that the probabilities of different trials are independent given $θ$ .
^
Technically, narrowness is measured by the hypothesis’ entropy, as described. How easily a hypothesis is falsified by another is measured by their KL-divergence, the amount of bits it expects to lose to the other.
^
Except that in this case, we’re not uncertain about $x$ , but $D_{1}$ and $D_{2}$ assign the same probabilities to it anyways so it doesn’t affect the likelihood ratio.