Causality, Transformative AI and alignment—part I

TL;DR: transformative AI(TAI) plausibly requires causal models of the world. Thus, a component of AI safety is ensuring secure paths to generating these causal models. We think the lens of causal models might be undervalued within the current alignment research landscape and suggest possible research directions.

This post was written by Marius Hobbhahn and David Seiler. MH would like Richard Ngo for encouragement and feedback.

If you think these are interesting questions and want to work on them, write us. We will probably start to play around with GPT-3 soonish. If you want to join the project, just reach out. There is certainly stuff we missed. Feel free to send us references if you think they are relevant.

There are already a small number of people working on causality within the EA community. They include Victor Veitch, Zhijing Jin and PabloAMC. Check them out for further insights. There are also other alignment researchers working on causal influence diagrams (authors: Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane Legg) whose work is very much related.

Causality—a working definition:

Just to get this out of the way: we follow a broad definition of causality, i.e. we assume it can be learned from (some) data and doesn’t have to be put into the model by humans. Furthermore, we don’t think the representation has to be explicit, e.g. in a probabilistic model, but could be represented in other ways, e.g. in the weights of neural networks.

But what is it? In a loose sense, you already know: things make other things happen. When you touch a light switch and a light comes on, that’s causality. There is a more technical sense in which no one understands causality, not even Judea Pearl (where does causal information ultimately come from if you have to make causal assumptions to get it? For that matter, how do we get variables out of undifferentiated sense data?). But it’s possible to get useful results without understanding causality precisely, and for our purposes, it’s enough to approach the question at the level of causal models.

Concretely: you can draw circles around phenomena in the world (like “a switch” and “a lightbulb”) to make them into nodes in a graph, and draw arrows between those nodes to represent their causal relationships (from the switch to the lightbulb if you think the switch causes the lightbulb to turn on, or from the lightbulb to the switch if you think it’s the other way around).

There’s an old Sequences post that covers the background in more detail. The key points for practical purposes are that causal models:

  1. Are sparse, and thus easy to reason about and make predictions with (or at least, easier to reason about than the joint distribution over all your life experiences).

  2. Can be segmented by observations. Suppose you know that the light switch controls the flow of current to the bulb and that the current determines whether the bulb is on or off. Then, if you observe that there’s no current in the wire (maybe there’s a blackout), then you don’t need to know anything about the state of the switch to know the state of the bulb.

  3. Able to evaluate counterfactuals. If the light switch is presently off, but you want to imagine what would happen if it were on, your causal model can tell you (insofar as it’s correct).

Why does causality matter?

Causal, compared to correlational, information has two main advantages. For the following section, I got help from a fellow Ph.D. student.

1. Data efficiency

Markov factorization: Mathematically speaking, Markov factorization ensures conditional independence between some nodes given other nodes. In practice, this means that we can write a joint probability distribution as a sparse graph where only some nodes are connected if we assume causality. It introduces sparsity.

“Namely, if we have a joint with n binary random variables, it would have 2^n − 1 independent parameters (the last one is determined to make the sum equal to 1). If we have k factors with n/​k variables each, then we would have k(2^(n/​k) − 1) independent parameters. For n=20 and k=4, the numbers are 1048576 vs. 124.”—Patrik Reizinger

Independent Mechanisms: the independent mechanisms principle ensures that factors do not influence each other. Therefore, if we observe shifts in our data distribution, we only need to retrain a few parts of the model. If we observe global warming, for example, the vast majority of physics stays the same. We only need to recalibrate some parts of our model that relate to temperature and climate. Another example is the lightbulb blackout scenario from above. If you know there is a blackout, you don’t need to flip the switch to know that the light won’t turn on.

The conclusion of these two statements is that correlational models assume a lot more relations between variables than causal models and the entire model needs to be retrained every time the data changes. In causal models, however, we usually only need to retrain a small number of mechanisms. Therefore, causal models are much more sample efficient than correlational ones.

2. Action guiding

Causal models introduce a very strong assumption on the model. Namely, variables are not just related, they are related in a directed way. Thus, causal models imply a testable hypothesis. If our causal model is that taking a specific drug reduces the severity of a disease, then we can test this with an RCT. So our model, drug → disease, is a falsifiable hypothesis.

The same thing is not possible for correlational models. If we say the intake of drugs correlates with the severity of the disease we say that either the drug helps with the disease, people who have less severe diseases take more drugs or both depend on a third variable. As soon as we intervene by fixing one variable and observing the other, we have already made a causal assumption.

Correlational knowledge can still be used for actions—you can still take the drug and hope the causal arrow goes in the right direction. But it could also have a different effect than desired since you don’t know which variable is the cause and which one is the effect.

Causal models greatly improve the ability of models to make decisions and interact with their environment. Therefore we think it is highly plausible that transformative AI will have some causal model of the world. Due to the rise of data-driven learning, we expect this model to be learned from data, but we could also imagine some human interference or inductive biases.

Overall, we think that the thesis that causality matters for TAI is not very controversial but we think there are a lot of implications for AI safety that are not yet fully explored.

Questions & Implications for AI safety:

If the causal models in ML algorithms have a large effect on their actions/​predictions, we should really understand how they work. Some considerations include:

  1. Which causal models do current ML architectures have? Does GPT-3 have a causal model of the world and how can we find out? Can we find sets of prompts that give us relevant information about this question? Can interpretability tell us something about the internal causal model?
    If our ML model has learned a slightly wrong causal model of the world, it will make incorrect predictions on data points outside of the training distribution. Therefore it seems relevant to understand which kind of model the algorithm is acting on. This is a subcategory of alignment and interpretability.

  2. What are the inductive biases of causal models? Do classification networks learn causality and do they even need to? We know from interpretability that they learn associations but is it more “If structure X is in the image then Y” or “Structure X and label Y seem related”. Which inductive biases do LLMs have wrt causality? Do RL architectures automatically learn causality because they intervene?
    If we could say, for example, with higher certainty whether LLMs create internal causal (vs. correlational) models of the world, they might be easier to control and we could get higher certainty about their predictions.

  3. Do we need interventions to learn causal models efficiently? It seems intuitively plausible that interventions speed up learning but they are not strictly necessary. Economists, for example, use natural experiments to derive causal conclusions from observational data. While this is certainly nice, we don’t know whether a lot of observational data is sufficient to build large causal world models.
    We are scared of ML algorithms increasingly interacting with the real world because if the interventions go wrong they can do a lot of harm. GPT-3 recently got hooked up to google and we expect someone to be mad enough to give it even more access to interventions on the internet. If there was a non-interventional way to get similar results, we would certainly prefer that.

  4. What is the difference in resource efficiency between humans and current ML algorithms? It is plausible that humans need less data to learn a new task than training current ML models from scratch. However, it is unclear how large that difference is when models are pre-trained to a comparable level of human pretraining from evolution. If we compare the time, for example, it takes humans to beat OpenAI five with the time it takes to train OpenAI five to beat these strategies again, we might get closer to the difference in resource efficiency. Some people have already asked whether GPT-3 is already sample-efficient (for fitting new data after pretraining). This could also be explored further.
    Having a better understanding of this difference in training efficiency might give us more insight into the quality of the world model of current algorithms.

  5. A worry: Our intuition is that humans have a bias to overidentify causality, i.e. see causality when it is not necessarily given. This might have been a good survival strategy for our ancestors since not identifying a causal mechanism is likely more deadly than incorrectly identifying one. However, in today’s complex world, this bias might be inappropriate. Just think about how many different stories of causal mechanisms are told after any election, most of which are simplistic and monocausal—”Hillary lost because of X”.
    Our worry is that ML researchers, once they figure out how, will introduce a similar “overidentifying causality” inductive bias into models. This would mean that very powerful models with potentially big impacts have the causal model of a political pundit rather than a scientist.
    Furthermore, since language models are trained on text that is generated by humans, they might just learn this bias on their own. Then, GPT-n would be as useless as the average political analysis.

What now?

We ask a lot of questions but don’t have many answers. Thus, we think the highest priority is to get a clearer picture, e.g. refine the questions, translate them into testable hypotheses and read more work from other scientists working on causality.

We think that reasonable first steps could be:

  1. Investigate GPT-3 wrt causality. BigBench is an effort to benchmark LLMs and it includes some questions about causality. But there are certainly more questions one could ask.

  2. Summarize the literature on causality from an AI safety perspective. The field of causality is large and scrambled across ML, economics, and physics. Just collecting and summarizing the different findings from an AI safety perspective seems like a promising start.

  3. Think about inductive biases and causality. Which models even allow for causal models? Which ones necessarily lead to them? Even high-level considerations without mathematical proofs might already be helpful.

  4. Summarize the literature on animals learning causal models. Surely some scientists have explored this question already, we just have to find them. Maybe it tells us something about AI.

If you think these are interesting questions and want to work on them, reach out. We will probably start to play around with GPT-3 soon. There is certainly research we missed. Feel free to send us references if you think they are relevant.

Causality is not everything

We don’t want this to be another piece along the lines of “AI truly needs X to be intelligent” where X might be something vague like understanding/​creativity/​etc. We have the hunch that causality might play a role in transformative AI and feel like it is currently underrepresented in the AI safety landscape. Not more, not less.

Furthermore, we don’t need a causal model of everything. Correlations are often sufficient. For example, if you hear an alarm, you don’t need to know exactly what caused the alarm to be cautious. But knowing whether the alarm was caused by fire or by an earthquake will determine what the optimal course of action is.

So we don’t think humans need to have a causal model of everything and neither do AIs but at least for safety-relevant applications, we should look into it deeper.


Causality might be one interesting angle for AI safety but certainly not the only one. However, there are a ton of people in classic ML who think that causality is the missing piece to AGI. They could be completely wrong but we think it’s at least worth exploring from an AI safety lens.

In this post, we outlined why causality might be relevant for TAI, which kind of questions might be relevant and how we could start answering them.


Is there a clear distinction between causality and correlation?

Some people will see our definition as naive and undercomplex. Maybe there is no such thing as causality and it’s all just different shades of correlation. Maybe all causal models are wrong and humans see something that isn’t. Maybe, maybe, maybe.

Similar to how there is no hard evidence for consciousness and philosophical zombies that act just as if they were conscious but truly aren’t could exist, all causal claims could also be explained with a lot of correlations and luck. But as argued, e.g. by Eliezer, Occam’s razor would make the existence of some sort of consciousness much more likely than its absence and by the same logic causality more likely than its absence.