# Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

** Authors sorted alphabetically.*

Summary: This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations. The key idea behind causal scrubbing is to test interpretability hypotheses via *behavior-preserving resampling ablations*. We apply this method to develop a refined understanding of how a small language model implements induction and how an algorithmic model correctly classifies if a sequence of parentheses is balanced.

# 1 Introduction

A question that all mechanistic interpretability work must answer is, “how well does this interpretation explain the phenomenon being studied?”. In the many recent papers in mechanistic interpretability, researchers have generally relied on ad-hoc methods to evaluate the quality of interpretations.^{[1]}

This *ad hoc* nature of existing evaluation methods poses a serious challenge for scaling up mechanistic interpretability. Currently, to evaluate the quality of a particular research result, we need to deeply understand both the interpretation and the phenomenon being explained, and then apply researcher judgment. Ideally, we’d like to find the interpretability equivalent of property-based testing—automatically checking the correctness of interpretations, instead of relying on grit and researcher judgment. More systematic procedures would also help us scale-up interpretability efforts to larger models, behaviors with subtler effects, and to larger teams of researchers. To help with these efforts, we want a procedure that is both powerful enough to finely distinguish better interpretations from worse ones, and general enough to be applied to complex interpretations.

In this work, we propose **causal scrubbing**, a systematic ablation method for testing precisely stated hypotheses about how a particular neural network^{[2]} implements a behavior on a dataset. Specifically, given an informal hypothesis about which parts of a model implement the intermediate calculations required for a behavior, we convert this to a formal correspondence between a computational graph for the model and a human-interpretable computational graph. Then, causal scrubbing starts from the output and recursively finds all of the invariances of parts of the neural network that are implied by the hypothesis, and then replaces the activations of the neural network with the *maximum entropy*^{[3]} distribution subject to certain natural constraints implied by the hypothesis and the data distribution. We then measure how well the scrubbed model implements the specific behavior.^{[4]} Insofar as the hypothesis explains the behavior on the dataset, the model’s performance should be unchanged.

Unlike previous approaches that were specific to particular applications, causal scrubbing aims to work on a large class of interpretability hypotheses, including almost all hypotheses interpretability researchers propose in practice (that we’re aware of). Because the tests proposed by causal scrubbing are mechanically derived from the proposed hypothesis, causal scrubbing can be incorporated “in the inner loop” of interpretability research. For example, starting from a hypothesis that makes very broad claims about how the model works and thus is consistent with the model’s behavior on the data, we can iteratively make hypotheses that make more specific claims while monitoring how well the new hypotheses explain model behavior. We demonstrate two applications of this approach in later posts: first on a parenthesis balancer checker, then on the induction heads in a two-layer attention-only language model.

We see our contributions as the following:

We formalize a notion of interpretability hypotheses that can represent a large, natural class of mechanistic interpretations;

We propose an algorithm,

*causal scrubbing*, that tests hypotheses by systematically replacing activations in all ways that the hypothesis implies should not affect performance.We demonstrate the practical value of this approach by using it to investigate two interpretability hypotheses for small transformers trained in different domains.

This is the main post in a four post sequence, and covers the most important content:

What is causal scrubbing? Why do we think it’s more principled than other methods? (sections 2-4)

A summary of our results from applying causal scrubbing (section 5)

Discussion: Applications, Limitations, Future work (sections 6 and 7).

In addition, there are three posts with information of less general interest. The first is a series of appendices to the content of this post. Then, a pair of posts covers the details of what we discovered applying causal scrubbing to a paren-balance checker and induction in a small language model.^{[5]} They are collected in a sequence here.

## 1.1 Related work

**Ablations for Model Interpretability:** One commonly used technique in mechanistic interpretability is the “ablate, then measure” approach. Specifically, for interpretations that aim to explain why the model achieves low loss, it’s standard to remove parts that the interpretation identifies as important and check that model performance suffers, or to remove unimportant parts and check that model performance is unaffected. For example, in Nanda and Lieberum’s Grokking work, to verify the claim that the model uses certain key frequencies to compute the correct answer to modular addition questions, the authors confirm that zero ablating the key frequencies greatly increases loss, while zero ablating random other frequencies has no effect on loss. In Anthropic’s Induction Head paper, they remove the induction heads and observe that this reduces the ability of models to perform in-context learning. In the IOI mechanistic interpretability project, the authors define the behavior of a transformer subcircuit by mean-ablating everything except the nodes from the circuit. This is used to formulate criteria for validating that the proposed circuit preserves the behavior they investigate and includes all the redundant nodes performing a similar role.

Causal scrubbing can be thought of as a generalized form of the “ablate, then measure” methodology.^{[6]} However, unlike the standard zero and mean ablations, we ablate modules by resampling activations from *other *inputs (which we’ll justify in the next post). In this work, we also apply causal scrubbing to more precisely measure different mechanisms of induction head behavior than in the Anthropic paper.

**Causal Tracing: **Like causal tracing, causal scrubbing identifies computations by patching activations. However, causal tracing aims to *identify* a specific path (“trace”) that contributes causally to a particular behavior by corrupting all nodes in the neural network with noise and then iteratively denoising nodes. In contrast, causal scrubbing tries to solve a different problem: systematically *testing* hypotheses about the behavior of a whole network by removing (“scrubbing away”) every* *causal relationship that should not matter according to the hypothesis being evaluated. In addition, causal tracing patches with (homoscedastic) Gaussian noise and not with the activations of other samples. Not only does this take your model off distribution, it might have no effect in cases where the scale of the activation is much larger than the scale of the noise.

**Heuristic explanations: **This work takes a perspective on interpretability that is strongly influenced by ARC’s work on “heuristic explanations” of model behavior. In particular, causal scrubbing can be thought of as a form of defeasible reasoning: unlike mathematical proofs (where if you have a proof for a proposition P, you’ll never see a better proof for the negation of P that causes you to overall believe P is false), we expect that in the context of interpretability, we need to accept arguments that might be overturned by future arguments.

# 2 Setup

We assume a dataset over a domain and a function which captures a behavior of interest. We will then explain the expectation of this function on our dataset, .

This allows us to explain behaviors of the form “a particular model gets low loss on a distribution .” To represent this we include the labels in and both the model and a loss function in :

We also want to explain behaviors such as “if the prompt contains some bigram `AB`

and ends with the token `A`

, then the model is likely to predict `B`

follows next.” We can do this by choosing a dataset where each datum has the prompt `...AB...A`

and expected completion `B`

. For instance:

We then propose a hypothesis about how this behavior is implemented. Formally, a *hypothesis*** ** for is a tuple of three things:

A computational graph

^{[7]}, which implements the functionWe require to be

*extensionally equal*to (equal on*all*of )A computational graph , intuitively an ‘interpretation’ of the model.

A correspondence function from the nodes of to the nodes of .

We require to be an injective graph homomorphism: that is, if there is an edge in then the edge must exist in .

We additionally require and to each have a single input and output node, where maps input to input and output to output. All input nodes are of type which allows us to evaluate both and on all of .

Here is an example hypothesis:

In this figure, we hypothesize that works by having A compute whether , B compute whether , and then ORing those values. Then we’re asserting that the behavior is explained by the relationship between D and the true label .

A couple of important things to notice:

We will often rewrite the computational graph of the original model implementation into a more convenient form (for instance splitting up a sum into terms, or grouping together several computations into one).

You can think of as a heuristic

^{[8]}that the hypothesis claims that the model uses to achieve the behavior. It’s possible that the heuristic is imperfect and will sometimes disagree with the label . In that case our hypothesis would claim that the model should be incorrect on these inputs.Note that the mapping doesn’t tell you how to translate a value of into an activation, only which nodes correspond.

We will call the “important nodes” of .

^{[9]}Let , be nodes in and respectively such that .

Intuitively this is a claim that when we evaluate both and on the same input, then the value of (usually an activation of the model) ‘represents’ the value of (usually a simple feature of the input).

The causal scrubbing algorithm will test a weaker claim: that the equivalence classes on inputs to are the same as the equivalence classes on inputs to . We think this is sufficient to meaningfully test the mechanistic interpretability hypotheses we are interested in, although it is not strong enough to eliminate all incorrect hypotheses.

Among other things, the hypothesis claims that nodes of that are not mapped to by are unimportant for the behavior under investigation.

^{[10]}

Hypotheses are covered in more detail in the appendix.

# 3 Causal Scrubbing

In this section we provide two different explanations of causal scrubbing:

An informal description of the activation-replacements that a hypothesis implies are valid. We try to provide a helpful introduction to the core idea of causal scrubbing via many diagrams; and

Different readers of this document have found different explanations to be helpful, so we encourage you to skip around or skim some sections.

Our goal will be to define a metric by recursively sampling activations that should be equivalent according to each node of the interpretation . We then compare this value to . If a hypothesis is (reasonably) accurate, then the activation replacements we perform should not alter the loss and so we’d have . Overall, we think that this difference will be a reasonable proxy for the *faithfulness* of the hypothesis—that is, how accurately the hypothesis corresponds to the “real reasons” behind the model behavior.^{[11]}

## 3.1 An informal description: What activation replacements does a hypothesis imply are valid?

Consider a hypothesis on the graphs below, where maps to the corresponding nodes of highlighted in green:

This hypothesis claims that the activations A and B respectively represent checking whether the first and second component of the input is greater than 3. Then the activation D represents checking whether either of these conditions were true. Both the third component of the input and the activation of C are unimportant (at least for the behavior we are explaining, the log loss with respect to the label ).

If this hypothesis is true, we should be able to perform two types of ‘resampling ablations’:

replacing the activations of A, B, and D with the activations on other inputs that are “equivalent” under ; and

replacing the activations that are claimed to be unimportant for a particular path (such as C or into B) with their activation on any other input.

To illustrate these interventions, we will depict a “treeified” version of where every path from the input to output of is represented by a different copy of the input. Replacing an activation with one from a different input is equivalent to replacing all inputs in the subtree upstream of that activation.

### Intervention 1: semantically equivalent subtrees

Consider running the model on two inputs _{ }= (5,6,7, True) and _{ }= (8, 0, 4, True). The value of A’ is the same on both and . Thus, if the hypothesis depicted above is correct, the output of A on both these is equivalent. This means when evaluating on we can replace the activation of A with its value on , as depicted here:

To perform the replacement, we replaced all of the inputs upstream of A in our treeified model. (We could have performed this replacement with any other that agrees on A’.)

Our hypothesis permits many other activation replacements. For example, we can perform this replacement for D instead:

### Intervention 2: unimportant inputs

The other class of intervention permitted by is replacement of any inputs to nodes in that suggests aren’t semantically important. For example, says that the only important input for A is . So the model’s behavior should be preserved if we replace the activations for and (or, equivalently, change the input that feeds into these activations). The same applies for and into B. Additionally, says that D isn’t influenced by C, so arbitrarily resampling all the inputs to C shouldn’t impact the model’s behavior.

Pictorially, this looks like this:

Notice that we are making 3 different replacements with 3 different inputs simultaneously. Still, if is accurate, we will have preserved the important information and the output of should be similar.

The causal scrubbing algorithm involves performing both of these types of intervention many times. In fact, we want to maximize the number of such interventions we perform on every run of – to the extent permitted by .

## 3.2 The causal scrubbing algorithm

We define an algorithm for evaluating hypotheses. This algorithm uses the intuition, illustrated in the previous section, of what activation replacements are permitted by a hypothesis.

The core idea is that hypotheses can be interpreted as an “intervention blacklist”. We like to think of this as the hypothesis sticking its neck out and challenging us to swap around activations in any way that it hasn’t specifically ruled out.

In a single sentence, the algorithm is: Whenever we need to compute an activation, we ask “What are all the other activations that, according to , we could replace this activation with and still preserve the model’s behavior?”, and then make the replacement by choosing uniformly at random from that subset of the dataset, and do this recursively.

In this algorithm we don’t explicitly treeify G; but we traverse it one path at a time in a tree-like fashion.

We define the * scrubbed expectation*, , as the expectation of the behavior over samples from this algorithm.

### Intuitive Algorithm

*(This is mostly redundant with the pseudocode below. Read in your preferred order.)*

The algorithm is defined in pseudocode below. Intuitively we:

Sample a random reference input from

Traverse all paths through from output towards the input by calling

`run_scrub`

on nodes of recursively. For every node we consider the subgraph of that contains everything ‘upstream’ of (used to calculate its value from the input). Each of these correspond to a subgraph of the image in .The return value of

`run_scrub(n_I, c, D, x)`

is an activation from . Specifically it is an activation for the corresponding node in that the**hypothesis claims represents the value of**when is run on input`x`

.Let .

If is an input node we will return .

Otherwise we will determine the activations of each input from the parents of . For each parent of :

If there exists a parent of that corresponds to then the hypothesis claims that the value of is important for . In particular it is important as it represents the value defined by . Thus we sample a datum

`new_x`

that agrees with on the value of . We’ll**recursively call**`run_scrub`

on in order to get an activation for .For any “unimportant parent” not mapped by the correspondence, we select an input

`other_x`

. This is a random input from the dataset, however we enforce that the*same*random input is used by all unimportant parents of a particular node.^{[12]}We record the value of on`other_x`

.We now have the activations of all the parents of – these are exactly the inputs to running the function defined for the node . We return the output of this function.

### Pseudocode

```
def estim(h, D):
"""Estimate E_scrubbed(h, D)"""
_G, I, c = h
outs = []
for i in NUM_SAMPLES:
x = random.sample(D)
outs.append(run_scrub(c, D, output_node_of(I), x))
return mean(outs)
def run_scrub(
c, # correspondence I -> G
D: Set[Datum],
n_I, # node of I
ref_x: Datum
):
"""Returns an activation of n_G which h claims represents n_I(ref_x)."""
n_G = c(n_I)
if n_G is an input node:
return ref_x
inputs_G = {}
# pick a random datum to use for all “unimportant parents” of this node
random_x = random.sample(D)
# get the scrubbed activations of the inputs to n_G
for parent_G in n_G.parents():
# “important” parents
if parent_G is in map(c, n_I.parents()):
parent_I = c.inverse(parent_G)
# sample a new datum that agrees on the interpretation node
new_x = sample_agreeing_x(D, parent_I, ref_x)
# and get its scrubbed activations recursively
inputs_G[parent_G] = run_scrub(c, D, parent_I, new_x)
# “unimportant” parents
else:
# get the activations on the random input value chosen above
inputs_G[parent_G] = parent_G.value_on(random_x)
# now run n_G given the computed input activations
return n_G.value_from_inputs(inputs_G)
def sample_agreeing_x(D, n_I, ref_x):
"""Returns a random element of D that agrees with ref_x on the value of n_I"""
D_agree = [x in D if n_I.value_on(ref_x) == n_I.value_on(x)]
return random.sample(D_agree)
```

# 4 Why ablate by resampling?

## 4.1 What does it mean to say “this thing doesn’t matter”?

Suppose a hypothesis claims that some module in the model isn’t important for a given behavior. There are a variety of different interventions that people do to test this. For example:

Zero ablation: setting the activations of that module to 0

Mean ablation: replacing the activations of that module with their empirical mean on D

Resampling ablation: patching in the activation of that module on a random different input

In order to decide between these, we should think about the precise claim we’re trying to test by ablating the module.

If the claim is “this module’s activations are literally unused”, then we could try replacing them with huge numbers or even NaN. But in actual cases, this would destroy the model behavior, and so this isn’t the claim we’re trying to test.

We think a better type of claim is: “The behavior might depend on various properties of the activations of this module, but those activations aren’t encoding any information that’s relevant to this subtask.” Phrased differently: The distribution of activations of this module is (maybe) important for the behavior. But we don’t depend on any properties of this distribution that are conditional on *which* particular input the model receives.

This is why, in our opinion, the most direct way to translate this hypothesis into an intervention experiment is to patch in the module’s activation on a randomly sampled different input–this distribution will have all the properties that the module’s activations usually have, but any connection between those properties and the correct prediction will have been scrubbed away.

## 4.2 Problems with zero and mean ablation

Despite their prevalence in prior work, zero and mean ablations do not translate the claims we’d like to make faithfully.

As noted above, the claim we’re trying to evaluate is that the information in the output of this component doesn’t matter for our current model, not the claim that deleting the component would have no effect on behavior. We care about evaluating the claim as faithfully as possible on our current model and not replacing it with a slightly different model, which zero or mean ablation of a component does. This core problem can manifest in three ways:

*Zero and mean ablations take your model off distribution in an unprincipled manner.**Zero and mean ablations can have unpredictable effects on measured performance.**Zero and mean ablations remove variation and thus present an inaccurate view of what’s happening.*

For more detail on these specific issues, we refer readers to the appendix post.

# 5 Results

To show the value of this approach, we apply causal scrubbing algorithm to two tasks: 1) verifying hypotheses about an algorithmic model we found previously through ad-hoc interpretability, and 2) test and incrementally improve hypotheses about how induction heads work on a 2-layer attention only model. Here, we summarize the results of those applications here to illustrate the applications of causal scrubbing; detailed results can be found in the respective auxiliary posts.

## 5.1 On a paren balance checker

We apply the causal scrubbing algorithm to a small transformer which classifies sequences of parentheses as balanced or unbalanced; see the results post for more information. In particular, we test three claims about the mechanisms this model uses.

**Claim 1: **There are three heads that directly pass important information to output:^{[13]}

Heads 1.0 and 2.0 test the conjunction of two checks: that there are an equal number of open and close parentheses in the entire sequence, and that the sequence starts open.

Head 2.1 checks that the nesting depth is never negative at any point in the sequence.

Claim 1 is represented by the following hypothesis:^{[14]}

**Claim 2: **Heads 1.0 and 2.0 depend only on their input at position 1, and this input indirectly depends on:

The output of 0.0 at position 1, which computes the overall proportion of parentheses which are open. This is written into a particular direction of the residual stream in a linear fashion.

The embedding at position 1, which indicates if the sequence starts with

`(`

.

**Claim 3: **Head 2.1 depends on the input at all positions, and if the nesting depth (when reading right to left!) is negative at that position.^{[15]}

Here is a visual representation of the combination of all three claims:

Testing these claims with causal scrubbing, we find that they are reasonably, but not completely, accurate:

Claim(s) tested | Performance recovered^{[16]} |

1 | 93% |

1 + 2 | 88% |

1 + 3 | 84% |

1 + 2 + 3 | 72% |

As expected, performance drops as we are more specific about how exactly the high level features are computed. This is because as the hypotheses get more specific, they induce more activation replacements, often stacked several layers deep.^{[17]}

This indicates our hypothesis is subtly incorrect in several ways, either by missing pathways along which information travels or imperfectly identifying the features that the model uses in practice.

We explain these results in more detail in this appendix post.

## 5.2 On induction

We investigated ‘induction’ heads in a 2 layer attention only model. We were able to easily test out and incrementally improve hypotheses about which computations in the model were important for the behavior of the heads.

We first tested a naive induction hypothesis, which separates out the input to an induction head in layer 1 into three separate paths – the value, the key, and the query – and specified where the important information in each path comes from. We hypothesized that both the values and queries are formed based on only the input directly from the token embeddings via the residual stream and have no dependence on attention layer 0. The keys, however, are produced only by the input from attention layer 0; in particular, they depend on the part of the output of attention layer 0 that corresponds to attention on the previous token position.^{[18]}

We test these hypotheses on a subset of openwebtext where induction is likely (but not guaranteed) to be helpful.^{[19]} Evaluated on this dataset, this naive hypothesis only recovers 35% of the performance. In order to improve this we made various edits which allow the information to flow through additional pathways:

First, we allow the attention pattern of the induction head to compare a set of three consecutive tokens (instead of just a single token) to determine when to induct.

Next, we also allow the query and value to also depend on the part of the output of layer 0 that corresponds to the current position.

We also special case three layer 0 heads which attend to repeated occurrences of the current token. In particular, we assume that the important part of the output of these heads is what their output would be

*if*their attention was just an identity matrix.^{[20]}

With these adjustments, our hypothesis recovers 86% of the performance.

We believe it would have been significantly harder to develop and have confidence in a hypothesis this precise only using ad-hoc methods to verify the correctness of a hypothesis.

We explain these results in more detail in this appendix post.

# 6 Relevance to alignment

The most obvious application of causal scrubbing to alignment is using it to evaluate mechanistic interpretations. In particular, we can imagine several specific use cases that are relevant to alignment:

*Checking interpretations of model behaviors produced by human researchers.*Having a standardized, reliable, and convenient set of tests would make it much easier to scale up mechanistic interpretability efforts; this might be particularly important if there are big interpretability projects right before the deployment of transformative AI.*Automated algorithmic searches for explanations.*In some cases, researchers might be able to specify a space of hypotheses and then use optimization algorithms to find the most predictive ones. We’ve done some work like this and we hope to do much more in the future.*AI-assisted explanations.*We might be able to train models to produce highly rated and human-understandable explanations.

In all three applications, we required that researchers understand the explanations that were verified by causal scrubbing. Unfortunately, it might be the case that the behaviors we want to interpret in large neural networks won’t have *any* understandable interpretations at all if most of the cognition performed inside powerful AI systems is in some sense irreducibly complex. It also seems plausible that even if these human-understandable interpretations exist, it might be intractable or impractical to find them.

A lot of our interest in causal scrubbing (and mechanistic interpretability more generally) comes from applications which require interpretability-like techniques which rely on formally manipulating explanation-like objects but *don’t* require that these objects be understood by anyone (human or AI):

*Automated strategies for solving ELK.*ARC is optimistic about some strategies for solving ELK that involve searching for objects similar to causal scrubbing explanations and then using properties of these explanations as part of the training procedure of the model, in ways that don’t require humans to understand the explanations.*Detecting deceptive alignment.*Suppose you have a weak trusted model and a strong untrusted model. You might be able to search for explanations of why these models take similar actions which allow you to distinguish whether the untrusted model is deceptively aligned just based on the structure of the explanation, rather than via having to understand its content.*Relaxed adversarial training*requires some way of adjudicating arguments about whether the internals of models imply they’ll behave badly in ways that are hard to find with random sampling (because the failures only occur off the training distribution, or they’re very rare). This doesn’t require that any human is able to understand these arguments; it just requires we have a mechanical argument evaluation procedure. Improved versions of the causal scrubbing algorithm might be able to fill this gap.

# 7 Limitations

Unfortunately, causal scrubbing may not be able to express all the tests of interpretability hypotheses we might want to express:

Causal scrubbing only allows activation replacements that are

*perfectly permissible*by the hypothesis: that is, the respective inputs have an exactly equal value in the correspondance.Despite being maximally strict in what replacements to allow, we are in practice willing to accept hypotheses that fail to perfectly preserve performance. We think this is an inconsistency in our current approach.

As a concrete example, if you think a component of your model encodes a continuous feature, you might want to test this by replacing the activation of this component with the activation on an input that is

*approximately*equal on this feature–causal scrubbing will refuse to do this swap.You can solve this problem by considering a generalized form of causal scrubbing, where hypotheses specify a non-uniform distribution over swaps. We’ve worked with this “generalized causal scrubbing” algorithm a bit. The space of hypotheses is continuous, which is nice for a lot of reasons (e.g. you can search over the hypothesis space with SGD). However, there are a variety of conceptual problems that still need to be resolved (e.g. there are a few different options for defining the union of two hypotheses, and it’s not obvious which is most principled).

Causal scrubbing can only propose tests that can be constructed using the data provided to it. If your hypothesis predicts that model performance will be preserved if you swap the input to any other input which has a particular property, but no other inputs in the dataset have that property, causal scrubbing can’t test your hypothesis. This happens in practice–there is probably only one sequence in webtext with a particular first name at token positions 12, 45, and 317, and a particular last name at 13, 46, 234.

This problem is addressed if you are able to produce samples that match properties by some mechanism other than rejection sampling.

Causal scrubbing doesn’t allow us to distinguish between two features that are perfectly correlated on our dataset, since they would induce the same equivalence classes. In fact, to the extent that two features A and B are highly correlated, causal scrubbing will not complain if you misidentify an A-detector as a B-detector.

^{[21]}

Another limitation is that causal scrubbing does not guarantee that it will reject a hypothesis that is importantly false or incomplete. Here are two concrete cases where this happens:

When a model uses some heuristic that isn’t

*always*applicable, it might use other circuits to inhibit the heuristic (for example, the negative name mover heads in the Indirect Object Identification paper). However, these inhibitory circuits are purely harmful for inputs where the heuristic*is*applicable. In these cases, if you ignore the inhibitory circuits, you might overestimate the contribution of the heuristic to performance, leading you to falsely believe that your incomplete interpretation fully explains the behavior (and therefore fail to notice other components of the network that contribute to performance).If two terms are correlated, sampling them independently (by two different random activation swaps) reduces the variance of the sum. Sometimes, this variance can be harmful for model performance – for instance, if it represents interference from polysemanticity. This can cause a hypothesis that scrubs out correlations present in the model’s activations to appear ‘more accurate’ under causal scrubbing.

^{[22]}

These examples are both due to the hypotheses not being specific *enough* and neglecting to include some correlation in the model (either between input-feature and activation or between two activations) that would hurt the performance of the scrubbed model.

We don’t think that this is a problem with causal scrubbing in particular; but instead is because interpretability explanations should be regarded as an example of defeasible reasoning, where it is possible for an argument to be overturned by further arguments.

We think these problems are fairly likely to be solvable using an adversarial process where hypotheses are tested by allowing an adversary to modify the hypothesis to make it more specific in whatever ways affect the scrubbed behavior the most. Intuitively, this adversarial process requires that proposed hypotheses “point out all the mechanisms that are going on that matter for the behavior”, because if the proposed hypothesis doesn’t point something important out, the adversary can point it out. More details on this approach are included in the appendix post.

Despite these limitations, we are still excited about causal scrubbing. We’ve been able to directly apply it to understanding the behaviors of simple models and are optimistic about it being scalable to larger models and more complex behaviors (insofar as mechanistic interpretability can be applied to such problems at all). We currently expect causal scrubbing to be a big part of the methodology we use when doing mechanistic interpretability work in the future.

# Acknowledgements

*This work was done by the Redwood Research interpretability team. We’re especially thankful for Tao Lin for writing the software that we used for this research and for Kshitij Sachan for contributing to early versions of causal scrubbing. Causal scrubbing was strongly inspired by Kevin Wang, Arthur Conmy, and Alexandre Variengien’s **work on how GPT-2 Implements Indirect Object Identification**. We’d also like to thank Paul Christiano and Mark Xu for their insights on heuristic arguments on neural networks. Finally, thanks to Ben Toner, Oliver Habryka, Ajeya Cotra, Vladimir Mikulik, Tristan Hume, Jacob Steinhardt, Neel Nanda, Stephen Casper, and many others for their feedback on this work and prior drafts of this sequence.*

## Citation

Please cite as:

`Chan, et al., "Causal Scrubbing: a method for rigorously testing interpretability hypotheses", AI Alignment Forum, 2022. `

BibTeX Citation:

```
@article{chan2022causal,
title={Causal scrubbing, a method for rigorously testing interpretability hypotheses},
author={Chan, Lawrence and Garriga-Alonso, Adrià and Goldwosky-Dill, Nicholas and Greenblatt, Ryan and Nitishinskaya, Jenny and Radhakrishnan, Ansh and Shlegeris, Buck and Thomas, Nate},
year={2022},
journal={AI Alignment Forum},
note={\url{https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing}}
}
```

- ^
For example, in the causal tracing paper (Meng et al 2022), to evaluate whether their hypothesis correctly identified the location of facts in GPT-2, the authors replace the activation of the involved neurons and observed that the model behaved as though it believed the edited fact, and not the original fact. In the Induction Heads paper (Olsson et al 2022) the authors provide six different lines of evidence, from macroscopic co-occurrence to mechanistic plausibility.

- ^
Causal scrubbing is technically formulated in terms of general computational graphs, but we’re primarily interested in using causal scrubbing on computational graphs that implement neural networks.

- ^
See the discussion in the “An alternative formalism: constructing a distribution on treeified inputs” section of the appendix post.

- ^
Most commonly, the behavior we attempt to explain is why a model achieves low loss on a particular set of examples, and thus we measure the loss directly. However, the method can explain any expected quality of the model’s output.

- ^
We expect the results posts will be especially useful for people who wish to apply causal scrubbing in their own research.

- ^
Note that we can use causal scrubbing to ablate a particular module, by using a hypothesis where that specific module’s outputs do not matter for the model’s performance.

- ^
A computational graph is a graph where the nodes represent computations and the edges specify the inputs to the computations.

- ^
In the normal sense of the word, not ARC’s Heuristic Arguments approach

- ^
Since is required to be an injective graph homomorphism, it immediately follows that is a subgraph of which is isomorphic to . This subgraph will be a union of paths from the input to the output.

- ^
In the appendix we’ll discuss that it is possible to modify the correspondence to include these unimportant nodes, and that doing so removes some ambiguity on when to sample unimportant nodes together or separately.

- ^
We have no guarantee, however, that any hypothesis that passes the causal scrubbing test is desirable. See more discussion of counterexamples in the limitations section.

- ^
This is because otherwise our algorithm would crucially depend on the exact representation of the causal graph: e.g. if the output of a particular attention layer was represented as a single input or if there was one input per attention head instead. There are several other approaches that can be taken to addressing this ambiguity, see the appendix.

- ^
That is, we consider the contribution of these heads through the residual stream into the final layer norm, excluding influence they may have through intermediate layers.

- ^
Note that as part of this hypothesis we have aggressively simplified the original model into a computational graph with only 5 separate computations. In particular, we relied on the fact that residual stream just before the classifier head can be written as a sum of terms, including a term for each attention head (see “Attention Heads are Independent and Additive” section of Anthropic’s “Mathematical Framework for Transformer Circuits” paper). Since we claim only three of these terms are important, we clump all other terms together into one node. Additionally note this means that the ‘Head 2.0’ node in G includes

*all*of the computations from layers 0 and 1, as these are required to compute the output of head 2.0 from the input. - ^
The claim we test is somewhat more subtle, involving a weighted average between the proportion of the open-parentheses in the prefix and suffix of the string when split at every position. This is equivalent for the final computation of balancedness, but more closely matches the model’s internal computation.

- ^
As measured by normalizing the loss so 100% is loss of the normal model (0.0003) and 0% is the loss when randomly permuting the labels. For the reasoning behind this metric see the appendix.

- ^
Our final hypothesis combines up to 51 different inputs: 4 inputs feeding into each of 1.0 and 2.0, 42 feeding into 2.1 (one for each sequence position), and 1 for the ‘other terms’.

- ^
The output of an attention layer can be written as a sum of terms, one for each previous sequence position. We can thus claim that only one of these terms is important for forming the queries.

- ^
In particular we create a whitelist of tokens on which exact 2-token induction is often a helpful heuristic (over and above bigram-heuristics). We then filter openwebtext (prompt, next-token) pairs for prompts that end in tokens on our whitelist. We evaluate loss on the actual next token from the dataset, however, which may not be what induction expects. More details here.

We do this as we want to understand not just how our model implements induction but also how it decides*when*to use induction. - ^
And thus the residual of (actual output—estimated output) is unimportant and can be interchanged with the residual on any other input.

- ^
This is a common way for interpretability hypotheses to be ‘partially correct.’ Depending on the type of reliability needed, this can be more or less problematic.

- ^
Another real world example of this is this this experiment on the paren balance checker

- Critiques of prominent AI safety labs: Redwood Research by 31 Mar 2023 8:58 UTC; 339 points) (EA Forum;
- Understanding and controlling a maze-solving policy network by 11 Mar 2023 18:59 UTC; 312 points) (
- Shallow review of live agendas in alignment & safety by 27 Nov 2023 11:10 UTC; 304 points) (
- What I would do if I wasn’t at ARC Evals by 5 Sep 2023 19:19 UTC; 212 points) (
- Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] by 3 Dec 2022 0:58 UTC; 195 points) (
- Towards understanding-based safety evaluations by 15 Mar 2023 18:18 UTC; 152 points) (
- Critiques of prominent AI safety labs: Conjecture by 12 Jun 2023 5:52 UTC; 148 points) (EA Forum;
- Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1 by 9 May 2023 19:41 UTC; 119 points) (
- Research agenda: Formalizing abstractions of computations by 2 Feb 2023 4:29 UTC; 91 points) (
- A circuit for Python docstrings in a 4-layer attention-only transformer by 20 Feb 2023 19:35 UTC; 91 points) (
- Input Swap Graphs: Discovering the role of neural network components at scale by 12 May 2023 9:41 UTC; 90 points) (
- Practical Pitfalls of Causal Scrubbing by 27 Mar 2023 7:47 UTC; 87 points) (
- Sparsify: A mechanistic interpretability research agenda by 3 Apr 2024 12:34 UTC; 83 points) (
- Review of AI Alignment Progress by 7 Feb 2023 18:57 UTC; 72 points) (
- A comparison of causal scrubbing, causal abstractions, and related methods by 8 Jun 2023 23:40 UTC; 72 points) (
- Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2 by 25 May 2023 15:37 UTC; 71 points) (
- ‘Fundamental’ vs ‘applied’ mechanistic interpretability research by 23 May 2023 18:26 UTC; 62 points) (
- Voting Results for the 2022 Review by 2 Feb 2024 20:34 UTC; 57 points) (
- EIS V: Blind Spots In AI Safety Interpretability Research by 16 Feb 2023 19:09 UTC; 54 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 51 points) (
- EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety by 17 Feb 2023 20:48 UTC; 48 points) (
- Attribution Patching: Activation Patching At Industrial Scale by 16 Mar 2023 21:44 UTC; 45 points) (
- AXRP Episode 19 - Mechanistic Interpretability with Neel Nanda by 4 Feb 2023 3:00 UTC; 44 points) (
- A gentle introduction to mechanistic anomaly detection by 3 Apr 2024 23:06 UTC; 42 points) (
- AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt by 11 Apr 2024 21:30 UTC; 41 points) (
- Eight Strategies for Tackling the Hard Part of the Alignment Problem by 8 Jul 2023 18:55 UTC; 41 points) (
- EA & LW Forums Weekly Summary (28th Nov − 4th Dec 22′) by 6 Dec 2022 9:38 UTC; 36 points) (EA Forum;
- Sparse Autoencoders: Future Work by 21 Sep 2023 15:30 UTC; 34 points) (
- Causal scrubbing: results on induction heads by 3 Dec 2022 0:59 UTC; 34 points) (
- An overview of some promising work by junior alignment researchers by 26 Dec 2022 17:23 UTC; 34 points) (
- Causal scrubbing: results on a paren balance checker by 3 Dec 2022 0:59 UTC; 34 points) (
- Join the AI Testing Hackathon this Friday by 12 Dec 2022 14:24 UTC; 33 points) (EA Forum;
- 17 Jan 2024 19:35 UTC; 24 points) 's comment on Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small by (
- AXRP Episode 23 - Mechanistic Anomaly Detection with Mark Xu by 27 Jul 2023 1:50 UTC; 22 points) (
- 10 May 2023 20:17 UTC; 21 points) 's comment on New OpenAI Paper—Language models can explain neurons in language models by (
- ML Safety at NeurIPS & Paradigmatic AI Safety? MLAISU W49 by 9 Dec 2022 10:38 UTC; 19 points) (
- Causal scrubbing: Appendix by 3 Dec 2022 0:58 UTC; 17 points) (
- Appendices to the live agendas by 27 Nov 2023 11:10 UTC; 16 points) (
- Critiques of prominent AI safety labs: Conjecture by 12 Jun 2023 1:32 UTC; 14 points) (
- An introduction to language model interpretability by 20 Apr 2023 22:22 UTC; 14 points) (
- 200 COP in MI: Techniques, Tooling and Automation by 6 Jan 2023 15:08 UTC; 13 points) (
- AXRP Episode 21 - Interpretability for Engineers with Stephen Casper by 2 May 2023 0:50 UTC; 12 points) (
- An overview of some promising work by junior alignment researchers by 26 Dec 2022 17:23 UTC; 10 points) (EA Forum;
- 14 Jun 2023 2:53 UTC; 10 points) 's comment on Critiques of prominent AI safety labs: Conjecture by (EA Forum;
- EA & LW Forums Weekly Summary (28th Nov − 4th Dec 22′) by 6 Dec 2022 9:38 UTC; 10 points) (
- Join the AI Testing Hackathon this Friday by 12 Dec 2022 14:24 UTC; 10 points) (
- Critiques of prominent AI safety organizations: Introduction by 19 Jul 2023 6:54 UTC; 7 points) (
- 2 Apr 2023 18:14 UTC; 3 points) 's comment on Critiques of prominent AI safety labs: Redwood Research by (EA Forum;
- 15 Feb 2023 0:17 UTC; 3 points) 's comment on EIS III: Broad Critiques of Interpretability Research by (
- Critiques of prominent AI safety labs: Redwood Research by 17 Apr 2023 18:20 UTC; 1 point) (

(I’m just going to speak for myself here, rather than the other authors, because I don’t want to put words in anyone else’s mouth. But many of the ideas I describe in this review are due to other people.)

I think this work was a solid intellectual contribution. I think that the metric proposed for how much you’ve explained a behavior is the most reasonable metric by a pretty large margin.

The core contribution of this paper was to produce negative results about interpretability. This led to us abandoning work on interpretability a few months later, which I’m glad we did. But these negative results haven’t had that much influence on other people’s work AFAICT, so overall it seems somewhat low impact.

The empirical results in this paper demonstrated that induction heads are not the simple circuit which many people claimed (see this post for a clearer statement of that), and we then used these techniques to get mediocre results for IOI (described in this comment).

There hasn’t been much followup on this work. I suspect that the main reasons people haven’t built on this are:

it’s moderately annoying to implement it

it makes your explanations look bad (IMO because they actually are unimpressive), so you aren’t that incentivized to get it working

the interp research community isn’t very focused on validating whether its explanations are faithful, and in any case we didn’t successfully persuade many people that explanations performing poorly according to this metric means they’re importantly unfaithful

I think that interpretability research isn’t going to be able to produce explanations that are very faithful explanations of what’s going on in non-toy models (e.g. I think that no such explanation has ever been produced). Since I think faithful explanations are infeasible, measures of faithfulness of explanations don’t seem very important to me now.

(I think that people who want to do research that uses model internals should evaluate their techniques by measuring performance on downstream tasks (e.g. weak-to-strong generalization and measurement tampering detection) instead of trying to use faithfulness metrics.)

I wish we’d never bothered with trying to produce faithful explanations (or researching interpretability at all). But causal scrubbing was important in convincing us to stop working on this, so I’m glad for that.

See the dialogue between Ryan Greenblatt, Neel Nanda, and me for more discussion of all this.

—

Another reflection question: did we really have to invent this whole recursive algorithm? Could we have just done something simpler?

My guess is that no, we couldn’t have done something simpler–the core contribution of CaSc is to give you a single number for the whole explanation, and I don’t see how to get that number without doing something like our approach where you apply every intervention at the same time.

I agree with the overall point (that this was a solid intellectual contribution and is a reasonable-ish metric), but there’s been a non-zero amount of followups or at least use cases of this work, imo. Off the top of my head:

In general, CaSc has been used on lots of toy/tiny models to a decent level of success. I agree that part of the reason for CaSc’s lack of adoption is that the metric consistently returns “this explanation is not very faithful/complete/etc”. For example:

I checked the hypotheses for the toy modular arithmetic/group composition work with my own hand-crafted CaSc implementation and found that the modular arithmetic results held up quite well.

CaSc-style tests were used by Marius and Stefan to confirm their solutions to Stephen Casper’s Mech Interp challenges (challenge 1, challenge 2).

etc.

Erik Jenner’s agenda is pretty closely related to causal scrubbing and is still actively being worked on.

Thanks for the links! I agree that the usecases are non-zero.

By “explanations” you mean labeled high-level causal graphs right? Do you also think it’s infeasible to identify sparse, unlabeled circuits as “the part of the model that’s doing the task”, like in ACDC, in a way that gets good performance on some downstream task?

By explanations, I think Buck means fully human understandable explanations.

Personally, I don’t have a strong opinion and this will probably depend on the exact architecture and the extent of sparsity we demand. This seems related to other views I have on difficulties in interp (ETA: so I’m probably more pessimistic here than people who are more optimistic about interp), but at least partially orthogonal.