This is the fourth entry in the “Which Circuit is it?” series. We will explore the notions of counterfactual faithfulness. This project is done in collaboration with Groundless.

Last time, we opened the black box and saw how interventions can help us distinguish which subcircuit is a better explanation. This time, we will look at counterfactuals and see if we get more definite answers.

When we first considered interventions, we used them to measure the alignment between the target and proxy explanations. We leveraged the fact that the proxy (subcircuit) is a subgraph of the target (full model) to align interventions. This time we will exploit that fact again, but in a different way.

We will treat the entire subcircuit as a single component and the remainder of the full model as the environment. Then, we will do a sort of causal identification, akin to activation patching. We will try to determine if the subcircuit is the sole cause of the behavior of interest. We will do so by asking:

If everything in the environment were different except for a single component, would the behavior of interest remain? (recovery)
If the environment were the same but the single component were different, would the behavior of interest disappear? (disruption)

We will see that this type of analysis allows us to differentiate more clearly the top subcircuits and to reflect more deeply on explanations.

Counterfactuals

Counterfactuals ask: what would have happened if things had gone differently?
They require us to reason about worlds we did not observe. Sometimes, this requires imagination^[1]. Sometimes, even unruly vision.

Let’s start with the real world, what we call our clean sample:

Screenshot 2026-03-23 at 11.14.47 PM.png

Then, we consider a possible world, which we call our corrupted sample:

Screenshot 2026-03-23 at 11.16.45 PM.png

In each world, we isolate the single component from the environment:

We will surgically transplant components from one world into the other.

WORLD = COMPONENT + ENVIRONMENT

Screenshot 2026-03-24 at 12.27.27 AM.png

Our analysis has two versions, depending on the focus of our causal question:

In-Circuit: We will ask about the causal effect of the component

Screenshot 2026-03-24 at 12.39.32 AM.png

Out-of-Circuit: We will ask about the causal effect of the environment

Screenshot 2026-03-24 at 12.39.44 AM.png

We also have two different directions:

Denoising: We will patch the clean component into the corrupted environment.

Screenshot 2026-03-24 at 12.32.58 AM.png

Noising: We will patch the corrupted component into the clean environment.

Screenshot 2026-03-24 at 12.33.05 AM.png

In-Circuit

Let’s ask questions about the component’s causal effect.

Sufficiency

If everything in the environment were different except for a single component, would the behavior of interest remain?

Screenshot 2026-03-23 at 11.35.58 PM.png

We can think of this surgical modification as denoising the corrupted sample with the clean component and verifying if we can recover the clean sample signal.

Necessity

If the environment were the same but the single component were different, would the behavior of interest disappear?

Screenshot 2026-03-23 at 11.37.09 PM.png

Out-Of-Circuit

Let’s ask questions about the causal effect of the environment.

Completeness

If everything in the environment were different except for a single component, would the behavior of interest disappear?

Screenshot 2026-03-23 at 11.54.27 PM.png

Independence

If the environment were the same but the single component were different, would the behavior of interest remain?

Screenshot 2026-03-23 at 11.54.48 PM.png

Four Perspectives

We measure two things:
Recovery: Does the patched circuit produce the clean output?
Disruption: Does the patched circuit fail to produce the clean output?
The four scores are combinations of these, depending on whether we patch in-circuit or out-of-circuit, and whether we denoise or noise.

We get four different scores that characterize the causal effect of the component:

Let’s calculate these scores for our toy experiment.

Experiments

As a reminder, these are the subcircuits we were analyzing last time.

We calculate the scores on the four ideal inputs.

Sufficiency

Necessity

Sufficiency and necessity give us an isolated view of the component.

Completeness

Independence

Clear winner?

Subcircuit #34 scores the highest.

Counterfactual faithfulness has helped us sort out the top subcircuits!

It is important to note that the main differentiator was the causal effect of the environment.

But we are not done yet.

In our second entry, we deferred a question:

To simplify our initial analysis, we will identify subcircuits by their node masks and, among the many possible edge variants for each mask, consider only the most complete one. Once we identify a node mask that clearly outperforms the others, we will then examine its edge variants in detail.

Let’s look at the top edge-variants for subcircuit #34 (node mask #34):

We might have been able to differentiate a best node mask for the full model, but we are not able to differentiate among edge-variants for the same subcircuit node mask!

It seems there is more work for us.

Counterfactuals got us further than observation or intervention alone. But they also revealed a new layer of non-identifiability: the edges. And the tools we’ve been using so far all operate in activation space. To go further, we may need a different paradigm entirely: parameter space interpretability.

Paradigms are often deeply ingrained in us and very hard to change.
Their failures often go silent. We forget they are not indefeasible.
Our account of causality uses particular frameworks, which we shall examine more closely.

Paradigm as Substrate

There are many frameworks for causality. Structured Causal Models (SCM) built on Directed Acyclic Graphs (DAG)^[2] are the most common in circuit analysis, but they have blind spots^[3] that matter for neural networks:

Determinism breaks d-separation. Every neuron is a deterministic function of the layer below, so no faithful DAG exists over network activations. Factored Space Models ^[4] handle this by defining structural independence on product spaces.
SCMs require you to pick variables first, but in LLMs, the meaningful causal units are unknown and distributed. Causal spaces^[5] let you define causal effects on events and subspaces without committing to a variable decomposition.
Interchange interventions compare two computation traces, not one. Pearl’s operates on a single world. Counterfactual spaces^[6] formalize multi-world comparisons directly.
Cycles, continuous-time dynamics, and latent accumulation in residual streams don’t fit the acyclic, discrete structure assumed by SCMs. Causal spaces^[7] handle these natively.

Each of these frameworks makes different assumptions about what remains stable during our analysis. Each is a different substrate for causal reasoning. Just as the evaluation domain was a substrate in our observational analysis, and the circuit boundary was a substrate in our interventional analysis, the causal framework itself is a substrate.

Our counterfactual conclusions are sensitive to the assumptions of the causal framework we select.

Let’s recap what we’ve established:

Counterfactual faithfulness provides stronger discriminative power than observation or intervention alone.
The main differentiator was the causal effect of the environment (completeness and independence), not the component in isolation (sufficiency and necessity).
Subcircuit #34 is the clear winner at the node-mask level.
But edge variants within the same node mask remain indistinguishable. Activation-space methods could have a ceiling here.
The causal framework itself is a substrate.

Next time, we move from activation space to parameter space.

^
For instance, in The Intimacies of Four Continents, Lowe asks the reader to use counterfactuals to refuse the narrative that the way things went was the only way they could have gone, and to use imagination to reckon with the possibilities lost in history.
^
An Introduction to Causal Inference. Pearl (2010).
^
Also, counterfactuals are tricky to reason with because several inference patterns that seem perfectly logical can break down.
Look at: Counterfactual Fallacies in Causality. Stanford Encyclopedia of Philosophy (2026).
^
Factored Space Models. Garrabrant et al. (2024).
^
A fine-grained look at causal effects in causal spaces. Park et al. (2025b)
^
Counterfactual spaces. Park et al. (2026).
^
A Measure-Theoretic Axiomatisation of Causality. Park et al. (2025a).

Comparing Across Possible Worlds

Counterfactuals

In-Circuit

Sufficiency

Necessity

Out-Of-Circuit

Completeness

Independence

Four Perspectives

Experiments

Sufficiency

Necessity

Completeness

Independence

Clear winner?

But we are not done yet.

Paradigm as Substrate