This is the fourth entry in the “Which Circuit is it?” series. We will explore the notions of counterfactual faithfulness. This project is done in collaboration with Groundless.
Last time, we opened the black box and saw how interventions can help us distinguish which subcircuit is a better explanation. This time, we will look at counterfactuals and see if we get more definite answers.
When we first considered interventions, we used them to measure the alignment between the target and proxy explanations. We leveraged the fact that the proxy (subcircuit) is a subgraph of the target (full model) to align interventions. This time we will exploit that fact again, but in a different way.
We will treat the entire subcircuit as a single component and the remainder of the full model as the environment. Then, we will do a sort of causal identification, akin to activation patching. We will try to determine if the subcircuit is the sole cause of the behavior of interest. We will do so by asking:
If everything in the environment were different except for a single component, would the behavior of interest remain? (recovery)
If the environment were the same but the single component were different, would the behavior of interest disappear? (disruption)
We will see that this type of analysis allows us to differentiate more clearly the top subcircuits and to reflect more deeply on explanations.
Counterfactuals
Counterfactuals ask: what would have happened if things had gone differently? They require us to reason about worlds we did not observe. Sometimes, this requires imagination[1]. Sometimes, even unruly vision.
Let’s start with the real world, what we call our clean sample:
The clean sample
Then, we consider a possible world, which we call our corrupted sample:
In each world, we isolate the single component from the environment:
We will surgically transplant components from one world into the other.
WORLD =COMPONENT+ENVIRONMENT
The subcircuit is the component and the rest is the environment.
Our analysis has two versions, depending on the focus of our causal question:
In-Circuit: We will ask about the causal effect of the component
In-Circuit: The component is the focus
Out-of-Circuit: We will ask about the causal effect of the environment
Out-of-Circuit: The environment is the focus
We also have two different directions:
Denoising: We will patch the clean component into the corrupted environment.
Denoising: patch clean into corrupt
Noising: We will patch the corrupted component into the clean environment.
Noising: patch corrupted into clean
In-Circuit
Let’s ask questions about the component’s causal effect.
Sufficiency
If everything in the environment were different except for a single component, would the behavior of interest remain?
Is the component sufficient to recover the behavior?
We can think of this surgical modification as denoising the corrupted sample with the clean component and verifying if we can recover the clean sample signal.
Necessity
If the environment were the same but the single component were different, would the behavior of interest disappear?
Is the component necessary not to disrupt the behavior?
Out-Of-Circuit
Let’s ask questions about the causal effect of the environment.
Completeness
If everything in the environment were different except for a single component, would the behavior of interest disappear?
If the environment is sufficient to recover the behavior, the component is not complete.
Independence
If the environment were the same but the single component were different, would the behavior of interest remain?
If the environment is necessary not to disrupt the behavior, the component is not independent.
Four Perspectives
We measure two things: Recovery: Does the patched circuit produce the clean output? Disruption: Does the patched circuit fail to produce the clean output? The four scores are combinations of these, depending on whether we patch in-circuit or out-of-circuit, and whether we denoise or noise.
We get four different scores that characterize the causal effect of the component:
Let’s calculate these scores for our toy experiment.
Experiments
As a reminder, these are the subcircuits we were analyzing last time.
Subcircuit #44
Subcircuit #34
We calculate the scores on the four ideal inputs.
Sufficiency
Multiple subcircuits score perfectly, sufficiency is not a differentiator in our toy example
Necessity
Multiple subcircuits score perfectly, necessity is not a differentiator in our toy example
Sufficiency and necessity give us an isolated view of the component.
Completeness
There is a single subcircuit that scores highest in completeness!
Independence
The same single subcircuit that scores highest in independence!
Clear winner?
Subcircuit #34 scores the highest.
Counterfactual faithfulness has helped us sort out the top subcircuits!
It is important to note that the main differentiator was the causal effect of the environment.
To simplify our initial analysis, we will identify subcircuits by their node masks and, among the many possible edge variants for each mask, consider only the most complete one. Once we identify a node mask that clearly outperforms the others, we will then examine its edge variants in detail.
Let’s look at the top edge-variants for subcircuit #34 (node mask #34):
We see that some of the edge-variants subcircuits are incomparable under inclusion (not a subcircuit of the other).
All edge variants score the same for node mask #34!
We might have been able to differentiate a best node mask for the full model, but we are not able to differentiate among edge-variants for the same subcircuit node mask!
It seems there is more work for us.
Counterfactuals got us further than observation or intervention alone. But they also revealed a new layer of non-identifiability: the edges. And the tools we’ve been using so far all operate in activation space. To go further, we may need a different paradigm entirely: parameter space interpretability.
Paradigms are often deeply ingrained in us and very hard to change. Their failures often go silent. We forget they are not indefeasible. Our account of causality uses particular frameworks, which we shall examine more closely.
Paradigm as Substrate
There are many frameworks for causality. Structured Causal Models (SCM) built on Directed Acyclic Graphs (DAG)[2] are the most common in circuit analysis, but they have blind spots[3] that matter for neural networks:
Determinism breaks d-separation. Every neuron is a deterministic function of the layer below, so no faithful DAG exists over network activations. Factored Space Models [4] handle this by defining structural independence on product spaces.
SCMs require you to pick variables first, but in LLMs, the meaningful causal units are unknown and distributed. Causal spaces[5] let you define causal effects on events and subspaces without committing to a variable decomposition.
Interchange interventions compare two computation traces, not one. Pearl’s operates on a single world. Counterfactual spaces[6] formalize multi-world comparisons directly.
Cycles, continuous-time dynamics, and latent accumulation in residual streams don’t fit the acyclic, discrete structure assumed by SCMs. Causal spaces[7] handle these natively.
Each of these frameworks makes different assumptions about what remains stable during our analysis. Each is a different substrate for causal reasoning. Just as the evaluation domain was a substrate in our observational analysis, and the circuit boundary was a substrate in our interventional analysis, the causal framework itself is a substrate.
Our counterfactual conclusions are sensitive to the assumptions of the causal framework we select.
Let’s recap what we’ve established:
Counterfactual faithfulness provides stronger discriminative power than observation or intervention alone.
The main differentiator was the causal effect of the environment (completeness and independence), not the component in isolation (sufficiency and necessity).
Subcircuit #34 is the clear winner at the node-mask level.
But edge variants within the same node mask remain indistinguishable. Activation-space methods could have a ceiling here.
The causal framework itself is a substrate.
Next time, we move from activation space to parameter space.
For instance, in The Intimacies of Four Continents, Lowe asks the reader to use counterfactuals to refuse the narrative that the way things went was the only way they could have gone, and to use imagination to reckon with the possibilities lost in history.
Also, counterfactuals are tricky to reason with because several inference patterns that seem perfectly logical can break down. Look at: Counterfactual Fallacies in Causality. Stanford Encyclopedia of Philosophy (2026).
Comparing Across Possible Worlds
This is the fourth entry in the “Which Circuit is it?” series. We will explore the notions of counterfactual faithfulness. This project is done in collaboration with Groundless.
Last time, we opened the black box and saw how interventions can help us distinguish which subcircuit is a better explanation. This time, we will look at counterfactuals and see if we get more definite answers.
When we first considered interventions, we used them to measure the alignment between the target and proxy explanations. We leveraged the fact that the proxy (subcircuit) is a subgraph of the target (full model) to align interventions. This time we will exploit that fact again, but in a different way.
We will treat the entire subcircuit as a single component and the remainder of the full model as the environment. Then, we will do a sort of causal identification, akin to activation patching. We will try to determine if the subcircuit is the sole cause of the behavior of interest. We will do so by asking:
If everything in the environment were different except for a single component, would the behavior of interest remain? (recovery)
If the environment were the same but the single component were different, would the behavior of interest disappear? (disruption)
We will see that this type of analysis allows us to differentiate more clearly the top subcircuits and to reflect more deeply on explanations.
Counterfactuals
Counterfactuals ask: what would have happened if things had gone differently?
They require us to reason about worlds we did not observe. Sometimes, this requires imagination[1]. Sometimes, even unruly vision.
Let’s start with the real world, what we call our clean sample:
The clean sample
Then, we consider a possible world, which we call our corrupted sample:
In each world, we isolate the single component from the environment:
We will surgically transplant components from one world into the other.
The subcircuit is the component and the rest is the environment.
Our analysis has two versions, depending on the focus of our causal question:
In-Circuit: We will ask about the causal effect of the component
In-Circuit: The component is the focus
Out-of-Circuit: We will ask about the causal effect of the environment
Out-of-Circuit: The environment is the focus
We also have two different directions:
Denoising: We will patch the clean component into the corrupted environment.
Denoising: patch clean into corrupt
Noising: We will patch the corrupted component into the clean environment.
Noising: patch corrupted into clean
In-Circuit
Let’s ask questions about the component’s causal effect.
Sufficiency
If everything in the environment were different except for a single component, would the behavior of interest remain?
Is the component sufficient to recover the behavior?
We can think of this surgical modification as denoising the corrupted sample with the clean component and verifying if we can recover the clean sample signal.
Necessity
If the environment were the same but the single component were different, would the behavior of interest disappear?
Is the component necessary not to disrupt the behavior?
Out-Of-Circuit
Let’s ask questions about the causal effect of the environment.
Completeness
If everything in the environment were different except for a single component, would the behavior of interest disappear?
If the environment is sufficient to recover the behavior,
the component is not complete.
Independence
If the environment were the same but the single component were different, would the behavior of interest remain?
If the environment is necessary not to disrupt the behavior,
the component is not independent.
Four Perspectives
We measure two things:
Recovery: Does the patched circuit produce the clean output?
Disruption: Does the patched circuit fail to produce the clean output?
The four scores are combinations of these, depending on whether we patch in-circuit or out-of-circuit, and whether we denoise or noise.
We get four different scores that characterize the causal effect of the component:
Let’s calculate these scores for our toy experiment.
Experiments
As a reminder, these are the subcircuits we were analyzing last time.
Subcircuit #44
Subcircuit #34
We calculate the scores on the four ideal inputs.
Sufficiency
Multiple subcircuits score perfectly, sufficiency is not a differentiator in our toy example
Necessity
Multiple subcircuits score perfectly, necessity is not a differentiator in our toy example
Sufficiency and necessity give us an isolated view of the component.
Completeness
There is a single subcircuit that scores highest in completeness!
Independence
The same single subcircuit that scores highest in independence!
Clear winner?
Subcircuit #34 scores the highest.
Counterfactual faithfulness has helped us sort out the top subcircuits!
But we are not done yet.
In our second entry, we deferred a question:
Let’s look at the top edge-variants for subcircuit #34 (node mask #34):
We see that some of the edge-variants subcircuits are
incomparable under inclusion (not a subcircuit of the other).
All edge variants score the same for node mask #34!
We might have been able to differentiate a best node mask for the full model, but we are not able to differentiate among edge-variants for the same subcircuit node mask!
It seems there is more work for us.
Counterfactuals got us further than observation or intervention alone. But they also revealed a new layer of non-identifiability: the edges. And the tools we’ve been using so far all operate in activation space. To go further, we may need a different paradigm entirely: parameter space interpretability.
Paradigms are often deeply ingrained in us and very hard to change.
Their failures often go silent. We forget they are not indefeasible.
Our account of causality uses particular frameworks, which we shall examine more closely.
Paradigm as Substrate
There are many frameworks for causality. Structured Causal Models (SCM) built on Directed Acyclic Graphs (DAG)[2] are the most common in circuit analysis, but they have blind spots[3] that matter for neural networks:
Determinism breaks d-separation. Every neuron is a deterministic function of the layer below, so no faithful DAG exists over network activations. Factored Space Models [4] handle this by defining structural independence on product spaces.
SCMs require you to pick variables first, but in LLMs, the meaningful causal units are unknown and distributed. Causal spaces[5] let you define causal effects on events and subspaces without committing to a variable decomposition.
Interchange interventions compare two computation traces, not one. Pearl’s operates on a single world. Counterfactual spaces[6] formalize multi-world comparisons directly.
Cycles, continuous-time dynamics, and latent accumulation in residual streams don’t fit the acyclic, discrete structure assumed by SCMs. Causal spaces[7] handle these natively.
Each of these frameworks makes different assumptions about what remains stable during our analysis. Each is a different substrate for causal reasoning. Just as the evaluation domain was a substrate in our observational analysis, and the circuit boundary was a substrate in our interventional analysis, the causal framework itself is a substrate.
Let’s recap what we’ve established:
Counterfactual faithfulness provides stronger discriminative power than observation or intervention alone.
The main differentiator was the causal effect of the environment (completeness and independence), not the component in isolation (sufficiency and necessity).
Subcircuit #34 is the clear winner at the node-mask level.
But edge variants within the same node mask remain indistinguishable. Activation-space methods could have a ceiling here.
The causal framework itself is a substrate.
Next time, we move from activation space to parameter space.
For instance, in The Intimacies of Four Continents, Lowe asks the reader to use counterfactuals to refuse the narrative that the way things went was the only way they could have gone, and to use imagination to reckon with the possibilities lost in history.
An Introduction to Causal Inference. Pearl (2010).
Also, counterfactuals are tricky to reason with because several inference patterns that seem perfectly logical can break down.
Look at: Counterfactual Fallacies in Causality. Stanford Encyclopedia of Philosophy (2026).
Factored Space Models. Garrabrant et al. (2024).
A fine-grained look at causal effects in causal spaces. Park et al. (2025b)
Counterfactual spaces. Park et al. (2026).
A Measure-Theoretic Axiomatisation of Causality. Park et al. (2025a).