This is the third entry in the “Which Circuit is it?” series. We will explore possible notions of faithfulness as we consider interventions. This project is done in collaboration with Groundless.
Last time, we delved into black-box methods to see how far observational faithfulness could take us in our toy experiment. Between our top 2 subcircuits (#34 and #44), we were unable to determine which was a better explanation. Also, we saw how some proxy explanations (subcircuit hypotheses) that are robust in one domain diverge from the target explanation (the full model) in another domain.
This time, we will open up the black box. We will see whether interventions can help us sort out which subcircuits are better explanations than others. The core idea: if an interpretation is interventionally faithful, then changes in the proxy world should correctly predict changes in the target world.
Interventions
Interventions are edits we can make to a neural network’s internals to make it behave in specific ways.
As Tools for Control
Understanding how neural networks work is not just an intellectual exercise. We want guarantees about their behavior. We want to be able to reach inside and control them: steer their behavior, ensure safety, and align them with human values.
Let’s say we wanted a large model to act as if it were the Golden Gate Bridge.[1] There could be an intervention that triggers that behavior in our large model (target explanation):
An intervention could be as simple as adding a value to a specific node.
Large Language Models (LLMs) have billions of parameters. How could we figure out how to get it to act the way we want?
Imagine we already have an idealreplacement model, a proxy explanation for how the large model works, but with components we do understand.
The proxy explanation is understandable to us so it’s operable.
In the replacement model, we have clarity on how to make the desired change. We edit one specific piece on it and can turn on “act like Golden Gate Bridge” behavior.
At the same time, we use the replacement model to interpret the large model. If an interpretation is interventionallyfaithful, we could correctly link changes in the target world to those in the proxy world.
We have two worlds. The target world in the left and the proxy world in the right.
The interpretation serves as a bridge: by understanding how interventions work in the proxy world, we can figure out how to intervene in the target world.
This diagram should ideally commute: ”interpret then intervene” = “intervene then interpret”
For this to work, we need to have a way to translate our interventions from the target world to the proxy world.
In our imagined example, the existence of an interventionallyfaithful replacement model assumes we already know this mapping. In real scenarios, how do we figure out the map that translates our interventions?
As Tools for Causal Identification
Perhaps, we developed our proxy model in a structurally-informed way, such that we have intuitions of how to construct that map.
Otherwise, to derive that map from data, we need to study each explanation in isolation and identify which of its components causes the same behavior of interest.
We need to causally identify the components to form an intervention map.
Interventions in this setting are used to determine the cause and effect.
There is a lot to say about causal discovery and causal identification of components, but it is outside the scope of this article. However, we will use the concept of causal effect to assess whether a proxy explanation is adequate for the target explanation.
The Golden Gate Bridge example assumes we already know how to translate interventions between the proxy and target worlds. But that’s the hard part.
Where could we get such a map for our toy example?
As Tools for Verification
In our toy setting, the proxy and target explanations have a special relationship: one is a subcircuit of the other.
We have a canonical way to map interventions: one circuit is a subgraph of the other, so we intervene on the “same” node in both. But is this really the right map? The shared graph structure makes it natural, but natural is not the same as correct. We’ll run with this assumption for now and revisit it in a later post.
Then, we can use interventions to verify if our explanations align.
Subcircuit Boundaries
An in-circuit node is one that is shared by both the full model and the subcircuit. If we intervene in an in-circuit node, we expect the behavior of interest to change:
Before any intervention, both circuits produce the same output for the same input. So if we intervene on the same in-circuit node, we’d expect both to change their output in the same way.
An out-of-circuit node is one that is only present in the full model. If we intervene in an out-of-circuit node, we expect the behavior of interest to remain unaffected.
The logic is straightforward: if the subcircuit fully explains the behavior, then nodes it excludes shouldn’t matter. They serve other functions or contribute noise. Intervening on them should be like flipping a switch in an unrelated room.
Before any intervention, both circuits produce the same output for the same input. So if we intervene out-of-circuit, outputs should remain unchanged.
In the last post in the series, we saw that the observational faithfulness depended on our input domain specification. This time, we will consider the intervention domain specification. Our intervention is in-distribution if the intervened node values are close to the values it can take with ideal input. Our intervention is out-of-distribution if the value of the intervened node deviates substantially from its normal values:
Let’s see if interventions can help us decide which subcircuit is the best explanation.
Experiments
This is the subcircuit we were analyzing last time.
Subcircuit #44
Subcircuit #34
They are both equally observationally faithful: They behave the same way as the full model does under noise and with out-of-distribution input. We hope the interventions will help us cleanly break this tie.
In-Circuit Interventions
In-Distribution
We intervene on nodes shared by both the full model and the subcircuit, using activation values drawn from the normal operating range.
For each intervention, we compare the change in the full model’s output against the change in the subcircuit’s output. If a subcircuit is interventionally faithful, these should track closely.
The plots show this comparison for the top subcircuits. Subcircuit #34 tracks the full model more tightly across interventions than #44, but not by much.
Let’s see whether more extreme interventions yield a stronger signal.
Out-Of-Distribution
Now we push the values of the intervened node well outside their normal range.
This is a stress test. We are asking: Does the subcircuit still predict the full model’s behavior when things get weird?
Most subcircuits pass this test, which makes it uninformative. Most subcircuits pass this test, which makes it uninformative. The likely reason: at extreme activation values, tanh saturates, and all subcircuits end up in the same flat region. The intervention is too blunt to reveal structural differences[2].
We need to look elsewhere.
Out-of-Circuit Interventions
Here, we intervene on nodes that exist only in the full model, not in the subcircuit. If the subcircuit is a good explanation, these out-of-circuit nodes should not matter much for the behavior of interest. Intervening on them should leave the behavior roughly unchanged.
In-Distribution
Intervening outside the circuit with in-distribution values has little effect in our situation. Both of our top subcircuits pass this test with flying colors.
Will both subcircuits survive stronger interventions? Let’s push harder.
Out-Of-Distribution
This is where things get interesting!
Intervening outside the circuit with out-of-distribution values breaks all subcircuits!
When we apply such interventions, both our top subcircuits fall behind!
Putting It All Together
We can naively average the score between the four cases.
Across all four conditions, #34 is more interventionally faithful than #44.
In-circuit, #34 tracks the full model more tightly than #44. Out of circuit, they both break down similarly. If we had to pick one subcircuit as the better explanation, #34 wins.
Interventions broke the tie that observation couldn’t. But they also surfaced a new problem.
The results also reveal something uncomfortable: even for the better subcircuit, the boundary between “in-circuit” and “out-of-circuit” is not as clean as we would like.
We are not yet fully convinced that #34 is undeniably the best subcircuit. The gap is not big enough. Next time, we will see if counterfactuals can help us widen the gap.
Boundary as Substrate
Intervening outside the circuit can have a strong effect on the behavior of interest. This is the uncomfortable result of our experiments, and it deserves its own discussion.
When we define a subcircuit, we draw a boundary. Nodes inside the boundary are “the explanation.” Nodes outside are “everything else.” Our interventional tests assume this boundary is meaningful: that in-circuit nodes are the ones that matter, and out-of-circuit nodes can be safely ignored.
But the out-of-circuit experiments show this is not quite true. The boundary leaks. Nodes we excluded still influence the behavior, especially under out-of-distribution conditions.
This connects to a broader point about whatFarr et al. call substrate. In MoSSAIC, substrate is defined as “that layer of abstraction which you don’t have to think about.” Each layer of a system sits on top of a lower layer that is assumed to remain stable. The mech-interp researcher operates at the level of weights and activations, assuming a fixed architecture, a fixed operating system, and stable hardware. These assumptions are nested and usually invisible.
Our subcircuit boundary is a substrate assumption. We assumed the out-of-circuit nodes form a stable, inert background. The experiments say otherwise.
We cannot separate the object from the environment
Chvykov’s work [3] from dynamical systems makes a related point from a completely different angle. In his framework, the distinction between “system” and “environment” is not given by physics. It is a choice. The same dynamical system produces different emergent patterns depending on where you draw the system-environment boundary. The patterns that emerge are not properties of the system alone, nor of the environment alone. They are properties of the interaction, of the cut.
The parallel to circuit analysis is direct: the subcircuit is the “system,” the rest of the model is the “environment,” and the faithfulness of our explanation depends on where we made the cut. A different boundary might yield a different story.
What is outside can make us brittle
Take this out of toy-model land for a moment. Suppose we have identified a subcircuit in a large language model responsible for refusal behavior. We trust our boundary: these nodes handle refusal, the rest handle other things. Now, suppose a GPU computation error occurs at a node outside the refusal subcircuit. Nothing to worry about, right? The refusal circuit should be unaffected. But our out-of-circuit experiments tell a different story. Intervening on nodes outside the circuit did change the behavior of interest. If small perturbations outside the boundary can leak in, then the isolation we assumed is not the isolation we have.
Recent work makes this concrete[4]. LLM inference is not deterministic. Changing batch size, GPU count, or GPU type produces numerically different outputs, even at temperature zero. The root cause is floating-point non-associativity: the order in which numbers get added affects the result due to rounding, and that order changes with the system configuration.
For reasoning models, these rounding differences in early tokens can cascade into completely divergent chains of thought. Under bfloat16 precision with greedy decoding, a reasoning model can show up to a 9% variation in accuracy and a difference of thousands of tokens in response length[5], simply from changing the hardware setup.
Little leaks everywhere
There is another boundary that leaks: the one between training and inference. Modern RL frameworks use separate engines for rollout generation and gradient computation. Same weights. Should give the same probabilities. They do not[6]. showed that the inference engine and the training engine can give contradictory predictions for the same token under the same parameters. What we think is on-policy RL is secretly off-policy, because the numerical substrate shifted between the two engines. This is the out-of-circuit problem again. We drew a boundary (same weights = same model) and assumed everything below that boundary was stable. It was not. The substrate participated, and the behavior changed.
From the circuit perspective, this is an uncontrolled out-of-circuit intervention. A node activation shifts by a rounding error. The shift is tiny. But if the boundary between “relevant” and “irrelevant” computation is porous, tiny shifts in the wrong place can propagate.
Even if the probability of those errors is low, it only takes one instance in the wrong place to matter. It could even be catastrophic.
Let’s recap what we’ve established:
Interventions can break ties that observation cannot. In-circuit interventions under normal conditions identified #34 as the better explanation.
Out-of-circuit interventions revealed that the boundary between “explanation” and “everything else” is not clean. Excluded nodes still influence the behavior.
The subcircuit boundary is itself a substrate assumption. Its stability is not guaranteed.
In the next post, we will continue investigating interventions in relation to counterfactuals. Until next time.
Poking and Editing the Circuits
This is the third entry in the “Which Circuit is it?” series. We will explore possible notions of faithfulness as we consider interventions. This project is done in collaboration with Groundless.
Last time, we delved into black-box methods to see how far observational faithfulness could take us in our toy experiment. Between our top 2 subcircuits (#34 and #44), we were unable to determine which was a better explanation. Also, we saw how some proxy explanations (subcircuit hypotheses) that are robust in one domain diverge from the target explanation (the full model) in another domain.
This time, we will open up the black box. We will see whether interventions can help us sort out which subcircuits are better explanations than others. The core idea: if an interpretation is interventionally faithful, then changes in the proxy world should correctly predict changes in the target world.
Interventions
Interventions are edits we can make to a neural network’s internals to make it behave in specific ways.
As Tools for Control
Understanding how neural networks work is not just an intellectual exercise. We want guarantees about their behavior. We want to be able to reach inside and control them: steer their behavior, ensure safety, and align them with human values.
Let’s say we wanted a large model to act as if it were the Golden Gate Bridge.[1]
There could be an intervention that triggers that behavior in our large model (target explanation):
An intervention could be as simple as adding a value to a specific node.
Large Language Models (LLMs) have billions of parameters.
How could we figure out how to get it to act the way we want?
Imagine we already have an ideal replacement model, a proxy explanation for how the large model works, but with components we do understand.
The proxy explanation is understandable to us so it’s operable.
In the replacement model, we have clarity on how to make the desired change.
We edit one specific piece on it and can turn on “act like Golden Gate Bridge” behavior.
At the same time, we use the replacement model to interpret the large model.
If an interpretation is interventionally faithful, we could correctly link changes in the target world to those in the proxy world.
We have two worlds.
The target world in the left and the proxy world in the right.
The interpretation serves as a bridge: by understanding how interventions work in the proxy world, we can figure out how to intervene in the target world.
This diagram should ideally commute:
”interpret then intervene” = “intervene then interpret”
For this to work, we need to have a way to translate our interventions from the target world to the proxy world.
In our imagined example, the existence of an interventionally faithful replacement model assumes we already know this mapping.
In real scenarios, how do we figure out the map that translates our interventions?
As Tools for Causal Identification
Perhaps, we developed our proxy model in a structurally-informed way, such that we have intuitions of how to construct that map.
Otherwise, to derive that map from data, we need to study each explanation in isolation and identify which of its components causes the same behavior of interest.
We need to causally identify the components to form an intervention map.
Interventions in this setting are used to determine the cause and effect.
There is a lot to say about causal discovery and causal identification of components, but it is outside the scope of this article. However, we will use the concept of causal effect to assess whether a proxy explanation is adequate for the target explanation.
The Golden Gate Bridge example assumes we already know how to translate interventions between the proxy and target worlds. But that’s the hard part.
Where could we get such a map for our toy example?
As Tools for Verification
In our toy setting, the proxy and target explanations have a special relationship: one is a subcircuit of the other.
We have a canonical way to map interventions:
one circuit is a subgraph of the other, so we intervene on the “same” node in both.
But is this really the right map?
The shared graph structure makes it natural, but natural is not the same as correct.
We’ll run with this assumption for now and revisit it in a later post.
Then, we can use interventions to verify if our explanations align.
Subcircuit Boundaries
An in-circuit node is one that is shared by both the full model and the subcircuit. If we intervene in an in-circuit node, we expect the behavior of interest to change:
Before any intervention, both circuits produce the same output for the same input.
So if we intervene on the same in-circuit node, we’d expect both to change their output in the same way.
An out-of-circuit node is one that is only present in the full model. If we intervene in an out-of-circuit node, we expect the behavior of interest to remain unaffected.
The logic is straightforward: if the subcircuit fully explains the behavior, then nodes it excludes shouldn’t matter. They serve other functions or contribute noise. Intervening on them should be like flipping a switch in an unrelated room.
Before any intervention, both circuits produce the same output for the same input.
So if we intervene out-of-circuit, outputs should remain unchanged.
In the last post in the series, we saw that the observational faithfulness depended on our input domain specification. This time, we will consider the intervention domain specification. Our intervention is in-distribution if the intervened node values are close to the values it can take with ideal input. Our intervention is out-of-distribution if the value of the intervened node deviates substantially from its normal values:
Let’s see if interventions can help us decide which subcircuit is the best explanation.
Experiments
This is the subcircuit we were analyzing last time.
Subcircuit #44
Subcircuit #34
They are both equally observationally faithful: They behave the same way as the full model does under noise and with out-of-distribution input. We hope the interventions will help us cleanly break this tie.
In-Circuit Interventions
In-Distribution
We intervene on nodes shared by both the full model and the subcircuit, using activation values drawn from the normal operating range.
For each intervention, we compare the change in the full model’s output against the change in the subcircuit’s output. If a subcircuit is interventionally faithful, these should track closely.
The plots show this comparison for the top subcircuits. Subcircuit #34 tracks the full model more tightly across interventions than #44, but not by much.
Let’s see whether more extreme interventions yield a stronger signal.
Out-Of-Distribution
Now we push the values of the intervened node well outside their normal range.
This is a stress test. We are asking: Does the subcircuit still predict the full model’s behavior when things get weird?
Most subcircuits pass this test, which makes it uninformative.
Most subcircuits pass this test, which makes it uninformative. The likely reason: at extreme activation values, tanh saturates, and all subcircuits end up in the same flat region. The intervention is too blunt to reveal structural differences[2].
We need to look elsewhere.
Out-of-Circuit Interventions
Here, we intervene on nodes that exist only in the full model, not in the subcircuit.
If the subcircuit is a good explanation, these out-of-circuit nodes should not matter much for the behavior of interest. Intervening on them should leave the behavior roughly unchanged.
In-Distribution
Intervening outside the circuit with in-distribution values has little effect in our situation.
Both of our top subcircuits pass this test with flying colors.
Will both subcircuits survive stronger interventions?
Let’s push harder.
Out-Of-Distribution
This is where things get interesting!
Intervening outside the circuit with out-of-distribution values breaks all subcircuits!
When we apply such interventions, both our top subcircuits fall behind!
Putting It All Together
We can naively average the score between the four cases.
Across all four conditions, #34 is more interventionally faithful than #44.
In-circuit, #34 tracks the full model more tightly than #44.
Out of circuit, they both break down similarly.
If we had to pick one subcircuit as the better explanation, #34 wins.
The results also reveal something uncomfortable: even for the better subcircuit, the boundary between “in-circuit” and “out-of-circuit” is not as clean as we would like.
We are not yet fully convinced that #34 is undeniably the best subcircuit.
The gap is not big enough. Next time, we will see if counterfactuals can help us widen the gap.
Boundary as Substrate
Intervening outside the circuit can have a strong effect on the behavior of interest. This is the uncomfortable result of our experiments, and it deserves its own discussion.
When we define a subcircuit, we draw a boundary. Nodes inside the boundary are “the explanation.” Nodes outside are “everything else.” Our interventional tests assume this boundary is meaningful: that in-circuit nodes are the ones that matter, and out-of-circuit nodes can be safely ignored.
But the out-of-circuit experiments show this is not quite true. The boundary leaks. Nodes we excluded still influence the behavior, especially under out-of-distribution conditions.
This connects to a broader point about what Farr et al. call substrate. In MoSSAIC, substrate is defined as “that layer of abstraction which you don’t have to think about.” Each layer of a system sits on top of a lower layer that is assumed to remain stable. The mech-interp researcher operates at the level of weights and activations, assuming a fixed architecture, a fixed operating system, and stable hardware. These assumptions are nested and usually invisible.
Our subcircuit boundary is a substrate assumption. We assumed the out-of-circuit nodes form a stable, inert background. The experiments say otherwise.
We cannot separate the object from the environment
Chvykov’s work [3] from dynamical systems makes a related point from a completely different angle. In his framework, the distinction between “system” and “environment” is not given by physics. It is a choice. The same dynamical system produces different emergent patterns depending on where you draw the system-environment boundary. The patterns that emerge are not properties of the system alone, nor of the environment alone. They are properties of the interaction, of the cut.
The parallel to circuit analysis is direct: the subcircuit is the “system,” the rest of the model is the “environment,” and the faithfulness of our explanation depends on where we made the cut. A different boundary might yield a different story.
What is outside can make us brittle
Take this out of toy-model land for a moment. Suppose we have identified a subcircuit in a large language model responsible for refusal behavior. We trust our boundary: these nodes handle refusal, the rest handle other things. Now, suppose a GPU computation error occurs at a node outside the refusal subcircuit. Nothing to worry about, right? The refusal circuit should be unaffected. But our out-of-circuit experiments tell a different story. Intervening on nodes outside the circuit did change the behavior of interest. If small perturbations outside the boundary can leak in, then the isolation we assumed is not the isolation we have.
Recent work makes this concrete[4]. LLM inference is not deterministic. Changing batch size, GPU count, or GPU type produces numerically different outputs, even at temperature zero. The root cause is floating-point non-associativity: the order in which numbers get added affects the result due to rounding, and that order changes with the system configuration.
For reasoning models, these rounding differences in early tokens can cascade into completely divergent chains of thought. Under bfloat16 precision with greedy decoding, a reasoning model can show up to a 9% variation in accuracy and a difference of thousands of tokens in response length[5], simply from changing the hardware setup.
Little leaks everywhere
There is another boundary that leaks: the one between training and inference. Modern RL frameworks use separate engines for rollout generation and gradient computation. Same weights. Should give the same probabilities. They do not[6]. showed that the inference engine and the training engine can give contradictory predictions for the same token under the same parameters. What we think is on-policy RL is secretly off-policy, because the numerical substrate shifted between the two engines. This is the out-of-circuit problem again. We drew a boundary (same weights = same model) and assumed everything below that boundary was stable. It was not.
The substrate participated, and the behavior changed.
From the circuit perspective, this is an uncontrolled out-of-circuit intervention. A node activation shifts by a rounding error. The shift is tiny. But if the boundary between “relevant” and “irrelevant” computation is porous, tiny shifts in the wrong place can propagate.
Even if the probability of those errors is low, it only takes one instance in the wrong place to matter. It could even be catastrophic.
Let’s recap what we’ve established:
Interventions can break ties that observation cannot. In-circuit interventions under normal conditions identified #34 as the better explanation.
Out-of-circuit interventions revealed that the boundary between “explanation” and “everything else” is not clean. Excluded nodes still influence the behavior.
The subcircuit boundary is itself a substrate assumption. Its stability is not guaranteed.
In the next post, we will continue investigating interventions in relation to counterfactuals.
Until next time.
This actually happened. Anthropic found a feature in Claude that, when amplified, caused the model to identify as the Golden Gate Bridge.
When we look at other small MLPs later in the series with different activation functions, this might change.
Chvykov et al., 2021
https://rattling.org/
He et al., 2025
Yuan et al., 2025
Yao et al., 2025
In future posts, we will question if this map is naive, but for now, let’s use this canonical association to conduct some experiments.