Implicit vs. Explicit Gender Use Different Circuits for Pronoun Resolution in GPT-2 Small

Masters in AI student working through Neel Nanda’s open problems list. Feedback very welcome.

Summary

I investigated how GPT-2 Small resolves pronouns after “because” clauses (e.g. “Alice gave Bob the book because ___ was kind” → “she”). Using activation patching, ablation, and direct logit attribution (DLA), I identified a sparse 5-head circuit that causally drives this behavior when gender is encoded implicitly through name-gender priors.

The key results:

Ablating 5 heads (3.5% of the model) inverts the model’s pronoun preference (logit diff: +0.107 → −0.038)
A confound-controlled replication with explicit gender markers (“Alex, who is female, gave Jordan the book because”) shows the same circuit does not drive behavior when gender is stated explicitly
Activation patching on the neutral dataset identifies a partially overlapping but distinct set of heads, with L9H5 and L7H3 appearing in both
The explicit-gender circuit candidates fail to confirm causally under ablation, suggesting the explicit-gender mechanism is more distributed or MLP-mediated

This is consistent with GPT-2 using different computational pathways depending on how gender information is encoded in the input.

Motivation

Wang et al. (2022) identified the Indirect Object Identification (IOI) circuit in GPT-2, which handles sentences like “John gave Mary the ball; Mary gave ___.” The IOI circuit focuses on pure entity co-reference — which name fills a slot based on syntactic position.

Causal pronoun resolution is a related but distinct task: the pronoun is determined by who performed an action, not just who appears in the right syntactic slot. For example:

“Alice gave Bob the book because ___ was kind” → she (Alice, the giver)
“Bob helped Alice move because ___ was strong” → he (Bob, the helper)

This requires tracking grammatical role (agent vs. recipient) in addition to entity identity. I investigated whether GPT-2 uses a dedicated circuit for this, whether that circuit is confounded by name-gender priors, and what happens when those priors are removed.

Methods

Model: GPT-2 Small (12 layers, 12 heads, d_model=768) via TransformerLens.

Dataset 1 — Implicit gender (original): 20 prompts across 5 categories using gendered names (Alice/Sarah/Emma = female, Bob/John/James = male). Corrupted versions created by swapping subject/object names.

Dataset 2 — Explicit gender (neutral): 20 prompts using gender-neutral names (Alex, Jordan, Riley, Casey) with explicit role markers: “Alex, who is female, gave Jordan the book because.” Corrupted by swapping names and flipping gender markers. This removes the name-gender prior as a confound.

Metric: Logit difference — logit(correct pronoun) − logit(incorrect pronoun) — rather than argmax accuracy. Gives a continuous signal of directional confidence.

Activation patching: For each prompt, patched one attention head at a time from the clean cache into the corrupted run. Heads with the most negative restoration scores are most causally important.

Ablation: Zero-ablated subsets of heads via hook_z hooks and measured logit diff change on both datasets.

Direct logit attribution (DLA): Projected each head’s output vector through the unembedding matrix W_U to measure its direct contribution to logit(correct) − logit(incorrect) at the final token position.

Results

1. Baseline Performance

Dataset	Accuracy	Mean logit diff
Implicit gender (original)	75.0%	+0.107
Explicit gender (neutral)	70.0%	+0.574

The neutral dataset produces a substantially higher logit margin despite lower accuracy, suggesting explicit gender phrases stabilize entity representations. The higher margin but lower accuracy indicates the model is more confident on the prompts it gets right, but fails more cleanly on the ones it gets wrong.

2. Activation Patching — Implicit vs. Explicit Gender

Top 5 causal heads, implicit-gender dataset:

Rank	Layer	Head	Patching score
1	7	1	-0.0999
2	4	11	-0.0984
3	9	5	-0.0734
4	7	3	-0.0612
5	8	3	-0.0262

Top 5 causal heads, explicit-gender dataset:

Rank	Layer	Head	Patching score
1	9	5	-0.5626
2	7	3	-0.5576
3	7	6	-0.5527
4	5	11	-0.5261
5	8	9	-0.5252

Figure 1: Average activation patching scores across 20 implicit-gender prompts. Darker red indicates heads whose output, when patched from clean to corrupted runs, most strongly restores correct pronoun preference. Top causal heads: L7H1, L4H11, L9H5, L7H3, L8H3.

Overlap: L9H5 and L7H3 appear in both lists. L7H1, L4H11, L8H3 are implicit-only. L7H6, L5H11, L8H9 are explicit-only.

Note: patching scores on the neutral dataset are tightly clustered (-0.563 to −0.525) compared to the implicit dataset (wider spread: −0.100 to −0.026), which may indicate a more distributed signal or noisier patching due to longer prompts.

Figure 2: Average activation patching scores across 20 explicit-gender (neutral) prompts. Compared to Figure 1, the signal is more distributed across layers and heads, with no single dominant cluster. Top causal heads: L9H5, L7H3, L7H6, L5H11, L8H9. L9H5 and L7H3 are the only heads shared with the implicit-gender circuit, suggesting a partially overlapping but distinct computational pathway.

3. Full Circuit Ablation — Both Datasets

Heads ablated	Orig LD	Δ Orig	Neut LD	Δ Neut
Shared core (L9H5, L7H3)	+0.023	-0.085	+0.343	+0.052
Implicit-only (L7H1, L4H11)	+0.082	-0.025	+0.265	-0.027
Explicit-only (L7H6, L5H11, L8H9)	+0.143	+0.036	+0.323	+0.032
Full original circuit	-0.038	-0.145	+0.359	+0.067
Full neutral circuit	+0.057	-0.051	+0.408	+0.117

Key finding: Ablating the full original circuit inverts behavior on implicit-gender prompts (logit diff goes negative) but has no negative effect on neutral prompts — it slightly improves them. The explicit-only heads similarly fail to causally affect the neutral dataset when ablated.

This means the explicit-gender mechanism is not explained by any of the candidate attention heads. It is either more distributed across heads, implemented primarily in MLP layers, or relies on a residual stream pathway not captured by attention head patching.

4. Direct Logit Attribution

DLA measures each head’s direct contribution to the final pronoun logit, independent of attention patterns.

Head	Mean DLA diff	Direction
L4H11	+0.656	→ pushes correct
L7H1	+0.086	→ weakly pushes correct
L8H3	-0.010	→ approximately neutral
L9H5	-0.539	→ pushes wrong
L7H3	-0.790	→ pushes wrong

This is a striking dissociation. L7H3 and L9H5 — the two heads that appear in both patching results and were hypothesized as a “shared core” — have strongly negative DLA. They are causally necessary (patching them restores corrupted behavior) but directly harmful (their output pushes toward the wrong pronoun on clean runs).

L4H11 is the only circuit head with strongly positive DLA, making it the most plausible candidate for a genuine subject-tracking head that directly promotes the correct pronoun.

Interpretation: L7H3 and L9H5 may function as suppression heads — they carry information that is useful when patched into a corrupted context, but on clean runs they appear to be suppressing a competing signal. This is architecturally similar to the S-inhibition heads in the IOI circuit, which suppress the subject name to allow the indirect object to fill the slot. Here, L7H3/L9H5 may be suppressing a default gender prediction to allow the subject-role signal to dominate — but the suppression itself shows up as negative DLA on clean prompts.

5. Attention Patterns — L7H3 on Implicit vs. Explicit Prompts

On implicit-gender prompts, L7H3 consistently attends from the final token back to the grammatical subject and main verb:

“Alice gave Bob the book because” → attends to Alice and gave
“Bob helped Alice move because” → attends to Bob and helped
“When Alice told Bob the news,” → attends to Alice and told

Figure 3: L7H3 attention patterns on implicit-gender prompts. From the final token position, the head consistently attends back to the grammatical subject (Alice, Bob) and main verb (gave, helped, told) across all three prompt structures. Crucially, in the second prompt ‘Bob helped Alice move because,’ the head attends to Bob — the grammatical subject — not Alice, who appears later in the sentence. This rules out a simple ‘attend to first name’ or ‘attend to last name’ heuristic and is consistent with genuine subject-role tracking.

On explicit-gender prompts (“Alex, who is female, gave Jordan the book because”), L7H3 still attends to the subject name Alex, but the longer relative clause between subject and verb disrupts the clean subject+verb pattern seen on shorter prompts.

Figure 4: Shared-core heads L9H5 and L7H3 on explicit-gender (neutral) prompts. Both heads attend to the grammatical subject (Alex, Jordan) consistent with Figure 3, but the relative clause (‘who is female/male’) sitting between subject and verb disrupts the clean subject+verb co-attention pattern. The more diffuse attention pattern here compared to Figure 3 may partly explain why the implicit circuit does not transfer causally to explicit-gender prompts.

6. Fallback Behavior After Ablation

When the full implicit circuit is ablated, the model does not fail randomly. It frequently copies a nearby name token:

“Bob helped Alice move because” → predicts Alice (copies last name)
“When Alice told Bob the news,” → predicts Alice (copies subject name)

This mirrors the IOI “copy name” fallback and suggests the pronoun resolution circuit actively suppresses a lower-level name-copying heuristic. Without it, the default behavior re-emerges.

Discussion

The implicit/explicit dissociation

The central finding is that GPT-2 Small does not have a single unified pronoun resolution mechanism. The circuit identified via activation patching on implicit-gender prompts causally drives behavior on those prompts but has no effect — or slightly beneficial effect when ablated — on explicit-gender prompts. This suggests the model has learned two separate lookup strategies:

Name-gender heuristic (implicit): When gender-marked names are present (Alice, Bob), a sparse attention circuit reads off the name-gender association and applies it to pronoun prediction.
Explicit semantic parsing (explicit): When gender is stated directly (“who is female”), a different — and as yet unidentified — mechanism handles resolution. This mechanism is likely more distributed, potentially involving MLP layers or a broader set of attention heads below the detection threshold of single-head patching.

This is a meaningful distinction for mechanistic interpretability. It means that findings from experiments using gendered names may not generalize to gender-neutral contexts, and that the same surface behavior (correct pronoun prediction) can be implemented by different underlying computations depending on the input format.

Relation to the IOI circuit

The IOI circuit (Wang et al. 2022) handles entity co-reference — which name fills a slot. The circuit found here handles role-based pronoun assignment — which entity performed the action. They are related tasks using partially overlapping machinery:

IOI key heads cluster in layers 9-10; this circuit peaks at layer 7
Both circuits exhibit a fallback to name-copying when ablated
The S-inhibition mechanism in IOI has a potential analogue in L7H3/L9H5′s suppression role here

The shared fallback behavior suggests common lower-level machinery, with distinct higher-level circuits handling the two tasks.

Why does ablation improve neutral-prompt performance?

The positive Δ on neutral prompts after ablating the implicit circuit is unexpected. One interpretation: on explicit-gender prompts, the implicit-gender circuit is actually introducing noise — it tries to apply name-gender heuristics to gender-neutral names (Alex, Jordan) where those heuristics are unreliable, partially corrupting the signal from the explicit-gender mechanism. Ablating it removes that noise. This would predict that adding more explicitly gendered names (Alice, Bob) to the neutral prompts would reduce this effect.

Limitations

Dataset size: 20 prompts per condition. Effect sizes are noisy. The qualitative patterns are consistent but quantitative claims should be treated as preliminary.

Unidentified explicit-gender circuit: The mechanism handling explicit gender markers was not identified. Attention head patching found candidates but ablation did not confirm them. MLP attribution experiments are the clear next step.

DLA vs. patching dissociation unexplained: L7H3 has negative DLA but positive patching score. The suppression hypothesis is plausible but not directly tested. Decomposing the head’s output into components (via path patching or OV circuit analysis) would be needed to confirm it.

Attention ≠ causation: Attention visualizations for L7H3 are suggestive but the patching and ablation results provide the actual causal evidence.

Next Steps

MLP attribution on explicit-gender prompts — apply neuron-level activation patching to identify which MLP layers carry the explicit-gender signal
Path patching — trace the full computational path from subject token through L4H11 (positive DLA) to the final logit
OV circuit analysis for L7H3 — decompose what information the head is actually writing into the residual stream to test the suppression hypothesis
Other causal connectives — test whether the circuit activates on “so”, “therefore”, “thus” or is specific to “because”
Scale to GPT-2 Medium/Large — does the circuit structure persist or reorganize?
Expand dataset to 100+ prompts — needed before quantitative claims can be made with confidence

Code

Full notebook: https://github.com/NIKSHITH-G/gpt2-pronoun-circuit

Runs on Google Colab free tier (T4 GPU). Libraries: TransformerLens, circuitsvis, plotly, einops.

References

Wang, K. et al. (2022). Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. arXiv:2211.00593
Nanda, N. et al. (2023). Progress measures for grokking via mechanistic interpretability. ICLR 2023
Conmy, A. et al. (2023). Towards Automated Circuit Discovery for Mechanistic Interpretability. NeurIPS 2023
Elhage, N. et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic