Exploration of Counterfactual Importance and Attention Heads

Introduction

Some reasoning steps in a large language model’s chain of thought, such as those involving generating plans or managing uncertainty, have disproportionately more influence on final answer correctness and downstream reasoning than others. Recent work (Macar & Bogdan) has discovered that models containing attention heads which attend solely to important sentences.

Counterfactual importance is a method that uses resampling to determine the importance of a sentence in generating the correct answer. This post extends Thought Anchors and tests if ablating the receiver heads impacts the counterfactual importance of plan generation or uncertainty management. ^[1]

Methodology

Using the original full text Chain of Thought for the Math Rollouts dataset from problem 6481

I created new rollouts with Deepseek R1 Distill Llama 8B and Deepseek R1 Distill Qwen 14B.

I generated 12 rollouts and measured the counterfactual importance of the model from random ablation, no ablation, and receiver head ablation. The code used to generate the rollouts and analyze the rollouts.

The code used to find the receiver heads which uses the Math rollouts full_cots and gets the activations across the questions and calculates the kurtosis across the questions.

As the counterfactual importance calculations in thought anchors used the Cosine similarity of the embeddings when using a sentence model. I also explored the counterfactual importance using cosine similarity values from the embeddings from the tokenizers for Deepseek R1 Distill LLama and Deepseek R1 Distill Qwen 14B respectively.

Counterfactual importance is calculated by re-running rollouts without a specific sentence multiple times and shorting these sentences between similar and not similar and calculating the KL_divergence between the accuracies of the similar and not similar rollouts. This is used to determine how much a similar embedding to the original sentence affects the accuracy of the answer.

Key Findings

Dropdown Sentences

Dropdown Sentences: A behavior of the model after receiver head or random ablation, where a continuous group of sentences that have counterfactual importance close to 0 and have their range increase after ablation.

Ablating Receiver Heads

Found the Dropdown Sentences were mainly composed of in the order of fact_retrieval, uncertainty_management, and active computation sentences. When ablating receiver heads, there were no plan generation sentences among the Dropdown Sentences.
There appears to be a group of Dropdown Sentences between sentences 25-50, and this group of Dropdown Sentences occurs regardless of whether there is random ablation or receiver head ablation
These sentences have a counterfactual importance close to 0 but, no 0 which demonstrates it is simply not due to the counterfactual importance not being calculated.
Note: When a range of sentences was Dropdown Sentences appeared multiple times I did not do multiple observations on the sentence categories as they both share the same original full_cot to run rollouts of
Note: A Negative for the Difference graph means an increase in counterfactual importance after ablation
The Dropdown Sentences, when ablating receiver heads, mainly contain fact_retrieval, uncertainty_management, and active computation for ablating receiver heads or random ablation.
The Dropdown Sentences never included plan generation sentences, suggesting a potential connection between the 2.
Dropdown Sentences suggest there might be something special about plan generation sentences that make them not appear in sentences close to a counterfactual importance of 0 after ablation. They also sign that the distribution of counterfactual importance across sentences can be changed by ablation.

	Sent	LLM
No Ablation
Ablating Receiver Heads
Difference (Non Ablated—Ablated)

Ablating Random Heads

	Sent
No Ablation
Random Ablation
Difference (Non Ablated—Random Ablated)

Exploring the Effects of Ablation on Counterfactual Importance

Sent: Refers to the Sentence Model Embeddings, which are used to create cos_similarity and used with the similarity threshold of 0.8 to calculate counterfactual importance.

LLM: Using cos_similarity using embeddings directly from the tokenizer of the model instead of the sentence model, like the Thought Anchors Paper.^[2]

Graphs Counterfactual Importance by Sentence Category (Box Plot)

The following observes the distribution of counterfactual importance across sentence categories.
After ablating the 20 receiver heads, plan generation became more counterfactually important, along with a decrease in counterfactual importance for active computation.

	No_ablation	Ablating Receiver Heads
Sent
LLM

Random Ablation on Counterfactual Importance

(Box Plot)

I created a random list of 20 heads and measured the counterfactual importance after ablation
In sharp contrast to the previous graphs of receiver head ablation. Random ablation lowered the counterfactual importance for all sentence categories except final answer emission.
The counterfactual importance of final answer emission greatly increased after random ablation. This suggests the rest of the Full Cot is not as impactful in generating the final answer with random ablation.
These results suggest the receiver heads could be special, as ablating them does not cause counterfactual importance of all sentence categories to go to 0 and instead changes the distribution from active computation to plan generation.

	Random Ablation
Sent
LLM

Difference of Counterfactual Importance Increase and Decrease

Investigation towards the sentences that had the highest change in counterfactual importance
Greatest increase sentences are the sentences that had the top 20 highest increases in counterfactual importance after Ablation.
Greatest decrease sentences are the sentences that had the top 20 highest decreases in counterfactual importance after Ablation.
There are more active computation sentences in the greatest decrease by about 5% compared to the percent of active computation in the greatest increase sentences.
Looking at the sentences with the greatest increase and decrease in counterfactual importance, lines up similarly to how the counterfactual importance of plan generation increased and decreased for active computation. Since active computation was more present in the sentences with the greatest decrease, while plan generation was absent from the sentences with the greatest decrease.^[3]

Greatest Increase in Counterfactual Importance	No_ablation vs Receiver Head Ablation	Random Ablation
Sent
LLM

Greatest Decrease in Counterfactual Importance	No_ablation vs Receiver Head Ablation	Random Ablation
Sent
LLM

Observing the Plan Generation Sentences Highest Increases in Counterfactual Importance after Receiver Head Ablation

The plan generation sentences that had the highest increases in counterfactual importance were similar to uncertainty management sentences with words like wait or let me check.
The plan generation sentences with the highest increases, such as uncertainty management sentences, suggest a possible error caused by using an LLM to label the full_cot chunks in thought anchors.
- However, since plan generation was not found in the largest decreases in counterfactual importance, the results could suggest there is a unique intersection between plan generation and the uncertainty management sentence.
- This hypothetical category of uncertainty management/plan generation sentence hybrids seems to currently only have an increase in counterfactual importance

I made attribution graphs on the (Sent) plan generation sentences with the greatest increase in counterfactual importance after ablation

Implications and Future Work

These findings suggest several interesting directions for future research:

Is the increase in plan generation related to the decrease in counterfactual importance in active computation. Could the counterfactual importance shifting from active computation to plan generation?
What is so special about sentences 40-50 for problem 6481 as they are a range of sentences that have a consistently low counterfactual importance after random ablation or receiver head ablation?
Why is plan generation in the sentences with the highest increase in counterfactual importance but missing from the sentences with the greatest decreases in counterfactual importance?
Maybe the number of heads ablated can affect the counterfactual importance of the results just because 2% of the heads are ablated?

Attribution graphs for the models used in thought anchors is not possible to my knowledge as I could not find transcoders for the specific models with deepseek distilled.

I am aware that distillation could be considered a type of finetunning so it would be possible to use transcoders for llama 8b or qwen 14b. However, the transcoder for Qwen 14b currently for Qwen 3 which is different from Deepseek R1 Distill Qwen 14B which was originally Qwen 2.5. So finding the compute to either distill Deepseek R1 into Qwen 3 14B or making new transcoders for Llama 8B was out of the scope of the project.

Ablating the receiver heads for Qwen3 14B had no significance. The following is my attribution graph code.

Acknowledgments

Thanks to Abdur Raheem Ali for supervising this project, and to Uzay Macar for helpful feedback on an earlier version of this draft.

Glossary

Glossary (Words from Thought Anchors):

Kurtosis - the degree of “tailedness”

Receiver Heads—Attention heads found in thought anchors that narrow attention toward specific sentences

Counterfactual Importance—A black box method used in Thought Anchors to determine importance

Glossary (Original Terminology Introduced in this experiment):

Sent: Refers to the Sentence Model Embeddings, which are used to create cos_similarity and used with the similarity threshold of 0.8 to calculate counterfactual importance.

LLM: Using cos_similarity using embeddings directly from the tokenizer of the model instead of the sentence model, like the Thought Anchors Paper.

ex) The top graphs are of full counterfactual importance, the bottom are a zoomed-in demonstration of sentences 100 − 125. Where, after ablation, there is a continuous list of close to 0 counterfactual importance values. After ablation, the list of sentences with close to 0 counterfactual importance grows to 105-115 after ablation.

^
Ablating the receiver heads led to a doubling of the counterfactual importance of plan generation and also caused a decrease in the counterfactual importance of active computation. This appears to be a unique behavior to ablating receiver heads, as random ablation leads the counterfactual importance of all sentence categories to go to 0, except for plan generation, which shows an increase.
^
Explore Changes in counterfactual importance After Ablation
The counterfactual importance across all sentences decreases after ablation (github)
Ablating in general caused the Cosine Similarity of the problem setup category to go from being close to a consistent cos_similarity of 1 to having a cos_similarity that varies significantly more.
Using Cosine Similarity, the final answer emission category has more variance when using Embedding Values directly from the tokenizer. (github)
^
Absolute Difference of Counterfactual Importance
All sentence categories except final answer emission were present in Highest and Lowest difference in Absolute value difference.
Fact retrieval is the most common in both groups for the greatest and lowest change in absolute difference