TLDR/Introduction

I recently completed ARENA 1.2 and wanted to apply what I’d learned. I noticed that on different phrasings of an same echo task, models produce the same response. For example, ‘repeat the following word: cat’ and ‘echo the following word: cat’ both cause the model to print the word ‘cat’. This implies shared circuits exist. I decided to investigate these in Qwen-2.5 1.5B Instruct. I used a dataset of 44 words.

My most interesting discovery is that ablation results and DLA results disagree strongly in this task. This strongly implies that much of the processing happens in the MLP layer for echo tasks. I don’t believe it happens in attention blocks on higher layers because

1) the model rarely changes its mind about what the correct token is after it’s emerged. Going through further attention blocks doesn’t cause signifiicant change.

2) when it does, it’s removing a space (e.g ′ cat’ vs ‘cat’) or decapitalising (‘Dog’ vs ‘dog’)

Head 2 at layer 22 is very significant from ablation experiments. See below for graphics that illustrate this.

Tokens emerge quite sharply, normally at layer 22 and often with a Chinese intermediate form. For example, with the word cat and both prompt forms, a logit lens resolves the Chinese for ‘please input here’ on layer 21. This suggests that the model knows it’s meant to output a word but isn’t sure what.

On all emergence layers, head 2 attends very strongly to the BOS token. I investigated this and found that with different prompts, the BOS tokens had cosine similarity of 1 when on the same layer. I then theorised that the BOS token was being used as a carrier signal. However, because of causal masking, this cannot be true. Because the BOS token is at the start of the sequence, it cannot read information from any tokens ahead of it. It will therefore always be identical on the same layer with different prompts.

I therefore think the BOS token is being used as an attention sink, as described by Xiao et al in 2023. The ablation losses could be explained as the model has learned to rely on a constant term further down. Removing this constant term forces the softmax to redistribute possiblities in a way it wasn’t trained for. It might also be worth investigating where the other ~10% of head 2′s attention goes and using mean ablation instead.

Per Notebook overview

Notebook 1

I started with just using ‘cat’ with both prompt forms. I confirmed that both prompt forms produced identical results. I then found the cosine similarity of the last token in both prompts across all layers. I was expecting it to start and end at one, and be constant across the centre of the model. It instead started at 1 and fell steadily to 0.87 at the end of the model. This is interesting and something that I should investigate further as it suggests the different representations of the task are being written in as the token goes through the model.

I then used a logit lens to uncover which layer the correct token emerges at. I found that the tokens emerge at layer 22 and in one case, after a Chinese token.

I then expanded the logit lens to explore the top 10 tokens on each layer. I found that the 10th token always has essentially 0 probability after softmax. In light of what I found with falling cosine similarity over layers, this data could be a source for further investigation.

I then did DLA against the token for ‘cat’ at layer 22. The logit values were quite low, with the highest being +0.065 and the lowest being −0.059. However, there was a pattern of heads 4 and 3 both being in the top 3 heads for both prompts.

I then zero ablated each individual head on layer 22 and measured the impact on final logits. One weakness in my analysis was that I only measured the logit change on the cat token as opposed to seeing the absolute change. It’s possible that my ablations raised another token’s score above that of cat and I didn’t gather data on that.

From this ablation, I found that heads 0, 2, and 7 were the top 3 for both prompts, with head 2 being the strongest. Outside of the top 3 heads, the order appeared random over the different prompts. I believe that this shows 0, 2, and 7 explain the common behaviour.

I then extracted the contributions of 0, 2, and 7 from the pattern cache for both prompts. I found that in both cases, head 2 was diverting ~90% of its attention to the BOS token.

I then repeated this whole pipeline with the word ‘dog’ instead and got broadly identical results.

Notebook 2

I introduced 42 common English words and ran the same analysis as in Notebook 1 on them. I found

1) that tokens don’t always emerge on layer 22 and can be earlier or later. Emerging at layer 23 is slightly more common than layer 22 with echo (15 at layer 22 vs 17 at layer 23)

2) that no matter which layer tokens emerge on, head 2 is always in the top 3 heads

3) the mismatch between DLA and ablation results is constant with a larger sample size

4) On all emergence layers, H2 always focuses on the BOS token (note to reader—I made a note to myself to confirm this, forgot to do it before pushing to github, and then noticed I’d forgotten to confirm while doing this writeup. My intuition was correct and it does focus on the BOS token)

5) All BOS tokens have cosine similarity of 1 while passing through layer 22 (this is trivially true. See the introduction for more details)

I hypothesised that the BOS token was being used to carry some kind of critical information. This would either be an instruction to print or the identity of the word to print.

I wrote this notebook by giving Notebook 1 to Claude and asking it to extract the pipeline and apply it to new words. I then commented it by hand. I found this approach was too detached from the code and will stick with my normal approach of using Claude like Stack Exchange on steroids.

Notebook 3

In this notebook, I was investigating my hypothesis from Notebook 2. I passed the BOS token through the relevant MLP layer and then performed DLA on it. It produced what I call ‘activation slop’. There were very high cosine similarities with random fragments of foreign and English words. These correlations are just coincidental as the BOS token can only attend to itself. The cosine similarity moved between 1 when words emerged on the same layer and 0 when they emerged on different layers. This was due to the BOS token transforming as it passed through MLPs.

Conclusion

DLA and ablation results are very different in Qwen-2.5 1.5B Instruct when investigating echo tasks. This shows that much of the processing happens through indirect paths. Logit lens shows the correct token usually emerges at layer 22. When the task is framed as ‘echo this word:’ instead of ‘repeat this word:’, the correct token emerges slightly later. Head attribution shows that head 2 is important no matter the layer the token emerges at. Attention pattern analysis of head 2 shows a strong focus on the BOS token. This is likely an attention sink and should be investigated further.

Stuff I’m going to do next

Continue with ARENA. I haven’t studied autoencoders at all and they’re a very common mechanistic interpretability tool that could help.
Investigate the diverging cosine similarity results from Notebook 1 and the BOS attention sink hypothesis
When I was doing my ablation experiments, I only recorded how each head changed the logits assigned to the target token and not any other token. Investigating this properly might show which heads are responsible for what. For a really simplistic example, if ablating head 7 causes the model to produce ‘:’ instead of ‘cat’, I have an argument that head 7 carries the cat information but no ‘repeat/print’ information.
Read more of the currently existing mechanistic interpretability literature. In-Context Learning Creates Task Vectors by Hendel, Geva, and Globerson looks interesting.

Investigating echo tasks in Qwen 2.5-1.5B Instruct—part 1

TLDR/​Introduction