silentbob comments on Understanding LLMs: Insights from Mechanistic Interpretability

silentbob 3 Sep 2025 9:32 UTC
4 points
0
Great write-up, thanks Stephen! I really appreciate the effort this must have taken. I’m currently working through the ARENA curriculum, so reading your post was both a great refresher of earlier concepts as well as a good outlook on some of the things still ahead of me.
Some minor issues I came across:
- Maybe it’s just me, but I am not quite able to parse this sentence: “Duplicate token heads: are active at the second position where “John” (S2) and attend to the first position where “John” is (S1).”—is there just an “is” missing ahead of (S2)? But even then I’m not even sure what S1 and S2 mean in this context (“Subject” I suppose?). I don’t think the meaning of S1 and S2 (or the S in S-inhibition) are introduced anywhere.
- I think Figure 9 has a typo, “key v1 causing value v1”, here the first v1 should probably be k1?
- “Step 8. Back to words:” has a double space (edit: and “1. Keys as pattern detectors” as well)
- Probably a nit-pick, but in Interpretability with SAEs you write “A typical vector of neuron activations such as the residual stream”—I believe the residual stream is technically not a vector of neuron activations, but rather a vector influenced by neuron activations (among other things, like the embedding matrix), right?
Thanks again!
- Stephen McAleese 5 Sep 2025 20:54 UTC
  4 points
  0
  Parent
  Thank you for taking the time to comment and for pointing out some errors in the post! Your attention to detail is impressive. I updated the post to reflect your feedback:
  - I removed the references to S1 and S2 in the IOI description and fixed the typos you mentioned.
  - I changed “A typical vector of neuron activations such as the residual stream...” to “A typical activation vector such as the residual stream...”
  Good luck with the rest of the ARENA curriculum! Let me know if you come across anything else.