+1 that I’m still fairly confused about in context learning, induction heads seem like a big part of the story but we’re still confused about those too!
Neel Nanda
AtP*: An efficient and scalable method for localizing LLM behaviour to components
Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems
This is not a LessWrong dynamic I’ve particularly noticed and it seems inaccurate to describe it as invisible helicopter blades to me
We’ve found slightly worse results for MLPs, but nowhere near 40%, I expect you’re training your SAEs badly. What exact metric equals 40% here?
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
Thanks for the post, I found it moving. You might want to add a timestamp at the top saying “written in Nov 2023” or something, otherwise the OpenAI board stuff is jarring
Thanks! This inspired me to buy multiple things that I’ve been vaguely annoyed to lack
Thanks for writing this up, I found it useful to have some of the maths spelled out! In particular, I think that the equation constraining l, the number of simultaneously active features, is likely crucial for constraining the number of features in superposition
The art is great! How was it made?
In my opinion the pun is worth it
This seems like a useful resource, thanks for making it! I think it would be more useful if you enumerated the different ARENA notebooks, my guess is many readers won’t click through to the link, and are more likely to if they see the different names. And IMO the arena tutorials are much higher production quality than the other notebooks on that list
We dig into this in post 3. The layers compose importantly with each other and don’t seem to be doing the same thing in parallel, path patching the internal connections will break things, so I don’t think it’s like what you’re describing
Very cool work! I’m happy to see that the “my vs their colour” result generalises
Thanks for doing this, I’m excited about Neuronpedia focusing on SAE features! I expect this to go much better than neuron interpretability
Attention SAEs Scale to GPT-2 Small
The illusion is most concerning when learning arbitrary directions in space, not when iterating over individual neurons OR SAE features. I don’t have strong takes on whether the illusion is more likely with neurons than SAEs if you’re eg iterating over sparse subsets, in some sense it’s more likely that you get a dormant and a disconnected feature in your SAE than as neurons since they are more meaningful?
Interesting post, thanks for writing it!
I think that the QK section somewhat under-emphasises the importance of the softmax. My intuition is that models rarely care about as precise a task as counting the number of pairs of matching query-key features at each pair of token positions, and that instead softmax is more of an “argmax-like” function that finds a handful of important token positions (though I have not empirically tested this, and would love to be proven wrong!). This enables much cheaper and more efficient solutions, since you just need the correct answer to be the argmax-ish.
For example, ignoring floating point precision, you can implement a duplicate token head with and arbitrarily high . If there are vocab elements, map the th query and key to the point of the way round the unit circle. The dot product is maximised when they are equal.
If you further want the head to look at a resting position unless the duplicate token is there, you can increase , and have a dedicated BOS dimension with a score of , so you only get a higher score for a perfect match. And then make the softmax temperature super low so it’s an argmax.
These models were not trained with dropout. Nice idea though!
The bolded part seems false? This maps 0.2 original act → 0.2 new act while adding 0.1 to the encoder bias maps 0.2 original act → 0.1 new act. Ie, changing the encoder bias changes the value of all activations, while thresholding only affects small ones