Stefan Heimersheim. Research Scientist at Apollo Research, Mechanistic Interpretability.
StefanHex
Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
I would like the following subscription: All posts with certain tags, e.g. all [AI] posts or all [Interpretability (ML & AI)] posts.
I just noticed (and enabled) a “subscribe” feature in the page for the tag, it says “Get notifications when posts are added to this tag.” — I’m unsure if those are emails, but assuming they are, my problem is solved. I never noticed this option before.
And here’s the code to do it with replacing the LayerNorms with identities completely:
import torch from transformers import GPT2LMHeadModel from transformer_lens import HookedTransformer model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu") # Undo my hacky LayerNorm removal for block in model.transformer.h: block.ln_1.weight.data = block.ln_1.weight.data / 1e6 block.ln_1.eps = 1e-5 block.ln_2.weight.data = block.ln_2.weight.data / 1e6 block.ln_2.eps = 1e-5 model.transformer.ln_f.weight.data = model.transformer.ln_f.weight.data / 1e6 model.transformer.ln_f.eps = 1e-5 # Properly replace LayerNorms by Identities class HookedTransformerNoLN(HookedTransformer): def removeLN(self): for i in range(len(self.blocks)): self.blocks[i].ln1 = torch.nn.Identity() self.blocks[i].ln2 = torch.nn.Identity() self.ln_final = torch.nn.Identity() hooked_model = HookedTransformerNoLN.from_pretrained("gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu") hooked_model.removeLN() prompt = torch.tensor([1,2,3,4], device="cpu") logits = hooked_model(prompt) print(logits.shape) print(logits[0, 0, :10])
Here’s a quick snipped to load the model into TransformerLens!
import torch from transformers import GPT2LMHeadModel from transformer_lens import HookedTransformer model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu") hooked_model = HookedTransformer.from_pretrained("gpt2", hf_model=model, fold_ln=False, center_unembed=False).to("cpu") # Kill the LayerNorms because TransformerLens overwrites eps for block in hooked_model.blocks: block.ln1.eps = 1e12 block.ln2.eps = 1e12 hooked_model.ln_final.eps = 1e12 # Make sure the outputs are the same prompt = torch.tensor([1,2,3,4], device="cpu") logits = hooked_model(prompt) logits2 = model(prompt).logits print(logits.shape, logits2.shape) print(logits[0, 0, :10]) print(logits2[0, :10])
You can remove GPT2’s LayerNorm by fine-tuning for an hour
I really like the investigation into properties of SAE features, especially the angle of testing whether SAE features have particular properties than other (random) directions don’t have!
Random directions as a baseline: Based on my experience here I expect random directions to be a weak baseline. For example the covariance matrix of model activations (or SAE features) is very non-uniform. I’d second @Hoagy’s suggestion of linear combination of SAE features, or direction towards other model activations as I used here.
Ablation vs functional FT-LLC: I found the comparison between your LLC measure (weights before the feature), and the ablation effect (effect of this feature on the output) interesting, and I liked that you give some theories, both very interesting! Do you think @jake_mendel’s error correction theory is related to these in any way?
I like this idea! I’d love to see checks of this on the SOTA models which tend to have lots of layers (thanks @Joseph Miller for running the GPT2 experiment already!).
I notice this line of argument would also imply that the embedding information can only be accessed up to a certain layer, after which it will be washed out by the high-norm outputs of layers. (And the same for early MLP layers which are rumoured to act as extended embeddings in some models.) -- this seems unexpected.
Additionally, they would be further evidence (but not conclusive[2]) towards hypotheses Residual Networks Behave Like Ensembles of Relatively Shallow Networks.
I have the opposite expectation: Effective layer horizons enforce a lower bound on the number of modules involved in a path. Consider the shallow path
Input (layer 0) → MLP 10 → MLP 50 → Output (layer 100)
If the effective layer horizon is 25, then this path cannot work because the output of MLP10 gets lost. In fact, no path with less than 3 modules is possible because there would always be a gap > 25.
Only a less-shallow paths would manage to influence the output of the model
Input (layer 0) → MLP 10 → MLP 30 → MLP 50 → MLP 70 → MLP 90 → Output (layer 100)
This too seems counterintuitive, not sure what to make of this.
I know he’s legitimately affiliated with that YT channel
Can I ask how you know that? The amount of “w Stephen Fry” video titles made me suspicious, and I wondered whether it’s AI generated and not Stephen-Fry-endorsed, but I haven’t done any further research.
Edit: A colleague just pointed out that other videos are up to 7 years old (and AI voice wasn’t this good then), so in that video the voice must be real
A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Has anyone tested whether feature splitting can be explained by composite (non-atomic) features?
Feature splitting is the observation that SAEs with larger dictionary size find features that are geometrically (cosine similarity) and semantically (activating dataset examples) similar. In particular, a larger SAE might find multiple features that are all similar to each other, and to a single feature found in a smaller SAE.
Anthropic gives the example of the feature ” ‘the’ in mathematical prose” which splits into features ” ‘the’ in mathematics, especially topology and abstract algebra” and ” ‘the’ in mathematics, especially complex analysis” (and others).
There’s at least two hypotheses for what is going on.
The “true features” are the maximally split features; the model packs multiple true features into superposition close to each other. Smaller SAEs approximate multiple true features as one due to limited dictionary size.
The “true features” are atomic features, and split features are composite features made up of multiple atomic features. Feature splitting is an artefact of training the model for sparsity, and composite features could be replaced by linear combinations of a small number of other (atomic) features.
Anthropic conjectures hypothesis 1 in Towards Monosemanticity. Demian Till argues for hypothesis 2 in this post. I find Demian’s arguments compelling. They key idea is that an SAE can achieve lower loss by creating composite features for frequently co-occurring concepts: The composite feature fires instead of two (or more) atomic features, providing a higher sparsity (lower sparsity penalty) at the cost of taking up another dictionary entry (worse reconstruction).
I think the composite feature hypothesis is plausible, especially in light of Anthropic’s Feature Completeness results in Scaling Monosemanticity. They find that not all model concepts are represented in SAEs, and that rarer concepts are less likely to be represented (they find an intriguing relation between number of alive features and feature frequency required to be represented in the SAE, likely related to the frequency-rank via Zipf’s law). I find it probably that the optimiser may dedicate extra dictionary entries to composite features of high-frequency concepts at the cost of representing low-frequency concepts.
This is bad for interpretability not (only) because low-frequency concepts are omitted, but because the creation of composite features requires the original atomic features to not fire anymore in the composite case.
Imagine there is a “deception” feature, and a “exam” feature. How deception in exams is quite common, so the model learns a composite “deception in the context of exams” feature, and the atomic “deception” and “exam” features no longer fire in that case.
Then we can no longer use the atomic “deception” SAE direction as a reliable detector of deception, because it doesn’t fire in cases where the composite feature is active!
Do we have good evidence for the one or the other case?
We observe that split features often have high cosine similarity, but this is explained by both hypotheses. (Anthropic says features are clustered together because they’re similar. Demian Till’s hypothesis would claim that multiple composite features contain the same atomic features, again explaining the similarity.)
A naive test may be to test whether features can be explained by a sparse linear combination of other features, though I’m not sure how easy this would be to test.
For reference, cosine similarity of SAE decoder directions in Joseph Bloom’s GPT2-small SAEs,
blocks.1.hook_resid_pre
andblocks.10.hook_resid_pre
compared to random directions and random directions with the same covariance as typical activations.
But there is still a mystery I don’t fully understand: how is it possible to find so many “noise” vectors that don’t influence the output of the network much.
In unrelated experiments I found that steering into a (uniform) random direction is much less effective, than steering into a random direction sampled with same covariance as the real activations. This suggests that there might be a lot of directions[1] that don’t influence the output of the network much. This was on GPT2 but I’d expect it to generalize for other Transformers.
- ^
Though I don’t know how much space / what the dimensionality of that space is; I’m judging this by the “sensitivity curve” (how much steering is needed for a noticeable change in KL divergence).
- ^
Hmm, with that we’d need to get 800 orthogonal vectors.[1] This seems pretty workable. If we take the MELBO vector magnitude change (7 → 20) as an indication of how much the cosine similarity changes, then this is consistent with for the original vector. This seems plausible for a steering vector?
- ^
Thanks to @Lucius Bushnaq for correcting my earlier wrong number
- ^
That model has an Attention and MLP block (GPT2-style model with 1 layer but a bit wider, 21M params).
I changed my mind over the course of this morning. TheTinyStories models’ language isn’t that bad, and I think it’d be a decent research project to try to fully understand one of these.
I’ve been playing around with the models this morning, quotes from the 1-layer model:
Once upon a time, there was a lovely girl called Chloe. She loved to go for a walk every morning and one day she came across a road.
One day, she decided she wanted to go for a ride. She jumped up and down, and as she jumped into the horn, shouting whatever makes you feel like.
When Chloe was flying in the sky, she saw some big white smoke above her. She was so amazed and decided to fly down and take a closer look.
When Chloe got to the edge of a park, there was a firework show. The girl smiled and said “Oh, hello, there. I will make sure to finish in my flying body before it gets too cold,” it said.So Chloe flew to the park again, with a very persistent look at the white horn. She was very proud of her creation and was thankful for being so brave.
Summary: Chloe, a persistent girl, explores the park with the help of a firework sparkle and is shown how brave the firework can be persistent.and
Once upon a time, there lived a young boy. His name was Caleb. He loved to learn new things and gain healthy by playing outside.
One day, Caleb was in the garden and he started eating an onion. He was struggling to find enough food to eat, but he couldn’t find anything.
Just then, Caleb appeared with a magical lake. The young boy told Caleb he could help him find his way home if he ate the onion. Caleb was so excited to find the garden had become narrow enough for Caleb to get his wish.
Caleb thought about what the pepper was thinking. He then decided to try and find a safer way to play with them next time. From then on, Caleb became healthier and could eat sweets and sweets in the house.
With the peppers, Caleb ate delicious pepper and could be heard by again. He was really proud of himself and soon enough he was playing in the garden again.
This feels like the kind of inconsistency I expect from a model that has only one layer. It can recall that the story was about flying and stuff, and the names, but it feels a bit like the model doesn’t remember what it said a paragraph before.
2-layer model:
Once upon a time, there was a lazy bear. He lived in a tall village surrounded by thick trees and lonely rivers.
The bear wanted to explore the far side of the mountain, so he asked a kind bird if he wanted to come. The bird said, “Yes, but first let me seat in my big tree. Follow me!”
The bear was excited and followed the bird. They soon arrived at a beautiful mountain. The mountain was rich with juicy, delicious fruit. The bear was so happy and thanked the bird for his help. They both shared the fruit and had a great time.
The bear said goodbye to the bird and returned to his big tree, feeling very happy and content. From then on, the bear went for food every day and could often seat in his tall tree by the river.
Summary: A lazy bear ventures on a mountain and finds a kind bird who helps him find food on his travels. The bear is happy and content with the food and a delicious dessert.and
Once upon a time, there were two best friends, a gingerbread fox and a gingerbread wolf. Everyone loved the treats and had a great time together, playing games and eating the treats.
The gingerbread fox spoke up and said: “Let’s be like buying a house for something else!” But the ginger suggested that they go to the market instead. The friends agreed and they both went to the market.
Back home, the gingerbread fox was happy to have shared the treats with the friends. They all ate the treats with the chocolates, ran around and giggled together. The gingerbread fox thought this was the perfect idea, and every day the friends ate their treats and laughed together.
The friends were very happy and enjoyed every single morsel of it. No one else was enjoying the fun and laughter that followed. And every day, the friends continued to discuss different things and discover new new things to imagine.
Summary: Two best friends, gingerbread and chocolate, go to the market to buy treats but end up only buying a small house for a treat each, which they enjoy doing together.I think if we can fully understand (in the Python code sense, probably with a bunch of lookup tables) how these models work this will give us some insight into where we’re at with interpretability. Do the explanations feel sufficiently compressed? Does it feel like there’s a simpler explanation that the code & tables we’ve written?
Edit: Specifically I’m thinking of
Train SAEs on all layers
Use this for Attention QK circuits (and transform OV circuit into SAE basis, or Transcoder basis)
Use Transcoders for MLPs
(Transcoders vs SAEs are somewhat redundant / different approaches, figure out how to connect everything together)
The tiny story status seems quite simple, in the sense that I can see how you could provide TinyStories levels of loss by following simple rules plus a bunch of memorization.
Empirically, one of the best models in the tiny stories paper is a super wide 1L transformer, which basically is bigrams, trigrams, and slightly more complicated variants [see Bucks post] but nothing that requires a step of reasoning.
I am actually quite uncertain where the significant gap between TinyStories, GPT-2 and GPT-4 is. Maybe I could fully understand TinyStories-1L if I tried, would this tell us about GPT-4? I feel like the result for TinyStories will be a bunch of heuristics.
Thanks for the comment Lawrence, I appreciate it!
I agree this doesn’t distinguish superposition vs no superposition at all; I was more thinking about the “error correction” aspect of MCIS (and just assuming superposition to be true). But I’m excited too for the SAE application, we got some experiments in the pipeline!
Your Correct behaviour point sounds reasonable but I feel like it’s not an explanation? I would have the same intuitive expectation, but that doesn’t explain how the model manages to not be sensitive. Explanations I can think of in increasing order of probability:
Story 0: Perturbations change activations and logprobs, but the answer doesn’t change because the logprob difference was large. I don’t think the KL divergence would behave like that.
Story 1: Perturbations do change the activations but the difference in the logprobs is small due to layer norm, unembed, or softmax shenanigans.
We did a test-experiment of perturbing the 12th layer rather than the 2nd layer, and the difference between real-other and random disappeared. So I don’t think it’s a weird effect when activations get converted to outputs.
Story 2: Perturbations in a lower layer cause less perturbation in later layers if the model is on-distribution (+ similar story for sensitivity).
This is what the L2-metric plots (right panel) suggest, and also what I understand your story to be.
But this doesn’t explain how the model does this, right? Are there simple stories how this happens?
I guess there’s lots of stories not limited to MCIS, anything along the lines of “ReLUs require thresholds to be passed”?
Based on that, I think the results still require some “error-correction” explanation, though you’re right that this doesn’t have to me MCIS (it’s just that there’s no other theory that doesn’t also conflict with superposition?).
My core request is that I want (SAE-)features to be a property of the model, rather than the dataset.
This can be misunderstood in the sense of taking issue with “If a concept is missing from the SAE training set, the SAE won’t find the corresponding feature.”—no, this is fine, the model-feature exists but simply isn’t found by the SAE.
What I mean to say is I take issue if “SAEs find a feature only because this concept is common in the dataset rather than because the model uses this concept.”[1] -- in my books this is SAEs making up features and that won’t help us understand models
- ^
Of course a concept being common in the model-training-data makes it likely (?) to be a concept the model uses, but I don’t think this is a 1:1 correspondence. (So just making the SAE training set equal to the model training set wouldn’t solve the issue.)
There is a view that SAE features are just a useful tool for describing activations (interpretable features) and manipulating activations (useful for steering and probing). That SAEs are just a particularly good method in a larger class of methods, but not uniquely principled. In that case I wouldn’t expect this connection to model behaviour.
But often we make the claim that we often make is that the model sees and understands the world as a set of model-features, and that we can see the same features by looking at SAE-features of the activations. And then I want to see the extra evidence.
[Interim research report] Activation plateaus & sensitive directions in GPT2
Are the features learned by the model the same as the features learned by SAEs?
TL;DR: I want
true featuresmodel-features to be a property of the model weights, and to be recognizable without access to the full dataset. Toy models have that property. My “poor man’s model-features” have it. I want to know whether SAE-features have this property too, or if SAE-features do not match thetrue featuresmodel-features.Introduction: Neural networks likely encode features in superposition. That is, features are represented as directions in activation space, and the model likely tracks many more features than dimensions in activation space. Because features are sparse, it should still be possible for the model to recover and use individual feature values.[1]
Problem statement: The prevailing method for finding these features are Sparse Autoencoders (SAEs). SAEs are well-motivated because they do recover superposed features in toy models. However, I am not certain whether SAEs recover the features of LLMs. I am worried (though not confident) that SAEs do not recover the features of the model (but the dataset), and that we are thus overconfident in how much SAEs tell us.
SAE failure mode: SAEs are trained to achieve a certain compression[2] task: Compress activations into a sparse overcomplete basis, and reconstruct the original activations based on this compressed representation. The solution to this problem can be identical to what the neural network does (wanting to store & use information), but it not necessarily is. In TMS, the network’s only objective is to compress features, so it is natural that the SAE-features match the model-features. But LLMs solve a different task (well, we don’t have a good idea what LLMs do), and training an SAE on a model’s activations might yield a basis different from the model-features (see hypothetical Example 1 below).
Operationalisation of model-features (I’m tabooing “true features”): In the Toy Model of Superposition (TMS) the model’s weights are clearly adjusted to the features directions. We can tell a feature from looking at the model weights. I want this to be a property of
trueSAE-features as well. Then I would be confident that the features are a property of the model, and not (only) of the dataset distribution. Concrete operationalisation:I give you 5 real SAE-features, and 5 made-up features (with similar properties). Can you tell which features are the real ones? Without relying on the dataset (but you may use an individual prompt). Lindsey (2024) is some evidence, but would it distinguish the SAE-features from an arbitrary decomposition of the activations into 5 fake-features?
Why do I care? I expect that the model-features are, in some sense, the computational units of the model. I expect our understanding to be more accurate (and to generalize) if we understand what the model actually does internally (see hypothetical Example 2 below).
Is this possible? Toy models of computation in superposition seem to suggest that models give special treatment to feature directions (compared to arbitrary activation directions), for example the error correction described here. This may privilege the basis of model-features over other decompositions of activations. I discuss experiment proposals at the bottom.
Example 1: Imagine an LLM was trained on The Pile excluding Wikipedia. Now we train an SAE on the model’s activations on a different dataset including Wikipedia. I expect that the SAE will find Wikipedia-related features: For example, a Wikipedia-citation-syntax feature on a low level, or an Wikipedia-style-objectivity feature on a high level. I would claim that this is not a feature of the model: During training the model never encountered these concepts, it has not reserved a direction in its superposition arrangement (think geometric shapes in Toy Model of Superposition) for this feature.
It feels like there is a fundamental distinction between a model (SGD) “deciding” whether to learn a feature (as it does in TMS) and an SAE finding a feature that was useful for compressing activations.
Example 2: Maybe an SAE trained on an LLM playing Civilization and Risk finds a feature that corresponds to “strategic deception” on this dataset. But actually the model does not use a “strategic deception” feature (instead strategic deception originates from some, say, the “power dynamics” feature), and it just happens that the instances of strategic deception in those games clustered into a specific direction. If we now take this direction to monitor for strategic deception we will fail to notice other strategic deception originating from the same “power dynamics” features.
If we had known that the model-features that were active during the strategic deception instances were the “power dynamics” (+ other) features, we would have been able to choose the right, better generalizing, deception detection feature.
Experiment proposals: I have explored the abnormal effect that “poor man’s model-features” (sampled as the difference between two independent model activations) have on model outputs, and their relation to theoretically predicted noise suppression in feature activations. Experiments in Gurnee (2024) and Lindsey (2024) suggest that SAE decoder errors and SAE-features also have an abnormal effect on the model. With the LASR Labs team I mentor I want to explore whether SAE-features match the theoretical predictions, and whether the SAE-feature effects match those expected from model-features.
- A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team by 18 Jul 2024 14:15 UTC; 117 points) (
- [Interim research report] Activation plateaus & sensitive directions in GPT2 by 5 Jul 2024 17:05 UTC; 61 points) (
- Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs by 6 Sep 2024 2:28 UTC; 27 points) (
Even after reading this (2 weeks ago), I today couldn’t manage to find the comment link and manually scrolled down. I later noticed it (at the bottom left) but it’s so far away from everything else. I think putting it somewhere at the top near the rest of the UI would be much easier for me