The features a model thinks in do not need to form a basis or dictionary for its activations.
Three assumptions people in interpretability often make about the features that comprise a model’s ontology:
Features are one-dimensional variables.
Meaning, the value of feature i on data point x can be represented by some scalar number ci(x).
Features are ‘linearly represented’.
Meaning, each feature ci(x) can be approximately recovered from the activation vector →a(x)[1] with a linear projection onto an associated feature vector →fi.[2] So, we can write ci(x)≈→fi⋅→a(x).
Meaning, the model’s activations →a(x) at a given layer can be decomposed into a sum over all the features of the model represented in that layer[4]: →a(x)=∑ici(x)→fi.
It seems to me that a lot of people are not tracking that 3) is an extra assumption they are making. I think they think that assumption 3) is a natural consequence of assumptions 1) and 2), or even just of assumption 2) alone. It’s not.
Counterexample
Model setup
Suppose we have a language model that has a thousand sparsely activating scalar, linearly represented features for different animals. So, “elephant”, “giraffe”, “parrot”, and so on all with their own associated feature directions →f1,…,→f1000. The model embeds those one thousand animal features in a fifty-dimensional sub-space of the activations. This subspace has a meaningful geometry: It is spanned by a set of fifty directions →f′1,…,→f′50 corresponding to different attributes animals have. Things like “furriness”, “size”, “length of tail” and such. So, each animal feature can equivalently be seen as either one of a thousand sparsely activating scalar feature, or just as a particular setting of those fifty not-so-sparse scalar attributes.
Some circuits in the model act on the animal directions →fi. E.g. they have query-key lookups for various facts about elephants and parrots. Other circuits in the model act on the attribute directions →f′i. They’re involved in implementing logic like ‘if there’s a furry animal in the room, people with allergies might have problems’. Sometimes they’re involved in circuits that have nothing to do with animals whatsoever. The model’s “size” attribute is the same one used for houses and economies for example, so that direction might be read-in to a circuit storing some fact about economic growth.
So, both the one thousand animal features and the fifty attribute features are elements of the model’s ontology, variables along which small parts of its cognition are structured. But we can’t make a basis for the model activations out of those one thousand and fifty features of the model. We can write either →a(x)=∑1000i=1ci(x)→fi, or a(x)=∑50i=1c′i(x)→f′i. But ∑1000i=1ci(x)→fi+∑50i=1c′i(x)→f′i does not equal the model activation vector →a(x), it’s too large.
Doing interp on this model
Say we choose →a(x)=∑ici(x)→fi as our basis for this subspace of the example model’s activations, and then go on to make a causal graph of the model’s computation, with each basis element being a node in the graph, and lines between nodes representing connections. Then the circuits dealing with query-key lookups for animal facts will look neat and understandable at a glance, with few connections and clear logic. But the circuits involving the attributes will look like a mess. A circuit reading in the size direction will have a thousand small but collectively significant connections to all of the animals.
If we choose →a(x)=∑ic′i(x)→f′i as our basis for the graph instead, circuits that act on some of the fifty attributes will look simple and sensible, but now the circuits storing animal facts will look like a mess. A circuit implementing “space” AND “cat” ⇒ [increase association with rainbows] is going to have fifty connections to features like “size” and “furriness’.
The model’s ontology does not correspond to either the →fi basis or the →f′i basis. It just does not correspond to any basis of activation space at all, not even in a loose sense. Different circuits in the model can just process the activations in different bases, and they are under no obligation to agree with each other. Not even if they are situated right next to each other, in the same model layer.
Note that for all of this, we have not broken assumption 1) or assumption 2). The features this model makes use of are all linearly represented and scalar. We also haven’t broken the secret assumption 0) I left out at the start, that the model can be meaningfully said to have an ontology comprised of elementary features at all.
Takeaways
I’ve seen people call out assumptions 1) and 2), and at least think about how we can test whether they hold, and how we might need to adjust our interpretability techniques if and when they don’t hold. I have not seen people do this for assumption 3). Though I might just have missed it, of course.
My current dumb guess is that assumption 2) is mostly correct, but assumptions 1) and 3) are both incorrect.
The reason I think assumption 3) is incorrect is that the counterexample I sketched here seems to me like it’d be very common. LLMs seem to be made of lots of circuits. Why would these circuits all share a basis? They don’t seem to me to have much reason to.
I think a way we might find the model’s features without assumption 3) is to focus on the circuits and computations first. Try to directly decompose the model weights or layer transitions into separate, simple circuits, then infer the model’s features from looking at the directions those circuits read and write to. In the counterexample above, this would have shown us both the animal features and the attribute features.
It’s a vector because we’ve already assumed that features are all scalar. If a feature was two-dimensional instead, this would be a projection into an associated two-dimensional subspace.
I’m using the term basis loosely here, this also includes sparse overcomplete ‘bases’ like those in SAEs. The more accurate term would probably be ‘dictionary’, or ‘frame’.
Or if the computation isn’t layer aligned, the activations along some other causal cut through the network can be written as a sum of all the features represented on that cut.
Sometimes it’s simpler (less edges) to use the attributes (Cute) or animals (Bunny) or both (eg a particularly cute bunny). Assumption 3 doesn’t allow mixing different bases together.
So here we have 2 attributes (for−−→fatt) & 4 animals (for−−−−→fanimal).
If the downstream circuit (let’s assume a linear + ReLU) reads from the “Cute” direction then: 1. If we are only using −−−−→fanimal : Bunny + Dolphin (interpretable, but add 100 more animals & it’ll take a lot more work to interpret) 2. If we are only using −−→fatt: Cute (simple)
If a downstream circuit reads from the “bunny” direction, then the reverse: 1. Only −−−−→fanimal: Bunny (simple) 2. Only −−→fatt: Cute + Furry ( + 48 attributes makes it more complex)
However, what if there’s a particularly cute rabbit?
Only −−−−→fanimal: Bunny + 0.2*Dolphin(?) (+ many more animals)
Only −−→fatt: 2*Cute + Furry (+ many more attributes)
Neither of the above work! BUT what if we mixed them:
3. Bunny + 0.2*Cute (simple)
I believe you’re claiming that something like APD would, when given the very cute rabbit input, activate the Bunny & Cute components (or whatever directions the model is actually using), which can be in different bases, so can’t form a dictionary/basis. [1]
Technically you didn’t specify that c(x)can’t be an arbitrary function, so you’d be able to reconstruct activations combining different bases, but it’d be horribly convoluted in practice.
Technically you didn’t specify that c(x)can’t be an arbitrary function, so you’d be able to reconstruct activations combining different bases, but it’d be horribly convoluted in practice.
I wouldn’t even be too fussed about ‘horribly convoluted’ here. I’m saying it’s worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick.
The problem here isn’t that we can’t make a dictionary that includes all the 1050 feature directions→f as dictionary elements. We can do that. For example, while we can’t write
→a(x)=∑1000i=1ci(x)→fi+∑50i=1c′i(x)→f′i
because those sums each already equal →a(x) on their own, we can write →a(x)=∑1000i=1ci(x)2→fi+∑50i=1c′i(x)2→f′i.
The problem is instead that we can’t make a dictionary that has the 1050feature activations ci(x),c′i(x) as the coefficients in the dictionary. This is bad because it means our dictionary activations cannot equal thescalar variables the model’s own circuits actually care about. They cannot equal the ‘features of the model’ in the sense defined at the start, the scalar features comprising its ontology. As a result, if we were to look at a causal graph of the model, using the 1050 half-size dictionary feature activations we picked as the graph nodes, a circuit taking in the feature celephant(x) through a linear read-off along the direction →felephant would have edges in our graph connecting it to both the elephant direction, making up about 50% of the total contribution, and the fifty attribute directions, making up the remaining 50%. Same the other way around, any circuit reading in even a single attribute feature will have 1000 edges connecting to all of the animal features[1], making up 50% of the total contribution. It’s the worst of both worlds. Every circuit looks like a mess now.
Since the animals are sparse, in practice this usually means edges to a small set of different animals for every data point. Whichever ones happen to be active at the time.
It seems like in this setting, the animals are just the sum of attributes that commonly co-occur together, rather than having a unique identifying direction. E.g. the concept of a “furry elephant” or a “tiny elephant” would be unrepresentable in this scheme, since elephant is defined as just the collection of attributes that elephants usually have, which includes being large and not furry.
I feel like in this scheme, it’s not really the case that there’s 1000 animal directions, since the base unit is the attributes, and there’s no way to express an animal separately from its attributes. For there to be a true “elephant” direction, then it should be possible to have any set of arbitrary attributes attached to an elephant (small, furry, pink, etc...), and this would require that there is a “label” direction that indicates “elephant” that’s mostly orthogonal to every other feature so it can be queried uniquely via projection.
That being said, I could image a situation where the co-occurrence between labels and attributes is so strong (nearly perfect hierarchy) that the model’s circuits can select the attributes along with the label without it ever being a problem during training. For instance, maybe a circuit that’s trying to select the “elephant” label actually selects “elephant + gray”, and since “pink elephant” never came up during training, the circuit never received a gradient to force it to just select “elephant” which is what it’s really aiming for.
E.g. the concept of a “furry elephant” or a “tiny elephant” would be unrepresentable in this scheme
It’s representable. E.g. the model can learn a circuit reading in a direction that is equal to the sum of the furry attribute direction and the elephant direction, or the tiny direction and the elephant direction respectively. This circuit can then store facts about furry elephants or tiny elephants.
I feel like in this scheme, it’s not really the case that there’s 1000 animal directions, since the base unit is the attributes
In what sense? If you represent the network computations in terms of the attribute features, you will get a very complicated computational graph with lots of interaction lines going all over the place. So clearly, the attributes on their own are not a very good basis for understanding the network.
Similarly, you can always represent any neural network in the standard basis of the network architecture. Trivially, all features can be seen as mere combinations of these architectural ‘base units’. But if you try to understand what the network is doing in terms of interactions in the standard basis, you won’t get very far.
For there to be a true “elephant” direction, then it should be possible to have any set of arbitrary attributes attached to an elephant (small, furry, pink, etc...), and this would require that there is a “label” direction that indicates “elephant” that’s mostly orthogonal to every other feature so it can be queried uniquely via projection.
The ‘elephant’ feature in this setting is mostly-orthogonal to every other feature in the ontology, including the features that are attributes. So it can be read out with a linear projection. ‘elephant’ and ‘pink’ shouldn’t have substantially higher cosine similarity than ‘elephant’ and ‘parrot’.
Just to clarify, do you mean something like “elephant = grey + big + trunk + ears + African + mammal + wise” so to encode a tiny elephant you would have “grey + tiny + trunk + ears + African + mammal + wise” which the model could still read off as 0.86 × elephant when relevant, but also tiny when relevant.
‘elephant’ would be a sum of fifty attribute feature vectors, all with scalar coefficients that match elephants in particular. The coefficients would tend have sizes on the order of 1√50, because the subspace is fifty-dimensional. So, if you wanted to have a pure tiny feature and an elephant feature active at the same time to encode a tiny elephant, ‘elephant’ and ‘tiny’ would be expected to have read-off interference on the order of 1√50. Alternatively, you could instead encode a new animal ‘tiny elephant’ as its own point in the fifty-dimensional space. Those are actually distinct things here. If this is confusing, maybe it helps to imagine that the name for ‘tiny elephant’ is ‘exampledon’, and exampledons just happen to look like tiny elephants.
Is the distinction between “elephant + tiny” and “exampledon” primarily about the things the model does downstream? E.g. if none of the fifty dimensions of our subspace represent “has a bright purple spleen” but exampledons do, then the model might need to instead produce a “purple” vector as an output from an MLP whenever “exampledon” and “spleen” are present together.
This implies that there is no elephant direction separate from the attributes that happen to commonly co-occur with elephants. E.g. it’s not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction. This is what I mean that the attributes are the ‘base units’ in this scheme, and ‘animals’ are just commonly co-occurring sets of attributes. This is the same as the “red triangle” problem in SAEs: https://www.lesswrong.com/posts/QoR8noAB3Mp2KBA4B/do-sparse-autoencoders-find-true-features. The animals in this framing are just invented combinations of the underlying attribute features. We would want the dictionary to learn the attributes, not arbitrary combinations of attributes, since these are the true “base units” that can vary freely. e.g. in the “red triangle” problem, we want a dictionary to learn “red” and “triangle”, not “red triangle” as its own direction.
Put another way, there’s no way to represent an “elephant” in this scheme without also attaching attributes to it. Likewise, it’s not possible to differentiate between an elephant with the set of attributes x y and z and a rabbit with identical attributes x y and z, since the sum of attributes are what you’re calling an elephant or rabbit. There’s no separate “this is a rabbit, regardless of what attributes it has” direction.
To properly represent animals and attributes, there needs to be a direction for each animal that’s separate from any attributes that animal may have, so that it’s possible to represent a “tiny furry pink elephant with no trunk” vs a “tiny furry pink rabbit with no trunk”.
E.g. it’s not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction.
You can’t represent elephants along with arbitrary combinations of attributes. You can’t do that in a scheme where feature directions are fully random with no geometry either though. There, only a small number of features can have non-zero values at the same time, so you still only get O(√50) non-zero attribute features at once maximum.[1]
We would want the dictionary to learn the attributes, not arbitrary combinations of attributes, since these are the true “base units” that can vary freely.
You can call them the “base units” if you like. But that won’t change the fact that some directions in the space spanned by those “base units” are special, with associated circuits that care about those directions in particular, and understanding or even recognising those circuits in a causal graph made of the “base units” will be pretty darned hard. For the same reason trying to understand the network in the neuron basis is hard.
Put another way, there’s no way to represent an “elephant” in this scheme without also attaching attributes to it.
Yes.
Likewise, it’s not possible to differentiate between an elephant with the set of attributes x y and z and a rabbit with identical attributes x y and z, since the sum of attributes are what you’re calling an elephant or rabbit.
Not quite. You cannot specify a rabbit and simultaneously specify the rabbit having arbitrary numerical attribute values for attributes x,y,z differing from normal rabbits. You can have a rabbit, and some attributes x,y,z treated as sparse boolean-ish features at the same time. E.g.→a=→frabbit+→fx+→fy+→fz works. Circuits downstream that store facts about rabbits will still be triggered by this →a. Circuits downstream that do something with attribute x will be reading in an x-attribute value of 1 plus the x-coefficient of rabbits.
A consequence of this is that ‘cute rabbit’ is a bit cuter than either ‘cute’ or ‘rabbit’ on their own. But that doesn’t seem particularly strange to me. Associations in my own mind sure seem to work like that.
I’m with @chanind: If elephant is fully represented by a sum of its attributes, then it’s quite reasonable to say that the model has no fundamental notion of an elephant in that representation.
Yes, the combination “grey + big + mammal + …” is special in some sense. If the model needed to recall that elephans are afraid of mice, the circuit would appear to check “grey and big and mammal” and that’s an annoying mouthful that would be repeated all over the model. But it’s a faithful representation of what’s going on.
Let me be precise by what I mean “has no fundamental notion of an elephant”. Suppose I tried to fine tune the model to represent some new fact about animals, say, if they are worth a lot of points in Scrabble. One way the model could do this by squeezing another feature into the activation space. The other features might rotate a little during this training, but all the existing circuitry would basically continue functioning unchanged.
But they’d be too unchanged: the “afraid of mice” circuit would still be checking for “grey and big and mammal and …” as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for “grey and big and mammal and … and high-scrabble-scoring”. Any interpretability tool that told you that “grey and big and mammal and …” was “elephant” in the first model is now going to have difficulty representing the situation.
Meanwhile, consider a “normal” model that has a residual notion of an elephant after you take away all all facts about elephants. Then both old and new circuits would contain references to that residual (plus other junk) and one could meaningfully say both circuits have something in common.
Your example, which represents animals purely by their properties, reminds me of this classic article, which argues that a key feature in thought is forming concepts of things that are independent of the properties we learnt about them.
I too agreed w/ Chanind initially, but I think I see where Lucius is coming from.
If we forget about a basis & focused on minimal description length (MDL), it’d be nice to have a technique that found the MDL [features/components] for each datapoint. e.g. in my comment, I have 4 animals (bunny, etc) & two properties (cute, furry). For MDL reasons, it’d be great to sometimes use cute/furry & sometimes use Bunny if that reflects model computation more simply.
If you have both attributes & animals as fundamental units (and somehow have a method that tells you which minimal set of units form each datapoint) then a bunny will just use the bunny feature (since that’s simpler than cute + furry + etc), & a very cute bunny will use bunny + cute (instead of bunny + 0.1*dolphin + etc (or similar w/ properties)).
So if we look at Lucius initial statement:
The features a model thinks in do not need to form a basis or dictionary for its activations. [emphasis mine]
They don’t need to, but they can form a basis. It very well could be simpler to not constrain our understanding/model of the NN’s features as forming a basis.
Ideally Lucius can just show us this magical method that gets you simple components that don’t form a basis, then we’d all have a simpler time understanding his point. I believe this “magical method” is Attribution based parameter decomposition (APD) that they (lucius, dan, lee?) have been working on, which I would be excited if more people tried creative methods to scale this up. I’m unsure if this method will work, but it is a different bet than e.g. SAEs & currently underrated imo.
But they’d be too unchanged: the “afraid of mice” circuit would still be checking for “grey and big and mammal and …” as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for “grey and big and mammal and … and high-scrabble-scoring”. Any interpretability tool that told you that “grey and big and mammal and …” was “elephant” in the first model is now going to have difficulty representing the situation.
Thank you, this is a good example of a type-of-thing to watch out for in circuit interpretation. I had not thought of this before. I agree that an interpretability tool that rounded those two circuits off to taking in the ‘same’ feature would be a bad interpretability tool. It should just show you that those two circuits exist, and have some one dimensional features they care about, and those features are related but non-trivially distinct.
But this is not at all unique to the sort of model used in the counterexample. A ‘normal’ model can still have one embedding direction for elephant →felephant at one point, used by a circuit C1, then in fine tuning switch to a slightly different embedding direction →felephant′. Maybe it learned more features in fine tuning, some of those features are correlated with elephants and ended up a bit too close in cosine similarity to →felephant, and so interference can be lowered my moving the embedding around a bit. A circuit C2 learned in fine tuning would then be reading from this →felephant′ and not match C1 which is still reading in →felephant. You might argue that C1 will surely want to adjust to start using →felephant′ as well to lower the loss, but that would seem to apply equally well to your example. So I don’t see how this is showing that the model used in the original counterexample has no notion of an elephant in a sense that does not also apply to the sort of models people might tend to imagine when they think in the conventional SDL paradigm.
EDIT: On a second read, I think I misunderstood you here. You seem to think the crucial difference is that the delta between →felephantand →felephant′ is mostly ‘unstructured’, whereas the difference between “grey and big and mammal and …” and “grey and big and mammal and … and high-scrabble-scoring” is structured. I don’t see why that should matter though. So long as our hypothetical interpretability tool is precise enough to notice the size of the discrepancy between those features and not throw them into the same pot, we should be fine. For that, it wouldn’t seem to me to really matter much whether the discrepancy is ‘meaningful’ to the model or not.
I’m with @chanind: If elephant is fully represented by a sum of its attributes, then it’s quite reasonable to say that the model has no fundamental notion of an elephant in that representation. ...
This is not a load-bearing detail of the example. If you like, you can instead imagine a model that embeds 1000 animals in an e.g. 100-dimensional subspace, with a 50 dimensional sub-sub-space where the embedding directions correspond to 50 attributes, and a 50 dimensional sub-sub-space where embeddings are just random.
This should still get you basically the same issues the original example did I think? For any dictionary decomposition of the activations you pick, some of the circuits will end up looking like a horrible mess, even though they’re secretly taking in a very low-rank subspace of the activations that’d make sense to us if we looked at it. I should probably double check that when I’m more awake though.[1]
I think the central issue here is mostly just having some kind of non-random, ‘meaningful’ feature embedding geometry that the circuits care about, instead of random feature embeddings.
If I understand correctly, it sounds like you’re saying there is a “label” direction for each animal that’s separate from each of the attributes. So, you could have activation a1 = elephant + small + furry + pink, and a2 = rabbit + small + furry + pink. a1 and a2 have the same attributes, but different animal labels. Their corresponding activations are thus different despite having the same attributes due to the different animal label components.
I’m confused why a dictionary that consists of a feature direction for each attribute and each animal label can’t explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal. In this sense the activations are just the sum of the various elements of the dictionary multiplied by a magnitude, so it seems like you should be able to explain these activations using dictionary learning.
Is the idea that the 1000 animals and 50 attributes form an overcomplete basis, therefore you can come up with infinite ways to span the space using these basis components? The idea behind compressed sensing in dictionary learning is that if each activation is composed of a sparse sum of features, then L1 regularization can still recover the true features despite the basis being overcomplete.
If I understand correctly, it sounds like you’re saying there is a “label” direction for each animal that’s separate from each of the attributes.
No, the animal vectors are all fully spanned by the fifty attribute features.
I’m confused why a dictionary that consists of a feature direction for each attribute and each animal label can’t explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal.
The animal features are sparse. The attribute features are not sparse.[1]
In this sense the activations are just the sum of the various elements of the dictionary multiplied by a magnitude, so it seems like you should be able to explain these activations using dictionary learning.
The magnitudes in a dictionary seeking to decompose the activation vector into these 1050 features will not be able to match the actual magnitudes of the features ci(x),i=1…1000,c′i(x),i=1…50 as seen by linear probes and the network’s own circuits.
Is the idea that the 1000 animals and 50 attributes form an overcomplete basis, therefore you can come up with infinite ways to span the space using these basis components?
Relative to the animal features at least. They could still be sparse relative to the rest of the network if this 50-dimensional animal subspace is rarely used.
No, the animal vectors are all fully spanned by the fifty attribute features.
Is this just saying that there’s superposition noise, so everything is spanning everything else? If so that doesn’t seem like it should conflict with being able to use a dictionary, dictionary learning should work with superposition noise as long as the interference doesn’t get too massive.
The animal features are sparse. The attribute features are not sparse.
If you mean that the attributes are a basis in the sense that the neurons are a basis, then I don’t see how you can say there’s a unique “label” direction for each animal that’s separate from the the underlying attributes such that you can set any arbitrary combination of attributes, including all attributes turned on at once or all turned off since they’re not sparse, and still read off the animal label without interference. It seems like that would be like saying that the elephant direction = [1, 0, −1], but you can change arbitrarily all 3 of those numbers to any other numbers and still be the elephant direction.
Nope. Try it out. If you attempt to split the activation vector into 1050 vectors for animals + attributes, you can’t get the dictionary activations to equal the feature activations ci(x), c′i(x).
Is the central point here that a given input will activate it’s representation in both the size 1000 and size 50 sub-dictionaries, meaning the reconstruction will be 2x too big?
The features a model thinks in do not need to form a basis or dictionary for its activations.
Three assumptions people in interpretability often make about the features that comprise a model’s ontology:
Features are one-dimensional variables.
Meaning, the value of feature i on data point x can be represented by some scalar number ci(x).
Features are ‘linearly represented’.
Meaning, each feature ci(x) can be approximately recovered from the activation vector →a(x)[1] with a linear projection onto an associated feature vector →fi.[2] So, we can write ci(x)≈→fi⋅→a(x).
Features form a ‘basis’ for activation space.[3]
Meaning, the model’s activations →a(x) at a given layer can be decomposed into a sum over all the features of the model represented in that layer[4]: →a(x)=∑ici(x)→fi.
It seems to me that a lot of people are not tracking that 3) is an extra assumption they are making. I think they think that assumption 3) is a natural consequence of assumptions 1) and 2), or even just of assumption 2) alone. It’s not.
Counterexample
Model setup
Suppose we have a language model that has a thousand sparsely activating scalar, linearly represented features for different animals. So, “elephant”, “giraffe”, “parrot”, and so on all with their own associated feature directions →f1,…,→f1000. The model embeds those one thousand animal features in a fifty-dimensional sub-space of the activations. This subspace has a meaningful geometry: It is spanned by a set of fifty directions →f′1,…,→f′50 corresponding to different attributes animals have. Things like “furriness”, “size”, “length of tail” and such. So, each animal feature can equivalently be seen as either one of a thousand sparsely activating scalar feature, or just as a particular setting of those fifty not-so-sparse scalar attributes.
Some circuits in the model act on the animal directions →fi. E.g. they have query-key lookups for various facts about elephants and parrots. Other circuits in the model act on the attribute directions →f′i. They’re involved in implementing logic like ‘if there’s a furry animal in the room, people with allergies might have problems’. Sometimes they’re involved in circuits that have nothing to do with animals whatsoever. The model’s “size” attribute is the same one used for houses and economies for example, so that direction might be read-in to a circuit storing some fact about economic growth.
So, both the one thousand animal features and the fifty attribute features are elements of the model’s ontology, variables along which small parts of its cognition are structured. But we can’t make a basis for the model activations out of those one thousand and fifty features of the model. We can write either →a(x)=∑1000i=1ci(x)→fi, or a(x)=∑50i=1c′i(x)→f′i. But ∑1000i=1ci(x)→fi+∑50i=1c′i(x)→f′i does not equal the model activation vector →a(x), it’s too large.
Doing interp on this model
Say we choose →a(x)=∑ici(x)→fi as our basis for this subspace of the example model’s activations, and then go on to make a causal graph of the model’s computation, with each basis element being a node in the graph, and lines between nodes representing connections. Then the circuits dealing with query-key lookups for animal facts will look neat and understandable at a glance, with few connections and clear logic. But the circuits involving the attributes will look like a mess. A circuit reading in the size direction will have a thousand small but collectively significant connections to all of the animals.
If we choose →a(x)=∑ic′i(x)→f′i as our basis for the graph instead, circuits that act on some of the fifty attributes will look simple and sensible, but now the circuits storing animal facts will look like a mess. A circuit implementing “space” AND “cat” ⇒ [increase association with rainbows] is going to have fifty connections to features like “size” and “furriness’.
The model’s ontology does not correspond to either the →fi basis or the →f′i basis. It just does not correspond to any basis of activation space at all, not even in a loose sense. Different circuits in the model can just process the activations in different bases, and they are under no obligation to agree with each other. Not even if they are situated right next to each other, in the same model layer.
Note that for all of this, we have not broken assumption 1) or assumption 2). The features this model makes use of are all linearly represented and scalar. We also haven’t broken the secret assumption 0) I left out at the start, that the model can be meaningfully said to have an ontology comprised of elementary features at all.
Takeaways
I’ve seen people call out assumptions 1) and 2), and at least think about how we can test whether they hold, and how we might need to adjust our interpretability techniques if and when they don’t hold. I have not seen people do this for assumption 3). Though I might just have missed it, of course.
My current dumb guess is that assumption 2) is mostly correct, but assumptions 1) and 3) are both incorrect.
The reason I think assumption 3) is incorrect is that the counterexample I sketched here seems to me like it’d be very common. LLMs seem to be made of lots of circuits. Why would these circuits all share a basis? They don’t seem to me to have much reason to.
I think a way we might find the model’s features without assumption 3) is to focus on the circuits and computations first. Try to directly decompose the model weights or layer transitions into separate, simple circuits, then infer the model’s features from looking at the directions those circuits read and write to. In the counterexample above, this would have shown us both the animal features and the attribute features.
Potentially up to some small ϵ noise. For a nice operationalisation, see definition 2 on page 3 of this paper.
It’s a vector because we’ve already assumed that features are all scalar. If a feature was two-dimensional instead, this would be a projection into an associated two-dimensional subspace.
I’m using the term basis loosely here, this also includes sparse overcomplete ‘bases’ like those in SAEs. The more accurate term would probably be ‘dictionary’, or ‘frame’.
Or if the computation isn’t layer aligned, the activations along some other causal cut through the network can be written as a sum of all the features represented on that cut.
I think you’re saying:
So here we have 2 attributes (for−−→fatt) & 4 animals (for−−−−→fanimal).
If the downstream circuit (let’s assume a linear + ReLU) reads from the “Cute” direction then:
1. If we are only using −−−−→fanimal : Bunny + Dolphin (interpretable, but add 100 more animals & it’ll take a lot more work to interpret)
2. If we are only using −−→fatt: Cute (simple)
If a downstream circuit reads from the “bunny” direction, then the reverse:
1. Only −−−−→fanimal: Bunny (simple)
2. Only −−→fatt: Cute + Furry ( + 48 attributes makes it more complex)
However, what if there’s a particularly cute rabbit?
Only −−−−→fanimal: Bunny + 0.2*Dolphin(?) (+ many more animals)
Only −−→fatt: 2*Cute + Furry (+ many more attributes)
Neither of the above work! BUT what if we mixed them:
3. Bunny + 0.2*Cute (simple)
I believe you’re claiming that something like APD would, when given the very cute rabbit input, activate the Bunny & Cute components (or whatever directions the model is actually using), which can be in different bases, so can’t form a dictionary/basis. [1]
Technically you didn’t specify that c(x) can’t be an arbitrary function, so you’d be able to reconstruct activations combining different bases, but it’d be horribly convoluted in practice.
I wouldn’t even be too fussed about ‘horribly convoluted’ here. I’m saying it’s worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick.
The problem here isn’t that we can’t make a dictionary that includes all the 1050 feature directions →f as dictionary elements. We can do that. For example, while we can’t write
→a(x)=∑1000i=1ci(x)→fi+∑50i=1c′i(x)→f′i
because those sums each already equal →a(x) on their own, we can write
→a(x)=∑1000i=1ci(x)2→fi+∑50i=1c′i(x)2→f′i.
The problem is instead that we can’t make a dictionary that has the 1050 feature activations ci(x),c′i(x) as the coefficients in the dictionary. This is bad because it means our dictionary activations cannot equal the scalar variables the model’s own circuits actually care about. They cannot equal the ‘features of the model’ in the sense defined at the start, the scalar features comprising its ontology. As a result, if we were to look at a causal graph of the model, using the 1050 half-size dictionary feature activations we picked as the graph nodes, a circuit taking in the feature celephant(x) through a linear read-off along the direction →felephant would have edges in our graph connecting it to both the elephant direction, making up about 50% of the total contribution, and the fifty attribute directions, making up the remaining 50%. Same the other way around, any circuit reading in even a single attribute feature will have 1000 edges connecting to all of the animal features[1], making up 50% of the total contribution. It’s the worst of both worlds. Every circuit looks like a mess now.
Since the animals are sparse, in practice this usually means edges to a small set of different animals for every data point. Whichever ones happen to be active at the time.
It seems like in this setting, the animals are just the sum of attributes that commonly co-occur together, rather than having a unique identifying direction. E.g. the concept of a “furry elephant” or a “tiny elephant” would be unrepresentable in this scheme, since elephant is defined as just the collection of attributes that elephants usually have, which includes being large and not furry.
I feel like in this scheme, it’s not really the case that there’s 1000 animal directions, since the base unit is the attributes, and there’s no way to express an animal separately from its attributes. For there to be a true “elephant” direction, then it should be possible to have any set of arbitrary attributes attached to an elephant (small, furry, pink, etc...), and this would require that there is a “label” direction that indicates “elephant” that’s mostly orthogonal to every other feature so it can be queried uniquely via projection.
That being said, I could image a situation where the co-occurrence between labels and attributes is so strong (nearly perfect hierarchy) that the model’s circuits can select the attributes along with the label without it ever being a problem during training. For instance, maybe a circuit that’s trying to select the “elephant” label actually selects “elephant + gray”, and since “pink elephant” never came up during training, the circuit never received a gradient to force it to just select “elephant” which is what it’s really aiming for.
It’s representable. E.g. the model can learn a circuit reading in a direction that is equal to the sum of the furry attribute direction and the elephant direction, or the tiny direction and the elephant direction respectively. This circuit can then store facts about furry elephants or tiny elephants.
In what sense? If you represent the network computations in terms of the attribute features, you will get a very complicated computational graph with lots of interaction lines going all over the place. So clearly, the attributes on their own are not a very good basis for understanding the network.
Similarly, you can always represent any neural network in the standard basis of the network architecture. Trivially, all features can be seen as mere combinations of these architectural ‘base units’. But if you try to understand what the network is doing in terms of interactions in the standard basis, you won’t get very far.
The ‘elephant’ feature in this setting is mostly-orthogonal to every other feature in the ontology, including the features that are attributes. So it can be read out with a linear projection. ‘elephant’ and ‘pink’ shouldn’t have substantially higher cosine similarity than ‘elephant’ and ‘parrot’.
Just to clarify, do you mean something like “elephant = grey + big + trunk + ears + African + mammal + wise” so to encode a tiny elephant you would have “grey + tiny + trunk + ears + African + mammal + wise” which the model could still read off as 0.86 × elephant when relevant, but also tiny when relevant.
‘elephant’ would be a sum of fifty attribute feature vectors, all with scalar coefficients that match elephants in particular. The coefficients would tend have sizes on the order of 1√50, because the subspace is fifty-dimensional. So, if you wanted to have a pure tiny feature and an elephant feature active at the same time to encode a tiny elephant, ‘elephant’ and ‘tiny’ would be expected to have read-off interference on the order of 1√50. Alternatively, you could instead encode a new animal ‘tiny elephant’ as its own point in the fifty-dimensional space. Those are actually distinct things here. If this is confusing, maybe it helps to imagine that the name for ‘tiny elephant’ is ‘exampledon’, and exampledons just happen to look like tiny elephants.
Is the distinction between “elephant + tiny” and “exampledon” primarily about the things the model does downstream? E.g. if none of the fifty dimensions of our subspace represent “has a bright purple spleen” but exampledons do, then the model might need to instead produce a “purple” vector as an output from an MLP whenever “exampledon” and “spleen” are present together.
This implies that there is no elephant direction separate from the attributes that happen to commonly co-occur with elephants. E.g. it’s not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction. This is what I mean that the attributes are the ‘base units’ in this scheme, and ‘animals’ are just commonly co-occurring sets of attributes. This is the same as the “red triangle” problem in SAEs: https://www.lesswrong.com/posts/QoR8noAB3Mp2KBA4B/do-sparse-autoencoders-find-true-features. The animals in this framing are just invented combinations of the underlying attribute features. We would want the dictionary to learn the attributes, not arbitrary combinations of attributes, since these are the true “base units” that can vary freely. e.g. in the “red triangle” problem, we want a dictionary to learn “red” and “triangle”, not “red triangle” as its own direction.
Put another way, there’s no way to represent an “elephant” in this scheme without also attaching attributes to it. Likewise, it’s not possible to differentiate between an elephant with the set of attributes x y and z and a rabbit with identical attributes x y and z, since the sum of attributes are what you’re calling an elephant or rabbit. There’s no separate “this is a rabbit, regardless of what attributes it has” direction.
To properly represent animals and attributes, there needs to be a direction for each animal that’s separate from any attributes that animal may have, so that it’s possible to represent a “tiny furry pink elephant with no trunk” vs a “tiny furry pink rabbit with no trunk”.
You can’t represent elephants along with arbitrary combinations of attributes. You can’t do that in a scheme where feature directions are fully random with no geometry either though. There, only a small number of features can have non-zero values at the same time, so you still only get O(√50) non-zero attribute features at once maximum.[1]
You can call them the “base units” if you like. But that won’t change the fact that some directions in the space spanned by those “base units” are special, with associated circuits that care about those directions in particular, and understanding or even recognising those circuits in a causal graph made of the “base units” will be pretty darned hard. For the same reason trying to understand the network in the neuron basis is hard.
Yes.
Not quite. You cannot specify a rabbit and simultaneously specify the rabbit having arbitrary numerical attribute values for attributes x,y,z differing from normal rabbits. You can have a rabbit, and some attributes x,y,z treated as sparse boolean-ish features at the same time. E.g.→a=→frabbit+→fx+→fy+→fz works. Circuits downstream that store facts about rabbits will still be triggered by this →a. Circuits downstream that do something with attribute x will be reading in an x-attribute value of 1 plus the x-coefficient of rabbits.
A consequence of this is that ‘cute rabbit’ is a bit cuter than either ‘cute’ or ‘rabbit’ on their own. But that doesn’t seem particularly strange to me. Associations in my own mind sure seem to work like that.
Less, if you want to be able to perform computation in superposition.
I’m with @chanind: If elephant is fully represented by a sum of its attributes, then it’s quite reasonable to say that the model has no fundamental notion of an elephant in that representation.
Yes, the combination “grey + big + mammal + …” is special in some sense. If the model needed to recall that elephans are afraid of mice, the circuit would appear to check “grey and big and mammal” and that’s an annoying mouthful that would be repeated all over the model. But it’s a faithful representation of what’s going on.
Let me be precise by what I mean “has no fundamental notion of an elephant”. Suppose I tried to fine tune the model to represent some new fact about animals, say, if they are worth a lot of points in Scrabble. One way the model could do this by squeezing another feature into the activation space. The other features might rotate a little during this training, but all the existing circuitry would basically continue functioning unchanged.
But they’d be too unchanged: the “afraid of mice” circuit would still be checking for “grey and big and mammal and …” as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for “grey and big and mammal and … and high-scrabble-scoring”. Any interpretability tool that told you that “grey and big and mammal and …” was “elephant” in the first model is now going to have difficulty representing the situation.
Meanwhile, consider a “normal” model that has a residual notion of an elephant after you take away all all facts about elephants. Then both old and new circuits would contain references to that residual (plus other junk) and one could meaningfully say both circuits have something in common.
Your example, which represents animals purely by their properties, reminds me of this classic article, which argues that a key feature in thought is forming concepts of things that are independent of the properties we learnt about them.
I too agreed w/ Chanind initially, but I think I see where Lucius is coming from.
If we forget about a basis & focused on minimal description length (MDL), it’d be nice to have a technique that found the MDL [features/components] for each datapoint. e.g. in my comment, I have 4 animals (bunny, etc) & two properties (cute, furry). For MDL reasons, it’d be great to sometimes use cute/furry & sometimes use Bunny if that reflects model computation more simply.
If you have both attributes & animals as fundamental units (and somehow have a method that tells you which minimal set of units form each datapoint) then a bunny will just use the bunny feature (since that’s simpler than cute + furry + etc), & a very cute bunny will use bunny + cute (instead of bunny + 0.1*dolphin + etc (or similar w/ properties)).
So if we look at Lucius initial statement:
They don’t need to, but they can form a basis. It very well could be simpler to not constrain our understanding/model of the NN’s features as forming a basis.
Ideally Lucius can just show us this magical method that gets you simple components that don’t form a basis, then we’d all have a simpler time understanding his point. I believe this “magical method” is Attribution based parameter decomposition (APD) that they (lucius, dan, lee?) have been working on, which I would be excited if more people tried creative methods to scale this up. I’m unsure if this method will work, but it is a different bet than e.g. SAEs & currently underrated imo.
Thank you, this is a good example of a type-of-thing to watch out for in circuit interpretation. I had not thought of this before. I agree that an interpretability tool that rounded those two circuits off to taking in the ‘same’ feature would be a bad interpretability tool. It should just show you that those two circuits exist, and have some one dimensional features they care about, and those features are related but non-trivially distinct.
But this is not at all unique to the sort of model used in the counterexample. A ‘normal’ model can still have one embedding direction for elephant →felephant at one point, used by a circuit C1, then in fine tuning switch to a slightly different embedding direction →felephant′. Maybe it learned more features in fine tuning, some of those features are correlated with elephants and ended up a bit too close in cosine similarity to →felephant, and so interference can be lowered my moving the embedding around a bit. A circuit C2 learned in fine tuning would then be reading from this →felephant′ and not match C1 which is still reading in →felephant. You might argue that C1 will surely want to adjust to start using →felephant′ as well to lower the loss, but that would seem to apply equally well to your example. So I don’t see how this is showing that the model used in the original counterexample has no notion of an elephant in a sense that does not also apply to the sort of models people might tend to imagine when they think in the conventional SDL paradigm.
EDIT: On a second read, I think I misunderstood you here. You seem to think the crucial difference is that the delta between →felephantand →felephant′ is mostly ‘unstructured’, whereas the difference between “grey and big and mammal and …” and “grey and big and mammal and … and high-scrabble-scoring” is structured. I don’t see why that should matter though. So long as our hypothetical interpretability tool is precise enough to notice the size of the discrepancy between those features and not throw them into the same pot, we should be fine. For that, it wouldn’t seem to me to really matter much whether the discrepancy is ‘meaningful’ to the model or not.
This is not a load-bearing detail of the example. If you like, you can instead imagine a model that embeds 1000 animals in an e.g. 100-dimensional subspace, with a 50 dimensional sub-sub-space where the embedding directions correspond to 50 attributes, and a 50 dimensional sub-sub-space where embeddings are just random.
This should still get you basically the same issues the original example did I think? For any dictionary decomposition of the activations you pick, some of the circuits will end up looking like a horrible mess, even though they’re secretly taking in a very low-rank subspace of the activations that’d make sense to us if we looked at it. I should probably double check that when I’m more awake though.[1]
I think the central issue here is mostly just having some kind of non-random, ‘meaningful’ feature embedding geometry that the circuits care about, instead of random feature embeddings.
EDIT: I am now more awake. I still think this is right.
If I understand correctly, it sounds like you’re saying there is a “label” direction for each animal that’s separate from each of the attributes. So, you could have activation a1 = elephant + small + furry + pink, and a2 = rabbit + small + furry + pink. a1 and a2 have the same attributes, but different animal labels. Their corresponding activations are thus different despite having the same attributes due to the different animal label components.
I’m confused why a dictionary that consists of a feature direction for each attribute and each animal label can’t explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal. In this sense the activations are just the sum of the various elements of the dictionary multiplied by a magnitude, so it seems like you should be able to explain these activations using dictionary learning.
Is the idea that the 1000 animals and 50 attributes form an overcomplete basis, therefore you can come up with infinite ways to span the space using these basis components? The idea behind compressed sensing in dictionary learning is that if each activation is composed of a sparse sum of features, then L1 regularization can still recover the true features despite the basis being overcomplete.
No, the animal vectors are all fully spanned by the fifty attribute features.
The animal features are sparse. The attribute features are not sparse.[1]
The magnitudes in a dictionary seeking to decompose the activation vector into these 1050 features will not be able to match the actual magnitudes of the features ci(x),i=1…1000,c′i(x),i=1…50 as seen by linear probes and the network’s own circuits.
No, that is not the idea.
Relative to the animal features at least. They could still be sparse relative to the rest of the network if this 50-dimensional animal subspace is rarely used.
Is this just saying that there’s superposition noise, so everything is spanning everything else? If so that doesn’t seem like it should conflict with being able to use a dictionary, dictionary learning should work with superposition noise as long as the interference doesn’t get too massive.
If you mean that the attributes are a basis in the sense that the neurons are a basis, then I don’t see how you can say there’s a unique “label” direction for each animal that’s separate from the the underlying attributes such that you can set any arbitrary combination of attributes, including all attributes turned on at once or all turned off since they’re not sparse, and still read off the animal label without interference. It seems like that would be like saying that the elephant direction = [1, 0, −1], but you can change arbitrarily all 3 of those numbers to any other numbers and still be the elephant direction.
If the animal specific features form an overcomplete basis, isn’t the set of animals + attributes just an even more overcomplete basis?
Nope. Try it out. If you attempt to split the activation vector into 1050 vectors for animals + attributes, you can’t get the dictionary activations to equal the feature activations ci(x), c′i(x).
Is the central point here that a given input will activate it’s representation in both the size 1000 and size 50 sub-dictionaries, meaning the reconstruction will be 2x too big?
Yes.