This implies that there is no elephant direction separate from the attributes that happen to commonly co-occur with elephants. E.g. it’s not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction. This is what I mean that the attributes are the ‘base units’ in this scheme, and ‘animals’ are just commonly co-occurring sets of attributes. This is the same as the “red triangle” problem in SAEs: https://www.lesswrong.com/posts/QoR8noAB3Mp2KBA4B/do-sparse-autoencoders-find-true-features. The animals in this framing are just invented combinations of the underlying attribute features. We would want the dictionary to learn the attributes, not arbitrary combinations of attributes, since these are the true “base units” that can vary freely. e.g. in the “red triangle” problem, we want a dictionary to learn “red” and “triangle”, not “red triangle” as its own direction.
Put another way, there’s no way to represent an “elephant” in this scheme without also attaching attributes to it. Likewise, it’s not possible to differentiate between an elephant with the set of attributes x y and z and a rabbit with identical attributes x y and z, since the sum of attributes are what you’re calling an elephant or rabbit. There’s no separate “this is a rabbit, regardless of what attributes it has” direction.
To properly represent animals and attributes, there needs to be a direction for each animal that’s separate from any attributes that animal may have, so that it’s possible to represent a “tiny furry pink elephant with no trunk” vs a “tiny furry pink rabbit with no trunk”.
E.g. it’s not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction.
You can’t represent elephants along with arbitrary combinations of attributes. You can’t do that in a scheme where feature directions are fully random with no geometry either though. There, only a small number of features can have non-zero values at the same time, so you still only get O(√50) non-zero attribute features at once maximum.[1]
We would want the dictionary to learn the attributes, not arbitrary combinations of attributes, since these are the true “base units” that can vary freely.
You can call them the “base units” if you like. But that won’t change the fact that some directions in the space spanned by those “base units” are special, with associated circuits that care about those directions in particular, and understanding or even recognising those circuits in a causal graph made of the “base units” will be pretty darned hard. For the same reason trying to understand the network in the neuron basis is hard.
Put another way, there’s no way to represent an “elephant” in this scheme without also attaching attributes to it.
Yes.
Likewise, it’s not possible to differentiate between an elephant with the set of attributes x y and z and a rabbit with identical attributes x y and z, since the sum of attributes are what you’re calling an elephant or rabbit.
Not quite. You cannot specify a rabbit and simultaneously specify the rabbit having arbitrary numerical attribute values for attributes x,y,z differing from normal rabbits. You can have a rabbit, and some attributes x,y,z treated as sparse boolean-ish features at the same time. E.g.→a=→frabbit+→fx+→fy+→fz works. Circuits downstream that store facts about rabbits will still be triggered by this →a. Circuits downstream that do something with attribute x will be reading in an x-attribute value of 1 plus the x-coefficient of rabbits.
A consequence of this is that ‘cute rabbit’ is a bit cuter than either ‘cute’ or ‘rabbit’ on their own. But that doesn’t seem particularly strange to me. Associations in my own mind sure seem to work like that.
I’m with @chanind: If elephant is fully represented by a sum of its attributes, then it’s quite reasonable to say that the model has no fundamental notion of an elephant in that representation.
Yes, the combination “grey + big + mammal + …” is special in some sense. If the model needed to recall that elephans are afraid of mice, the circuit would appear to check “grey and big and mammal” and that’s an annoying mouthful that would be repeated all over the model. But it’s a faithful representation of what’s going on.
Let me be precise by what I mean “has no fundamental notion of an elephant”. Suppose I tried to fine tune the model to represent some new fact about animals, say, if they are worth a lot of points in Scrabble. One way the model could do this by squeezing another feature into the activation space. The other features might rotate a little during this training, but all the existing circuitry would basically continue functioning unchanged.
But they’d be too unchanged: the “afraid of mice” circuit would still be checking for “grey and big and mammal and …” as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for “grey and big and mammal and … and high-scrabble-scoring”. Any interpretability tool that told you that “grey and big and mammal and …” was “elephant” in the first model is now going to have difficulty representing the situation.
Meanwhile, consider a “normal” model that has a residual notion of an elephant after you take away all all facts about elephants. Then both old and new circuits would contain references to that residual (plus other junk) and one could meaningfully say both circuits have something in common.
Your example, which represents animals purely by their properties, reminds me of this classic article, which argues that a key feature in thought is forming concepts of things that are independent of the properties we learnt about them.
I too agreed w/ Chanind initially, but I think I see where Lucius is coming from.
If we forget about a basis & focused on minimal description length (MDL), it’d be nice to have a technique that found the MDL [features/components] for each datapoint. e.g. in my comment, I have 4 animals (bunny, etc) & two properties (cute, furry). For MDL reasons, it’d be great to sometimes use cute/furry & sometimes use Bunny if that reflects model computation more simply.
If you have both attributes & animals as fundamental units (and somehow have a method that tells you which minimal set of units form each datapoint) then a bunny will just use the bunny feature (since that’s simpler than cute + furry + etc), & a very cute bunny will use bunny + cute (instead of bunny + 0.1*dolphin + etc (or similar w/ properties)).
So if we look at Lucius initial statement:
The features a model thinks in do not need to form a basis or dictionary for its activations. [emphasis mine]
They don’t need to, but they can form a basis. It very well could be simpler to not constrain our understanding/model of the NN’s features as forming a basis.
Ideally Lucius can just show us this magical method that gets you simple components that don’t form a basis, then we’d all have a simpler time understanding his point. I believe this “magical method” is Attribution based parameter decomposition (APD) that they (lucius, dan, lee?) have been working on, which I would be excited if more people tried creative methods to scale this up. I’m unsure if this method will work, but it is a different bet than e.g. SAEs & currently underrated imo.
But they’d be too unchanged: the “afraid of mice” circuit would still be checking for “grey and big and mammal and …” as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for “grey and big and mammal and … and high-scrabble-scoring”. Any interpretability tool that told you that “grey and big and mammal and …” was “elephant” in the first model is now going to have difficulty representing the situation.
Thank you, this is a good example of a type-of-thing to watch out for in circuit interpretation. I had not thought of this before. I agree that an interpretability tool that rounded those two circuits off to taking in the ‘same’ feature would be a bad interpretability tool. It should just show you that those two circuits exist, and have some one dimensional features they care about, and those features are related but non-trivially distinct.
But this is not at all unique to the sort of model used in the counterexample. A ‘normal’ model can still have one embedding direction for elephant →felephant at one point, used by a circuit C1, then in fine tuning switch to a slightly different embedding direction →felephant′. Maybe it learned more features in fine tuning, some of those features are correlated with elephants and ended up a bit too close in cosine similarity to →felephant, and so interference can be lowered my moving the embedding around a bit. A circuit C2 learned in fine tuning would then be reading from this →felephant′ and not match C1 which is still reading in →felephant. You might argue that C1 will surely want to adjust to start using →felephant′ as well to lower the loss, but that would seem to apply equally well to your example. So I don’t see how this is showing that the model used in the original counterexample has no notion of an elephant in a sense that does not also apply to the sort of models people might tend to imagine when they think in the conventional SDL paradigm.
EDIT: On a second read, I think I misunderstood you here. You seem to think the crucial difference is that the delta between →felephantand →felephant′ is mostly ‘unstructured’, whereas the difference between “grey and big and mammal and …” and “grey and big and mammal and … and high-scrabble-scoring” is structured. I don’t see why that should matter though. So long as our hypothetical interpretability tool is precise enough to notice the size of the discrepancy between those features and not throw them into the same pot, we should be fine. For that, it wouldn’t seem to me to really matter much whether the discrepancy is ‘meaningful’ to the model or not.
I’m with @chanind: If elephant is fully represented by a sum of its attributes, then it’s quite reasonable to say that the model has no fundamental notion of an elephant in that representation. ...
This is not a load-bearing detail of the example. If you like, you can instead imagine a model that embeds 1000 animals in an e.g. 100-dimensional subspace, with a 50 dimensional sub-sub-space where the embedding directions correspond to 50 attributes, and a 50 dimensional sub-sub-space where embeddings are just random.
This should still get you basically the same issues the original example did I think? For any dictionary decomposition of the activations you pick, some of the circuits will end up looking like a horrible mess, even though they’re secretly taking in a very low-rank subspace of the activations that’d make sense to us if we looked at it. I should probably double check that when I’m more awake though.[1]
I think the central issue here is mostly just having some kind of non-random, ‘meaningful’ feature embedding geometry that the circuits care about, instead of random feature embeddings.
This implies that there is no elephant direction separate from the attributes that happen to commonly co-occur with elephants. E.g. it’s not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction. This is what I mean that the attributes are the ‘base units’ in this scheme, and ‘animals’ are just commonly co-occurring sets of attributes. This is the same as the “red triangle” problem in SAEs: https://www.lesswrong.com/posts/QoR8noAB3Mp2KBA4B/do-sparse-autoencoders-find-true-features. The animals in this framing are just invented combinations of the underlying attribute features. We would want the dictionary to learn the attributes, not arbitrary combinations of attributes, since these are the true “base units” that can vary freely. e.g. in the “red triangle” problem, we want a dictionary to learn “red” and “triangle”, not “red triangle” as its own direction.
Put another way, there’s no way to represent an “elephant” in this scheme without also attaching attributes to it. Likewise, it’s not possible to differentiate between an elephant with the set of attributes x y and z and a rabbit with identical attributes x y and z, since the sum of attributes are what you’re calling an elephant or rabbit. There’s no separate “this is a rabbit, regardless of what attributes it has” direction.
To properly represent animals and attributes, there needs to be a direction for each animal that’s separate from any attributes that animal may have, so that it’s possible to represent a “tiny furry pink elephant with no trunk” vs a “tiny furry pink rabbit with no trunk”.
You can’t represent elephants along with arbitrary combinations of attributes. You can’t do that in a scheme where feature directions are fully random with no geometry either though. There, only a small number of features can have non-zero values at the same time, so you still only get O(√50) non-zero attribute features at once maximum.[1]
You can call them the “base units” if you like. But that won’t change the fact that some directions in the space spanned by those “base units” are special, with associated circuits that care about those directions in particular, and understanding or even recognising those circuits in a causal graph made of the “base units” will be pretty darned hard. For the same reason trying to understand the network in the neuron basis is hard.
Yes.
Not quite. You cannot specify a rabbit and simultaneously specify the rabbit having arbitrary numerical attribute values for attributes x,y,z differing from normal rabbits. You can have a rabbit, and some attributes x,y,z treated as sparse boolean-ish features at the same time. E.g.→a=→frabbit+→fx+→fy+→fz works. Circuits downstream that store facts about rabbits will still be triggered by this →a. Circuits downstream that do something with attribute x will be reading in an x-attribute value of 1 plus the x-coefficient of rabbits.
A consequence of this is that ‘cute rabbit’ is a bit cuter than either ‘cute’ or ‘rabbit’ on their own. But that doesn’t seem particularly strange to me. Associations in my own mind sure seem to work like that.
Less, if you want to be able to perform computation in superposition.
I’m with @chanind: If elephant is fully represented by a sum of its attributes, then it’s quite reasonable to say that the model has no fundamental notion of an elephant in that representation.
Yes, the combination “grey + big + mammal + …” is special in some sense. If the model needed to recall that elephans are afraid of mice, the circuit would appear to check “grey and big and mammal” and that’s an annoying mouthful that would be repeated all over the model. But it’s a faithful representation of what’s going on.
Let me be precise by what I mean “has no fundamental notion of an elephant”. Suppose I tried to fine tune the model to represent some new fact about animals, say, if they are worth a lot of points in Scrabble. One way the model could do this by squeezing another feature into the activation space. The other features might rotate a little during this training, but all the existing circuitry would basically continue functioning unchanged.
But they’d be too unchanged: the “afraid of mice” circuit would still be checking for “grey and big and mammal and …” as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for “grey and big and mammal and … and high-scrabble-scoring”. Any interpretability tool that told you that “grey and big and mammal and …” was “elephant” in the first model is now going to have difficulty representing the situation.
Meanwhile, consider a “normal” model that has a residual notion of an elephant after you take away all all facts about elephants. Then both old and new circuits would contain references to that residual (plus other junk) and one could meaningfully say both circuits have something in common.
Your example, which represents animals purely by their properties, reminds me of this classic article, which argues that a key feature in thought is forming concepts of things that are independent of the properties we learnt about them.
I too agreed w/ Chanind initially, but I think I see where Lucius is coming from.
If we forget about a basis & focused on minimal description length (MDL), it’d be nice to have a technique that found the MDL [features/components] for each datapoint. e.g. in my comment, I have 4 animals (bunny, etc) & two properties (cute, furry). For MDL reasons, it’d be great to sometimes use cute/furry & sometimes use Bunny if that reflects model computation more simply.
If you have both attributes & animals as fundamental units (and somehow have a method that tells you which minimal set of units form each datapoint) then a bunny will just use the bunny feature (since that’s simpler than cute + furry + etc), & a very cute bunny will use bunny + cute (instead of bunny + 0.1*dolphin + etc (or similar w/ properties)).
So if we look at Lucius initial statement:
They don’t need to, but they can form a basis. It very well could be simpler to not constrain our understanding/model of the NN’s features as forming a basis.
Ideally Lucius can just show us this magical method that gets you simple components that don’t form a basis, then we’d all have a simpler time understanding his point. I believe this “magical method” is Attribution based parameter decomposition (APD) that they (lucius, dan, lee?) have been working on, which I would be excited if more people tried creative methods to scale this up. I’m unsure if this method will work, but it is a different bet than e.g. SAEs & currently underrated imo.
Thank you, this is a good example of a type-of-thing to watch out for in circuit interpretation. I had not thought of this before. I agree that an interpretability tool that rounded those two circuits off to taking in the ‘same’ feature would be a bad interpretability tool. It should just show you that those two circuits exist, and have some one dimensional features they care about, and those features are related but non-trivially distinct.
But this is not at all unique to the sort of model used in the counterexample. A ‘normal’ model can still have one embedding direction for elephant →felephant at one point, used by a circuit C1, then in fine tuning switch to a slightly different embedding direction →felephant′. Maybe it learned more features in fine tuning, some of those features are correlated with elephants and ended up a bit too close in cosine similarity to →felephant, and so interference can be lowered my moving the embedding around a bit. A circuit C2 learned in fine tuning would then be reading from this →felephant′ and not match C1 which is still reading in →felephant. You might argue that C1 will surely want to adjust to start using →felephant′ as well to lower the loss, but that would seem to apply equally well to your example. So I don’t see how this is showing that the model used in the original counterexample has no notion of an elephant in a sense that does not also apply to the sort of models people might tend to imagine when they think in the conventional SDL paradigm.
EDIT: On a second read, I think I misunderstood you here. You seem to think the crucial difference is that the delta between →felephantand →felephant′ is mostly ‘unstructured’, whereas the difference between “grey and big and mammal and …” and “grey and big and mammal and … and high-scrabble-scoring” is structured. I don’t see why that should matter though. So long as our hypothetical interpretability tool is precise enough to notice the size of the discrepancy between those features and not throw them into the same pot, we should be fine. For that, it wouldn’t seem to me to really matter much whether the discrepancy is ‘meaningful’ to the model or not.
This is not a load-bearing detail of the example. If you like, you can instead imagine a model that embeds 1000 animals in an e.g. 100-dimensional subspace, with a 50 dimensional sub-sub-space where the embedding directions correspond to 50 attributes, and a 50 dimensional sub-sub-space where embeddings are just random.
This should still get you basically the same issues the original example did I think? For any dictionary decomposition of the activations you pick, some of the circuits will end up looking like a horrible mess, even though they’re secretly taking in a very low-rank subspace of the activations that’d make sense to us if we looked at it. I should probably double check that when I’m more awake though.[1]
I think the central issue here is mostly just having some kind of non-random, ‘meaningful’ feature embedding geometry that the circuits care about, instead of random feature embeddings.
EDIT: I am now more awake. I still think this is right.