I’m with @chanind: If elephant is fully represented by a sum of its attributes, then it’s quite reasonable to say that the model has no fundamental notion of an elephant in that representation.
Yes, the combination “grey + big + mammal + …” is special in some sense. If the model needed to recall that elephans are afraid of mice, the circuit would appear to check “grey and big and mammal” and that’s an annoying mouthful that would be repeated all over the model. But it’s a faithful representation of what’s going on.
Let me be precise by what I mean “has no fundamental notion of an elephant”. Suppose I tried to fine tune the model to represent some new fact about animals, say, if they are worth a lot of points in Scrabble. One way the model could do this by squeezing another feature into the activation space. The other features might rotate a little during this training, but all the existing circuitry would basically continue functioning unchanged.
But they’d be too unchanged: the “afraid of mice” circuit would still be checking for “grey and big and mammal and …” as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for “grey and big and mammal and … and high-scrabble-scoring”. Any interpretability tool that told you that “grey and big and mammal and …” was “elephant” in the first model is now going to have difficulty representing the situation.
Meanwhile, consider a “normal” model that has a residual notion of an elephant after you take away all all facts about elephants. Then both old and new circuits would contain references to that residual (plus other junk) and one could meaningfully say both circuits have something in common.
Your example, which represents animals purely by their properties, reminds me of this classic article, which argues that a key feature in thought is forming concepts of things that are independent of the properties we learnt about them.
I too agreed w/ Chanind initially, but I think I see where Lucius is coming from.
If we forget about a basis & focused on minimal description length (MDL), it’d be nice to have a technique that found the MDL [features/components] for each datapoint. e.g. in my comment, I have 4 animals (bunny, etc) & two properties (cute, furry). For MDL reasons, it’d be great to sometimes use cute/furry & sometimes use Bunny if that reflects model computation more simply.
If you have both attributes & animals as fundamental units (and somehow have a method that tells you which minimal set of units form each datapoint) then a bunny will just use the bunny feature (since that’s simpler than cute + furry + etc), & a very cute bunny will use bunny + cute (instead of bunny + 0.1*dolphin + etc (or similar w/ properties)).
So if we look at Lucius initial statement:
The features a model thinks in do not need to form a basis or dictionary for its activations. [emphasis mine]
They don’t need to, but they can form a basis. It very well could be simpler to not constrain our understanding/model of the NN’s features as forming a basis.
Ideally Lucius can just show us this magical method that gets you simple components that don’t form a basis, then we’d all have a simpler time understanding his point. I believe this “magical method” is Attribution based parameter decomposition (APD) that they (lucius, dan, lee?) have been working on, which I would be excited if more people tried creative methods to scale this up. I’m unsure if this method will work, but it is a different bet than e.g. SAEs & currently underrated imo.
But they’d be too unchanged: the “afraid of mice” circuit would still be checking for “grey and big and mammal and …” as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for “grey and big and mammal and … and high-scrabble-scoring”. Any interpretability tool that told you that “grey and big and mammal and …” was “elephant” in the first model is now going to have difficulty representing the situation.
Thank you, this is a good example of a type-of-thing to watch out for in circuit interpretation. I had not thought of this before. I agree that an interpretability tool that rounded those two circuits off to taking in the ‘same’ feature would be a bad interpretability tool. It should just show you that those two circuits exist, and have some one dimensional features they care about, and those features are related but non-trivially distinct.
But this is not at all unique to the sort of model used in the counterexample. A ‘normal’ model can still have one embedding direction for elephant →felephant at one point, used by a circuit C1, then in fine tuning switch to a slightly different embedding direction →felephant′. Maybe it learned more features in fine tuning, some of those features are correlated with elephants and ended up a bit too close in cosine similarity to →felephant, and so interference can be lowered my moving the embedding around a bit. A circuit C2 learned in fine tuning would then be reading from this →felephant′ and not match C1 which is still reading in →felephant. You might argue that C1 will surely want to adjust to start using →felephant′ as well to lower the loss, but that would seem to apply equally well to your example. So I don’t see how this is showing that the model used in the original counterexample has no notion of an elephant in a sense that does not also apply to the sort of models people might tend to imagine when they think in the conventional SDL paradigm.
EDIT: On a second read, I think I misunderstood you here. You seem to think the crucial difference is that the delta between →felephantand →felephant′ is mostly ‘unstructured’, whereas the difference between “grey and big and mammal and …” and “grey and big and mammal and … and high-scrabble-scoring” is structured. I don’t see why that should matter though. So long as our hypothetical interpretability tool is precise enough to notice the size of the discrepancy between those features and not throw them into the same pot, we should be fine. For that, it wouldn’t seem to me to really matter much whether the discrepancy is ‘meaningful’ to the model or not.
I’m with @chanind: If elephant is fully represented by a sum of its attributes, then it’s quite reasonable to say that the model has no fundamental notion of an elephant in that representation. ...
This is not a load-bearing detail of the example. If you like, you can instead imagine a model that embeds 1000 animals in an e.g. 100-dimensional subspace, with a 50 dimensional sub-sub-space where the embedding directions correspond to 50 attributes, and a 50 dimensional sub-sub-space where embeddings are just random.
This should still get you basically the same issues the original example did I think? For any dictionary decomposition of the activations you pick, some of the circuits will end up looking like a horrible mess, even though they’re secretly taking in a very low-rank subspace of the activations that’d make sense to us if we looked at it. I should probably double check that when I’m more awake though.[1]
I think the central issue here is mostly just having some kind of non-random, ‘meaningful’ feature embedding geometry that the circuits care about, instead of random feature embeddings.
I’m with @chanind: If elephant is fully represented by a sum of its attributes, then it’s quite reasonable to say that the model has no fundamental notion of an elephant in that representation.
Yes, the combination “grey + big + mammal + …” is special in some sense. If the model needed to recall that elephans are afraid of mice, the circuit would appear to check “grey and big and mammal” and that’s an annoying mouthful that would be repeated all over the model. But it’s a faithful representation of what’s going on.
Let me be precise by what I mean “has no fundamental notion of an elephant”. Suppose I tried to fine tune the model to represent some new fact about animals, say, if they are worth a lot of points in Scrabble. One way the model could do this by squeezing another feature into the activation space. The other features might rotate a little during this training, but all the existing circuitry would basically continue functioning unchanged.
But they’d be too unchanged: the “afraid of mice” circuit would still be checking for “grey and big and mammal and …” as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for “grey and big and mammal and … and high-scrabble-scoring”. Any interpretability tool that told you that “grey and big and mammal and …” was “elephant” in the first model is now going to have difficulty representing the situation.
Meanwhile, consider a “normal” model that has a residual notion of an elephant after you take away all all facts about elephants. Then both old and new circuits would contain references to that residual (plus other junk) and one could meaningfully say both circuits have something in common.
Your example, which represents animals purely by their properties, reminds me of this classic article, which argues that a key feature in thought is forming concepts of things that are independent of the properties we learnt about them.
I too agreed w/ Chanind initially, but I think I see where Lucius is coming from.
If we forget about a basis & focused on minimal description length (MDL), it’d be nice to have a technique that found the MDL [features/components] for each datapoint. e.g. in my comment, I have 4 animals (bunny, etc) & two properties (cute, furry). For MDL reasons, it’d be great to sometimes use cute/furry & sometimes use Bunny if that reflects model computation more simply.
If you have both attributes & animals as fundamental units (and somehow have a method that tells you which minimal set of units form each datapoint) then a bunny will just use the bunny feature (since that’s simpler than cute + furry + etc), & a very cute bunny will use bunny + cute (instead of bunny + 0.1*dolphin + etc (or similar w/ properties)).
So if we look at Lucius initial statement:
They don’t need to, but they can form a basis. It very well could be simpler to not constrain our understanding/model of the NN’s features as forming a basis.
Ideally Lucius can just show us this magical method that gets you simple components that don’t form a basis, then we’d all have a simpler time understanding his point. I believe this “magical method” is Attribution based parameter decomposition (APD) that they (lucius, dan, lee?) have been working on, which I would be excited if more people tried creative methods to scale this up. I’m unsure if this method will work, but it is a different bet than e.g. SAEs & currently underrated imo.
Thank you, this is a good example of a type-of-thing to watch out for in circuit interpretation. I had not thought of this before. I agree that an interpretability tool that rounded those two circuits off to taking in the ‘same’ feature would be a bad interpretability tool. It should just show you that those two circuits exist, and have some one dimensional features they care about, and those features are related but non-trivially distinct.
But this is not at all unique to the sort of model used in the counterexample. A ‘normal’ model can still have one embedding direction for elephant →felephant at one point, used by a circuit C1, then in fine tuning switch to a slightly different embedding direction →felephant′. Maybe it learned more features in fine tuning, some of those features are correlated with elephants and ended up a bit too close in cosine similarity to →felephant, and so interference can be lowered my moving the embedding around a bit. A circuit C2 learned in fine tuning would then be reading from this →felephant′ and not match C1 which is still reading in →felephant. You might argue that C1 will surely want to adjust to start using →felephant′ as well to lower the loss, but that would seem to apply equally well to your example. So I don’t see how this is showing that the model used in the original counterexample has no notion of an elephant in a sense that does not also apply to the sort of models people might tend to imagine when they think in the conventional SDL paradigm.
EDIT: On a second read, I think I misunderstood you here. You seem to think the crucial difference is that the delta between →felephantand →felephant′ is mostly ‘unstructured’, whereas the difference between “grey and big and mammal and …” and “grey and big and mammal and … and high-scrabble-scoring” is structured. I don’t see why that should matter though. So long as our hypothetical interpretability tool is precise enough to notice the size of the discrepancy between those features and not throw them into the same pot, we should be fine. For that, it wouldn’t seem to me to really matter much whether the discrepancy is ‘meaningful’ to the model or not.
This is not a load-bearing detail of the example. If you like, you can instead imagine a model that embeds 1000 animals in an e.g. 100-dimensional subspace, with a 50 dimensional sub-sub-space where the embedding directions correspond to 50 attributes, and a 50 dimensional sub-sub-space where embeddings are just random.
This should still get you basically the same issues the original example did I think? For any dictionary decomposition of the activations you pick, some of the circuits will end up looking like a horrible mess, even though they’re secretly taking in a very low-rank subspace of the activations that’d make sense to us if we looked at it. I should probably double check that when I’m more awake though.[1]
I think the central issue here is mostly just having some kind of non-random, ‘meaningful’ feature embedding geometry that the circuits care about, instead of random feature embeddings.
EDIT: I am now more awake. I still think this is right.