This was a really thought-provoking post; thanks for writing it! I thought this was an unusually good attempt to articulate problems with the current interpretability paradigm and do some high-level thinking about what we could do differently. However, I think a few of the specific points are weaker than you make them seem in a way that somewhat contradicts the title of the post. I also may be misunderstanding parts, so please let me know if that’s the case.
Problems 2 and 3 (the learned feature dictionary may not match the model’s feature dictionary, and activation space interpretability can fail to find compositional structure) both seem to be specific instances of ‘you are finding the right underlying features, but broken down differently from how the model is actually thinking about them’. This seems like a double edged sword to me. On one hand, it would be nice to know what level of abstraction the model is using. On the other hand, it’s useful to be able to analyze the computation at different levels of abstraction. And, if the model is breaking things down differently from the exact features you find, you may be able to piece this together from its downstream computation. I think these problems could just as easily be seen as a good thing, and they definitely don’t doom activation-space interpretability.
Problem 1, activations can contain structure of the data distribution that the models themselves don’t ‘know’ about, seems correct. However, this largely seems solvable by taking into account the effect of features on the output. Eg, as you later mention, attribution dictionary learning and E2E SAEs both seem like great attempts to tackle this problem.
Re problem 4, function approximation creates artefacts: in general, it’s never possible to distinguish orthogonal directions without figuring out what class of input examples the feature fires on. For instance, in the x2 example, you might end up with 5-10 different features activating to reconstruct something as simple as a representation of x2. But these features would faithfully activate on inputs where the model needs to compute x2, at least as well as any other normal feature activates on related inputs. Additionally, insofar as these 5-10 features are confusing, you can still find the x^2 feature in the network’s downstream computation, e.g. by using transcoders.
One problem here is that your SAEs could get overwhelmed with too many function approximation features, making it much more difficult to analyze. I don’t have strong priors on whether or not this is true, but from the empirical results, I tentatively don’t think it is?
Again, thanks for writing the post, and please let me know if I’m missing anything / if you have general thoughts on this comment (:
My general issue with most of your counterpoints is that they apply just as much to the standard basis of the network. That is, the neurons in the MLPs, the residual stream activations as they are in Pytorch, etc. .
The standard basis represents the activations of the network completely faithfully. It does this even better than techniques like SAEs, which always have some amount of reconstruction error. All the model’s features will be linear combinations of activations in the standard basis, so it does have ‘the right underlying features, but broken down differently from how the model is actually thinking about them’.
Same for all your other points. Theoretically, can you solve problems 1, 2 and 3 with the standard basis by taking information about how the model is computing downstream into account in the right way? Sure. You’d ‘take it into account’ by finding some completely new basis. Can you solve problem 4 with transcoders? I think vanilla versions would struggle, because the transcoder needs to combine many latents to form an x2, but probably.
But our point is that ‘piecing together the model’s features from its downstram computations’ is the whole job of a decomposition. If you have to use information about the model’s computations to find the features of the model, you’re pretty much conceding that what we call activation space interpretability here doesn’t work:
What do we mean by activation space interpretability? Interpretability work that attempts to understand neural networks by explaining the inputs and outputs of their layers in isolation. In this post, we focus in particular on the problem of decomposing activations, via techniques such as sparseautoencoders (SAEs), PCA, or just by looking at individual neurons. This is in contrast to interpretability work that leverages the wider functional structure of the model and incorporates more information about how the model performs computation. Examples of existing techniques using such information include Transcoders, end2end-SAEs and joint activation/gradient PCAs.
I am also skeptical that the techniques you name (e2e SAEs, transcoders, sparse dictionary learning on attributions) suffice to solve all problems in this class in their current form. That would have been a separate discussion beyond the scope of this post though. All we’re trying to say here is that you do very likely need to leverage the wider functional structure of the model and incorporate more information about how the model performs computation to decompose the model well.
Thanks for the response! I still think that most of the value of SAEs comes from finding a human-interpretable basis, and most of these problems don’t directly interfere with this property. I’m also somewhat skeptical that SAEs actually do find a human-interpretable basis, but that’s a separate question.
All the model’s features will be linear combinations of activations in the standard basis, so it does have ‘the right underlying features, but broken down differently from how the model is actually thinking about them’.
I think this is a fair point. I also think there’s a real sense in which it’s useful to know that the model is representing the concepts “red” and “square,” even if it’s thinking of them in terms of the Red feature and the Square feature and your SAE found the “red square” feature. It’s much harder to figure out what concepts the model is representing in human interpretable terms by staring at activations in a standard basis. There’s a big difference between “we know what human-interpretable concepts the model is representing but not exactly what structure it uses to think of them” and “we just don’t know what concepts the model is representing to begin with.” I think if we could do the former well, that would already be amazing.
Put slightly more strongly: The question of whether the model thinks in terms of “red square” or “red and “square” is moot, because the model does not actually think in terms of these concepts to begin with. The model thinks in its own language, and our job is to translate that language to our own. In this {red, blue} X {square, circle} space, looking at the attribution of “red square” and “red circle” to downstream features should give us the same result as looking at the attribution of “red” to downstream features, since red square and red circle encompass the full range of possibilities of things that can be red. It might be more convenient for us if we find the features that make the models attribution graph as simple as possible across a wide range of input examples, but there’s no real sense in which we’re misrepresenting its thought process.
Same for all your other points. Theoretically, can you solve problems 1, 2 and 3 with the standard basis by taking information about how the model is computing downstream into account in the right way? Sure. You’d ‘take it into account’ by finding some completely new basis.
Solving problem 1 could just entail adjusting the SAE basis while retaining most of its value! Solving problems 2 and 3 would require finding a basis which represents the same human-interpretable features but in a somewhat different way. Insofar as SAE features are actually human-interpretable (which I’m often skeptical of), I think this basis adds a ton of value.
I am also often skeptical of SAEs, but I feel that the biggest problem is that they don’t actually capture the sum of the concepts the model is representing in a human-interpretable way. If they actually did this correctly, I would happily forgo knowing the exact structure the model uses to think of them, and I would be alright if there were extra artifacts that made them harder to analyze.
(Also, oops, I didn’t realize that by activation-space you meant one layer’s activations only).
This was a really thought-provoking post; thanks for writing it! I thought this was an unusually good attempt to articulate problems with the current interpretability paradigm and do some high-level thinking about what we could do differently. However, I think a few of the specific points are weaker than you make them seem in a way that somewhat contradicts the title of the post. I also may be misunderstanding parts, so please let me know if that’s the case.
Problems 2 and 3 (the learned feature dictionary may not match the model’s feature dictionary, and activation space interpretability can fail to find compositional structure) both seem to be specific instances of ‘you are finding the right underlying features, but broken down differently from how the model is actually thinking about them’. This seems like a double edged sword to me. On one hand, it would be nice to know what level of abstraction the model is using. On the other hand, it’s useful to be able to analyze the computation at different levels of abstraction. And, if the model is breaking things down differently from the exact features you find, you may be able to piece this together from its downstream computation. I think these problems could just as easily be seen as a good thing, and they definitely don’t doom activation-space interpretability.
Problem 1, activations can contain structure of the data distribution that the models themselves don’t ‘know’ about, seems correct. However, this largely seems solvable by taking into account the effect of features on the output. Eg, as you later mention, attribution dictionary learning and E2E SAEs both seem like great attempts to tackle this problem.
Re problem 4, function approximation creates artefacts: in general, it’s never possible to distinguish orthogonal directions without figuring out what class of input examples the feature fires on. For instance, in the x2 example, you might end up with 5-10 different features activating to reconstruct something as simple as a representation of x2. But these features would faithfully activate on inputs where the model needs to compute x2, at least as well as any other normal feature activates on related inputs. Additionally, insofar as these 5-10 features are confusing, you can still find the x^2 feature in the network’s downstream computation, e.g. by using transcoders.
One problem here is that your SAEs could get overwhelmed with too many function approximation features, making it much more difficult to analyze. I don’t have strong priors on whether or not this is true, but from the empirical results, I tentatively don’t think it is?
Again, thanks for writing the post, and please let me know if I’m missing anything / if you have general thoughts on this comment (:
My general issue with most of your counterpoints is that they apply just as much to the standard basis of the network. That is, the neurons in the MLPs, the residual stream activations as they are in Pytorch, etc. .
The standard basis represents the activations of the network completely faithfully. It does this even better than techniques like SAEs, which always have some amount of reconstruction error. All the model’s features will be linear combinations of activations in the standard basis, so it does have ‘the right underlying features, but broken down differently from how the model is actually thinking about them’.
Same for all your other points. Theoretically, can you solve problems 1, 2 and 3 with the standard basis by taking information about how the model is computing downstream into account in the right way? Sure. You’d ‘take it into account’ by finding some completely new basis. Can you solve problem 4 with transcoders? I think vanilla versions would struggle, because the transcoder needs to combine many latents to form an x2, but probably.
But our point is that ‘piecing together the model’s features from its downstram computations’ is the whole job of a decomposition. If you have to use information about the model’s computations to find the features of the model, you’re pretty much conceding that what we call activation space interpretability here doesn’t work:
I am also skeptical that the techniques you name (e2e SAEs, transcoders, sparse dictionary learning on attributions) suffice to solve all problems in this class in their current form. That would have been a separate discussion beyond the scope of this post though. All we’re trying to say here is that you do very likely need to leverage the wider functional structure of the model and incorporate more information about how the model performs computation to decompose the model well.
Thanks for the response! I still think that most of the value of SAEs comes from finding a human-interpretable basis, and most of these problems don’t directly interfere with this property. I’m also somewhat skeptical that SAEs actually do find a human-interpretable basis, but that’s a separate question.
I think this is a fair point. I also think there’s a real sense in which it’s useful to know that the model is representing the concepts “red” and “square,” even if it’s thinking of them in terms of the Red feature and the Square feature and your SAE found the “red square” feature. It’s much harder to figure out what concepts the model is representing in human interpretable terms by staring at activations in a standard basis. There’s a big difference between “we know what human-interpretable concepts the model is representing but not exactly what structure it uses to think of them” and “we just don’t know what concepts the model is representing to begin with.” I think if we could do the former well, that would already be amazing.
Put slightly more strongly:
The question of whether the model thinks in terms of “red square” or “red and “square” is moot, because the model does not actually think in terms of these concepts to begin with. The model thinks in its own language, and our job is to translate that language to our own. In this {red, blue} X {square, circle} space, looking at the attribution of “red square” and “red circle” to downstream features should give us the same result as looking at the attribution of “red” to downstream features, since red square and red circle encompass the full range of possibilities of things that can be red. It might be more convenient for us if we find the features that make the models attribution graph as simple as possible across a wide range of input examples, but there’s no real sense in which we’re misrepresenting its thought process.
Solving problem 1 could just entail adjusting the SAE basis while retaining most of its value! Solving problems 2 and 3 would require finding a basis which represents the same human-interpretable features but in a somewhat different way. Insofar as SAE features are actually human-interpretable (which I’m often skeptical of), I think this basis adds a ton of value.
I am also often skeptical of SAEs, but I feel that the biggest problem is that they don’t actually capture the sum of the concepts the model is representing in a human-interpretable way. If they actually did this correctly, I would happily forgo knowing the exact structure the model uses to think of them, and I would be alright if there were extra artifacts that made them harder to analyze.
(Also, oops, I didn’t realize that by activation-space you meant one layer’s activations only).