Nice post! Re issue 1, there are a few things that you can do to work out if a representation you have found is a ‘model feature’ or a ‘dataset feature’. You can:
Check if intervening on the forward pass to modify this feature produces the expected effect on outputs. Caveats:
the best vector for probing is not the best vector for steering (in general the inverse of a matrix is not the transpose, and finding a basis of steering vectors from a basis of probe vectors involves inverting the basis matrix)
It’s possible that the feature you found is causally upstream of some features the model has learned, and even if the model hasn’t learned this feature, changing it affects things the model is aware of. OTOH, I’m not sure whether I want to say that this feature has not been learned by the model in this case.
Some techniques eg crosscoders don’t come equipped with a well defined notion of intervening on the feature during a forward pass.
Nonetheless, we can still sometimes get evidence this way, in particular about whether our probe has found subtle structure in the data that is really causally irrelevant to the model. This is already a common technique in interpretability (see eg the initimitable golden gate claude, and many more systematic steering tests like this one),
Run various shuffle/permutation controls:
Measure the selectivity of your feature finding technique: replace the structure in the data with some new structure (or just remove the structure) and then see if your probe finds that new structure. To the extent that the probe can learn the new structure, it is not telling you about what the model has learned. Most straightforwardly: if you have trained a supervised probe, you can train a second supervised probe on a dataset with randomised labels, and look at how much more accurate the probe is when trained on data with true labels. This can help distinguish between the hypothesis that you have found a real variable in the model, and the null hypothesis that the probing technique is powerful enough to find a direction that can classify any dataset with that accuracy. Selectivity tests should do things like match the bias of the train data (eg if training a probe on a sparsely activating feature, then the value of the feature is almost always zero and that should be preserved in the control). You can also test unsupervised techniques like SAEs this way by training them on random sequences of tokens. There’s probably more sophisticated controls that can be introduced here: eg you can try to destroy all the structure in the data and replace it with random structure that is still sparse in the same sense, and so on.
In addition to experiments that destroy the probe training data, you can also run experiments that destroy the structure in the model weights. To the extent that the probe works here, it is not telling you about what the model has learned. For example, reinitialise the weights of the model, and train the probe/SAE/look at the PCA directions. This is a weak control: a stronger control could do something like reiniatialising the weights of the model that matches the eigenspectrum of each weight matrix to the eigenspectrum of the corresponding matrix in the trained model (to rule out things like the SAE didn’t work in the randomised model because the activation vector is too small etc), although that control is still quite weak. This control was used nicely in Towards Monosemanticity here, although I think much more research of this form could be done with SAEs and their cousins.
I am told by Adam Shai that in experimental neuroscience, it is something of a sport to come up with better and better controls for testing the hypothesis that you have identified structure. Maybe some of that energy should be imported to interp?
Probably some other things not on my mind right now??
I am aware that there is less use in being able to identify whether your features are model features or dataset features than there is in having a technique that zero-shot identifies model features only. However, a reliable set of tools for distinguishing what type of feature we have found would give us feedback loops that could help us search for good feature-finding techniques. eg. good controls would give us the freedom to do things like searching over (potentially nonlinear) probe architectures for those with a high accuracy relative to the control (in the absence of the control, searching over architectures would lead us to more and more expressive nonlinear probes that tell us nothing about the model’s computation). I’m curious if this sort of thing would lead us away from treating activation vectors in isolation, as the post argues.
the best vector for probing is not the best vector for steering
I don’t understand this. If a feature is represented by a direction v in the activations, surely the best probe for that feature will also be v because then <v,v> is maximized.
[edit: I’m now thinking that actually the optimal probe vector is also orthogonal to span{→vj|j≠i} so maybe the point doesn’t stand. In general, I think it is probably a mistake to talk about activation vectors as linear combinations of feature vectors, rather than as vectors that can be projected into a set of interpretable readoff directions. see here for more.]
Yes, I’m calling the representation vector the same as the probing vector. Suppose my activation vector can be written as →a=∑ifi→vi where fi are feature values and →vi are feature representation vectors. Then the probe vector which minimises MSE (explains most of the variance) is just →vi. To avoid off target effects, the vector →si you want to steer with for feature i might be the vector that is most ‘surgical’: it only changes the value of this feature and no other features are changed. In that case it should be the vector that lies orthogonal to span{→vj|j≠i} which is only the same as →vi if the set {→vi} are orthogonal.
Obviously I’m working with a non-overcomplete basis of feature representation vectors here. If we’re dealing with the overcomplete case, then it’s messier. People normally talk about ‘approximately orthogonal vectors’ in which case the most surgical steering vector →si≈→vi but (handwaving) you can also talk about something like ‘approximately linearly independent vectors’ in which case my point stands I think (note that SAE decoder directions are definitely not approximately orthogonal). For something less handwavey see this appendix.
Nice post! Re issue 1, there are a few things that you can do to work out if a representation you have found is a ‘model feature’ or a ‘dataset feature’. You can:
Check if intervening on the forward pass to modify this feature produces the expected effect on outputs. Caveats:
the best vector for probing is not the best vector for steering (in general the inverse of a matrix is not the transpose, and finding a basis of steering vectors from a basis of probe vectors involves inverting the basis matrix)
It’s possible that the feature you found is causally upstream of some features the model has learned, and even if the model hasn’t learned this feature, changing it affects things the model is aware of. OTOH, I’m not sure whether I want to say that this feature has not been learned by the model in this case.
Some techniques eg crosscoders don’t come equipped with a well defined notion of intervening on the feature during a forward pass.
Nonetheless, we can still sometimes get evidence this way, in particular about whether our probe has found subtle structure in the data that is really causally irrelevant to the model. This is already a common technique in interpretability (see eg the initimitable golden gate claude, and many more systematic steering tests like this one),
Run various shuffle/permutation controls:
Measure the selectivity of your feature finding technique: replace the structure in the data with some new structure (or just remove the structure) and then see if your probe finds that new structure. To the extent that the probe can learn the new structure, it is not telling you about what the model has learned.
Most straightforwardly: if you have trained a supervised probe, you can train a second supervised probe on a dataset with randomised labels, and look at how much more accurate the probe is when trained on data with true labels. This can help distinguish between the hypothesis that you have found a real variable in the model, and the null hypothesis that the probing technique is powerful enough to find a direction that can classify any dataset with that accuracy. Selectivity tests should do things like match the bias of the train data (eg if training a probe on a sparsely activating feature, then the value of the feature is almost always zero and that should be preserved in the control).
You can also test unsupervised techniques like SAEs this way by training them on random sequences of tokens. There’s probably more sophisticated controls that can be introduced here: eg you can try to destroy all the structure in the data and replace it with random structure that is still sparse in the same sense, and so on.
In addition to experiments that destroy the probe training data, you can also run experiments that destroy the structure in the model weights. To the extent that the probe works here, it is not telling you about what the model has learned.
For example, reinitialise the weights of the model, and train the probe/SAE/look at the PCA directions. This is a weak control: a stronger control could do something like reiniatialising the weights of the model that matches the eigenspectrum of each weight matrix to the eigenspectrum of the corresponding matrix in the trained model (to rule out things like the SAE didn’t work in the randomised model because the activation vector is too small etc), although that control is still quite weak.
This control was used nicely in Towards Monosemanticity here, although I think much more research of this form could be done with SAEs and their cousins.
I am told by Adam Shai that in experimental neuroscience, it is something of a sport to come up with better and better controls for testing the hypothesis that you have identified structure. Maybe some of that energy should be imported to interp?
Probably some other things not on my mind right now??
I am aware that there is less use in being able to identify whether your features are model features or dataset features than there is in having a technique that zero-shot identifies model features only. However, a reliable set of tools for distinguishing what type of feature we have found would give us feedback loops that could help us search for good feature-finding techniques. eg. good controls would give us the freedom to do things like searching over (potentially nonlinear) probe architectures for those with a high accuracy relative to the control (in the absence of the control, searching over architectures would lead us to more and more expressive nonlinear probes that tell us nothing about the model’s computation). I’m curious if this sort of thing would lead us away from treating activation vectors in isolation, as the post argues.
I don’t understand this. If a feature is represented by a direction v in the activations, surely the best probe for that feature will also be v because then <v,v> is maximized.
AKA the predict/control discrepancy, from Section 3.3.1 of Wattenberg and Viegas, 2024
Also related to the idea that the best linear SAE encoder is not the transpose of the decoder.
[edit: I’m now thinking that actually the optimal probe vector is also orthogonal to span{→vj|j≠i} so maybe the point doesn’t stand. In general, I think it is probably a mistake to talk about activation vectors as linear combinations of feature vectors, rather than as vectors that can be projected into a set of interpretable readoff directions. see here for more.]
Yes, I’m calling the representation vector the same as the probing vector. Suppose my activation vector can be written as →a=∑ifi→vi where fi are feature values and →vi are feature representation vectors. Then the probe vector which minimises MSE (explains most of the variance) is just →vi. To avoid off target effects, the vector →si you want to steer with for feature i might be the vector that is most ‘surgical’: it only changes the value of this feature and no other features are changed. In that case it should be the vector that lies orthogonal to span{→vj|j≠i} which is only the same as →vi if the set {→vi} are orthogonal.
Obviously I’m working with a non-overcomplete basis of feature representation vectors here. If we’re dealing with the overcomplete case, then it’s messier. People normally talk about ‘approximately orthogonal vectors’ in which case the most surgical steering vector →si≈→vi but (handwaving) you can also talk about something like ‘approximately linearly independent vectors’ in which case my point stands I think (note that SAE decoder directions are definitely not approximately orthogonal). For something less handwavey see this appendix.
Makes sense—agreed!