in the context of the SFH; feature means what I called ‘atom’ in the text, i.e linear direction in activation space with a specific function in the model. This implies that any mechanism can be usefully decomposed in terms of these. Finding a mechanism which is difficult to express in this model is counter evidence. I think you could rescue the ‘feature hypothesis’ by using a vaguer definition of ‘feature’ (which is a common move)
I see. If I understand you correctly, a mechanism, whether human-interpretable or not, which seems somehow to be functionally separate but not explainable in terms of their operations on linear subspaces of activation space, would count as evidence against the strong feature hypothesis, right?
Aren’t the MLPs in a transformer straightforward examples of this?
(BTW, I agree with the main thrust of the post. I think that the linear feature hypothesis in most usefully strong forms should be default-false unless proven otherwise; I appreciate the thing you said two comments up about how “disproving a vague hypothesis is a bit difficult”).
Aren’t the MLPs in a transformer straightforward examples of this?
Thats certainly the most straightforward interpretation! I think a lot of the ideas I’m talking about here are downstream of the toy models paper, which introduces the idea that MLPs might fundamentally be explained in terms of (approximate) manipulations of these kinds of linear subspaces; i.e that everything that wasnt’ expliciable in this way would sort of function like noise, rather than playing a functional role.
I think that I agree this should have been treated with a lot more suspicion that it was in interpretability circles, but lots of people were excited about this paper, and then SAEs seemed to be sort of ‘magically’ finding lots of cool interpretable features based on this linear direction hypothesis. I think this seemed a bit like a validation of the linear feature idea in the TMS paper which explains a certain amount of the excitement around it.
I think the main point I wanted to make with the post was that the TMS model was a hypothesis, that it was a kind of vague hypothesis that we should probably have a clearer idea of so we could think about whether it was true or not, and that the strongest versions of it were probably not true.
Oh yeah, I’m certainly agreeing with the central intent of the post, now just clarifying the above discussion.
One clarification—here, as stated, “mechanisms operating in terms of linearly represented atoms” doesn’t constrain the mechanisms themselves to be linear, does it? SAE latents themselves are some nonlinear function of the actual model activations. But if the mechanisms are substantially nonlinear we’re not really claiming much.
My own impression is that things are nonlinear unless proven otherwise, and a priori I would really strongly expect the strong linear representation hypothesis to be just false. In general it seems extremely wishful to hope that exactly those things that are nonlinear (in whatever sense we mean) are not important, especially since we employ neural networks specifically to learn really weird functions we couldn’t have thought of ourselves.
It feels to me like evaluating any of the sentences in your comment rigorously requires a more specific definition of “feature”.
in the context of the SFH; feature means what I called ‘atom’ in the text, i.e linear direction in activation space with a specific function in the model. This implies that any mechanism can be usefully decomposed in terms of these. Finding a mechanism which is difficult to express in this model is counter evidence. I think you could rescue the ‘feature hypothesis’ by using a vaguer definition of ‘feature’ (which is a common move)
I see. If I understand you correctly, a mechanism, whether human-interpretable or not, which seems somehow to be functionally separate but not explainable in terms of their operations on linear subspaces of activation space, would count as evidence against the strong feature hypothesis, right?
Aren’t the MLPs in a transformer straightforward examples of this?
(BTW, I agree with the main thrust of the post. I think that the linear feature hypothesis in most usefully strong forms should be default-false unless proven otherwise; I appreciate the thing you said two comments up about how “disproving a vague hypothesis is a bit difficult”).
Thats certainly the most straightforward interpretation! I think a lot of the ideas I’m talking about here are downstream of the toy models paper, which introduces the idea that MLPs might fundamentally be explained in terms of (approximate) manipulations of these kinds of linear subspaces; i.e that everything that wasnt’ expliciable in this way would sort of function like noise, rather than playing a functional role.
I think that I agree this should have been treated with a lot more suspicion that it was in interpretability circles, but lots of people were excited about this paper, and then SAEs seemed to be sort of ‘magically’ finding lots of cool interpretable features based on this linear direction hypothesis. I think this seemed a bit like a validation of the linear feature idea in the TMS paper which explains a certain amount of the excitement around it.
I think the main point I wanted to make with the post was that the TMS model was a hypothesis, that it was a kind of vague hypothesis that we should probably have a clearer idea of so we could think about whether it was true or not, and that the strongest versions of it were probably not true.
Oh yeah, I’m certainly agreeing with the central intent of the post, now just clarifying the above discussion.
One clarification—here, as stated, “mechanisms operating in terms of linearly represented atoms” doesn’t constrain the mechanisms themselves to be linear, does it? SAE latents themselves are some nonlinear function of the actual model activations. But if the mechanisms are substantially nonlinear we’re not really claiming much.
My own impression is that things are nonlinear unless proven otherwise, and a priori I would really strongly expect the strong linear representation hypothesis to be just false. In general it seems extremely wishful to hope that exactly those things that are nonlinear (in whatever sense we mean) are not important, especially since we employ neural networks specifically to learn really weird functions we couldn’t have thought of ourselves.