lewis smith comments on The ‘strong’ feature hypothesis could be wrong

lewis smith 22 Oct 2025 11:31 UTC
1 point
0

Aren’t the MLPs in a transformer straightforward examples of this?

Thats certainly the most straightforward interpretation! I think a lot of the ideas I’m talking about here are downstream of the toy models paper, which introduces the idea that MLPs might fundamentally be explained in terms of (approximate) manipulations of these kinds of linear subspaces; i.e that everything that wasnt’ expliciable in this way would sort of function like noise, rather than playing a functional role.

I think that I agree this should have been treated with a lot more suspicion that it was in interpretability circles, but lots of people were excited about this paper, and then SAEs seemed to be sort of ‘magically’ finding lots of cool interpretable features based on this linear direction hypothesis. I think this seemed a bit like a validation of the linear feature idea in the TMS paper which explains a certain amount of the excitement around it.

I think the main point I wanted to make with the post was that the TMS model was a hypothesis, that it was a kind of vague hypothesis that we should probably have a clearer idea of so we could think about whether it was true or not, and that the strongest versions of it were probably not true.
- shawnghu 22 Oct 2025 21:57 UTC
  1 point
  0
  Parent
  Oh yeah, I’m certainly agreeing with the central intent of the post, now just clarifying the above discussion.
  
  One clarification—here, as stated, “mechanisms operating in terms of linearly represented atoms” doesn’t constrain the mechanisms themselves to be linear, does it? SAE latents themselves are some nonlinear function of the actual model activations. But if the mechanisms are substantially nonlinear we’re not really claiming much.
  
  My own impression is that things are nonlinear unless proven otherwise, and a priori I would really strongly expect the strong linear representation hypothesis to be just false. In general it seems extremely wishful to hope that exactly those things that are nonlinear (in whatever sense we mean) are not important, especially since we employ neural networks specifically to learn really weird functions we couldn’t have thought of ourselves.