This is a great article! I find the notion of a ‘tacit representation’ very interesting, and it makes me wonder whether we can construct a toy model where something is only tacitly (but not explicitly) represented. For example, having read the post, I’m updated towards believing that the goals of agents are represented tacitly rather than explicitly, which would make MI for agentic models much more difficult.
One minor point: There is a conceptual difference, but perhaps not an empirical difference, between ‘strong LRH is false’ and ‘strong LRH is true but the underlying features aren’t human-interpretable’. I think our existing techniques can’t yet distinguish between these two cases.
Relatedly, I (with collaborators) recently released a paper on evaluating steering vectors at scale: https://arxiv.org/abs/2407.12404. We found that many concepts (as defined in model-written evals) did not steer well, which has updated me towards believing that these concepts are not linearly represented. This in turn weakly updates me towards believing strong LRH is false, although this is definitely not a rigorous conclusion.
While these are logically distinct things, can you think of an experiment that would be able to distinguish the two even in principle? In other words, you say “our existing techniques can’t yet”—but what would one that can distinguish these even look like?
IMO this is downstream of coming up with a good a priori definition of “features” that does not rely on human interpretation. Then you’ll likely have features which are non-human-interpretable.
I think that they are distinguishable. For instance, if you can find an example of a structure which doesn’t fit the ‘feature’ model but clearly serves some algorithmic function, that would seem to be strong counter-evidence?
For example this paper https://arxiv.org/abs/2405.14860 demonstrates that at least the one-dimensional feature model is not complete. There might be some way to express that in ‘strong feature hypothesis’ form by adding a lot of epicycles, but I think that sort of thing would be evidence against the idea of independent 1-dimensional linear features.
The strong feature hypothesis does have the virtue of being strong; therefore it’s quite vulnerable to counter evidence! The main thing that makes this a bit more confusing is that I think exactly what the ‘feature’ hypothesis was was often left fairly vague; disproving a vague hypothesis is quite difficult.
in the context of the SFH; feature means what I called ‘atom’ in the text, i.e linear direction in activation space with a specific function in the model. This implies that any mechanism can be usefully decomposed in terms of these. Finding a mechanism which is difficult to express in this model is counter evidence. I think you could rescue the ‘feature hypothesis’ by using a vaguer definition of ‘feature’ (which is a common move)
I see. If I understand you correctly, a mechanism, whether human-interpretable or not, which seems somehow to be functionally separate but not explainable in terms of their operations on linear subspaces of activation space, would count as evidence against the strong feature hypothesis, right?
Aren’t the MLPs in a transformer straightforward examples of this?
(BTW, I agree with the main thrust of the post. I think that the linear feature hypothesis in most usefully strong forms should be default-false unless proven otherwise; I appreciate the thing you said two comments up about how “disproving a vague hypothesis is a bit difficult”).
Aren’t the MLPs in a transformer straightforward examples of this?
Thats certainly the most straightforward interpretation! I think a lot of the ideas I’m talking about here are downstream of the toy models paper, which introduces the idea that MLPs might fundamentally be explained in terms of (approximate) manipulations of these kinds of linear subspaces; i.e that everything that wasnt’ expliciable in this way would sort of function like noise, rather than playing a functional role.
I think that I agree this should have been treated with a lot more suspicion that it was in interpretability circles, but lots of people were excited about this paper, and then SAEs seemed to be sort of ‘magically’ finding lots of cool interpretable features based on this linear direction hypothesis. I think this seemed a bit like a validation of the linear feature idea in the TMS paper which explains a certain amount of the excitement around it.
I think the main point I wanted to make with the post was that the TMS model was a hypothesis, that it was a kind of vague hypothesis that we should probably have a clearer idea of so we could think about whether it was true or not, and that the strongest versions of it were probably not true.
Oh yeah, I’m certainly agreeing with the central intent of the post, now just clarifying the above discussion.
One clarification—here, as stated, “mechanisms operating in terms of linearly represented atoms” doesn’t constrain the mechanisms themselves to be linear, does it? SAE latents themselves are some nonlinear function of the actual model activations. But if the mechanisms are substantially nonlinear we’re not really claiming much.
My own impression is that things are nonlinear unless proven otherwise, and a priori I would really strongly expect the strong linear representation hypothesis to be just false. In general it seems extremely wishful to hope that exactly those things that are nonlinear (in whatever sense we mean) are not important, especially since we employ neural networks specifically to learn really weird functions we couldn’t have thought of ourselves.
I definitely agree that the LRH and the interpretability of the linear features are seperate hypotheses; that was what I was trying to get at by having monosemanticity as a seperate assumption to the LRH. I think that these are logically independent; there could be some explicit representation such that everything corresponds to an interpretable feature, but that format is more complicated than linear (i.e monosemanticity is true but LRH is false) or, as you say, the network could in some sense be mostly manipulating features but these features could be very hard to understand (LRH true, monosemanticity false) or they could just both be the wrong frame. I definitely think it would be good if we spent a bit more effort in clarifying these distinctions; I hope this essay made some progress in that direction but I don’t think it’s the last word on the subject.
I agree coming up with experiments which would test the LRH in isolation is difficult. But maybe this should be more of a research priority; we ought to be able to formulate a version of the strong LRH which makes strong empirical predictions. I think something along the lines of https://arxiv.org/abs/2403.19647 is maybe going in the write direction here. In a shameless self-plug, I hope that LMI’s recent work on open sourcing a massive SAE suite (Gemma Scope) will let people test out this sort of thing.
Having said that, one reason I’m a bit pessimistic is that stronger versions of the LRH do seem to predict there is some set of ‘ground truth’ features that a wide-enough or well tuned enough SAE ought to converge to (perhaps there should be some ‘phase change’ in the scaling graphs as you sweep the hyperparameters), but AFAIK we have been unable to find any evidence for this even in toy models.
I don’t want to overstate this point though; I think part of the reason for the excitement around SAEs is that this was genuinely quite great science ; the Toy Models paper proposed some theoretical reasons to expect linear representations in superposition, which implied that something like SAEs should recover interesting representations, and then was quite successful! (This is why I say in the post I think there’s a reasonable amount of evidence for at least the weak LRH).
This is a great article! I find the notion of a ‘tacit representation’ very interesting, and it makes me wonder whether we can construct a toy model where something is only tacitly (but not explicitly) represented. For example, having read the post, I’m updated towards believing that the goals of agents are represented tacitly rather than explicitly, which would make MI for agentic models much more difficult.
One minor point: There is a conceptual difference, but perhaps not an empirical difference, between ‘strong LRH is false’ and ‘strong LRH is true but the underlying features aren’t human-interpretable’. I think our existing techniques can’t yet distinguish between these two cases.
Relatedly, I (with collaborators) recently released a paper on evaluating steering vectors at scale: https://arxiv.org/abs/2407.12404. We found that many concepts (as defined in model-written evals) did not steer well, which has updated me towards believing that these concepts are not linearly represented. This in turn weakly updates me towards believing strong LRH is false, although this is definitely not a rigorous conclusion.
While these are logically distinct things, can you think of an experiment that would be able to distinguish the two even in principle? In other words, you say “our existing techniques can’t yet”—but what would one that can distinguish these even look like?
IMO this is downstream of coming up with a good a priori definition of “features” that does not rely on human interpretation. Then you’ll likely have features which are non-human-interpretable.
I think that they are distinguishable. For instance, if you can find an example of a structure which doesn’t fit the ‘feature’ model but clearly serves some algorithmic function, that would seem to be strong counter-evidence? For example this paper https://arxiv.org/abs/2405.14860 demonstrates that at least the one-dimensional feature model is not complete. There might be some way to express that in ‘strong feature hypothesis’ form by adding a lot of epicycles, but I think that sort of thing would be evidence against the idea of independent 1-dimensional linear features. The strong feature hypothesis does have the virtue of being strong; therefore it’s quite vulnerable to counter evidence! The main thing that makes this a bit more confusing is that I think exactly what the ‘feature’ hypothesis was was often left fairly vague; disproving a vague hypothesis is quite difficult.
It feels to me like evaluating any of the sentences in your comment rigorously requires a more specific definition of “feature”.
in the context of the SFH; feature means what I called ‘atom’ in the text, i.e linear direction in activation space with a specific function in the model. This implies that any mechanism can be usefully decomposed in terms of these. Finding a mechanism which is difficult to express in this model is counter evidence. I think you could rescue the ‘feature hypothesis’ by using a vaguer definition of ‘feature’ (which is a common move)
I see. If I understand you correctly, a mechanism, whether human-interpretable or not, which seems somehow to be functionally separate but not explainable in terms of their operations on linear subspaces of activation space, would count as evidence against the strong feature hypothesis, right?
Aren’t the MLPs in a transformer straightforward examples of this?
(BTW, I agree with the main thrust of the post. I think that the linear feature hypothesis in most usefully strong forms should be default-false unless proven otherwise; I appreciate the thing you said two comments up about how “disproving a vague hypothesis is a bit difficult”).
Thats certainly the most straightforward interpretation! I think a lot of the ideas I’m talking about here are downstream of the toy models paper, which introduces the idea that MLPs might fundamentally be explained in terms of (approximate) manipulations of these kinds of linear subspaces; i.e that everything that wasnt’ expliciable in this way would sort of function like noise, rather than playing a functional role.
I think that I agree this should have been treated with a lot more suspicion that it was in interpretability circles, but lots of people were excited about this paper, and then SAEs seemed to be sort of ‘magically’ finding lots of cool interpretable features based on this linear direction hypothesis. I think this seemed a bit like a validation of the linear feature idea in the TMS paper which explains a certain amount of the excitement around it.
I think the main point I wanted to make with the post was that the TMS model was a hypothesis, that it was a kind of vague hypothesis that we should probably have a clearer idea of so we could think about whether it was true or not, and that the strongest versions of it were probably not true.
Oh yeah, I’m certainly agreeing with the central intent of the post, now just clarifying the above discussion.
One clarification—here, as stated, “mechanisms operating in terms of linearly represented atoms” doesn’t constrain the mechanisms themselves to be linear, does it? SAE latents themselves are some nonlinear function of the actual model activations. But if the mechanisms are substantially nonlinear we’re not really claiming much.
My own impression is that things are nonlinear unless proven otherwise, and a priori I would really strongly expect the strong linear representation hypothesis to be just false. In general it seems extremely wishful to hope that exactly those things that are nonlinear (in whatever sense we mean) are not important, especially since we employ neural networks specifically to learn really weird functions we couldn’t have thought of ourselves.
i’m glad you liked it.
I definitely agree that the LRH and the interpretability of the linear features are seperate hypotheses; that was what I was trying to get at by having monosemanticity as a seperate assumption to the LRH. I think that these are logically independent; there could be some explicit representation such that everything corresponds to an interpretable feature, but that format is more complicated than linear (i.e monosemanticity is true but LRH is false) or, as you say, the network could in some sense be mostly manipulating features but these features could be very hard to understand (LRH true, monosemanticity false) or they could just both be the wrong frame. I definitely think it would be good if we spent a bit more effort in clarifying these distinctions; I hope this essay made some progress in that direction but I don’t think it’s the last word on the subject.
I agree coming up with experiments which would test the LRH in isolation is difficult. But maybe this should be more of a research priority; we ought to be able to formulate a version of the strong LRH which makes strong empirical predictions. I think something along the lines of https://arxiv.org/abs/2403.19647 is maybe going in the write direction here. In a shameless self-plug, I hope that LMI’s recent work on open sourcing a massive SAE suite (Gemma Scope) will let people test out this sort of thing.
Having said that, one reason I’m a bit pessimistic is that stronger versions of the LRH do seem to predict there is some set of ‘ground truth’ features that a wide-enough or well tuned enough SAE ought to converge to (perhaps there should be some ‘phase change’ in the scaling graphs as you sweep the hyperparameters), but AFAIK we have been unable to find any evidence for this even in toy models.
I don’t want to overstate this point though; I think part of the reason for the excitement around SAEs is that this was genuinely quite great science ; the Toy Models paper proposed some theoretical reasons to expect linear representations in superposition, which implied that something like SAEs should recover interesting representations, and then was quite successful! (This is why I say in the post I think there’s a reasonable amount of evidence for at least the weak LRH).