Aren’t the MLPs in a transformer straightforward examples of this?
Thats certainly the most straightforward interpretation! I think a lot of the ideas I’m talking about here are downstream of the toy models paper, which introduces the idea that MLPs might fundamentally be explained in terms of (approximate) manipulations of these kinds of linear subspaces; i.e that everything that wasnt’ expliciable in this way would sort of function like noise, rather than playing a functional role.
I think that I agree this should have been treated with a lot more suspicion that it was in interpretability circles, but lots of people were excited about this paper, and then SAEs seemed to be sort of ‘magically’ finding lots of cool interpretable features based on this linear direction hypothesis. I think this seemed a bit like a validation of the linear feature idea in the TMS paper which explains a certain amount of the excitement around it.
I think the main point I wanted to make with the post was that the TMS model was a hypothesis, that it was a kind of vague hypothesis that we should probably have a clearer idea of so we could think about whether it was true or not, and that the strongest versions of it were probably not true.
I think that the general argument I made in this post was correct, and anticipated a shift away from the strong feature hypothesis mode of thinking on SAEs. It’s hard to say the degree to which this was downstream of me publishing this post(probably mostly not? though it may have had some influence) but it probably deserves some Bayes points nevertheless. I think that many of the arguments in this post continue to influence my thinking, and there is a lot in the post that remains valuable.
In fact, if anything I think I should have been a bit more confident; the strong feature hypothesis is wrong would have been a better title. In particular I actually think that the criticism of infinite recursion was likely always fatal to versions of the SFH that included monosemanticity. That is: if a model understands a concept by activating a concept feature, then how does the concept feature understand the concept? Thinking about semantics is always prone to this sort of ‘homunculus’ fallacy.
Perhaps I should have drawn the distinction between the various forms of the SFH more explicitly. I think that the ‘atomic feature’ model, where atoms are the main ‘internal format’ of the model but we remain more agnostic about their interpretation is a more defensible and interesting one, though I think that I am still skeptical of it.
I think that there is a lot of interesting stuff in the post, but it could probably have benefitted from a more careful organisation and structuring of the argument; it has a fairly conversational, essayistic style which maybe makes it easier or more engaging to read, but it kind of rambles in places and occasionally the structure of the argument I’m making is a bit unclear, or jumps from one point to another related one without warning. This may have been related to an attempt to imitate the writing of two key influences on the post—Dennett and Wittgenstein—whose style is often conversational (and, in the case of Wittgenstein, often extremely terse and opaque), but it was probably misguided, and a more pedantic structure with enumerated points might have been better, if a bit less fun to write.
I think that one thing that maybe makes the post more likely to stand the test of time is it’s importance as a historical documentation of a view which was, I think, extremely widespread and influential in interpretability circles—the ‘strong feature hypothesis’, or ‘SAE realism’ - which is much less so now. As the post argues, I think this view was very rarely explicitly outlined or articulated, despite its influence, and so this post serves as an interesting reminder of this moment in our intellectual history. Given this, it’s a shame that I didn’t spend a bit more time explicitly describing the strong feature hypothesis, as I might have done with a more rigorous organisation of the post. But two aspects of the writing that are interesting are my assumption that the reader will recognise the view I was describing without much difficulty, and my politeness towards the viewpoint: these reflect that I expected my audience to be inclined to agree with the strong feature hypothesis at least initially, though it could also reflect the fact that I didn’t want my readers to think I was knocking down a strawman.
To the extent that the field has moved away from the strong feature hypothesis, arguing against it is a bit less likely to be relevant in the future. In terms of future value, I think that some of the more interesting things in the post are towards the end; I think that the tacit representation argument, and the point about the possible opacity of implementations of complex behaviour, is particularly important (though far from entirely novel, as it’s essentially straight out of Dennett). Nevertheless, I think this is a very important possibility that thinking about interpretability and safety neglects at our peril, and it’s probably one of the more evergreen bits of the post overall. I also think that I stand by the argument that it’s important to of focussing on function as a means to understand representations, rather than representations as a means to understand functions, as ultimately it’s functions that give the representations meaning rather than the other way around.
People continue to find SAEs valuable as an exploratory tool for debugging, though whether they are worth the cost of development remains an interesting question that I don’t have time to think about now. But I think that the usefulness is much more as an approximation and exploratory tool, rather than a search for the ‘ground truth’ variables of the model. This is a good thing, and it’s basically in line with what I was arguing for here.
I would like to have spent more time on this review, but I had to squeeze it in between looking after my very young daughter.