This post makes the excellent point that the paradigm that motivated SAEs—the superposition hypothesis—is incompatible with widely-known and easily demonstrated properties of SAE features (and feature vectors in general). The superposition hypothesis assumes that feature vectors have nonzero cosine similarity only because there isn’t enough space for them all to be orthogonal, in which case the cosine similarities themselves shouldn’t be meaningful. But in fact, cosine similarities between feature vectors have rich semantic content, as shown by circular embeddings (in several contexts) and feature splitting / dimensionality-reduction visualizations. Features aren’t just crammed together arbitrarily; they’re grouped with similar features.
I didn’t properly appreciate this point before reading this post (actually: before someone summarized the post to me verbally), at which point it became blindingly obvious.
This post holds up pretty well: SAEs are still popular (although they’ve lost some followers in the last ~year), and the point isn’t specific to SAEs anyway (circular features embeddings are ubiquitous). Superposition is also still an important idea, although I’ve been thinking about it less so I’m not sure what the state of the art is.
My only complaint is that “maybe if I’m being more sophisticated, I can specify the correlations between features” is giving the entire game away—the full set of correlations is nearly equivalent to the embeddings themselves, and has all of the interesting parts.
But I think the rest of the post demonstrates an important tension between theory and experiment, which an improved theory has to be able to account for, and I don’t think I’ve heard of an improved theory yet.
This post makes the excellent point that the paradigm that motivated SAEs—the superposition hypothesis—is incompatible with widely-known and easily demonstrated properties of SAE features (and feature vectors in general). The superposition hypothesis assumes that feature vectors have nonzero cosine similarity only because there isn’t enough space for them all to be orthogonal, in which case the cosine similarities themselves shouldn’t be meaningful. But in fact, cosine similarities between feature vectors have rich semantic content, as shown by circular embeddings (in several contexts) and feature splitting / dimensionality-reduction visualizations. Features aren’t just crammed together arbitrarily; they’re grouped with similar features.
I didn’t properly appreciate this point before reading this post (actually: before someone summarized the post to me verbally), at which point it became blindingly obvious.
There are some earlier blog posts that point out that superposition is probably only part of the story, e.g. https://transformer-circuits.pub/2023/superposition-composition/index.html on compositionality, but this one presents the relevant empirical evidence and its implications very clearly.
This post holds up pretty well: SAEs are still popular (although they’ve lost some followers in the last ~year), and the point isn’t specific to SAEs anyway (circular features embeddings are ubiquitous). Superposition is also still an important idea, although I’ve been thinking about it less so I’m not sure what the state of the art is.
My only complaint is that “maybe if I’m being more sophisticated, I can specify the correlations between features” is giving the entire game away—the full set of correlations is nearly equivalent to the embeddings themselves, and has all of the interesting parts.
But I think the rest of the post demonstrates an important tension between theory and experiment, which an improved theory has to be able to account for, and I don’t think I’ve heard of an improved theory yet.