Jason Gross comments on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Jason Gross 22 Jul 2024 6:33 UTC
LW: 1 AF: 1
0
AF
[Nix] Toy model of feature splitting
- There are at least two explanations for feature splitting I find plausible:
  Activations exist in higher dimensional manifolds in feature space, feature splitting is a symptom of one higher dimensional mostly-continuous feature being chunked into discrete features at different resolutions.
  There is a finite number of highly-related discrete features that activate on similar (but not identical) inputs and cause similar (but not identical) output actions. These can be summarized as a single feature with reasonable explained variance, but is better summarized as a collection of “split” features.
These do not sound like different explanations to me. In particular, the distinction between “mostly-continuous but approximated as discrete” and “discrete but very similar” seems ill-formed. All features are in fact discrete (because floating point numbers are discrete) and approximately continuous (because we posit that replacing floats with reals won’t change the behavior of the network meaningfully).
As far as toy models go, I’m pretty confident that the max-of-K setup from Compact Proofs of Model Performance via Mechanistic Interpretability will be a decent toy model. If you train SAEs post-unembed (probably also pre-unembed) with width d_vocab, you should find one feature for each sequence maximum (roughly). If you train with SAE width ${d_vocab}^{3} n_ctx$ , I expect each feature to split into roughly ${d_vocab}^{2} n_ctx$ features corresponding to the choice of query token, largest non-max token, and the number of copies of the maximum token. (How the SAE training data is distributed will change what exact features (principal directions of variation) are important to learn.). I’m quite interested in chatting with anyone working on / interested in this, and I expect my MATS scholar will get to testing this within the next month or two.
Edit: I expect this toy model will also permit exploring:
[Lee] Is there structure in feature splitting?
- Suppose we have a trained SAE with N features. If we apply e.g. NMF or SAEs to these directions are there directions that explain the structure of the splitting? As in, suppose we have a feature for math and a feature for physics. And suppose these split into (among other things)
  ‘topology in a math context’
  ‘topology in a physics context’
  ‘high dimensions in a math context’
  ‘high dimensions in a physics context’
- Is the topology-ifying direction the same for both features? Is the high-dimensionifying direction the same for both features? And if so, why did/didn’t the original SAEs find these directions?
I predict that whether or not the SAE finds the splitting directions depends on details about how much non-sparsity is penalized and how wide the SAE is. Given enough capacity, the SAE benefits (sparsity-wise) from replacing the (topology, math, physics) features with (topology-in-math, topology-in-physics), because split features activate more sparsely. Conversely, if the sparsity penalty is strong enough and there is not enough capacity to split, the loss recovered from having a topology feature at all (on top of the math/physics feature) may not outweigh the cost in sparsity.