george robinson comments on SAE feature geometry is outside the superposition hypothesis

george robinson 8 Jul 2024 16:40 UTC
1 point
0
This is probably just the way I’ve seen features/interpretability explained—the features on one layer are thought of as relevant combinations of simpler features from the previous layer (this explanation in particular seems to be the standard one for features of image classifiers). This is certainly simplistic since the higher level features are probably much more advanced functions of the previous layer rather than just ‘these n features are all present’. However for understanding some geometry I think it could be interesting.
For example, you can certainly build a simplicial complex in the following way: let the features for the first layer be the 0-simplices be the first layer. For a feature F on the n-th layer, compute the n most likely features from the previous layer to fire on a sample highly related to F, and produce an (n-1)-simplex on these (by most likely, I mean either by sampling or there may be a purely mathematical way of doing this from the feature vectors). This simplicial complex is a pretty basic object recording the relationship between features (on the same layer, or between layers). I can’t really say whether it would be particularly easy to actually compute with, but it might have some interesting topological features (e.g. how easy is it to disconnect the simplex by removing simplices, equivalently clamping features to zero).