Suppose we had a hypothetical ‘ideal’ SAE which exhaustively discovered all of the features represented by a model at a certain layer in their most ‘atomic’ form. Each latent’s decoder direction is perfectly aligned with its respective feature direction. Zero reconstruction error, with all latents having clear, interpretable meaning. If we had such an SAE for each component of a model at each layer this would obviously be extremely valuable since we could use them to do circuit analysis and basically understand how the model works. Sure it might still be painstaking and maybe we’d wish that some of the features weren’t so atomic or something, but basically we’d be in a good position to understand what’s going on.
I’m not sure that even an ideal SAE like that would fare well in this evaluation. Here are some reasons why:
The evaluation uses the same model layer on all tasks. While this layer was best on average for the baselines, it’s likely that for some/many of the tasks, the model doesn’t linearly represent the most relevant features at this layer, and therefore neither would a perfect SAE, resulting in poor k-sparse probing performance using the SAE. Baseline methods can still potentially perform decently on such tasks as they can combine many features which are somewhat correlated with the task and/or ‘craft’ more relevant features using non-linearities.
For some tasks, the model might not linearly represent super relevant features at any layer, again limiting the performance we can expect from even a perfect SAE with k-sparse probing. For example, it feels unlikely that models such as Gemma-2-9B would linearly represent whether the second half of a prompt is entailed by the first half, unless maybe if they were prompted to look out for this (idk this might be a bad example). Again, baseline methods can still attain decent performance by combining many weakly relevant features and using non-linearities.
Some tasks might be sufficiently complex as to naturally decompose into a combination of many (rather than few) atomic features. In such cases, the concept may be linearly represented at the layer in question, but since it’s composed of many atomic features, k-sparse probing with a perfect SAE will still struggle due to the limited k while baseline methods can learn to combine arbitrarily many features.
If even an ideal SAE could realistically underperform baselines in this evaluation setup then I’m not sure we should update too heavily in terms of SAE utility for arguably their primary use cases (things like circuit discovery where we don’t already know what we’re looking for). Of course anyone who was planning to use SAEs for probing under data scarcity conditions etc should update more substantially based on these results.
Even just for evaluating the utility of SAEs for supervised probing though, I think it’s unfair to use the same layer for all tasks. Afaik there could easily be tasks where the model represents the target concept using a small number of linear features at some layer, but not at the chosen layer. This will harm k-sparse SAE probe performance far more than the baseline performance because the baselines can make the best of the bad situation at the chosen layer by e.g. combining many features which are weakly correlated with the target concept and using non-linearities. I think it would be a fair test if the ‘quiver of arrows’ were expanded to include each method applied at each of a range of layers.