I think that if you only ever wanted to use SAEs for unsupervised discovery of features, these results are not very important. I was hopeful that SAEs would be more broadly useful and this was a negative update—it’s consistent with things like “SAEs faithfully capture the model’s ontology and not the things we want”, but this makes SAEs substantially less useful for practical tasks. I would love to see work that tries to find downstream tasks enabled by good circuit discovery and use this to gather evidence
Even just for evaluating the utility of SAEs for supervised probing though, I think it’s unfair to use the same layer for all tasks. Afaik there could easily be tasks where the model represents the target concept using a small number of linear features at some layer, but not at the chosen layer. This will harm k-sparse SAE probe performance far more than the baseline performance because the baselines can make the best of the bad situation at the chosen layer by e.g. combining many features which are weakly correlated with the target concept and using non-linearities. I think it would be a fair test if the ‘quiver of arrows’ were expanded to include each method applied at each of a range of layers.
I’d be surprised if it made a big difference, but I agree in principle that it could make a difference, and favours probes somewhat over SAEs, so fair point.
I think that if you only ever wanted to use SAEs for unsupervised discovery of features, these results are not very important. I was hopeful that SAEs would be more broadly useful and this was a negative update—it’s consistent with things like “SAEs faithfully capture the model’s ontology and not the things we want”, but this makes SAEs substantially less useful for practical tasks. I would love to see work that tries to find downstream tasks enabled by good circuit discovery and use this to gather evidence
Even just for evaluating the utility of SAEs for supervised probing though, I think it’s unfair to use the same layer for all tasks. Afaik there could easily be tasks where the model represents the target concept using a small number of linear features at some layer, but not at the chosen layer. This will harm k-sparse SAE probe performance far more than the baseline performance because the baselines can make the best of the bad situation at the chosen layer by e.g. combining many features which are weakly correlated with the target concept and using non-linearities. I think it would be a fair test if the ‘quiver of arrows’ were expanded to include each method applied at each of a range of layers.
I’d be surprised if it made a big difference, but I agree in principle that it could make a difference, and favours probes somewhat over SAEs, so fair point.