Demian Till comments on Takeaways From Our Recent Work on SAE Probing

Demian Till 22 Jun 2025 9:58 UTC
1 point
0
Even just for evaluating the utility of SAEs for supervised probing though, I think it’s unfair to use the same layer for all tasks. Afaik there could easily be tasks where the model represents the target concept using a small number of linear features at some layer, but not at the chosen layer. This will harm k-sparse SAE probe performance far more than the baseline performance because the baselines can make the best of the bad situation at the chosen layer by e.g. combining many features which are weakly correlated with the target concept and using non-linearities. I think it would be a fair test if the ‘quiver of arrows’ were expanded to include each method applied at each of a range of layers.
- Neel Nanda 22 Jun 2025 17:31 UTC
  2 points
  0
  Parent
  I’d be surprised if it made a big difference, but I agree in principle that it could make a difference, and favours probes somewhat over SAEs, so fair point.