We found that linear probes are great for classification, but they mostly fit spurious correlations, which can still be good if prediction is the end goal, such as when trying to identify deception. However, the directions found by a linear probe can’t be steered or ablated.
What works really well (and is still under-explored) is the vector obtained by subtracting the class means. Both for causal steering and classification. LEACE showed that this is infact the optimal linear erasure theoretically.
Unsupervised methods (like PCA) are less good at prediction but still quite good for causal interventions.
These results only got published as Figure 12 of the Representation Engineering paper, though in hindsight maybe it would have helped to highlight them more prominently as a paper of their own. SAEs were just being ‘discovered’ around this time (I remember @Hoagy was working on this at MATS′23) so we didn’t benchmark them unfortunately.
Various baselines have long been underrated in Interp literature, and now that we are re-realizing their importance, I’ll bring up some results we found in MATS′23 that in hindsight should probably have received more attention: https://www.lesswrong.com/posts/JCgs7jGEvritqFLfR/evaluating-hidden-directions-on-the-utility-dataset
We found that linear probes are great for classification, but they mostly fit spurious correlations, which can still be good if prediction is the end goal, such as when trying to identify deception. However, the directions found by a linear probe can’t be steered or ablated.
What works really well (and is still under-explored) is the vector obtained by subtracting the class means. Both for causal steering and classification. LEACE showed that this is infact the optimal linear erasure theoretically.
Unsupervised methods (like PCA) are less good at prediction but still quite good for causal interventions.
These results only got published as Figure 12 of the Representation Engineering paper, though in hindsight maybe it would have helped to highlight them more prominently as a paper of their own. SAEs were just being ‘discovered’ around this time (I remember @Hoagy was working on this at MATS′23) so we didn’t benchmark them unfortunately.