In general, I’m still excited about an unsupervised method that finds all the model’s features/functions. SAE’s are one possible option, but others are being worked on! (APD & L3D for a weight-based method)
Relatedly, I’m also excited about interpretable-from-scratch architectures that do lend themselves more towards mech-interp (or bottom-up in Dan’s language).
“focus should no longer be put into SAEs...?”
I think we should still invest research into them BUT it depends on the research.
Less interesting research:
1. Applying SAEs to [X-model/field] (or Y-problem w/o any baselines)
More interesting research:
Problems w/ SAEs & possible solutions
Feature supression (solved by post-training, gated-SAEs, & top-k)
Feature absorption (possibly solved by Matryoshka SAEs)
SAE’s don’t find the same features across seeds (maybe solved by constraining latents to the convex hull of the data)
Dark-matter of SAEs (nothing AFAIK)
Many more I’m likely forgetting/haven’t read
Comparing SAEs w/ strong baselines for solving specific problems
Using SAEs to test how true the linear representation hypothesis is
Changing SAE architecture to match the data
In general, I’m still excited about an unsupervised method that finds all the model’s features/functions. SAE’s are one possible option, but others are being worked on! (APD & L3D for a weight-based method)
Relatedly, I’m also excited about interpretable-from-scratch architectures that do lend themselves more towards mech-interp (or bottom-up in Dan’s language).