Logan Riggs comments on SAE vs. RepE

Logan Riggs 20 May 2025 13:57 UTC
6 points
0
“focus should no longer be put into SAEs...?”
I think we should still invest research into them BUT it depends on the research.
Less interesting research:
1. Applying SAEs to [X-model/field] (or Y-problem w/o any baselines)

More interesting research:
1. Problems w/ SAEs & possible solutions
  1. Feature supression (solved by post-training, gated-SAEs, & top-k)
  2. Feature absorption (possibly solved by Matryoshka SAEs)
  3. SAE’s don’t find the same features across seeds (maybe solved by constraining latents to the convex hull of the data)
  4. Dark-matter of SAEs (nothing AFAIK)
  5. Many more I’m likely forgetting/haven’t read
2. Comparing SAEs w/ strong baselines for solving specific problems
3. Using SAEs to test how true the linear representation hypothesis is
4. Changing SAE architecture to match the data
In general, I’m still excited about an unsupervised method that finds all the model’s features/functions. SAE’s are one possible option, but others are being worked on! (APD & L3D for a weight-based method)
Relatedly, I’m also excited about interpretable-from-scratch architectures that do lend themselves more towards mech-interp (or bottom-up in Dan’s language).