Matthew Khoriaty comments on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Matthew Khoriaty 15 Jan 2025 5:02 UTC
3 points
0
Thank you again.
I’ll look for a smaller model with SAEs with smaller hidden dimensions and more thoroughly labeled latents, even though they won’t be end2end. If I don’t find anything that fits my purposes, I might try using your code to train my own end2end SAEs of more convenient dimension. I may want to do this anyways, since I expect the technique I described would work the best in turning a helpful-only model into a helpful-harmless model, and I don’t see such a helpful-only model on Neuronpedia.
If the FFNN has a hidden dimension of 16, then it would have around 1.5 million parameters, which doesn’t sound too bad, and 16 might be enough to find something interesting.
Low-rank factorization might help with the parameter counts.
Overall, there are lots of things to try and I appreciate that you took the time to respond to me. Keep up the great work!
- Jordan Taylor 17 Jan 2025 22:57 UTC
  2 points
  0
  Parent
  Why do you need to have all feature descriptions at the outset? Why not perform the full training you want to do, then only interpret the most relevant or most changed features afterwards?
  - Matthew Khoriaty 18 Jan 2025 23:38 UTC
    1 point
    0
    Parent
    That is a sensible way to save compute resources. Thank you.