Dan Braun comments on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun 14 Jan 2025 17:47 UTC
3 points
0
heh, unfortunately a single SAE is 768 * 60. The residual stream in GPT2 is 768 dims and SAEs are big. You probably want to test this out on smaller models.
I can’t recall the compute costs for that script, sorry. A couple of things to note:
1. For a single SAE you will need to run it on ~25k latents (46k minus the dead ones) instead of the 200 we did.
2. You will only need to produce explanations for activations, and won’t have to do the second step of asking the model to produce activations given the explanations.
It’s a fun idea. Though a serious issue is that your external LoRA weights are going to be very large because their input and output will need to be the same size as your SAE dictionary, which could be 10-100x (or more, nobody knows) the residual stream size. So this could be a very expensive setup to finetune.
- Matthew Khoriaty 15 Jan 2025 5:02 UTC
  3 points
  0
  Parent
  Thank you again.
  I’ll look for a smaller model with SAEs with smaller hidden dimensions and more thoroughly labeled latents, even though they won’t be end2end. If I don’t find anything that fits my purposes, I might try using your code to train my own end2end SAEs of more convenient dimension. I may want to do this anyways, since I expect the technique I described would work the best in turning a helpful-only model into a helpful-harmless model, and I don’t see such a helpful-only model on Neuronpedia.
  If the FFNN has a hidden dimension of 16, then it would have around 1.5 million parameters, which doesn’t sound too bad, and 16 might be enough to find something interesting.
  Low-rank factorization might help with the parameter counts.
  Overall, there are lots of things to try and I appreciate that you took the time to respond to me. Keep up the great work!
  - Jordan Taylor 17 Jan 2025 22:57 UTC
    2 points
    0
    Parent
    Why do you need to have all feature descriptions at the outset? Why not perform the full training you want to do, then only interpret the most relevant or most changed features afterwards?
    - Matthew Khoriaty 18 Jan 2025 23:38 UTC
      1 point
      0
      Parent
      That is a sensible way to save compute resources. Thank you.