heh, unfortunately a single SAE is 768 * 60. The residual stream in GPT2 is 768 dims and SAEs are big. You probably want to test this out on smaller models.
I can’t recall the compute costs for that script, sorry. A couple of things to note:
For a single SAE you will need to run it on ~25k latents (46k minus the dead ones) instead of the 200 we did.
You will only need to produce explanations for activations, and won’t have to do the second step of asking the model to produce activations given the explanations.
It’s a fun idea. Though a serious issue is that your external LoRA weights are going to be very large because their input and output will need to be the same size as your SAE dictionary, which could be 10-100x (or more, nobody knows) the residual stream size. So this could be a very expensive setup to finetune.
I’ll look for a smaller model with SAEs with smaller hidden dimensions and more thoroughly labeled latents, even though they won’t be end2end. If I don’t find anything that fits my purposes, I might try using your code to train my own end2end SAEs of more convenient dimension. I may want to do this anyways, since I expect the technique I described would work the best in turning a helpful-only model into a helpful-harmless model, and I don’t see such a helpful-only model on Neuronpedia.
If the FFNN has a hidden dimension of 16, then it would have around 1.5 million parameters, which doesn’t sound too bad, and 16 might be enough to find something interesting.
Why do you need to have all feature descriptions at the outset? Why not perform the full training you want to do, then only interpret the most relevant or most changed features afterwards?
heh, unfortunately a single SAE is 768 * 60. The residual stream in GPT2 is 768 dims and SAEs are big. You probably want to test this out on smaller models.
I can’t recall the compute costs for that script, sorry. A couple of things to note:
For a single SAE you will need to run it on ~25k latents (46k minus the dead ones) instead of the 200 we did.
You will only need to produce explanations for activations, and won’t have to do the second step of asking the model to produce activations given the explanations.
It’s a fun idea. Though a serious issue is that your external LoRA weights are going to be very large because their input and output will need to be the same size as your SAE dictionary, which could be 10-100x (or more, nobody knows) the residual stream size. So this could be a very expensive setup to finetune.
Thank you again.
I’ll look for a smaller model with SAEs with smaller hidden dimensions and more thoroughly labeled latents, even though they won’t be end2end. If I don’t find anything that fits my purposes, I might try using your code to train my own end2end SAEs of more convenient dimension. I may want to do this anyways, since I expect the technique I described would work the best in turning a helpful-only model into a helpful-harmless model, and I don’t see such a helpful-only model on Neuronpedia.
If the FFNN has a hidden dimension of 16, then it would have around 1.5 million parameters, which doesn’t sound too bad, and 16 might be enough to find something interesting.
Low-rank factorization might help with the parameter counts.
Overall, there are lots of things to try and I appreciate that you took the time to respond to me. Keep up the great work!
Why do you need to have all feature descriptions at the outset? Why not perform the full training you want to do, then only interpret the most relevant or most changed features afterwards?
That is a sensible way to save compute resources. Thank you.