That is really pretty cool. I had a similar project that tried to do the same thing except it would use gradients. So you’d have a set of feature directions Wfeats that define a CLT or auto-encoders for each layer, then you’d train those feature directions using
With the fs being feature activations across features i,j, and the L being ordinary cross entropy loss on some prediction task.
The reasoning being that this is like training a vanilla SAE, except you’re encouraging sparsity in the attribution graph. My hope was
That, like you said, it “felt right”
I thought actually maybe this could fix some of the issues you mentioned with the gerrymandered features you get when you just train for sparsity on the feature level. Because like, those gerrymandered features are more sparse, but they make the computational graph more complicated. Like in the feature absorption example, it might gerry mander “animal” categories, but with this setup, downstream computation about animals should respond gradient-wise both to the exclusive “animal” feature and the gerrymandered “cow” feature (the animal is absorbed into the cow feature)
Hahaha, I didn’t get very far, because I was discouraged by the same observation you made, that this would be insanely computationally prohibitive. Also it was just finnicky to implement, because I think to get it to work properly you’d have to to make many of the same modifications people made to vanilla SAEs to get them to work better.
Did you think about this approach? My thought was that it would be easier to implement and would automatically handle all dependencies between all features without any issues. However it is just an approximation of the true causal relationship, so it might not work that well. I mean (2) in the above is a better approximation of whats actually happening in the computation, but (1) solves more of the gerrymandering issue.
The approach you suggest feels similar in spirit to Farnik et al, and I think it’s a reasonable thing to try. However, I opted for my approach since it produces an exactly sparse forward pass, rather than just suppressing the contribution of weak connections. So no arbitrary threshold must be chosen when buikding a feature circuit/attribution graph. Either two latents are connected, or they are not.
I also like the fact that my approach gives us sparse global virtual weights, which allows us to study global circuits—something Anthropic had problems with due to interference weights in their approach.
That is really pretty cool. I had a similar project that tried to do the same thing except it would use gradients. So you’d have a set of feature directions Wfeats that define a CLT or auto-encoders for each layer, then you’d train those feature directions using
1)∇Wfeats((∑s,i,j∂fs,i(x)∂fs,j(x)|)+LWfeats(x))
or
2)∇Wfeats((∑s,i,j|fb,s,j∂fs,i(x)∂fs,j(x)|)+LWfeats(x))
With the fs being feature activations across features i,j, and the L being ordinary cross entropy loss on some prediction task.
The reasoning being that this is like training a vanilla SAE, except you’re encouraging sparsity in the attribution graph. My hope was
That, like you said, it “felt right”
I thought actually maybe this could fix some of the issues you mentioned with the gerrymandered features you get when you just train for sparsity on the feature level. Because like, those gerrymandered features are more sparse, but they make the computational graph more complicated. Like in the feature absorption example, it might gerry mander “animal” categories, but with this setup, downstream computation about animals should respond gradient-wise both to the exclusive “animal” feature and the gerrymandered “cow” feature (the animal is absorbed into the cow feature)
Hahaha, I didn’t get very far, because I was discouraged by the same observation you made, that this would be insanely computationally prohibitive. Also it was just finnicky to implement, because I think to get it to work properly you’d have to to make many of the same modifications people made to vanilla SAEs to get them to work better.
Did you think about this approach? My thought was that it would be easier to implement and would automatically handle all dependencies between all features without any issues. However it is just an approximation of the true causal relationship, so it might not work that well. I mean (2) in the above is a better approximation of whats actually happening in the computation, but (1) solves more of the gerrymandering issue.
The approach you suggest feels similar in spirit to Farnik et al, and I think it’s a reasonable thing to try. However, I opted for my approach since it produces an exactly sparse forward pass, rather than just suppressing the contribution of weak connections. So no arbitrary threshold must be chosen when buikding a feature circuit/attribution graph. Either two latents are connected, or they are not.
I also like the fact that my approach gives us sparse global virtual weights, which allows us to study global circuits—something Anthropic had problems with due to interference weights in their approach.