The approach you suggest feels similar in spirit to Farnik et al, and I think it’s a reasonable thing to try. However, I opted for my approach since it produces an exactly sparse forward pass, rather than just suppressing the contribution of weak connections. So no arbitrary threshold must be chosen when buikding a feature circuit/attribution graph. Either two latents are connected, or they are not.
I also like the fact that my approach gives us sparse global virtual weights, which allows us to study global circuits—something Anthropic had problems with due to interference weights in their approach.
The approach you suggest feels similar in spirit to Farnik et al, and I think it’s a reasonable thing to try. However, I opted for my approach since it produces an exactly sparse forward pass, rather than just suppressing the contribution of weak connections. So no arbitrary threshold must be chosen when buikding a feature circuit/attribution graph. Either two latents are connected, or they are not.
I also like the fact that my approach gives us sparse global virtual weights, which allows us to study global circuits—something Anthropic had problems with due to interference weights in their approach.