Nikita Balagansky comments on [PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Nikita Balagansky 27 Feb 2025 6:05 UTC
2 points
0
While authors claim that their approach is fundamentally different from transcoders, from my perspective, it addresses the same issue: finding interpretable circuits. I agree that it modulates MLPs in different ways (e.g., connecting sparse inputs and sparse outputs, rather than modulating MLP directly). However, it would be great to see the difference between circuits identified by transcoders and those found by JSAE.
We also discuss this similarity in Appendix F of our recent work (where transition T is analogous to f ), though we do not consider gradient-based approaches. Nevertheless, the most similar features between input and output can still sufficiently explain feature dynamics being a good baseline.