Jason Gross comments on Sparsify: A mechanistic interpretability research agenda

Jason Gross 27 Apr 2024 2:06 UTC
LW: 2 AF: 1
0
AF

We propose a simple fix: Use $L_{0 < p < 1}$ instead of $L_{1}$ , which seems to be a Pareto improvement over $L_{1}$ (at least in some real models, though results might be mixed) in terms of the number of features required to achieve a given reconstruction error.

When I was discussing better sparsity penalties with Lawrence, and the fact that I observed some instability in $L_{0 < p < 1}$ in toy models of super-position, he pointed out that the gradient of $L_{0 < p < 1}$ norm explodes near zero, meaning that features with “small errors” that cause them to have very small but non-zero overlap with some activations might be killed off entirely rather than merely having the overlap penalized.

See here for some brief write-up and animations.