We propose a simple fix: Use L0<p<1 instead of L1, which seems to be a Pareto improvement over L1 (at least in some real models, though results might be mixed) in terms of the number of features required to achieve a given reconstruction error.
When I was discussing better sparsity penalties with Lawrence, and the fact that I observed some instability in L0<p<1 in toy models of super-position, he pointed out that the gradient of L0<p<1 norm explodes near zero, meaning that features with “small errors” that cause them to have very small but non-zero overlap with some activations might be killed off entirely rather than merely having the overlap penalized.
When I was discussing better sparsity penalties with Lawrence, and the fact that I observed some instability in L0<p<1 in toy models of super-position, he pointed out that the gradient of L0<p<1 norm explodes near zero, meaning that features with “small errors” that cause them to have very small but non-zero overlap with some activations might be killed off entirely rather than merely having the overlap penalized.
See here for some brief write-up and animations.