Thank you for writing this up! I experimented briefly with group sparsity as well, but with the goal of learning the “hierarchy” of features rather than to learn circular features like you’re doing here. I also struggled to get it to work in toy settings, but didn’t try extensively and ended up moving on to other things. I still think there must be something in group sparsity, since it’s so well studied in sparse coding and clearly does work in theory.
I also struggled with the problem of how to choose groups, since for traditional group sparsity you need to set the groups before-hand. I like your idea of trying to learn the group space. For using group sparsity to recover hierarchy, I wonder if there’s a way to learn a direction for the group as a whole, and project out that direction from each member of the group. The idea would be that if latents are sharing common components, those common components should probably be their own “group” representation, and this should be done until the leaf nodes are mostly orthogonal to each other. There are definitely overlapping hierarchies too, which is a challenge.
Regardless, thank you for sharing this! There’s a lot of great ideas in this post.
I view SAE width and SAE L0 as two separate parameters we should try to get right if we can. In toy models, similar failure modes to what we see with low L0 SAEs also happen if the SAE is narrower than the number of true features, in that the SAE tries to “cheat” and get better MSE loss by mixing correlated features together. If we can’t make the SAE as wide as the number of true features, I’d still expect wider SAEs to learn cleaner features to narrower SAEs. But then wider SAEs make feature absorption a lot worse, so that’s a problem. I don’t think multi-L0 SAEs would help or hurt in this case though—capturing near-infinite features requires a near-infinite width SAE regardless of the L0.
For setting the correct L0 for a given SAE width, I don’t think there’s a trade-off with absorption—getting the L0 correct should always improve things. I view the feature completeness stuff as also being somewhat separate from the choice of L0, since L0 is about how many features are active at the same time regardless of the total number of features. Even if there’s infinite features, there’s still hopefully only a small / finite number of features active for any given input.
In all the experiments across all 3 cases, the SAEs have the same width (20), so the higher L0 SAEs don’t learn any more features than lower L0 SAEs.
We looked into what happens if SAEs are wider than the number of true features in toy models in an earlier post, and found exactly what you suspect: the SAE starts inventing arbitrary combo latents (e.g. a “red triangle” latent in addition to “red” and “triangle” latents), or creating duplicate latents, or just killing off some of the extra latents.
For both L0 and width, it seems like giving the SAE more capacity than it needs to model the underlying data results in the SAE misusing the extra capacity and finding degenerate solutions.