StefanHex comments on Alternative Models of Superposition

StefanHex 13 Aug 2025 13:45 UTC
12 points
0
Reposting my Slack comment here for the record: I’m excited to see challenges to our fundamental assumptions and exploration of alternatives!
Unfortunately, I think that the modified loss function makes the task a lot easier, and the results not applicable to superposition. (I think @Alex Gibson makes a similar point above.)
In this post, we use a loss function that focuses only on reconstructing active features
It is much easier to reconstruct the active feature without regard for interference (inactive features also appearing active).
In general, I find that the issue in NNs is that you not only need to “store” things in superposition, but be able to read them off with low error / interference. Chris Olah’s note on “linear readability” here (inspired by the Computation in Superposition work) describes that somewhat.
We’ve experimented with similar loss function ideas (almost the same as your loss actually, for APD) at Apollo, but always found that ignoring inactive features makes the task unrealistically easy.