Nice work! Could you elaborate how you think your findings relate to Engels et al.’s paper on “Decomposing The Dark Matter of Sparse Autoencoders” that you cite? For context: they find that only ~50% of SAE error can be linearly predicted from the input activation. They hypothesize that the 50% linearly predictable component contains features not yet learned, while the remaining 50% consists of errors introduced by the SAE itself rather than meaningful model features (@josh-engels correct me if I’m misrepresenting the paper’s findings).
It would be interesting to see if your restoration results come from the linear vs nonlinear component. If only the nonlinear part matters for restoration, then the explanation for why error matters is pretty boring: it just prevents the reconstruction from being out of distribution.
However, if the restoration comes from the linear component, one hypothesis would be that the intermediate representations you describe live in multidimensional subspaces (kind of like high-frequency features) but have low-norm. This would make them economically unattractive for SAEs to learn: poor reconstruction gain due to low norm, while consuming multiple dictionary elements due to being multidimensional.
Thank you Clément, your hypothesis about the linear component is quite intriguing! I read Josh’s Dark Matter paper a while ago, and I remember there were multiple versions floating around at the time, so I’d definitely like to revisit the latest one before responding in depth.
That said, I can comment based on your explanation already. The main motivation I had behind the restoration experiment was to test the idea that, without the error nodes, the key features in the circuit simply can’t be computed. So when you say that the restoration effect may come from the non-linear component, do you mean that “the key features can’t be computed” might be caused simply by the fact that the pure SAE reconstruction is messing with the activation space geometry in a weird (non-linear) way — such that downstream components like MLPs, which would normally compute those features, now receive an OOD input and therefore fail?
That sounds quite plausible, if I’m understanding correctly. But I don’t immediately see how to explain the fact that patching these error nodes has a non-trivial effect by itself (i.e. boosting the logit for alternative verb completion),as ifthey contained some prompt-specific features/information. I think that’s mostly why we didn’t consider that possibility, but I’d definitely be excited to try ruling this out experimentally (maybe after revisiting the paper I’ll have some ideas).
Yes that’s what I meant. I agree that the fact that patching the error boost the alternative completion makes this explanation much weaker (although it could still be a combination of the two)
I think it’d be super interesting to expand your analysis to linear vs non-linear error to understand which part matter and then explore why!
Nice work! Could you elaborate how you think your findings relate to Engels et al.’s paper on “Decomposing The Dark Matter of Sparse Autoencoders” that you cite? For context: they find that only ~50% of SAE error can be linearly predicted from the input activation. They hypothesize that the 50% linearly predictable component contains features not yet learned, while the remaining 50% consists of errors introduced by the SAE itself rather than meaningful model features (@josh-engels correct me if I’m misrepresenting the paper’s findings).
It would be interesting to see if your restoration results come from the linear vs nonlinear component. If only the nonlinear part matters for restoration, then the explanation for why error matters is pretty boring: it just prevents the reconstruction from being out of distribution.
However, if the restoration comes from the linear component, one hypothesis would be that the intermediate representations you describe live in multidimensional subspaces (kind of like high-frequency features) but have low-norm. This would make them economically unattractive for SAEs to learn: poor reconstruction gain due to low norm, while consuming multiple dictionary elements due to being multidimensional.
Thank you Clément, your hypothesis about the linear component is quite intriguing! I read Josh’s Dark Matter paper a while ago, and I remember there were multiple versions floating around at the time, so I’d definitely like to revisit the latest one before responding in depth.
That said, I can comment based on your explanation already. The main motivation I had behind the restoration experiment was to test the idea that, without the error nodes, the key features in the circuit simply can’t be computed. So when you say that the restoration effect may come from the non-linear component, do you mean that “the key features can’t be computed” might be caused simply by the fact that the pure SAE reconstruction is messing with the activation space geometry in a weird (non-linear) way — such that downstream components like MLPs, which would normally compute those features, now receive an OOD input and therefore fail?
That sounds quite plausible, if I’m understanding correctly. But I don’t immediately see how to explain the fact that patching these error nodes has a non-trivial effect by itself (i.e. boosting the logit for alternative verb completion), as if they contained some prompt-specific features/information. I think that’s mostly why we didn’t consider that possibility, but I’d definitely be excited to try ruling this out experimentally (maybe after revisiting the paper I’ll have some ideas).
Yes that’s what I meant. I agree that the fact that patching the error boost the alternative completion makes this explanation much weaker (although it could still be a combination of the two)
I think it’d be super interesting to expand your analysis to linear vs non-linear error to understand which part matter and then explore why!