Nice work! I’m not sure I fully understand what the “gated-ness” is adding, i.e. what the role the Heaviside step function is playing. What would happen if we did away with it? Namely, consider this setup:
Let and be the encoder and decoder functions, as in your paper, and let be the model activation that is fed into the SAE.
The usual SAE reconstruction is , which suffers from the shrinkage problem.
Now, introduce a new learned parameter , and define an “expanded” reconstruction , where denotes elementwise multiplication.
Finally, take the loss to be:
.
where ensures the decoder gets no gradients from the first term. As I understand it, this is exactly the loss appearing in your paper. The only difference in the setup is the lack of the Heaviside step function.
Did you try this setup? Or does it fail for an obvious reason I missed?
I’m a bit confused here. First, I take it that α labels coordinate patches? Second, consider the very simple case with d=2 and K(w)=w21+w22. What g would put K into the stated form?