Hey, I was wondering if you are using any weight decay during the training of the SAE? It feels to me that, being equivalent to implicit L2 minimization, this could also be the culprit. And if you don’t, how come the first layer doesn’t collapse to 0 while the second layer grows to infinity? Thanks if you take the time to answer :)
Hey, I was wondering if you are using any weight decay during the training of the SAE? It feels to me that, being equivalent to implicit L2 minimization, this could also be the culprit. And if you don’t, how come the first layer doesn’t collapse to 0 while the second layer grows to infinity?
Thanks if you take the time to answer :)