What is the original SAE like, and why discard it? Because it’s co-evolved with the model, and therefore likely to seem more interpretable than it actually is?
The original SAE is actually quite good, and, in my experiments with Gated SAEs, I’m using those values. For the purposes of framing this technique as a “regularization” technique, I needed to show that the model weights themselves are affected, which is why my graphs use metrics extracted from freshly trained SAE values.
What is the original SAE like, and why discard it? Because it’s co-evolved with the model, and therefore likely to seem more interpretable than it actually is?
The original SAE is actually quite good, and, in my experiments with Gated SAEs, I’m using those values. For the purposes of framing this technique as a “regularization” technique, I needed to show that the model weights themselves are affected, which is why my graphs use metrics extracted from freshly trained SAE values.