Jordan Taylor comments on SAE regularization produces more interpretable models

Jordan Taylor 31 Jan 2025 1:30 UTC
1 point
0
What is the original SAE like, and why discard it? Because it’s co-evolved with the model, and therefore likely to seem more interpretable than it actually is?
- Peter Lai 31 Jan 2025 17:15 UTC
  1 point
  0
  Parent
  The original SAE is actually quite good, and, in my experiments with Gated SAEs, I’m using those values. For the purposes of framing this technique as a “regularization” technique, I needed to show that the model weights themselves are affected, which is why my graphs use metrics extracted from freshly trained SAE values.