Neel Nanda comments on Improving SAE’s by Sqrt()-ing L1 & Removing Lowest Activating Features

Neel Nanda 15 Mar 2024 17:31 UTC
2 points
2
1. Global Threshold—Let’s treat all features the same. Set all feature activations less than [0.1] to 0 (this is equivalent to adding a constant to the encoder bias).
The bolded part seems false? This maps 0.2 original act → 0.2 new act while adding 0.1 to the encoder bias maps 0.2 original act → 0.1 new act. Ie, changing the encoder bias changes the value of all activations, while thresholding only affects small ones
- Logan Riggs 15 Mar 2024 18:23 UTC
  2 points
  0
  Parent
  Ah, you’re right. I’ve updated it.