[Question] SAE sparse feature graph using only residual layers

Jaehyuk Lim23 May 2024 13:32 UTC

0 points

Sparse Autoencoders (SAEs)Interpretability (ML & AI)Inner Alignment AI

Does it make sense to extract sparse feature graph for a behavior from only residual layers of gpt2 small or do we need all mlp and attention as well?

Jaehyuk Lim23 May 2024 13:32 UTC

0 points

3 comments1 min readLW link

Sparse Autoencoders (SAEs)Interpretability (ML & AI)Inner Alignment AI