Does it make sense to extract sparse feature graph for a behavior from only residual layers of gpt2 small or do we need all mlp and attention as well?
Current theme: default
Less Wrong (text)
Less Wrong (link)
[Question] SAE sparse feature graph using only residual layers
Does it make sense to extract sparse feature graph for a behavior from only residual layers of gpt2 small or do we need all mlp and attention as well?