Neel Nanda comments on [Interim research report] Taking features out of superposition with sparse autoencoders

Neel Nanda 15 Dec 2022 20:48 UTC
LW: 2 AF: 1
0
AF
Dumb question: You say that your toy model generation process gets correlated features. But doesn’t it just get correlated feature probabilities. But that, given that you know the probabilities of feature 1 and feature 2 being present, knowing that feature 1 is actually present tells you nothing about feature 2?
- Lee Sharkey 16 Dec 2022 2:14 UTC
  LW: 1 AF: 1
  0
  AF Parent
  That’s correct. ‘Correlated features’ could ambiguously mean “Feature x tends to activate when feature y activates” OR “When we generate feature direction x, its distribution is correlated with feature y’s”. I don’t know if both happen in LMs. The former almost certainly does. The second doesn’t really make sense in the context of LMs since features are learned, not sampled from a distribution.