Dumb question: You say that your toy model generation process gets correlated features. But doesn’t it just get correlated feature probabilities. But that, given that you know the probabilities of feature 1 and feature 2 being present, knowing that feature 1 is actually present tells you nothing about feature 2?
That’s correct. ‘Correlated features’ could ambiguously mean “Feature x tends to activate when feature y activates” OR “When we generate feature direction x, its distribution is correlated with feature y’s”. I don’t know if both happen in LMs. The former almost certainly does. The second doesn’t really make sense in the context of LMs since features are learned, not sampled from a distribution.
Dumb question: You say that your toy model generation process gets correlated features. But doesn’t it just get correlated feature probabilities. But that, given that you know the probabilities of feature 1 and feature 2 being present, knowing that feature 1 is actually present tells you nothing about feature 2?
That’s correct. ‘Correlated features’ could ambiguously mean “Feature x tends to activate when feature y activates” OR “When we generate feature direction x, its distribution is correlated with feature y’s”. I don’t know if both happen in LMs. The former almost certainly does. The second doesn’t really make sense in the context of LMs since features are learned, not sampled from a distribution.