I’m with @chanind: If elephant is fully represented by a sum of its attributes, then it’s quite reasonable to say that the model has no fundamental notion of an elephant in that representation. ...
This is not a load-bearing detail of the example. If you like, you can instead imagine a model that embeds 1000 animals in an e.g. 100-dimensional subspace, with a 50 dimensional sub-sub-space where the embedding directions correspond to 50 attributes, and a 50 dimensional sub-sub-space where embeddings are just random.
This should still get you basically the same issues the original example did I think? For any dictionary decomposition of the activations you pick, some of the circuits will end up looking like a horrible mess, even though they’re secretly taking in a very low-rank subspace of the activations that’d make sense to us if we looked at it. I should probably double check that when I’m more awake though.[1]
I think the central issue here is mostly just having some kind of non-random, ‘meaningful’ feature embedding geometry that the circuits care about, instead of random feature embeddings.
This is not a load-bearing detail of the example. If you like, you can instead imagine a model that embeds 1000 animals in an e.g. 100-dimensional subspace, with a 50 dimensional sub-sub-space where the embedding directions correspond to 50 attributes, and a 50 dimensional sub-sub-space where embeddings are just random.
This should still get you basically the same issues the original example did I think? For any dictionary decomposition of the activations you pick, some of the circuits will end up looking like a horrible mess, even though they’re secretly taking in a very low-rank subspace of the activations that’d make sense to us if we looked at it. I should probably double check that when I’m more awake though.[1]
I think the central issue here is mostly just having some kind of non-random, ‘meaningful’ feature embedding geometry that the circuits care about, instead of random feature embeddings.
EDIT: I am now more awake. I still think this is right.