Technically you didn’t specify that c(x)can’t be an arbitrary function, so you’d be able to reconstruct activations combining different bases, but it’d be horribly convoluted in practice.
I wouldn’t even be too fussed about ‘horribly convoluted’ here. I’m saying it’s worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick.
The problem here isn’t that we can’t make a dictionary that includes all the 1050 feature directions→f as dictionary elements. We can do that. For example, while we can’t write
→a(x)=∑1000i=1ci(x)→fi+∑50i=1c′i(x)→f′i
because those sums each already equal →a(x) on their own, we can write →a(x)=∑1000i=1ci(x)2→fi+∑50i=1c′i(x)2→f′i.
The problem is instead that we can’t make a dictionary that has the 1050feature activations ci(x),c′i(x) as the coefficients in the dictionary. This is bad because it means our dictionary activations cannot equal thescalar variables the model’s own circuits actually care about. They cannot equal the ‘features of the model’ in the sense defined at the start, the scalar features comprising its ontology. As a result, if we were to look at a causal graph of the model, using the 1050 half-size dictionary feature activations we picked as the graph nodes, a circuit taking in the feature celephant(x) through a linear read-off along the direction →felephant would have edges in our graph connecting it to both the elephant direction, making up about 50% of the total contribution, and the fifty attribute directions, making up the remaining 50%. Same the other way around, any circuit reading in even a single attribute feature will have 1000 edges connecting to all of the animal features[1], making up 50% of the total contribution. It’s the worst of both worlds. Every circuit looks like a mess now.
Since the animals are sparse, in practice this usually means edges to a small set of different animals for every data point. Whichever ones happen to be active at the time.
I wouldn’t even be too fussed about ‘horribly convoluted’ here. I’m saying it’s worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick.
The problem here isn’t that we can’t make a dictionary that includes all the 1050 feature directions →f as dictionary elements. We can do that. For example, while we can’t write
→a(x)=∑1000i=1ci(x)→fi+∑50i=1c′i(x)→f′i
because those sums each already equal →a(x) on their own, we can write
→a(x)=∑1000i=1ci(x)2→fi+∑50i=1c′i(x)2→f′i.
The problem is instead that we can’t make a dictionary that has the 1050 feature activations ci(x),c′i(x) as the coefficients in the dictionary. This is bad because it means our dictionary activations cannot equal the scalar variables the model’s own circuits actually care about. They cannot equal the ‘features of the model’ in the sense defined at the start, the scalar features comprising its ontology. As a result, if we were to look at a causal graph of the model, using the 1050 half-size dictionary feature activations we picked as the graph nodes, a circuit taking in the feature celephant(x) through a linear read-off along the direction →felephant would have edges in our graph connecting it to both the elephant direction, making up about 50% of the total contribution, and the fifty attribute directions, making up the remaining 50%. Same the other way around, any circuit reading in even a single attribute feature will have 1000 edges connecting to all of the animal features[1], making up 50% of the total contribution. It’s the worst of both worlds. Every circuit looks like a mess now.
Since the animals are sparse, in practice this usually means edges to a small set of different animals for every data point. Whichever ones happen to be active at the time.