Sometimes it’s simpler (less edges) to use the attributes (Cute) or animals (Bunny) or both (eg a particularly cute bunny). Assumption 3 doesn’t allow mixing different bases together.
So here we have 2 attributes (for−−→fatt) & 4 animals (for−−−−→fanimal).
If the downstream circuit (let’s assume a linear + ReLU) reads from the “Cute” direction then: 1. If we are only using −−−−→fanimal : Bunny + Dolphin (interpretable, but add 100 more animals & it’ll take a lot more work to interpret) 2. If we are only using −−→fatt: Cute (simple)
If a downstream circuit reads from the “bunny” direction, then the reverse: 1. Only −−−−→fanimal: Bunny (simple) 2. Only −−→fatt: Cute + Furry ( + 48 attributes makes it more complex)
However, what if there’s a particularly cute rabbit?
Only −−−−→fanimal: Bunny + 0.2*Dolphin(?) (+ many more animals)
Only −−→fatt: 2*Cute + Furry (+ many more attributes)
Neither of the above work! BUT what if we mixed them:
3. Bunny + 0.2*Cute (simple)
I believe you’re claiming that something like APD would, when given the very cute rabbit input, activate the Bunny & Cute components (or whatever directions the model is actually using), which can be in different bases, so can’t form a dictionary/basis. [1]
Technically you didn’t specify that c(x)can’t be an arbitrary function, so you’d be able to reconstruct activations combining different bases, but it’d be horribly convoluted in practice.
Technically you didn’t specify that c(x)can’t be an arbitrary function, so you’d be able to reconstruct activations combining different bases, but it’d be horribly convoluted in practice.
I wouldn’t even be too fussed about ‘horribly convoluted’ here. I’m saying it’s worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick.
The problem here isn’t that we can’t make a dictionary that includes all the 1050 feature directions→f as dictionary elements. We can do that. For example, while we can’t write
→a(x)=∑1000i=1ci(x)→fi+∑50i=1c′i(x)→f′i
because those sums each already equal →a(x) on their own, we can write →a(x)=∑1000i=1ci(x)2→fi+∑50i=1c′i(x)2→f′i.
The problem is instead that we can’t make a dictionary that has the 1050feature activations ci(x),c′i(x) as the coefficients in the dictionary. This is bad because it means our dictionary activations cannot equal thescalar variables the model’s own circuits actually care about. They cannot equal the ‘features of the model’ in the sense defined at the start, the scalar features comprising its ontology. As a result, if we were to look at a causal graph of the model, using the 1050 half-size dictionary feature activations we picked as the graph nodes, a circuit taking in the feature celephant(x) through a linear read-off along the direction →felephant would have edges in our graph connecting it to both the elephant direction, making up about 50% of the total contribution, and the fifty attribute directions, making up the remaining 50%. Same the other way around, any circuit reading in even a single attribute feature will have 1000 edges connecting to all of the animal features[1], making up 50% of the total contribution. It’s the worst of both worlds. Every circuit looks like a mess now.
Since the animals are sparse, in practice this usually means edges to a small set of different animals for every data point. Whichever ones happen to be active at the time.
I think you’re saying:
So here we have 2 attributes (for−−→fatt) & 4 animals (for−−−−→fanimal).
If the downstream circuit (let’s assume a linear + ReLU) reads from the “Cute” direction then:
1. If we are only using −−−−→fanimal : Bunny + Dolphin (interpretable, but add 100 more animals & it’ll take a lot more work to interpret)
2. If we are only using −−→fatt: Cute (simple)
If a downstream circuit reads from the “bunny” direction, then the reverse:
1. Only −−−−→fanimal: Bunny (simple)
2. Only −−→fatt: Cute + Furry ( + 48 attributes makes it more complex)
However, what if there’s a particularly cute rabbit?
Only −−−−→fanimal: Bunny + 0.2*Dolphin(?) (+ many more animals)
Only −−→fatt: 2*Cute + Furry (+ many more attributes)
Neither of the above work! BUT what if we mixed them:
3. Bunny + 0.2*Cute (simple)
I believe you’re claiming that something like APD would, when given the very cute rabbit input, activate the Bunny & Cute components (or whatever directions the model is actually using), which can be in different bases, so can’t form a dictionary/basis. [1]
Technically you didn’t specify that c(x) can’t be an arbitrary function, so you’d be able to reconstruct activations combining different bases, but it’d be horribly convoluted in practice.
I wouldn’t even be too fussed about ‘horribly convoluted’ here. I’m saying it’s worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick.
The problem here isn’t that we can’t make a dictionary that includes all the 1050 feature directions →f as dictionary elements. We can do that. For example, while we can’t write
→a(x)=∑1000i=1ci(x)→fi+∑50i=1c′i(x)→f′i
because those sums each already equal →a(x) on their own, we can write
→a(x)=∑1000i=1ci(x)2→fi+∑50i=1c′i(x)2→f′i.
The problem is instead that we can’t make a dictionary that has the 1050 feature activations ci(x),c′i(x) as the coefficients in the dictionary. This is bad because it means our dictionary activations cannot equal the scalar variables the model’s own circuits actually care about. They cannot equal the ‘features of the model’ in the sense defined at the start, the scalar features comprising its ontology. As a result, if we were to look at a causal graph of the model, using the 1050 half-size dictionary feature activations we picked as the graph nodes, a circuit taking in the feature celephant(x) through a linear read-off along the direction →felephant would have edges in our graph connecting it to both the elephant direction, making up about 50% of the total contribution, and the fifty attribute directions, making up the remaining 50%. Same the other way around, any circuit reading in even a single attribute feature will have 1000 edges connecting to all of the animal features[1], making up 50% of the total contribution. It’s the worst of both worlds. Every circuit looks like a mess now.
Since the animals are sparse, in practice this usually means edges to a small set of different animals for every data point. Whichever ones happen to be active at the time.