If I understand correctly, it sounds like you’re saying there is a “label” direction for each animal that’s separate from each of the attributes.
No, the animal vectors are all fully spanned by the fifty attribute features.
I’m confused why a dictionary that consists of a feature direction for each attribute and each animal label can’t explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal.
The animal features are sparse. The attribute features are not sparse.[1]
In this sense the activations are just the sum of the various elements of the dictionary multiplied by a magnitude, so it seems like you should be able to explain these activations using dictionary learning.
The magnitudes in a dictionary seeking to decompose the activation vector into these 1050 features will not be able to match the actual magnitudes of the features ci(x),i=1…1000,c′i(x),i=1…50 as seen by linear probes and the network’s own circuits.
Is the idea that the 1000 animals and 50 attributes form an overcomplete basis, therefore you can come up with infinite ways to span the space using these basis components?
Relative to the animal features at least. They could still be sparse relative to the rest of the network if this 50-dimensional animal subspace is rarely used.
No, the animal vectors are all fully spanned by the fifty attribute features.
Is this just saying that there’s superposition noise, so everything is spanning everything else? If so that doesn’t seem like it should conflict with being able to use a dictionary, dictionary learning should work with superposition noise as long as the interference doesn’t get too massive.
The animal features are sparse. The attribute features are not sparse.
If you mean that the attributes are a basis in the sense that the neurons are a basis, then I don’t see how you can say there’s a unique “label” direction for each animal that’s separate from the the underlying attributes such that you can set any arbitrary combination of attributes, including all attributes turned on at once or all turned off since they’re not sparse, and still read off the animal label without interference. It seems like that would be like saying that the elephant direction = [1, 0, −1], but you can change arbitrarily all 3 of those numbers to any other numbers and still be the elephant direction.
No, the animal vectors are all fully spanned by the fifty attribute features.
The animal features are sparse. The attribute features are not sparse.[1]
The magnitudes in a dictionary seeking to decompose the activation vector into these 1050 features will not be able to match the actual magnitudes of the features ci(x),i=1…1000,c′i(x),i=1…50 as seen by linear probes and the network’s own circuits.
No, that is not the idea.
Relative to the animal features at least. They could still be sparse relative to the rest of the network if this 50-dimensional animal subspace is rarely used.
Is this just saying that there’s superposition noise, so everything is spanning everything else? If so that doesn’t seem like it should conflict with being able to use a dictionary, dictionary learning should work with superposition noise as long as the interference doesn’t get too massive.
If you mean that the attributes are a basis in the sense that the neurons are a basis, then I don’t see how you can say there’s a unique “label” direction for each animal that’s separate from the the underlying attributes such that you can set any arbitrary combination of attributes, including all attributes turned on at once or all turned off since they’re not sparse, and still read off the animal label without interference. It seems like that would be like saying that the elephant direction = [1, 0, −1], but you can change arbitrarily all 3 of those numbers to any other numbers and still be the elephant direction.