Lucius Bushnaq comments on Lucius Bushnaq’s Shortform

Lucius Bushnaq 16 Apr 2025 20:32 UTC
2 points
0
Technically you didn’t specify that $c (x)$ can’t be an arbitrary function, so you’d be able to reconstruct activations combining different bases, but it’d be horribly convoluted in practice.
I wouldn’t even be too fussed about ‘horribly convoluted’ here. I’m saying it’s worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick.

The problem here isn’t that we can’t make a dictionary that includes all the $1050$ feature directions $\to f$ as dictionary elements. We can do that. For example, while we can’t write
$\to a (x) = \sum_{i = 1}^{1000} c_{i} (x)_{i} + \sum_{i = 1}^{50} c_{i}^{'} (x) \to f_{i}^{'}$
because those sums each already equal $\to a (x)$ on their own, we can write
$\to a (x) = \sum_{i = 1}^{1000} \frac{c_{i} (x)}{2}_{i} + \sum_{i = 1}^{50} \frac{c_{i}^{'} (x)}{2} \to f_{i}^{'}$ .

The problem is instead that we can’t make a dictionary that has the $1050$ feature activations $c_{i} (x), c_{i}^{'} (x)$ as the coefficients in the dictionary. This is bad because it means our dictionary activations cannot equal the scalar variables the model’s own circuits actually care about. They cannot equal the ‘features of the model’ in the sense defined at the start, the scalar features comprising its ontology. As a result, if we were to look at a causal graph of the model, using the $1050$ half-size dictionary feature activations we picked as the graph nodes, a circuit taking in the feature $c_{elephant} (x)$ through a linear read-off along the direction ${\to f}_{elephant}$ would have edges in our graph connecting it to both the elephant direction, making up about 50% of the total contribution, and the fifty attribute directions, making up the remaining 50%. Same the other way around, any circuit reading in even a single attribute feature will have $1000$ edges connecting to all of the animal features^[1], making up $50 %$ of the total contribution. It’s the worst of both worlds. Every circuit looks like a mess now.
1. ^
  Since the animals are sparse, in practice this usually means edges to a small set of different animals for every data point. Whichever ones happen to be active at the time.