Logan Riggs comments on Lucius Bushnaq’s Shortform

Logan Riggs 16 Apr 2025 19:39 UTC
4 points
0
I think you’re saying:
Sometimes it’s simpler (less edges) to use the attributes (Cute) or animals (Bunny) or both (eg a particularly cute bunny). Assumption 3 doesn’t allow mixing different bases together.
So here we have 2 attributes (for $_{a t t}$ ) & 4 animals (for $_{a n i m a l}$ ).
If the downstream circuit (let’s assume a linear + ReLU) reads from the “Cute” direction then:
1. If we are only using $_{a n i m a l}$ : Bunny + Dolphin (interpretable, but add 100 more animals & it’ll take a lot more work to interpret)
2. If we are only using $_{a t t}$ : Cute (simple)

If a downstream circuit reads from the “bunny” direction, then the reverse:
1. Only $_{a n i m a l}$ : Bunny (simple)
2. Only $_{a t t}$ : Cute + Furry ( + 48 attributes makes it more complex)
However, what if there’s a particularly cute rabbit?
1. Only $_{a n i m a l}$ : Bunny + 0.2*Dolphin(?) (+ many more animals)
2. Only $_{a t t}$ : 2*Cute + Furry (+ many more attributes)
Neither of the above work! BUT what if we mixed them:
3. Bunny + 0.2*Cute (simple)
I believe you’re claiming that something like APD would, when given the very cute rabbit input, activate the Bunny & Cute components (or whatever directions the model is actually using), which can be in different bases, so can’t form a dictionary/basis. ^[1]
1. ^
  Technically you didn’t specify that $c (x)$ can’t be an arbitrary function, so you’d be able to reconstruct activations combining different bases, but it’d be horribly convoluted in practice.
What links here?
- Logan Riggs's comment on Lucius Bushnaq’s Shortform by Lucius Bushnaq (18 Apr 2025 16:42 UTC; 4 points)
- Lucius Bushnaq 16 Apr 2025 20:32 UTC
  2 points
  0
  Parent
  Technically you didn’t specify that $c (x)$ can’t be an arbitrary function, so you’d be able to reconstruct activations combining different bases, but it’d be horribly convoluted in practice.
  I wouldn’t even be too fussed about ‘horribly convoluted’ here. I’m saying it’s worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick.
  
  The problem here isn’t that we can’t make a dictionary that includes all the $1050$ feature directions $\to f$ as dictionary elements. We can do that. For example, while we can’t write
  $\to a (x) = \sum_{i = 1}^{1000} c_{i} (x)_{i} + \sum_{i = 1}^{50} c_{i}^{'} (x) \to f_{i}^{'}$
  because those sums each already equal $\to a (x)$ on their own, we can write
  $\to a (x) = \sum_{i = 1}^{1000} \frac{c_{i} (x)}{2}_{i} + \sum_{i = 1}^{50} \frac{c_{i}^{'} (x)}{2} \to f_{i}^{'}$ .
  
  The problem is instead that we can’t make a dictionary that has the $1050$ feature activations $c_{i} (x), c_{i}^{'} (x)$ as the coefficients in the dictionary. This is bad because it means our dictionary activations cannot equal the scalar variables the model’s own circuits actually care about. They cannot equal the ‘features of the model’ in the sense defined at the start, the scalar features comprising its ontology. As a result, if we were to look at a causal graph of the model, using the $1050$ half-size dictionary feature activations we picked as the graph nodes, a circuit taking in the feature $c_{elephant} (x)$ through a linear read-off along the direction ${\to f}_{elephant}$ would have edges in our graph connecting it to both the elephant direction, making up about 50% of the total contribution, and the fifty attribute directions, making up the remaining 50%. Same the other way around, any circuit reading in even a single attribute feature will have $1000$ edges connecting to all of the animal features^[1], making up $50 %$ of the total contribution. It’s the worst of both worlds. Every circuit looks like a mess now.
  1. ^
    Since the animals are sparse, in practice this usually means edges to a small set of different animals for every data point. Whichever ones happen to be active at the time.