jake_mendel comments on Activation space interpretability may be doomed

jake_mendel 10 Jan 2025 13:39 UTC
4 points
0
[edit: I’m now thinking that actually the optimal probe vector is also orthogonal to $span {{\to v}_{j} | j \neq i}$ so maybe the point doesn’t stand. In general, I think it is probably a mistake to talk about activation vectors as linear combinations of feature vectors, rather than as vectors that can be projected into a set of interpretable readoff directions. see here for more.]
Yes, I’m calling the representation vector the same as the probing vector. Suppose my activation vector can be written as $\to a = \sum_{i} f_{i} {\to v}_{i}$ where $f_{i}$ are feature values and ${\to v}_{i}$ are feature representation vectors. Then the probe vector which minimises MSE (explains most of the variance) is just ${\to v}_{i}$ . To avoid off target effects, the vector ${\to s}_{i}$ you want to steer with for feature $i$ might be the vector that is most ‘surgical’: it only changes the value of this feature and no other features are changed. In that case it should be the vector that lies orthogonal to $span {{\to v}_{j} | j \neq i}$ which is only the same as ${\to v}_{i}$ if the set ${{\to v}_{i}}$ are orthogonal.
Obviously I’m working with a non-overcomplete basis of feature representation vectors here. If we’re dealing with the overcomplete case, then it’s messier. People normally talk about ‘approximately orthogonal vectors’ in which case the most surgical steering vector ${\to s}_{i} \approx {\to v}_{i}$ but (handwaving) you can also talk about something like ‘approximately linearly independent vectors’ in which case my point stands I think (note that SAE decoder directions are definitely not approximately orthogonal). For something less handwavey see this appendix.
- Nina Panickssery 10 Jan 2025 13:50 UTC
  3 points
  0
  Parent
  Makes sense—agreed!