Charlie Steiner comments on Sparse Coding, for Mechanistic Interpretability and Activation Engineering

Charlie Steiner 23 Sep 2023 23:22 UTC
2 points
0
I’m not very enlightened by what tokens most excite the component directions in a vacuum. Interpreting text models is hard.
Maybe something like network dissection could work? What I’d want is a dataset of text samples labeled by properties that you want to find features to track.
E.g. suppose you want features that track “calm text” vs. “upset text.” Then you want each snippet labeled as either calm or upset—or even better, you could collect a squiggly curve for how “calm” vs. “upset” labelers think the text is around any given token (maybe by showing them shorter snippets and then combining them into longer ones, or maybe by giving them a UI that lets then change levels of different features as changes happen in the text). And then you look for features that track that coarse-grained property of the text—that vary on a long timescale, in ways correlated with the variation of how calm/upset the text seems to humans.
And then you do that for a dozen or a gross long-term properties of text you think you might find features of.
- David Udell 26 Sep 2023 0:47 UTC
  2 points
  0
  Parent
  I agree that stronger, more nuanced interpretability techniques should tell you more. But, when you see something like, e.g.,
  25132 ▁vs, ▁differently, ▁compared, ▁greater, all, ▁per
  25134 ▁I, ▁My, I, ▁personally
  isn’t it pretty obvious what those two autoencoder neurons were each doing?
  - Charlie Steiner 26 Sep 2023 1:23 UTC
    4 points
    0
    Parent
    It does seem obvious^[1], but I think this can easily be misleading. Are these activation directions always looking for these tokens regardless of context, or are they detecting the human-obvious theme they seem to be gesturing towards, or are they playing a more complicated functional role that merely happens to be activated by those tokens in the first position?
    E.g. Is the “▁vs, ▁differently, ▁compared” direction just a brute detector for those tokens? Or is it a more general detector for comparison and counting that would have rich but still human-obvious behavior on longer snippets? Or is it part of a circuit that needs to detect comparison words but is actually doing something totally different like completing discussions about shopping lists?
    ^
    certainly more so than
    31892 ▁she, bian, ▁recently, ▁means, ▁Because, ▁experienced