Charlie Steiner comments on Sparse Coding, for Mechanistic Interpretability and Activation Engineering

Charlie Steiner 26 Sep 2023 1:23 UTC
4 points
0
It does seem obvious^[1], but I think this can easily be misleading. Are these activation directions always looking for these tokens regardless of context, or are they detecting the human-obvious theme they seem to be gesturing towards, or are they playing a more complicated functional role that merely happens to be activated by those tokens in the first position?
E.g. Is the “▁vs, ▁differently, ▁compared” direction just a brute detector for those tokens? Or is it a more general detector for comparison and counting that would have rich but still human-obvious behavior on longer snippets? Or is it part of a circuit that needs to detect comparison words but is actually doing something totally different like completing discussions about shopping lists?
1. ^
  certainly more so than
  31892 ▁she, bian, ▁recently, ▁means, ▁Because, ▁experienced