David Scott Krueger (formerly: capybaralet) comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet) 1 Nov 2022 10:52 UTC
LW: 7 AF: 3
8
AF
I think that it implicitly uses the words ‘mechanistic interpretability’ differently than people typically do.
I disagree. I think in practice people say mechanistic interpretability all the time and almost never say these other more specific things. This feels a bit like moving the goalposts to me. And I already said in the caveats that it could be useful even if the most ambitious version doesn’t pan out.
For instance, we don’t need to understand how models predict whether the next token is ′ is’ or ′ was’ in order to be able to gain some signal on whether or not the model is lying with interp.
This is a statement that is almost trivially true, but we likely disagree on how much signal. It seems like much of mechanistic interpretability is predicated on something like weak tractability (e.g. that we can understand what deep networks are doing via simple modular/abstract circuits), I disagree with this, and think that we probably do need to understand “how models predict whether the next token is ′ is’ or ′ was’” to determine if a model was “lying” (whatever that means...).
But to the extent that weak/strong tractability are true, this should also make us much more optimistic about engineering modular systems. That is the main point of the post.