Neel Nanda comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

Neel Nanda 31 Oct 2022 22:13 UTC
LW: 3 AF: 2
0
AF

Strong tractability: We can build interpretable AGI-level systems without sacrificing too much competitiveness.

Interesting argument! I think my main pushback would be on clarifying exactly what “interpretable” means here. If you mean “we reverse engineer a system so well, and understand it so clearly, that we can use this understanding to build the system from scratch ourselves”, then I find your argument somewhat plausible, but I also think it’s pretty unlikely that we live in that world. My personal definition of strong tractability would be something like “AGI-level systems are made up of interpretable pieces, which correspond to understandable concepts. We can localise any model behaviour to the combination of several of these pieces, and understand the computation by which they fit together to produce that behaviour”. I think this still seems pretty hard, and probably not true! And that if true, this would be a massive win for alignment. But in this world, I still think it’s reasonable to expect us to still be unable to define and figure out how to assemble these pieces ourselves—there’s likely to be a lot of complexity and subtlety in exactly what pieces form and why, how they’re connected together, etc. Which seems much more easily done by a big blob of compute style approach than by human engineering.
What links here?
- Aaron_Scher's comment on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability by David Scott Krueger (formerly: capybaralet) (8 Nov 2022 18:45 UTC; 3 points)
- David Scott Krueger (formerly: capybaralet) 31 Oct 2022 22:25 UTC
  LW: 2 AF: 1
  0
  AF Parent
  I agree it’s a spectrum. I would put it this way:
  - For any point on the spectrum there is some difficulty in achieving it.
  - We can approach that point from either direction, 1) starting with a “big blob of compute” and encountering the difficulty in extracting these pieces from the blob, or 2) starting with assembling the pieces, and encountering the difficulty in figuring out how to assemble them.
  - It’s not at all clear that (1) would be easier than (2).
  - Probably it’s best to do some of both.
  Regarding difficulty of (1) vs. (2), OTMH, there may be some sort of complexity-style argument that engineering, say, a circuit is harder than recognizing it. However, the DNN doesn’t produce the circuit, we still need to do that using interpretability techniques. So I’m not sure how I feel about this argument.

Neel Nanda comments on “Cars and Elephants”: a handwavy argument/​analogy against mechanistic interpretability

Neel Nanda comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability