Mechanistic Interpretability as Reverse Engineering (follow-up to “cars and elephants”)

I think (perhaps) the distinction that I was trying to make in my previous post “Cars and Elephants”: a handwavy argument/​analogy against mechanistic interpretability is basically the distinction between engineering and reverse engineering.

Reverse engineering is analogous to mechanistic interpretability; engineering is analogous to “well-founded AI” (to borrow Stuart Russell’s term).

So it seems worth exploring the pros and cons of these two approaches to understanding x-safety-relevant properties of advanced AI systems.

As a gross simplification,[1] we could view the situation this way:

  • Using deep learning approaches, we can build advanced AI systems that are not well understood. Better reverse engineering would make them better understood.

  • Using “well-founded AI” approaches, we can build AI systems that are well understood, but not as advanced. Better engineering would make them more advanced.

Under this view, these two approaches are working towards the same end from different starting points.

A few more thoughts:

  • Competitiveness arguments favor reverse engineering. Safety arguments favor engineering.

  • We don’t have to choose one. We can work from both ends, and look for ways to combine approaches.

  • I’m not sure which end is easier to start from. My intuition says that there is the same underlying difficulty that needs to be addressed regardless of where you start from,[2] but the perspective I’m presenting seems to suggest otherwise.

  • There may be some sort of P vs. NP kind of argument in favor of reverse engineering, but it seems likely to rely on some unverifiable assumptions (e.g. that we will in fact reliably recognize good mechanistic interpretations).

  1. ^

    I know people will say that we don’t actually understand how “Well founded AI” approaches work any better. I don’t feel equipped to evaluate that claim beyond extremely simple cases, and don’t expect most readers are either.

  2. ^

    At least if your goal is to get something like an AGI system, the safety of which we have justified confidence in. This is perhaps too ambitious of a goal.