“Cars and Elephants”: a handwavy argument/​analogy against mechanistic interpretability

TL;DR: If we can build competitive AI systems that are interpretable, then I argue via analogy that trying to extract them from messy deep learning systems seems less promising than directly engineering them.

EtA—here’s a follow-up: Mechanistic Interpretability as Reverse Engineering (follow-up to “cars and elephants”)

Preliminaries:
Distinguish weak and strong tractability of (mechanistic) interpretability as follows:

  • Weak tractability: AGI-level systems are interpretable in principle, i.e. humans have the capacity to fully understand their workings in practice given the right instructions and tools. This would be false if intelligence involves irreducible complexity, e.g. because it involves concepts that are not “crisp”, modular, or decomposable; for instance, there might not a crisp conceptual core to various aspects of perception or abstract concepts such as “fairness”.[1]

  • Strong tractability: We can build interpretable AGI-level systems without sacrificing too much competitiveness.

The claim:
If strong tractability is true, then mechanistic interpretability is likely not the best way to engineer competitive AGI-level systems.

The analogy:
1) Suppose we have a broken down car with some bad parts and we want to have a car that is safe to drive. We could try to fix the car and update the parts.
2) But we also have a perfectly functioning elephant. So instead, we could try and tinker with the elephant to understand how it works and make its behavior more safe and predictable.
I claim (2) is roughly analogous to mechanistic interpretability, and (1) to pursuing something more like what Stuart Russell seems to be aiming for: a neurosymbolic approach to AI safety based on modularity, probabilistic programming, and formal methods.[2]

Fleshing out the argument a bit more:
To the extent that strong tractability is true, there must be simple principles we can recognize underlying intelligent behavior. If there are not such simple principles, then we shouldn’t expect mechanistic interpretability methods to yield safe, competitive systems. We already have many ideas about what some of those principles might be (from GOFAI and other areas). Why would we expect it is easier to recognize and extract these principles from neural networks than to deliberately incorporate them into the way we engineer systems?

Epistemic status: I seem to be the biggest interpretability hater/​skeptic I’ve encountered in the AI x-safety community. This is an argument I came up with a few days ago that seems to capture some of my intuitions, although it is hand-wavy. I haven’t thought about it much, and spent ~1hr writing this, but am publishing it anyways because I don’t express my opinions publicly as often as I’d like due to limited bandwidth.

Caveats: I made no effort to anticipate and respond to counter-arguments (e.g. “those methods aren’t actually more interpretable”). There are lots of different ways that interpretability might be useful for AI x-safety. It makes sense as part of a portfolio approach. It makes sense as an extra “danger detector” that might produce some true positives (even if there are a lot of false negatives) or one of many hacks that might be stacked. I’m not arguing that Stuart Russell’s approach is clearly superior to mechanistic interpretability. But it seems like roughly the entire AI existential safety community is very excited about mechanistic interpretability and entirely dismissive of Stuart Russell’s approach, and this seems bizarre.

Unrelated bonus reason to be skeptical of interpretability (hopefully many more to come!): when you to deploy a reasonably advanced system in the real world, it will likely recruit resources outside itself in various ways (e.g. The way people write things down on paper as a way of augmenting their memory), meaning that we will need to understand more than just the model itself, making the whole endeavor way less tractable.

  1. ^

    For what it’s worth, I think weak tractability is probably false and this is maybe a greater source of my skepticism about interpretability than the argument presented in this post.

  2. ^

    Perhaps well-summarized here, although I haven’t watched the talk yet: