Mateusz Bagiński comments on Against Almost Every Theory of Impact of Interpretability

Mateusz Bagiński 5 Sep 2023 13:27 UTC
4 points
2
A feature is still a fuzzy concept,
“Gene”, “species”, and even “concept” are also fuzzy concepts but despite that, we managed to substantially improve our understanding of the-things-in-the-world-they-point-to and the phenomena they interact with. Using these fuzzy concepts even made us realize how fuzzy they are, what’s the nature of their fuzziness, and what other (more natural/appropriate/useful/reality-at-joint-carving) abstractions we may replace them with.^[1] In other words, we can use fuzzy concepts as a ladder/provisional scaffold for understanding. Once our understanding is good enough, we may realize there’s a better foundation for the theory than the one that guided us to in the first place. (See: Context of Discovery and Context of Justification)
Or maybe interp could be useful for retargeting the search? This idea suggests that if we find a goal in a system, we can simply change the system’s goal and redirect it towards a better goal.
I think this is a promising quest, even if there are still difficulties:
One difficulty you don’t list is that it is not clear ex ante that the models we want to steer/retarget are going to have a “goal slot” or, more generally, something that could be used as a motivational API (a “telopheme” in Tsvi’s terminology). This does seem to be the case (at least to a significant extent) in the cases studied by Turner et al. but as you point out, the results from smaller models already fail to translate to/predict what we’re finding in bigger models (induction heads being a notable exception).
Instrumental convergence makes this problem even murkier. On the one hand, it may lead you to expect that the “goal part”/utility function of the agent will be separated from the rest in order to facilitate goal preservation. At the same time (1) if this would make it easier for us to steer/retarget the AI, then it would be advantageous for the AI to make this part of itself more obscure/less understandable to us; and (2) an AI need not have a clearly factored out goal to be sufficiently smarter than humans to pose an x-risk (see Soares).
I am skeptical that we can gain radically new knowledge from the weights/activations/circuits of a neural network that we did not already know, especially considering how difficult it can be to learn things from English textbooks alone.
One way this could work is: if we have some background knowledge/theory of the domain the AI learns about, then the AI may learn some things that we didn’t know but that (conditional on sufficiently good transparency/interpretability/ELK)^[2] we can extract from it in order to enrich our understanding.
The important question here is: will interp be better for that than more mundane/behavioral methods? Will there be some thing that interp will find that behavioral methods won’t find or that interp finds more efficiently (for whatever measure of efficiency) that behavioral methods don’t find?
Total explainability of complex systems with great power is not sufficient to eliminate risks.
Also, a major theme of Inadequate Equilibria.
Conceptual advances are more urgent.
Obvious counterpoint: in many subdomains of many domains, you need a tight feedback loop with reality to make conceptual progress. Sometimes you need a very tight feedback loop to rapidly iterate on your hypotheses. Also, getting acquainted with low-level aspects of the system lets you develop some tacit knowledge that usefully guides your thinking about the system.
Obvious counter-counterpoint: interp is nowhere near the level of being useful for informing conceptual progress on the things that really matter for AInotkillingeveryone.
1. ^
  My impression is that most biologists agree that the concept of “species” is “kinda fake”, but less so when it comes to genes and concepts.
2. ^
  Which may mean much better than what we should expect to have in the next N years.