Barriers to Mechanistic Interpretability for AGI Safety

Link post

I gave a talk at MIT in March earlier this year on barriers to mechanistic interpretability being helpful to AGI/​ASI safety, and why by default it will likely be net dangerous. Several people seem to be coming to similar conclusions recently (e.g., this recent post).

I discuss two major points (by no means exhaustive), one technical and one political, that present barriers to MI addressing AGI risk:

  1. AGI cognition is interactive. AGI systems interact with their environment, learn online and will externalize massive parts of their cognition into the environment. If you want to reason about such a system, you also need a model of the environment. Worse still, AGI cognition is reflective, and you will also need a model of cognition/​learning.

  2. (Most) MI will lead to capabilities, not oversight. Institutions are not set up and do not have the incentives to resist using capabilities gains and submit to monitoring and control.

This being said, there are more nuances to this opinion, and a lot of it is downstream of lack of coordination and the downsides of publishing in an adversarial environment like we are in right now. I still endorse the work done by e.g. Chris Olah’s team as brilliant, but extremely early, scientific work that has a lot of steep epistemological hurdles to overcome, but I unfortunately also believe that on net work such as Olah’s is at the moment more useful as a safety-washing tool for AGI labs like Anthropic than actually making a dent on existential risk concerns.

Here are the slides from my talk, and you can find the video here.