Barriers to Mechanistic Interpretability for AGI Safety

Connor Leahy29 Aug 2023 10:56 UTC

LW: 63 AF: 19

Interpretability (ML & AI)AI Conjecture (org)

I gave a talk at MIT in March earlier this year on barriers to mechanistic interpretability being helpful to AGI/ASI safety, and why by default it will likely be net dangerous. Several people seem to be coming to similar conclusions recently (e.g., this recent post).

I discuss two major points (by no means exhaustive), one technical and one political, that present barriers to MI addressing AGI risk:

AGI cognition is interactive. AGI systems interact with their environment, learn online and will externalize massive parts of their cognition into the environment. If you want to reason about such a system, you also need a model of the environment. Worse still, AGI cognition is reflective, and you will also need a model of cognition/learning.
(Most) MI will lead to capabilities, not oversight. Institutions are not set up and do not have the incentives to resist using capabilities gains and submit to monitoring and control.

This being said, there are more nuances to this opinion, and a lot of it is downstream of lack of coordination and the downsides of publishing in an adversarial environment like we are in right now. I still endorse the work done by e.g. Chris Olah’s team as brilliant, but extremely early, scientific work that has a lot of steep epistemological hurdles to overcome, but I unfortunately also believe that on net work such as Olah’s is at the moment more useful as a safety-washing tool for AGI labs like Anthropic than actually making a dent on existential risk concerns.

Here are the slides from my talk, and you can find the video here.

What links here?

Connor Leahy29 Aug 2023 10:56 UTC

LW: 63 AF: 19

13 comments1 min readLW link

Interpretability (ML & AI)AI Conjecture (org)

Gunnar_Zarncke 29 Aug 2023 17:12 UTC
9 points
1
AGI systems interact with their environment, learn online and will externalize massive parts of their cognition into the environment.
This comment is not about interpretability but a generalization of the question.
What is the AGI system and what is the environment? Where does the AGI system draw the boundary when reasoning about itself?
For humans, there is a clearer agent—environment distinction because we have bodies with a relatively clear physical boundary (though some people might already see their body as part of the environment and only count their brain or even their mind, however delineated). For AGI systems it is less clear. Is it the running software, the computers, the whole compute center, or even the organization keeping the machines running?
What links here?
- Gunnar_Zarncke's comment on The Pando Problem: Rethinking AI Individuality by Jan_Kulveit (29 Mar 2025 10:08 UTC; 6 points)
- Connor Leahy 30 Aug 2023 8:07 UTC
  8 points
  0
  Parent
  Yep, you see the problem! It’s tempting to just think of an AI as “just the model”, and study that in isolation, but that just won’t be good enough longterm.
  - mesaoptimizer 7 Sep 2023 13:25 UTC
    2 points
    −1
    Parent
    I see—you are implying that an AI model will leverage external system parts to augment itself. For example, a neural network would use an external scratch-pad as a different form of memory for itself. Or instantiate a clone of itself to do a certain task for it. Or perhaps use some sort of scaffolding.
    
    I think these concerns probably don’t matter for an AGI, because I expect that data transfer latency would be a non-trivial blocker for storing data outside the model itself, and it is more efficient to to self-modify and improve one’s own intelligence than to use some form of ‘factored cognition’. Perhaps these things are issues for an ostensibly boxed AGI, and if that is the case, then this makes a lot of sense.
    - Connor Leahy 8 Sep 2023 8:59 UTC
      3 points
      0
      Parent
      I strongly disagree and do not think that will be how AGI will look, AGI isn’t magic. But this is a crux and I might be wrong of course.
    - Noosphere89 7 Sep 2023 14:41 UTC
      2 points
      0
      Parent
      Yep, the latency and performance are real killers for embodied type cognition. I remember a tweet that suggested the entire Internet was not enough to train the model.
  - Gunnar_Zarncke 30 Aug 2023 14:25 UTC
    2 points
    0
    Parent
    It would be nice if the AGI saw the humans running its compute resources as part of its body that it wants to protect. The problem is that we humans also tamper with our bodies… Humans are like hair on the body of the AGI and maybe it wants to shave and use a whig.
- DusanDNesic 29 Aug 2023 20:22 UTC
  5 points
  3
  Parent
  Even for humans—are my nails me? Once clipped, are they me? Is my phone me? I feel like my phone is more me than my hair, for example. Is my child me, are my memes me, is my country me, etc etc… There are many reasons why agent boundaries are problematic, and that problem continues in AI Safety research.
- Carl Feynman 29 Aug 2023 21:08 UTC
  4 points
  1
  Parent
  Even worse: existing AI systems can call systems under the control of other companies, can write their own software and call it, or can be called by systems that are not themselves AI. How do you ensure they are safe under all permutations of such activities?
  You could say “Well, don’t do that, then,” but that horse has left the barn.
- Logan Riggs 31 Aug 2023 16:39 UTC
  2 points
  0
  Parent
  Wait, I don’t understand this at all. For language models, the environment is the text. For different environments, those training datasets will be the environment.
  - Gunnar_Zarncke 31 Aug 2023 20:06 UTC
    2 points
    0
    Parent
    This is not primarily about LLMs, which are Simulators (see also Janus’ Simulators), but about more general systems—AGIs.
    - Logan Riggs 31 Aug 2023 23:26 UTC
      2 points
      0
      Parent
      I meant to cover this in the “for different environments” parts. Like if we self-play on certain games, we’ll still have access to those games.
scasper 29 Aug 2023 21:50 UTC
LW: 2 AF: 2
1
AF
Several people seem to be coming to similar conclusions recently (e.g., this recent post).
I’ll add that I have as well and wrote a sequence about it :)
MiguelDev 1 Sep 2023 1:20 UTC
1 point
0
I agree with Connor and Charbel’s post. The next step is to establish a new method for sharing results with safety-focused companies, groups, and independent researchers. This requires:
1. Developing a screening method for inclusion.
2. Tracking people within the network, which becomes challenging, especially if they are recruited by capabilities-focused companies.
Continuing this line of thought, we can’t ensure 100% that such a network will consistently serve its intended purpose. So, if anyone has insights that could improve this idea, I’d like to hear them.