Rohin Shah comments on An Ambitious Vision for Interpretability

Rohin Shah 6 Dec 2025 11:15 UTC
LW: 28 AF: 18
12
AF
What exactly do you mean by ambitious mech interp, and what does it enable? You focus on debugging here, but you didn’t title the post “an ambitious vision for debugging”, and indeed I think a vision for debugging would look quite different.
For example, you might say that the goal is to have “full human understanding” of the AI system, such that some specific human can answer arbitrary questions about the AI system (without just delegating to some other system). To this I’d reply that this seems like an unattainable goal; reality is very detailed, AIs inherit a lot of that detail, a human can’t contain all of it.
Maybe you’d say “actually, the human just has to be able to answer any specific question given a lot of time to do so”, so that the human doesn’t have to contain all the detail of the AI, and can just load in the relevant detail for a given question. To do this perfectly, you still need to contain the detail of the AI, because you need to argue that there’s no hidden structure anywhere in the AI that invalidates your answer. So I still think this is an unattainable goal.
Maybe you’d then say “okay fine, but come on, surely via decent heuristic arguments, the human’s answer can get way more robust than via any of the pragmatic approaches, even if you don’t get something like a proof”. I used to be more optimistic about this but things like self-repair and negative heads make it hard in practice, not just in theory. Perhaps more fundamentally, if you’ve retreated this far back, it’s unclear to me why we’re calling this “ambitious mech interp” rather than “pragmatic interp”.
To be clear, I like most of the agendas in AMI and definitely want them to be a part of the overall portfolio, since they seem especially likely to provide new affordances. I also think many of the directions are more future-proof (ie more likely to generalize to future very different AI systems). So it’s quite plausible that we don’t disagree much on what actions to take. I mostly just dislike gesturing at “it would be so good if we had <probably impossible thing> let’s try to make it happen”.
- mishka 6 Dec 2025 20:52 UTC
  3 points
  1
  Parent
  I think it’s very good that OpenAI and Google DeepMind are pursuing complementary approaches in this sense.
  
  As long as both teams keep publishing their advances, this sounds like a win-win. This way the whole field makes better progress, compared to the situation where everyone is following the same general approach.
  - leogao 6 Dec 2025 21:13 UTC
    7 points
    0
    Parent
    to be clear, this post is just my personal opinion, and is not necessarily representative of the beliefs of the openai interpretability team as a whole
    - mishka 6 Dec 2025 21:24 UTC
      2 points
      0
      Parent
      Thanks, sure.
      
      And I am simplifying quite a bit (the whole field is much larger anyway).
      
      Mostly, I mean to say I do hope people diversify and not converge in their approaches to interpretability.