Adam Shai comments on Interpretability Will Not Reliably Find Deceptive AI

Adam Shai 4 May 2025 19:30 UTC
39 points
15
Thanks for writing this! I have been thinking about many of the issues in your Why Won’t Interpretability Be Reliable section lately, and mostly agree that this is the state of affairs. I often think of this from the perspective of the field of neuroscience. My experience there (in the subsection of neuro research that I believe is the most analogous to mech interp) is that these are basically the same fundamental issues that keep the field from progress (though not the only reasons).
Many in the interpretability field seem to (implicitly) think that if you took neuroscience and made access to neural activities a lot easier, and the ability to arbitrarily intervene on the system, and the ability to easily run a lot more experiments, then all of neuroscience would be solved. From that set of beliefs if follows that because neural networks don’t have these issues, mech interp will have the ability to more or less apply the current neuroscience approach to neural networks and “figure it all out.” While these points about ease of experiments and access to internals are important differences between neuro. research and mech. interp., I do not think they get past the fundamental issues. In other words—Mech. interp. has more to learn from neuroscience failures than its successes (public post/rant coming soon!).

Seeing this post from you makes me positively update about the ability of interp. to contribute to AI Safety—it’s important we see clearly the power and weaknesses of our approaches. A big failure mode I worry about is being overconfident that our interp. methods are able to catch everything, and then making decisions based on that overconfidence. One thing to do about such a worry is to put serious effort into understanding the limits of our approaches. This of course does happen to some degree already (e.g. there’s been a bunch of stress testing of SAEs from various places lately), which is great! I hope when decisions are made about safety/deployment/etc., that the lessons we’ve learned from those types of studies are internalized and brought to bear, alongside the positives about what our methods do let us know/monitor/control, and that serious effort continues to be made to understand what our approaches miss.
- Neel Nanda 5 May 2025 8:01 UTC
  12 points
  5
  Parent
  Thanks!
  
  In other words—Mech. interp. has more to learn from neuroscience failures than its successes (public post/rant coming soon!).
  
  I would be very interested in this post, I’m looking forwards to it
- Breno Carvalho 6 May 2025 16:57 UTC
  1 point
  0
  Parent
  I look forward for that post!