Neel Nanda comments on How useful is mechanistic interpretability?

Neel Nanda 23 Jul 2025 8:17 UTC
2 points
0
I’d disagree there. The idea here is that you tell which research ideas do and do not work by seeing if they let you do anything new. What failure looks like is totally different, it’s about TRAINING systems and the objectives you train for, not about iterating on finding the right interpretability techniques
- TristanTrim 24 Jul 2025 18:40 UTC
  7 points
  4
  Parent
  Thanks for the response.
  If I’m understanding you correctly, this is only true if you do not view the process of gathering researchers around a shared set of informational artifacts, including domain knowledge and software artifacts, as an optimization process similar in nature to training. I do view research progress in the field of ML as a form of genetic algorithm, and so I think “what failure looks like” does apply. The situation is more complicated with MI, since in MI the focus is on understanding models rather than improving the capabilities of models, but I still think “what failure looks like” applies. Maybe failure here looks like creating methods that allow us to tweak model behaviour, create clear seeming models, or pretty visualizations, but without noticing there is more going on then we are aware of.
  Oh, but the lens of my intuition on these things is based on my assigning a high credence to the risk of “using prosaic methods to create misaligned superintelligence” being the path social forces will follow without conscious intervention. If you don’t assign much credence to that then this concern might not make as much sense.
  To offer a competing focus for being empirically grounded, I prefer not focusing on what is easiest. Easiest should be avoided and focus on “useful for some task” should be regarded suspiciously, like someone is trying to let the ROI incentive pervert the careful incentivization we need around the study of robust alignment for superintelligence. I feel the incentive structure surrounding research of AI Alignment must itself be carefully aligned. So instead, focus should be on knowing true things about models. I would expect that to often be useful, but it is a distinct target, and not an easy target to build an incentivization structure for. Some related ideas from my mini review of your Concrete Steps to Get Started in Transformer Mechanistic Interpretability (Thanks for writing that btw!):
  How to think about evidence in MI: Neel addresses this in various places throughout the article. Unfortunately there is probably no easy answer, but some helpful hints or things to think about:
  What techniques are used to find evidence?
  How do we distinguish between true and false beliefs about models?
  Look for flaws. Look for evidence to falsify your hypothesis, not just evidence to confirm it. Watch out for cherry-picking.
  Use janky hacking! Make minor edits to see how things change. What does and doesn’t break a model’s behaviour? Open ended exploration can be useful for hypothesis formation, but don’t rely on non-rigorous techniques as if they were strong evidence.