RobertM comments on Interpretability Will Not Reliably Find Deceptive AI

RobertM 28 May 2025 2:53 UTC
2 points
0
Curated. While I don’t agree with every single positive claim advanced in the post (in particular, I’m less confident that chain-of-thought monitoring will survive to be a useful technique in the regime of transformative AI), this is an excellent distillation of the reasons for skepticism re: interpretability as a cure-all for identifying deceptive AIs. I also happen to think that those reasons generalize to many other agendas.
Separately, it’s virtuous to publicly admit to changing one’s mind, especially when the incentives are stacked the way they are—given Neel’s substantial role in popularizing interpretability as a research direction, I can only imagine this would have been harder for him to write than for many other people.
- Evan R. Murphy 28 May 2025 4:49 UTC
  3 points
  1
  Parent
  I agree it’s a good post, and it does take guts to tell people when you think that a research direction that you’ve been championing hard actually isn’t the Holy Grail. This is a bit of a nitpick but not insubstantial:
  Neel is talking about interpretability in general, not just mech-interp. He claims to be accounting in his predictions for other non-mech interp approaches to interpretability that seem promising to some other researchers, such as representation engineering (RepE), which Dan Hendrycks among others has been advocating for recently.
  - Neel Nanda 28 May 2025 8:37 UTC
    4 points
    0
    Parent
    This is true, though to be clear I’m specifically making the point that interpretability will not be a highly reliable method on its own for establishing the lack of deception—while this is true of all current approaches, in my opinion, I think only mech interp people have ever seriously claimed that their approach might succeed this hard
  - RobertM 28 May 2025 6:18 UTC
    2 points
    0
    Parent
    Whoops, yes, thanks, edited.