Evan R. Murphy comments on Interpretability Will Not Reliably Find Deceptive AI

Evan R. Murphy 28 May 2025 4:49 UTC
3 points
1
I agree it’s a good post, and it does take guts to tell people when you think that a research direction that you’ve been championing hard actually isn’t the Holy Grail. This is a bit of a nitpick but not insubstantial:
Neel is talking about interpretability in general, not just mech-interp. He claims to be accounting in his predictions for other non-mech interp approaches to interpretability that seem promising to some other researchers, such as representation engineering (RepE), which Dan Hendrycks among others has been advocating for recently.
- Neel Nanda 28 May 2025 8:37 UTC
  4 points
  0
  Parent
  This is true, though to be clear I’m specifically making the point that interpretability will not be a highly reliable method on its own for establishing the lack of deception—while this is true of all current approaches, in my opinion, I think only mech interp people have ever seriously claimed that their approach might succeed this hard
- RobertM 28 May 2025 6:18 UTC
  2 points
  0
  Parent
  Whoops, yes, thanks, edited.