Evan R. Murphy comments on Interpretability Will Not Reliably Find Deceptive AI

Evan R. Murphy 20 May 2025 20:53 UTC
2 points
0
Does representation engineering (RepE) seem like a game-changer for interpretability? I don’t see it mentioned in your post, so I’m trying to figure out if it is baked into your predictions or not.
It seemed like Apollo was able to spin up a pretty reliable strategic deception detector (95-99% accurate) using linear probes even though the techniques are new, and generally it sounds like RepE is getting traction on some things that have been a slog for mech interp. Does it look plausible that RepE could get us to high reliability interpretability on workable timelines or are we likely to hit similar walls with that approach?
Thanks for your post Neel (and Gemini 2.5) - really important perspective on all this.
- Neel Nanda 20 May 2025 23:34 UTC
  3 points
  0
  Parent
  It’s baked into my predictions. I would be shocked if probes could get us to >99% confidence in detecting things out of distribution on new generations of models. Doing it within well studied domains with a reasonable ground truth on a well studied model maybe, though 99.9% would still be impressive. But models are super messy