Neel Nanda comments on Interpretability Will Not Reliably Find Deceptive AI

Neel Nanda 20 May 2025 23:34 UTC
3 points
0
It’s baked into my predictions. I would be shocked if probes could get us to >99% confidence in detecting things out of distribution on new generations of models. Doing it within well studied domains with a reasonable ground truth on a well studied model maybe, though 99.9% would still be impressive. But models are super messy