Neel Nanda comments on Interpretability Will Not Reliably Find Deceptive AI

Neel Nanda 28 May 2025 8:37 UTC
4 points
0
This is true, though to be clear I’m specifically making the point that interpretability will not be a highly reliable method on its own for establishing the lack of deception—while this is true of all current approaches, in my opinion, I think only mech interp people have ever seriously claimed that their approach might succeed this hard