Neel Nanda comments on Interpretability Will Not Reliably Find Deceptive AI

Neel Nanda 5 May 2025 8:10 UTC
LW: 8 AF: 4
3
AF
I disagree re the way we currently use understand—eg I think that SAE reconstructions have the potential to smuggle in lots of things via EG the exact values of the continuous activations, latents that don’t quite mean what we think, etc.

It’s plausible that a future and stricter definition of understand fixes this though, in which case I might agree? But I would still be concerned that 99.9% understanding involves a really long tale of heuristics and I don’t know what may emerge from combining many things that individually make sense. And I probably put >0.1% that a super intelligence could adversarially smuggle things we don’t like into a system we don’t think we understand.

Anyway, all that pedantry aside, my actual concern is tractability. If addressed, this seems plausibly helpful!