Neel Nanda comments on Interpretability Will Not Reliably Find Deceptive AI

Neel Nanda 5 May 2025 12:11 UTC
2 points
0
I’m not trying to comment on other theories of change in this post, so no disagreement there