Buck comments on Interpretability Will Not Reliably Find Deceptive AI

Buck 5 May 2025 5:14 UTC
LW: 23 AF: 13
9
AF
I agree in principle, but as far as I know, no interp explanation that has been produced explains more like 20-50% of the (tiny) parts of the model it’s trying to explain (e.g. see the causal scrubbing results, or our discussion with Neel). See that dialogue with Neel for more on the question of how much of the model we understand.