Dusto comments on Interpretability Will Not Reliably Find Deceptive AI

Dusto 5 May 2025 23:03 UTC
3 points
0
Agreed. My comment was not a criticism of the post. I think the depth of deception makes interpretability nearly impossible in the sense that you are going to find deception triggers in nearly all actions as models become increasingly sophisticated.