Neel Nanda comments on Interpretability Will Not Reliably Find Deceptive AI

Neel Nanda 5 May 2025 22:08 UTC
2 points
1
I think making an AI be literally incapable of imitating a deceptive human seems likely impossible, and probably not desirable. I care about whether we could detect it actively scheming against its operators. And my post is solely focused on detection, not fixing (though obviously fixing is very important too)
- Dusto 5 May 2025 23:03 UTC
  3 points
  0
  Parent
  Agreed. My comment was not a criticism of the post. I think the depth of deception makes interpretability nearly impossible in the sense that you are going to find deception triggers in nearly all actions as models become increasingly sophisticated.