I think making an AI be literally incapable of imitating a deceptive human seems likely impossible, and probably not desirable. I care about whether we could detect it actively scheming against its operators. And my post is solely focused on detection, not fixing (though obviously fixing is very important too)
Agreed. My comment was not a criticism of the post. I think the depth of deception makes interpretability nearly impossible in the sense that you are going to find deception triggers in nearly all actions as models become increasingly sophisticated.
I think making an AI be literally incapable of imitating a deceptive human seems likely impossible, and probably not desirable. I care about whether we could detect it actively scheming against its operators. And my post is solely focused on detection, not fixing (though obviously fixing is very important too)
Agreed. My comment was not a criticism of the post. I think the depth of deception makes interpretability nearly impossible in the sense that you are going to find deception triggers in nearly all actions as models become increasingly sophisticated.