Dusto comments on Interpretability Will Not Reliably Find Deceptive AI

Dusto 5 May 2025 22:03 UTC
1 point
−2
Is it really a goal to have AI that is completely devoid of deception capability? I need to sit down and finish writing up something more thorough on this topic but I feel like deception is one of those areas where “shallow” and “deep” versions of the capability are talked about interchangeably. The shallow versions are the easy to spot and catch deceptive acts that I think most are worried about. But deception as an overall capability set is considerably farther reaching that the formal version of “action with the intent of instilling a false belief”.
Lets start with a non-exhaustive list of other actions tied to deception that are dual use:
- Omission (this is the biggest one, imagine anyone with classified information that had no deception skills)
- Misdirection
- Role-playing
- Symbolic and metaphoric language
- Influence via norms
- Reframing truths
- White lies
Imagine trying to action intelligence in the real world without deception. I’m sure most of you have had interactions with people that lack emotional intelligence, or situation awareness and have seen how strongly that will mute their intelligence in practice.
Here’s a few more scenarios I am interested to see how people would manage without deception:
- Elderly family members with dementia/alzheimers
- Underperforming staff members
- Teaching young children difficult tasks (bonus—children that are also lacking self-confidence)
- Relationships in general (spend a day saying exactly what you think and feel at any given moment and see how that works out)
- Dealing with individuals with mental health issues
- Business in general (“this feature is coming soon....”)
Control methods seem to be the only realistic option when it comes to deception.
- Neel Nanda 5 May 2025 22:08 UTC
  2 points
  1
  Parent
  I think making an AI be literally incapable of imitating a deceptive human seems likely impossible, and probably not desirable. I care about whether we could detect it actively scheming against its operators. And my post is solely focused on detection, not fixing (though obviously fixing is very important too)
  - Dusto 5 May 2025 23:03 UTC
    3 points
    0
    Parent
    Agreed. My comment was not a criticism of the post. I think the depth of deception makes interpretability nearly impossible in the sense that you are going to find deception triggers in nearly all actions as models become increasingly sophisticated.