Priyanka Bharadwaj comments on Interpretability Will Not Reliably Find Deceptive AI

Priyanka Bharadwaj 22 May 2025 8:49 UTC
1 point
0
Strongly agree.
Interpretability is basically like neuroscience, studying the “brain” of AI systems, which is valuable, but it’s fundamentally different from having the relational tools to actually influence behaviour in deployment. Even perfect internal understanding doesn’t bridge the gap to reliable intervention, like that a clinical psychologist or a close friend when let’s say someone’s suffering from depression.
I think this points toward complementary approaches that focus less on decoding thoughts and more on designing robust interaction patterns or systems that can remember what matters to us, repair when things go wrong, and build trust through ongoing calibration rather than perfect initial specification.
The portfolio approach you describe makes a lot of sense, but I wonder if we’re missing a layer that sits between interpretability and black-box methods. Something more like the “close friend” level of influence that comes from relationship context rather than just technical analysis.
- Neel Nanda 22 May 2025 9:05 UTC
  1 point
  0
  Parent
  The major issue I see is that that’s very hard to make robust to deceptive systems
  - Priyanka Bharadwaj 23 May 2025 0:41 UTC
    1 point
    0
    Parent
    True, though sustained relational deception seems harder to maintain than faking individual outputs. The inconsistencies might show up differently across long-term interaction patterns, potentially complementing other detection methods.