Charbel-Raphaël comments on Interpretability Will Not Reliably Find Deceptive AI

Charbel-Raphaël 8 May 2025 8:47 UTC
LW: 4 AF: 2
0
AF
Thanks a lot for writing this, this is an important consideration, and it would be sweet if Anthropic updated accordingly.
Some remarks:
- I’m still not convinced that Deceptive AI following scheming is the main risk compared to other risks (gradual disempowerment, concentration of power & value Lock in, a nice list of other risks from John).
- “Should we give up on interpretability? No!”—I think this is at least a case for reducing the focus a bit, and more diversification of approaches
- On the theories of impacts suggested:
  - “A Layer of Swiss Cheese”—why not! This can make sense in DeepMind’s plan, that was really good by the way.
  - “Enhancing Black-Box Evaluations”—I think a better theory is interp to complement AI Control techniques. Example: Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals).
    Maybe Simple probes can catch sleeper agents \ Anthropic could also be interesting, in the sense that the probe seems to generalize surprisingly well (I would really like to know if this generalizes to a model that was not trained to be harmful in the first place).
  - “Debugging mysterious behaviour”—Might be interesting, might help marginally to get better understanding, but this is not very central for me.