Katalina Hernandez comments on Interpretability Will Not Reliably Find Deceptive AI

Katalina Hernandez 4 May 2025 19:51 UTC
6 points
3
Hey Neel! I just wanted to say thank you for writing this. It’s honestly one of the most grounded and helpful takes I’ve seen in a while. I really appreciate your pragmatism, and the way you frame interpretability as an useful tool that still matters for early transformative systems (and real-world auditing!).
Quick question: do you plan to share more resources or thoughts on how interpretability can support black-box auditing and benchmarking for safety evaluations? I’m thinking a lot about this in the context of the General-Purpose AI Codes of Practice and how we can build technically grounded evaluations into policy frameworks.
Thanks again!
- Neel Nanda 5 May 2025 8:05 UTC
  3 points
  0
  Parent
  Thanks, that’s very kind!
  
  do you plan to share more resources or thoughts on how interpretability can support black-box auditing and benchmarking for safety evaluations
  
  I don’t have any current plans, sorry