Sodium comments on Interpretability Will Not Reliably Find Deceptive AI

Sodium 5 May 2025 1:54 UTC
1 point
0
AF
Re: Black box methods like “asking the model if it has hidden goals.”

I’m worried that these methods seem very powerful (e.g., Evans group’s Tell me about yourself, the pre-fill black box methods in Auditing language models) because the text output of the model in those papers haven’t undergone a lot of optimization pressure.

Outputs from real world models might undergo lots of scrutiny/optimization pressure^[1] so that the model appears to be a “friendly chatbot.” AI companies put much more care into crafting those personas than model organisms researchers would, and thus the AI could learn to “say nice things” much better.

So it’s possible that model internals will much more faithful relative to model outputs in real world settings compared to in academic settings.
1. ^
  or maybe they’ll just update GPT-4o to be a total sycophant and ship it to hundreds of millions people. Honestly hard to say nowadays.
- Buck 5 May 2025 5:12 UTC
  LW: 11 AF: 10
  6
  AF Parent
  This is why AI control research usually assumes that none of the methods you described work, and relies on black-box properties that are more robust to this kind of optimization pressure (mostly “the AI can’t do X”).
- Neel Nanda 5 May 2025 8:04 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Agreed in principle, my goal in the section on black box stuff is to lay out ideas that I think could work in spite of this optimisation pressure