Buck comments on Interpretability Will Not Reliably Find Deceptive AI

Buck 4 May 2025 20:59 UTC
LW: 100 AF: 49
52
AF
I agree with most of this, thanks for saying it. I’ve been dismayed for the last several years by continuing unreasonable levels of emphasis on interpretability techniques as a strategy for safety.
My main disagreement is that you place more emphasis than I would on chain-of-thought monitoring compared to other AI control methods. CoT monitoring seems like a great control method when available, but I think it’s reasonably likely that it won’t work on the AIs that we’d want to control, because those models will have access to some kind of “neuralese” that allows them to reason in ways we can’t observe. This is why I mostly focus on control measures other than CoT monitoring. (All of our control research to date has basically been assuming that CoT monitoring is unavailable as a strategy.)
Another note is that you might have other goals than finding deceptive AI, e.g. you might want to be able to convince other people that you’ve found deceptive AI (which I’m somewhat skeptical you’ll be able to do with non-behavioral methods), or you might want to be able to safely deploy known-scheming models. Interp doesn’t obviously help much with those, which makes it a worse target for research effort.
What links here?
- Fabien Roger's comment on Interpretability is the best path to alignment by Arch223 (6 Sep 2025 9:28 UTC; 6 points)
- ozziegooen 6 May 2025 0:14 UTC
  4 points
  2
  Parent
  CoT monitoring seems like a great control method when available, but I think it’s reasonably likely that it won’t work on the AIs that we’d want to control, because those models will have access to some kind of “neuralese” that allows them to reason in ways we can’t observe.
  
  Small point, but I think that “neuralese” is likely to be somewhat interpretable, still.
  1. We might advance at regular LLM interpretability, in which case those lessons might apply.
  2. We might pressure LLM systems to only use CoT neuralese that we can inspect.
  
  There’s also a question of how much future LLM agents will rely on CoT vs. more regular formats for storage. For example, I believe that a lot of agents now are saving information in English into knowledge bases of different kinds. It’s far easier for software people working with complex LLM workflows to make sure a lot of the intermediate formats are in languages they can understand.
  
  All that said, personally, I’m excited for a multi-layered approach, especially at this point when it seems fairly early.
- Davidmanheim 5 May 2025 6:59 UTC
  LW: 2 AF: 1
  0
  AF Parent
  CoT monitoring seems like a great control method when available
  
  As I posted in a top level comment, I’m not convinced that even success would be a good outcome. I think that if we get this working 99.999% reliably. we still end up delegating parts of the oversight in ways that have other alignment failure modes, such as via hyper-introspection.
- David Johnston 8 May 2025 20:26 UTC
  1 point
  0
  Parent
  
  models will have access to some kind of “neuralese” that allows them to reason in ways we can’t observe
  
  Only modest confidence, but while there’s an observability gap between neuralese and CoT monitoring, I suspect it’s smaller than the gap between reasoning traces that haven’t been trained against oversight and reasoning traces that have.
- Breno Carvalho 6 May 2025 16:51 UTC
  1 point
  0
  Parent
  Makes sense! As we’ve seen, while Chain-of-Thought (CoT) reasoning is highly interpretable, it doesn’t necessarily reflect the model’s true internal reasoning. Interpretability techniques might help us decode “neuralese” enough to verify whether a CoT explanation is faithful to the model’s actual reasoning process. This could also help detect covert communication between cooperating, unaligned agents. As Nanda pointed out, though, current methods are noisy and can be gamed by the system (for example, by hiding its intentions through more complex forms of superposition).
  I agree with both of you that interpretability is a powerful tool with its limiations. However, given the uncertainty around its scalability, our best strategy may be to use it for ensuring incremental alignment. That, in turn, could help accelerate progress in both interpretability and control research.
  
  > Another note is that you might have other goals than finding deceptive AI, e.g. you might want to be able to convince other people that you’ve found deceptive AI (which I’m somewhat skeptical you’ll be able to do with non-behavioral methods), or you might want to be able to safely deploy known-scheming models.
  
  It also seems that spotting traces of suspicious reasoning through mech interpretability would be useful in both convincing people and deployment contexts. For example, if a model is generating a cake recipe, it shouldn’t be reasoning about complex bioengineering concepts. If such concepts are present, interpretability methods might flag them as potential signs of misalignment. The same mechanism could serve as a red flag during production to identify when a model is acting in unexpected or unsafe ways, as a layered approach mentioned by Nanda.