David Johnston comments on Interpretability Will Not Reliably Find Deceptive AI

David Johnston 8 May 2025 20:26 UTC
1 point
0

models will have access to some kind of “neuralese” that allows them to reason in ways we can’t observe

Only modest confidence, but while there’s an observability gap between neuralese and CoT monitoring, I suspect it’s smaller than the gap between reasoning traces that haven’t been trained against oversight and reasoning traces that have.