models will have access to some kind of “neuralese” that allows them to reason in ways we can’t observe
Only modest confidence, but while there’s an observability gap between neuralese and CoT monitoring, I suspect it’s smaller than the gap between reasoning traces that haven’t been trained against oversight and reasoning traces that have.
Only modest confidence, but while there’s an observability gap between neuralese and CoT monitoring, I suspect it’s smaller than the gap between reasoning traces that haven’t been trained against oversight and reasoning traces that have.