Nathan Helm-Burger comments on Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Nathan Helm-Burger 3 Jun 2025 1:06 UTC
2 points
0
That makes more sense! Yes, I think that’s possibly true… Unless the malicious model gets good at jailbreaking, in which case even fine-tuning the monitor to mostly ignore the CoT would be risky.

And if the monitor completely ignores the CoT, then it loses out on the benefits of seeing the CoT. The CoT can be importantly helpful for detecting subtle sabotage by a stealthy malicious model. For example, the fine-tuned malicious model had a 0% chance of getting caught on subtle tasks by the monitor not using CoT. With CoT, that went up to like 3% catch rate. Small but important difference.