Daniel Kokotajlo comments on Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Daniel Kokotajlo 3 Jun 2025 0:05 UTC
5 points
3
Sorry I miscommunicated—when I said “this problem would go away” I didn’t mean “the problem of AIs being unfaithful in their CoT and fooling monitors.” I don’t expect that problem to go away nearly so easily!

All I meant was, the problem of the monitor which can see the CoT performing worse than the monitor which can only see the actions. The monitor which can see both the actions and the CoT, if it was trained, would eventually learn to just ignore the CoT and focus on the actions, and thereby perform at least as well as the monitor which can see both the actions and the CoT.
- Nathan Helm-Burger 3 Jun 2025 1:06 UTC
  2 points
  0
  Parent
  That makes more sense! Yes, I think that’s possibly true… Unless the malicious model gets good at jailbreaking, in which case even fine-tuning the monitor to mostly ignore the CoT would be risky.
  
  And if the monitor completely ignores the CoT, then it loses out on the benefits of seeing the CoT. The CoT can be importantly helpful for detecting subtle sabotage by a stealthy malicious model. For example, the fine-tuned malicious model had a 0% chance of getting caught on subtle tasks by the monitor not using CoT. With CoT, that went up to like 3% catch rate. Small but important difference.