Florian_Dietz comments on Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Florian_Dietz 30 Jul 2025 21:23 UTC
4 points
1
This is an interesting finding. I wonder how often it happened that the Assistant straight up lies in the COT. Does this happen often enough that it explains a large chunk of the reduced performance of “COT and Action Monitor” on non-subtle side tasks? It’s funny to think about: The Assistant is lying so audaciously that the monitor just takes it at its word and doesn’t double-check the actual output.

This finding suggests that there might be very low-hanging fruits we are missing: How many of the most problematic cases in this paper could be solved if the Monitor’s prompt just included a simple hint that the COT can lie and that it needs to double-check the actual outputs?

That experiment should be simple to run, and if the number is high I think this hotfix would be worth including in monitor models everywhere.