Nathan Helm-Burger comments on Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Nathan Helm-Burger 3 Jun 2025 22:00 UTC
2 points
0
Yes, I quite agree the next step is to try fine-tuning a monitor model. We did try quite a lot of prompting strategies, of which we only show a few.

We would’ve done it in this paper, but only had 3 months in the LASR fellowship to complete it. Even two more weeks would have been enough to do some monitor training experiments!

We also ran up against budget constraints on compute, so big thanks to LASR for the support on this!

This is just a first step in exploring the space of CoT monitoring. I wouldn’t recommend anybody get either optimistic or pessimistic at this point, just curious to ask the next obvious questions!