Nathan Helm-Burger comments on Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Nathan Helm-Burger 4 Jun 2025 1:25 UTC
2 points
0
Note: these are good points, and things that occurred to us also across the course of our 3 month project. We didn’t have time to address these things, but future work definitely should.

Key ToDo items:
1. Better, more realistic side tasks. (This will get easier once the models get better. Getting the red team models to accomplish both the main task and a realistic side task in a stealthy way was quite challenging.)
2. Strategic side task selection by the Red Team (only defecting when there’s actually a strategically good opportunity to do so. Too easy for the Blue Team if you force defections in dumb scenarios).
3. Fine-tuning the monitor (we almost had time, but not quite!)