Simon Storf comments on Building Black-box Scheming Monitors

Simon Storf 30 Jul 2025 6:45 UTC
2 points
0
Appreciate the feedback and definitely keen to discuss follow ups—we’ll drop you a message.
A few thoughts in response to your points:
- Our MVP monitor is trained on transcripts that include CoT, but we’re not currently committed to building “thoughts + actions” monitors over “action-only” ones. We’re keeping both options open for now and are currently running the “action-only” experiments mentioned in our “immediate priorities” section. We’re still unsure which direction would be most impactful.
- Since models might start reasoning internally soon, it might make sense to already explore monitors that don’t rely on seeing their thoughts explicitly. On the other hand, CoT could help us catch them in cases where models are able to scheme but might still have to rely heavily on explicit reasoning traces to do so. It sounds like you’re placing more weight on the former, which is interesting.
- We’ve made it clearer in the blog post that although we’re currently using CoT, we’re considering both approaches.