CorrigibleAgent comments on Building Black-box Scheming Monitors

CorrigibleAgent 1 Aug 2025 4:56 UTC
1 point
0
Thanks Rauno. I’ve added this clarification to the post:
We collect data using Gemini-2.5-pro, Claude-3.7-Sonnet, and o3 agents, including both the CoT (when exposed) and the actions. Note that for o3 trajectories, the model performs internal reasoning that is not visible in the transcripts.