Thanks for elaborating, this makes me somewhat less concerned about the label bias problem!
We’re using o3, Gemini 2.5 Pro, and Claude 3.7 Sonnet.
In this case, I recommend being very clear in the post and in a potential future paper about the fact that, at least for o3, the monitor doesn’t have access to the entire CoT—this is something that I wish existing work on CoT monitoring was clearer about (e.g., Arnav et al. only use models for which the full CoT is visible and give the monitor access to the full CoT, including the thinking tokens of reasoning models. Kutasov et al. also give the monitor the full CoT but, afaik, don’t have the thinking mode on when generating trajectories with reasoning models, e.g. 3.7 Sonnet. Ideally, I’d like all CoT monitoring evals to use the same practices regarding thinking models to make it easier to compare results across different benchmarks.)
Thanks Rauno. I’ve added this clarification to the post:
We collect data using Gemini-2.5-pro, Claude-3.7-Sonnet, and o3 agents, including both the CoT (when exposed) and the actions. Note that for o3 trajectories, the model performs internal reasoning that is not visible in the transcripts.
Thanks for elaborating, this makes me somewhat less concerned about the label bias problem!
In this case, I recommend being very clear in the post and in a potential future paper about the fact that, at least for o3, the monitor doesn’t have access to the entire CoT—this is something that I wish existing work on CoT monitoring was clearer about (e.g., Arnav et al. only use models for which the full CoT is visible and give the monitor access to the full CoT, including the thinking tokens of reasoning models. Kutasov et al. also give the monitor the full CoT but, afaik, don’t have the thinking mode on when generating trajectories with reasoning models, e.g. 3.7 Sonnet. Ideally, I’d like all CoT monitoring evals to use the same practices regarding thinking models to make it easier to compare results across different benchmarks.)
Thanks Rauno. I’ve added this clarification to the post: