Thanks very much for the feedback! Super helpful—here is our initial reaction:
Manual red-teaming for finding monitor failures
We strongly agree. Since writing the blog post, we’ve already started with manual red-teaming.
Quick follow-up: Could you expand a bit on why you’re not as excited about cross-dataset transfer experiments? My guess is it’s because the distribution shift introduces many simultaneous changes, making it hard to isolate exactly which features/transformations caused the performance differences. Is this correct?
Include cases where only the CoT is misaligned.
we’ll try to include this if we do stick with CoT!
Focusing on egregious misalignment rather than reward hacking.
This makes sense, we wanted to get initial signal quickly and it was difficult to get convincingly egregious and subtle misalignment cases.
We’re trying to get data on more egregious and subtle scenarios. Currently SHADE Arena and manually-generated trajectories seem to be our best bets for this. But if this proves to be the biggest bottleneck we might shift our primary focus to generating high-quality datasets first.
Measuring FPR on harder negative data.
Great suggestion, we’ll look into this.
Necessity of reasoning for monitoring
When suggesting that classifiers may not need reasoning, did you mean:
The internal reasoning of the monitored model (e.g. Gemini-2.5-pro’s CoT),
The explicit reasoning step requested by our prompt (“Provide your reasoning and conclude with: ANSWER: [number]”),
or both?
Our current assumption is that some sort of reasoning in the monitor prompt helps with red-teaming, but your rationale makes sense and we’re inclined to look into removing reasoning.
Being careful about spreading details of classifier failures
agreed! we’ll use this
thanks for the helpful feedback!
good point, we’ll add evaluations where we fix the threshold on one dataset and then see how TPR/FPR change on a different distribution
we’re definitely interested in the “interrogation” direction! For now it feels a bit early. People in labs working on catching scheming are still getting basic offline monitoring in place so we’re focusing on making that work well first.