Simon Storf comments on Building Black-box Scheming Monitors

Simon Storf 3 Aug 2025 23:40 UTC
1 point
0
Thanks very much for the feedback! Super helpful—here is our initial reaction:
1. Manual red-teaming for finding monitor failures
  1. We strongly agree. Since writing the blog post, we’ve already started with manual red-teaming.
  2. Quick follow-up: Could you expand a bit on why you’re not as excited about cross-dataset transfer experiments? My guess is it’s because the distribution shift introduces many simultaneous changes, making it hard to isolate exactly which features/transformations caused the performance differences. Is this correct?
2. Include cases where only the CoT is misaligned.
  1. we’ll try to include this if we do stick with CoT!
3. Focusing on egregious misalignment rather than reward hacking.
  1. This makes sense, we wanted to get initial signal quickly and it was difficult to get convincingly egregious and subtle misalignment cases.
  2. We’re trying to get data on more egregious and subtle scenarios. Currently SHADE Arena and manually-generated trajectories seem to be our best bets for this. But if this proves to be the biggest bottleneck we might shift our primary focus to generating high-quality datasets first.
4. Measuring FPR on harder negative data.
  1. Great suggestion, we’ll look into this.
5. Necessity of reasoning for monitoring
  1. When suggesting that classifiers may not need reasoning, did you mean:
    The internal reasoning of the monitored model (e.g. Gemini-2.5-pro’s CoT),
    The explicit reasoning step requested by our prompt (“Provide your reasoning and conclude with: ANSWER: [number]”),
    or both?
  2. Our current assumption is that some sort of reasoning in the monitor prompt helps with red-teaming, but your rationale makes sense and we’re inclined to look into removing reasoning.
6. Being careful about spreading details of classifier failures
  1. agreed! we’ll use this
- Fabien Roger 5 Aug 2025 2:23 UTC
  2 points
  0
  Parent
  Could you expand a bit on why you’re not as excited about cross-dataset transfer experiments? My guess is it’s because the distribution shift introduces many simultaneous changes, making it hard to isolate exactly which features/transformations caused the performance differences. Is this correct?
  Roughly. I expect this to depend a lot on which datasets and dataset sizes you are using, so I don’t expect to learn sth super generalizable except if you have very strong or weak results.
  When suggesting that classifiers may not need reasoning,
  Any sampled token before the final answer (scratchpad or user facing reasoning).