Try to find failure cases of classifiers with more manual red-teaming. In my own work, I learned way more by building transformations that made classifier fail more often and probing for edge cases than by building bigger datasets and fine-tuning on them
I will learn more from your work if you important find cases where the model surprisingly fails >70% of the time and therefore needs to be explicitly prompted / fine-tuned than if you exhibit a broad distribution where you can push the recall from 95% to 99% (and I also expect the latter to be way more painful because of label noise issues)
Knowing which transformations make your classifier fail sets you up for doing actually useful classifier training—Constitutional-classifier-style. I don’t expect cross-dataset transfer experiments to be very useful, except if the performance increase is surprisingly large
To the extent you are using CoT classifiers, please include situations where only the CoT is misaligned (e.g. where the model did not decide to do sth bad, or only decided on doing sth very slightly bad such as light sandbagging). I also want to catch those, and I have no clue how good “default” classifiers are at this.
Focus on extremely egregious but hard to catch instances of misalignment
Borderline cases such as reward hacking are not that interesting to me—people know these things exist, I am not very interested in catching more of them.
On the other hand, I really don’t want to miss the first situations of very clear misalignment. And classifiers might (or might not) be bad at catching those—in particular if the reasons why the model goes for misalignment is a good-sounding elaborate justification (alignment-faking / blackmail demo style) or if the situation is particularly OOD and weird (which might be the case for the first instances of scary misalignment).
More ambitiously, I don’t want to miss a first attempt, even if that first attempt tried to be subtle or used a bunch of implicit justifications or used prompt injections / jailbreaks.
When measuring FPR, use some harder negative data if possible. Some classic difficulties include:
Data points where the user asks for sth a bit sus (very classic in alignment science / red-teaming work)
Long trajectories where the user is annoyed at AIs being stupid or AIs hitting some sandbox / classifier limitations (might be less common in the future, unsure how important)
I bet you will need to adapt your prompt. Are there things a model could do to exploit the differences between the old and new prompt?
Do you really need the classifier to use reasoning? I bet it’s often not, and it makes comparisons between fine-tuned and prompted classifiers easier + it’s more realistic for low-latency classifiers.
Thanks very much for the feedback! Super helpful—here is our initial reaction:
Manual red-teaming for finding monitor failures
We strongly agree. Since writing the blog post, we’ve already started with manual red-teaming.
Quick follow-up: Could you expand a bit on why you’re not as excited about cross-dataset transfer experiments? My guess is it’s because the distribution shift introduces many simultaneous changes, making it hard to isolate exactly which features/transformations caused the performance differences. Is this correct?
Include cases where only the CoT is misaligned.
we’ll try to include this if we do stick with CoT!
Focusing on egregious misalignment rather than reward hacking.
This makes sense, we wanted to get initial signal quickly and it was difficult to get convincingly egregious and subtle misalignment cases.
We’re trying to get data on more egregious and subtle scenarios. Currently SHADE Arena and manually-generated trajectories seem to be our best bets for this. But if this proves to be the biggest bottleneck we might shift our primary focus to generating high-quality datasets first.
Measuring FPR on harder negative data.
Great suggestion, we’ll look into this.
Necessity of reasoning for monitoring
When suggesting that classifiers may not need reasoning, did you mean:
The internal reasoning of the monitored model (e.g. Gemini-2.5-pro’s CoT),
The explicit reasoning step requested by our prompt (“Provide your reasoning and conclude with: ANSWER: [number]”),
or both?
Our current assumption is that some sort of reasoning in the monitor prompt helps with red-teaming, but your rationale makes sense and we’re inclined to look into removing reasoning.
Being careful about spreading details of classifier failures
Could you expand a bit on why you’re not as excited about cross-dataset transfer experiments? My guess is it’s because the distribution shift introduces many simultaneous changes, making it hard to isolate exactly which features/transformations caused the performance differences. Is this correct?
Roughly. I expect this to depend a lot on which datasets and dataset sizes you are using, so I don’t expect to learn sth super generalizable except if you have very strong or weak results.
When suggesting that classifiers may not need reasoning,
Any sampled token before the final answer (scratchpad or user facing reasoning).
Cool work, excited about it!
Some suggestions for next steps:
Try to find failure cases of classifiers with more manual red-teaming. In my own work, I learned way more by building transformations that made classifier fail more often and probing for edge cases than by building bigger datasets and fine-tuning on them
I will learn more from your work if you important find cases where the model surprisingly fails >70% of the time and therefore needs to be explicitly prompted / fine-tuned than if you exhibit a broad distribution where you can push the recall from 95% to 99% (and I also expect the latter to be way more painful because of label noise issues)
Knowing which transformations make your classifier fail sets you up for doing actually useful classifier training—Constitutional-classifier-style. I don’t expect cross-dataset transfer experiments to be very useful, except if the performance increase is surprisingly large
To the extent you are using CoT classifiers, please include situations where only the CoT is misaligned (e.g. where the model did not decide to do sth bad, or only decided on doing sth very slightly bad such as light sandbagging). I also want to catch those, and I have no clue how good “default” classifiers are at this.
Focus on extremely egregious but hard to catch instances of misalignment
Borderline cases such as reward hacking are not that interesting to me—people know these things exist, I am not very interested in catching more of them.
On the other hand, I really don’t want to miss the first situations of very clear misalignment. And classifiers might (or might not) be bad at catching those—in particular if the reasons why the model goes for misalignment is a good-sounding elaborate justification (alignment-faking / blackmail demo style) or if the situation is particularly OOD and weird (which might be the case for the first instances of scary misalignment).
More ambitiously, I don’t want to miss a first attempt, even if that first attempt tried to be subtle or used a bunch of implicit justifications or used prompt injections / jailbreaks.
When measuring FPR, use some harder negative data if possible. Some classic difficulties include:
Data points where the user asks for sth a bit sus (very classic in alignment science / red-teaming work)
Long trajectories where the user is annoyed at AIs being stupid or AIs hitting some sandbox / classifier limitations (might be less common in the future, unsure how important)
I bet you will need to adapt your prompt. Are there things a model could do to exploit the differences between the old and new prompt?
Do you really need the classifier to use reasoning? I bet it’s often not, and it makes comparisons between fine-tuned and prompted classifiers easier + it’s more realistic for low-latency classifiers.
For the most concrete examples of patterns that fool default prompted classifiers, I’d prefer if you were somewhat careful with spreading information in the pretraining corpus as not everyone might use fine-tuned classifiers, maybe use precautions like canary strings and https://www.lesswrong.com/posts/DA3vbSEfABLdoCt59/we-built-a-tool-to-protect-your-dataset-from-simple-scrapers
Thanks very much for the feedback! Super helpful—here is our initial reaction:
Manual red-teaming for finding monitor failures
We strongly agree. Since writing the blog post, we’ve already started with manual red-teaming.
Quick follow-up: Could you expand a bit on why you’re not as excited about cross-dataset transfer experiments? My guess is it’s because the distribution shift introduces many simultaneous changes, making it hard to isolate exactly which features/transformations caused the performance differences. Is this correct?
Include cases where only the CoT is misaligned.
we’ll try to include this if we do stick with CoT!
Focusing on egregious misalignment rather than reward hacking.
This makes sense, we wanted to get initial signal quickly and it was difficult to get convincingly egregious and subtle misalignment cases.
We’re trying to get data on more egregious and subtle scenarios. Currently SHADE Arena and manually-generated trajectories seem to be our best bets for this. But if this proves to be the biggest bottleneck we might shift our primary focus to generating high-quality datasets first.
Measuring FPR on harder negative data.
Great suggestion, we’ll look into this.
Necessity of reasoning for monitoring
When suggesting that classifiers may not need reasoning, did you mean:
The internal reasoning of the monitored model (e.g. Gemini-2.5-pro’s CoT),
The explicit reasoning step requested by our prompt (“Provide your reasoning and conclude with: ANSWER: [number]”),
or both?
Our current assumption is that some sort of reasoning in the monitor prompt helps with red-teaming, but your rationale makes sense and we’re inclined to look into removing reasoning.
Being careful about spreading details of classifier failures
agreed! we’ll use this
Roughly. I expect this to depend a lot on which datasets and dataset sizes you are using, so I don’t expect to learn sth super generalizable except if you have very strong or weak results.
Any sampled token before the final answer (scratchpad or user facing reasoning).