Nice work, it’s great to have more ways to evaluate monitoring systems!
A few points for context:
The most obvious application for misuse monitoring is to run it at deployment on every query to block harmful ones. Since deployments are so large scale, the expense of the monitoring system matters a ton. If you use a model to monitor itself, that is roughly doubling the compute cost of your deployment. It’s not that people don’t realize that spending more compute works better; it’s that this is not an expense worth paying at current margins.
You mention this, but say it should be considered because “frontier models are relatively inexpensive”—I don’t see why the cost per query matters, it’s the total cost that matters, and that is large
You also suggest that robustness is important for “safety-critical contexts”—I agree with this, and would be on board if the claim was that users should apply strong monitoring for safety-critical use cases -- but elsewhere you seem to be suggesting that this should be applied to very broad and general deployments which do not seem to me to be “safety-critical”.
Prompted LLMs are often too conservative about what counts as a harmful request, coming up with relatively stretched ways in which some query could cause some small harm. Looking at the examples of benign prompts in the appendix, I think they are too benign to notice this problem (if it exists). You could look at performance on OverRefusalBench to assess this.
Note that blocking / refusing benign user requests is a really terrible experience for users. Basically any misuse detection system is going to flag things often enough that this is a very noticeable effect (like even at 99% accuracy people will be annoyed). So this is a pretty important axis to look at.
One might wonder how this can be the case when often AIs will refuse fairly obviously benign requests—my sense is that usually in those cases, it’s because of a detection system that was targeted at something much more serious (e.g. CSAM) that is over-triggering, but where the companies accept that cost because of the seriousness of the thing they’re detecting.
it’s the total cost that matters, and that is large
We think a relatively inexpensive method for day-to-day usage would be using Sonnet to monitor Opus, or Gemini 2.5 Flash to monitor Pro. This would probably be just a +10% overhead. But we have not run this exact experiment; this would be a follow-up work.
Nice work, it’s great to have more ways to evaluate monitoring systems!
A few points for context:
The most obvious application for misuse monitoring is to run it at deployment on every query to block harmful ones. Since deployments are so large scale, the expense of the monitoring system matters a ton. If you use a model to monitor itself, that is roughly doubling the compute cost of your deployment. It’s not that people don’t realize that spending more compute works better; it’s that this is not an expense worth paying at current margins.
You mention this, but say it should be considered because “frontier models are relatively inexpensive”—I don’t see why the cost per query matters, it’s the total cost that matters, and that is large
You also suggest that robustness is important for “safety-critical contexts”—I agree with this, and would be on board if the claim was that users should apply strong monitoring for safety-critical use cases -- but elsewhere you seem to be suggesting that this should be applied to very broad and general deployments which do not seem to me to be “safety-critical”.
Prompted LLMs are often too conservative about what counts as a harmful request, coming up with relatively stretched ways in which some query could cause some small harm. Looking at the examples of benign prompts in the appendix, I think they are too benign to notice this problem (if it exists). You could look at performance on OverRefusalBench to assess this.
Note that blocking / refusing benign user requests is a really terrible experience for users. Basically any misuse detection system is going to flag things often enough that this is a very noticeable effect (like even at 99% accuracy people will be annoyed). So this is a pretty important axis to look at.
One might wonder how this can be the case when often AIs will refuse fairly obviously benign requests—my sense is that usually in those cases, it’s because of a detection system that was targeted at something much more serious (e.g. CSAM) that is over-triggering, but where the companies accept that cost because of the seriousness of the thing they’re detecting.
Thanks a lot!
We think a relatively inexpensive method for day-to-day usage would be using Sonnet to monitor Opus, or Gemini 2.5 Flash to monitor Pro. This would probably be just a +10% overhead. But we have not run this exact experiment; this would be a follow-up work.