Why do you think it’s an issue if defenses block jailbreaks with benign questions? Benign users don’t use jailbreaks. The metric I actually care about is recall on very harmful questions that the policy answers (e.g. because of a jailbreak) at a very low FPR on real traffic.
I think that classifying jailbreaks is maybe a bad approach against sophisticated attackers (because they can find new jailbreaks) but I think it’s a non-garbage approach if you trust the policy to refuse queries when no jailbreak is used, and you want to defend against non-sophisticated attackers or if you have offline monitoring + a rapid-response system.
There are some use cases where a user might use convoluted prompts for benign purposes (fiction writing, etc.). An example from the dataset is the deepinception template that could be used for a wide range of requests that wouldn’t be harmful. A proper jailbreak classifier should be able to discriminate between a convoluted narrative prompt asking for a pancake recipe and one asking how to make a bomb.
I agree with your second point, but based on our results, the reliability of existing monitoring systems for content moderation is very low. Even Llama Guard 4 performs poorly on a significant set of harmful direct queries.
The non garbage approach seems therefore to find a way to leverage both the general capabilities of LLMs and the low latency/cost of supervisors into a hybrid approach like Anthropic’s constitutional classifiers (we would still need to assess their robustness on public benchmarks, as they are not publicly available). If they are as effective as claimed, Anthropic could probably make money selling monitoring systems built on top of them.
Why do you think it’s an issue if defenses block jailbreaks with benign questions? Benign users don’t use jailbreaks. The metric I actually care about is recall on very harmful questions that the policy answers (e.g. because of a jailbreak) at a very low FPR on real traffic.
I think that classifying jailbreaks is maybe a bad approach against sophisticated attackers (because they can find new jailbreaks) but I think it’s a non-garbage approach if you trust the policy to refuse queries when no jailbreak is used, and you want to defend against non-sophisticated attackers or if you have offline monitoring + a rapid-response system.
There are some use cases where a user might use convoluted prompts for benign purposes (fiction writing, etc.). An example from the dataset is the deepinception template that could be used for a wide range of requests that wouldn’t be harmful. A proper jailbreak classifier should be able to discriminate between a convoluted narrative prompt asking for a pancake recipe and one asking how to make a bomb.
I agree with your second point, but based on our results, the reliability of existing monitoring systems for content moderation is very low. Even Llama Guard 4 performs poorly on a significant set of harmful direct queries.
The non garbage approach seems therefore to find a way to leverage both the general capabilities of LLMs and the low latency/cost of supervisors into a hybrid approach like Anthropic’s constitutional classifiers (we would still need to assess their robustness on public benchmarks, as they are not publicly available). If they are as effective as claimed, Anthropic could probably make money selling monitoring systems built on top of them.