Very interesting work, thanks for sharing your write-up (:
A few thoughts in no particular order:
1. I really like your attack taxonomy. I’d love to see that fleshed out more generally and also think multi-agent attacks should be included. 2. Mode collapse: I think this can be stressed even more. AFAIK all publicly available attack datasets (control tax, tylord) have an insanely high percentage of backdoors that are “if statements”, I’d not be surprised if that number is close to 100% and I think it’s a real problem. I also ran into this issue in my recent attack selection work: https://arxiv.org/abs/2602.04930 3. I’m a bit skeptical of filtering out attacks that did not achieve the side task as this makes the red team much stronger (maybe more than realistic). 3. I’d love to get your ideas on how to scale up your approach to get a bigger monitoring benchmark that’s more general by automating your work. If AI can find zero day vulnerabilities, we should be able to build insane systems to test our monitors as well. I think this is a resource questions and I think we should be spending much more to build such systems.
Thanks, Joachim, we’re working on automating right now and hopefully can share some exciting results soon(ish)!
On 1, yes, so here we mainly included in-context single-agent attacks, specifically around BashArena. Things that are missing are 1. timing attacks (current settings/side tasks require only persistent attacks), 2. multi-agent, 3. steganographic attacks, etc. I have a longer list (happy to chat more!) and agree, this is very early work and lots more to do.
On 3 (filtering out incorrect side tasks) The benchmark is testing monitor discriminationagainst worst-case scenarios or the best attacks we can come up with, rather than testing safety against realistic attackers. So it’s answering the question of ‘how good is this monitor wrt to our best stress-testing’ and not ‘what is the probability of successful undetected attacks in deployment’. The goal here is A. have a lower bound for monitoring capability, and B. find ways using human-seeded attacks + refinement to push attacks to be as good and diverse as possible, rather than have a realistic red-team.
Very interesting work, thanks for sharing your write-up (:
A few thoughts in no particular order:
1. I really like your attack taxonomy. I’d love to see that fleshed out more generally and also think multi-agent attacks should be included.
2. Mode collapse: I think this can be stressed even more. AFAIK all publicly available attack datasets (control tax, tylord) have an insanely high percentage of backdoors that are “if statements”, I’d not be surprised if that number is close to 100% and I think it’s a real problem. I also ran into this issue in my recent attack selection work: https://arxiv.org/abs/2602.04930
3. I’m a bit skeptical of filtering out attacks that did not achieve the side task as this makes the red team much stronger (maybe more than realistic).
3. I’d love to get your ideas on how to scale up your approach to get a bigger monitoring benchmark that’s more general by automating your work. If AI can find zero day vulnerabilities, we should be able to build insane systems to test our monitors as well. I think this is a resource questions and I think we should be spending much more to build such systems.
Thanks, Joachim, we’re working on automating right now and hopefully can share some exciting results soon(ish)!
On 1, yes, so here we mainly included in-context single-agent attacks, specifically around BashArena. Things that are missing are 1. timing attacks (current settings/side tasks require only persistent attacks), 2. multi-agent, 3. steganographic attacks, etc. I have a longer list (happy to chat more!) and agree, this is very early work and lots more to do.
On 3 (filtering out incorrect side tasks) The benchmark is testing monitor discrimination against worst-case scenarios or the best attacks we can come up with, rather than testing safety against realistic attackers. So it’s answering the question of ‘how good is this monitor wrt to our best stress-testing’ and not ‘what is the probability of successful undetected attacks in deployment’. The goal here is A. have a lower bound for monitoring capability, and B. find ways using human-seeded attacks + refinement to push attacks to be as good and diverse as possible, rather than have a realistic red-team.