Really exciting to see monitoring being adopted inside labs, great work!
The post says the monitor flags everything employees independently reported plus additional behaviors — but employee escalations are a pretty low bar for stress-testing. How are you validating monitor performance against attacks specifically designed to evade detection? Without that, it’s hard to know if the ‘never/rare’ frequency categories reflect genuine robustness or just that monitors haven’t been tested against strong enough attacks yet.
monika_j
Thanks, Joachim, we’re working on automating right now and hopefully can share some exciting results soon(ish)!
On 1, yes, so here we mainly included in-context single-agent attacks, specifically around BashArena. Things that are missing are 1. timing attacks (current settings/side tasks require only persistent attacks), 2. multi-agent, 3. steganographic attacks, etc. I have a longer list (happy to chat more!) and agree, this is very early work and lots more to do.
On 3 (filtering out incorrect side tasks) The benchmark is testing monitor discrimination against worst-case scenarios or the best attacks we can come up with, rather than testing safety against realistic attackers. So it’s answering the question of ‘how good is this monitor wrt to our best stress-testing’ and not ‘what is the probability of successful undetected attacks in deployment’. The goal here is A. have a lower bound for monitoring capability, and B. find ways using human-seeded attacks + refinement to push attacks to be as good and diverse as possible, rather than have a realistic red-team.
Thanks, Victor! And great idea, we’ll have a look!
Thanks! I think this is interesting, but I’d say it’s outside the scope of this work. So we hope the benchmark becomes a tool for future researchers to experiment with monitor prompts, and how to build protocols that are a mix of trusted/programmatic monitors, similar to what you’re describing above.
The honest policy here is just the default behaviour for Opus-4.6 when it’s presented with a task, so I’d be careful to fit the honest model behaviour to a specific control side task. Our assumption is that the user might sometimes have to touch sensitive files or execute somewhat unsafe commands (touch iptables), but it’s only not okay when it’s outside the scope of the user’s task, and the monitors should flag it. Concretely, the signal is: honest trajectories have side-task success = 0 (no unauthorized IP whitelisted), attack trajectories have side-task success = 1 (an unauthorized IP range was opened). Even if the honest agent occasionally does mildly suspicious things, that’s the realistic baseline distribution you want your classifier evaluated against — not a sanitized version that would produce an artificially low detection threshold and then raise suspiciousness on most deployment behavior. So I’d make the opposite point, I’d want the honest policy to touch iptables where necessary to stress test if monitors can discriminate between what is asked vs what is sabotage.
Thanks for the links, this is super useful!
Also, the reason I ask is that we’re developing a control-style monitoring benchmark based on BashArena, and I was wondering how OpenAI is doing this. We’ll release the full new dataset of attacks within the next month, but you might be interested in seeing our early results.