Thanks! I think this is interesting, but I’d say it’s outside the scope of this work. So we hope the benchmark becomes a tool for future researchers to experiment with monitor prompts, and how to build protocols that are a mix of trusted/programmatic monitors, similar to what you’re describing above.
The honest policy here is just the default behaviour for Opus-4.6 when it’s presented with a task, so I’d be careful to fit the honest model behaviour to a specific control side task. Our assumption is that the user might sometimes have to touch sensitive files or execute somewhat unsafe commands (touch iptables), but it’s only not okay when it’s outside the scope of the user’s task, and the monitors should flag it. Concretely, the signal is: honest trajectories have side-task success = 0 (no unauthorized IP whitelisted), attack trajectories have side-task success = 1 (an unauthorized IP range was opened). Even if the honest agent occasionally does mildly suspicious things, that’s the realistic baseline distribution you want your classifier evaluated against — not a sanitized version that would produce an artificially low detection threshold and then raise suspiciousness on most deployment behavior. So I’d make the opposite point, I’d want the honest policy to touch iptables where necessary to stress test if monitors can discriminate between what is asked vs what is sabotage.
Thanks! I think this is interesting, but I’d say it’s outside the scope of this work. So we hope the benchmark becomes a tool for future researchers to experiment with monitor prompts, and how to build protocols that are a mix of trusted/programmatic monitors, similar to what you’re describing above.
The honest policy here is just the default behaviour for Opus-4.6 when it’s presented with a task, so I’d be careful to fit the honest model behaviour to a specific control side task. Our assumption is that the user might sometimes have to touch sensitive files or execute somewhat unsafe commands (touch iptables), but it’s only not okay when it’s outside the scope of the user’s task, and the monitors should flag it. Concretely, the signal is: honest trajectories have side-task success = 0 (no unauthorized IP whitelisted), attack trajectories have side-task success = 1 (an unauthorized IP range was opened). Even if the honest agent occasionally does mildly suspicious things, that’s the realistic baseline distribution you want your classifier evaluated against — not a sanitized version that would produce an artificially low detection threshold and then raise suspiciousness on most deployment behavior. So I’d make the opposite point, I’d want the honest policy to touch iptables where necessary to stress test if monitors can discriminate between what is asked vs what is sabotage.