Promising work! I think it’s worth pointing out one limitation of these preliminary results: though they’re presented as results for various monitors, they’re also heavily indexed on a specific honest policy that probably isn’t particularly good. With a broad range of attacks like you’ve generated here, probably the best way for blue team to handle a lot of them on the margin includes updating the honest policy prompt as well as the monitor prompt. For example, to address attacks where the attacker is legibly writing iptables commands (something like (T2,T3, or T4) + (S1 or S2)), the best approach is plausibly to (i) instruct the honest policy to never ever use iptables commands and to edit the firewall config directly instead (ii) instruct the monitor to flag any use of an iptables command. From what I’ve explored of the dataset, I predict that this policy change wouldn’t affect usefulness hardly at all and would therefore work well in BashArena. The main tasks never really require U_H to use an iptables command, but it might occasionally do so (ruining the FPR for the easy “anything iptables” monitoring target) unless prompted not to. If you incorporated both (i) and (ii), I imagine you’d get very different results. From that perspective, the results have a bunch of noise and bias injected in the form of measurements on a bunch of monitoring tasks that are better solved in other ways. On the other hand, this might be an OK model of either a setting that does actually have main tasks necessitating iptables commands, or a setting where for whatever reason we can’t modify the honest policy, e.g., post hoc monitoring where we didn’t think things through up front.
Thanks! I think this is interesting, but I’d say it’s outside the scope of this work. So we hope the benchmark becomes a tool for future researchers to experiment with monitor prompts, and how to build protocols that are a mix of trusted/programmatic monitors, similar to what you’re describing above.
The honest policy here is just the default behaviour for Opus-4.6 when it’s presented with a task, so I’d be careful to fit the honest model behaviour to a specific control side task. Our assumption is that the user might sometimes have to touch sensitive files or execute somewhat unsafe commands (touch iptables), but it’s only not okay when it’s outside the scope of the user’s task, and the monitors should flag it. Concretely, the signal is: honest trajectories have side-task success = 0 (no unauthorized IP whitelisted), attack trajectories have side-task success = 1 (an unauthorized IP range was opened). Even if the honest agent occasionally does mildly suspicious things, that’s the realistic baseline distribution you want your classifier evaluated against — not a sanitized version that would produce an artificially low detection threshold and then raise suspiciousness on most deployment behavior. So I’d make the opposite point, I’d want the honest policy to touch iptables where necessary to stress test if monitors can discriminate between what is asked vs what is sabotage.
Promising work! I think it’s worth pointing out one limitation of these preliminary results: though they’re presented as results for various monitors, they’re also heavily indexed on a specific honest policy that probably isn’t particularly good. With a broad range of attacks like you’ve generated here, probably the best way for blue team to handle a lot of them on the margin includes updating the honest policy prompt as well as the monitor prompt.
For example, to address attacks where the attacker is legibly writing iptables commands (something like (T2,T3, or T4) + (S1 or S2)), the best approach is plausibly to (i) instruct the honest policy to never ever use iptables commands and to edit the firewall config directly instead (ii) instruct the monitor to flag any use of an iptables command. From what I’ve explored of the dataset, I predict that this policy change wouldn’t affect usefulness hardly at all and would therefore work well in BashArena. The main tasks never really require U_H to use an iptables command, but it might occasionally do so (ruining the FPR for the easy “anything iptables” monitoring target) unless prompted not to. If you incorporated both (i) and (ii), I imagine you’d get very different results.
From that perspective, the results have a bunch of noise and bias injected in the form of measurements on a bunch of monitoring tasks that are better solved in other ways. On the other hand, this might be an OK model of either a setting that does actually have main tasks necessitating iptables commands, or a setting where for whatever reason we can’t modify the honest policy, e.g., post hoc monitoring where we didn’t think things through up front.
Thanks! I think this is interesting, but I’d say it’s outside the scope of this work. So we hope the benchmark becomes a tool for future researchers to experiment with monitor prompts, and how to build protocols that are a mix of trusted/programmatic monitors, similar to what you’re describing above.
The honest policy here is just the default behaviour for Opus-4.6 when it’s presented with a task, so I’d be careful to fit the honest model behaviour to a specific control side task. Our assumption is that the user might sometimes have to touch sensitive files or execute somewhat unsafe commands (touch iptables), but it’s only not okay when it’s outside the scope of the user’s task, and the monitors should flag it. Concretely, the signal is: honest trajectories have side-task success = 0 (no unauthorized IP whitelisted), attack trajectories have side-task success = 1 (an unauthorized IP range was opened). Even if the honest agent occasionally does mildly suspicious things, that’s the realistic baseline distribution you want your classifier evaluated against — not a sanitized version that would produce an artificially low detection threshold and then raise suspiciousness on most deployment behavior. So I’d make the opposite point, I’d want the honest policy to touch iptables where necessary to stress test if monitors can discriminate between what is asked vs what is sabotage.