Sabotage (probability of an attack causing harm and is uncaught) of the GPT-5 Nano monitor at a 1% action-level false positive rate (1% of honest actions would be flagged for audit). More capable and strategic attackers are harder to detect, reducing safety. Human-crafted attacks (LaStraj) evade the monitor almost entirely
Just to be sure, the human-crafted attacks do not work by making every action super suspicious and trigger the monitor, rendering the monitor useless at any reasonably low FPR right?
Yeah, we used the 1% FPR for Opus Honest policy for the bar graphs (threshold 7.5) for fair comparison. But even at other thresholds, human attacks are better
Just to be sure, the human-crafted attacks do not work by making every action super suspicious and trigger the monitor, rendering the monitor useless at any reasonably low FPR right?
Yeah, we used the 1% FPR for Opus Honest policy for the bar graphs (threshold 7.5) for fair comparison. But even at other thresholds, human attacks are better