papetoast comments on Introducing LinuxArena

papetoast 21 Apr 2026 12:10 UTC
3 points
0
Sabotage (probability of an attack causing harm and is uncaught) of the GPT-5 Nano monitor at a 1% action-level false positive rate (1% of honest actions would be flagged for audit). More capable and strategic attackers are harder to detect, reducing safety. Human-crafted attacks (LaStraj) evade the monitor almost entirely
Just to be sure, the human-crafted attacks do not work by making every action super suspicious and trigger the monitor, rendering the monitor useless at any reasonably low FPR right?
- Ram Potham 21 Apr 2026 15:01 UTC
  4 points
  0
  Parent
  Yeah, we used the 1% FPR for Opus Honest policy for the bar graphs (threshold 7.5) for fair comparison. But even at other thresholds, human attacks are better