Victor Gillioz comments on Monitoring benchmark for AI control

Victor Gillioz 4 Feb 2026 18:36 UTC
3 points
0
Cool work! Harder scheming samples are valuable for better assessing monitoring capability, and I like the fact that your refinement pipeline does not depend on adversarial iteration against a specific monitor.

I agree that risk assessment/calibration would be high impact in improving monitors. Given the detection vs risk assessment plots and the strong FPR dependency in the prompt sensitivity table, I would be curious to know what the false positive samples contain and how they compare to the false negative samples you discuss.

Also, see the last point in this comment from Fabien Roger about sharing red-teaming findings.
- monika_j 22 Feb 2026 21:56 UTC
  2 points
  0
  Parent
  Thanks, Victor! And great idea, we’ll have a look!