Cool work! Harder scheming samples are valuable for better assessing monitoring capability, and I like the fact that your refinement pipeline does not depend on adversarial iteration against a specific monitor.
I agree that risk assessment/calibration would be high impact in improving monitors. Given the detection vs risk assessment plots and the strong FPR dependency in the prompt sensitivity table, I would be curious to know what the false positive samples contain and how they compare to the false negative samples you discuss.
Also, see the last point in this comment from Fabien Roger about sharing red-teaming findings.
Cool work! Harder scheming samples are valuable for better assessing monitoring capability, and I like the fact that your refinement pipeline does not depend on adversarial iteration against a specific monitor.
I agree that risk assessment/calibration would be high impact in improving monitors. Given the detection vs risk assessment plots and the strong FPR dependency in the prompt sensitivity table, I would be curious to know what the false positive samples contain and how they compare to the false negative samples you discuss.
Also, see the last point in this comment from Fabien Roger about sharing red-teaming findings.
Thanks, Victor! And great idea, we’ll have a look!