Simon Storf comments on Building Black-box Scheming Monitors

Simon Storf 9 Aug 2025 19:58 UTC
1 point
0
thanks for the helpful feedback!
you may want to consider reporting threshold-dependent metrics as well, for instance the TPR/FPR on a distribution different from the one where the threshold is fixed, or the calibration of the monitor.
good point, we’ll add evaluations where we fix the threshold on one dataset and then see how TPR/FPR change on a different distribution
have you considered testing methods applied to the model powering the agent itself?
we’re definitely interested in the “interrogation” direction! For now it feels a bit early. People in labs working on catching scheming are still getting basic offline monitoring in place so we’re focusing on making that work well first.