Thanks, this seems like a great work! A couple of comments:
metrics: as you correctly note, AUROC is threshold-agnostic, which is great for comparing across models. However, the other side of the coin of this is that it is insensitive to monotonic transformations of the assigned values. Thus, it is possible that, in an OOD case, AUROC remains high but the monitor assigns 10x smaller values to the samples. In operational scenarios, one would set a threshold on a known distribution (which may be different from the training one for the monitor) and then use it for the operational distribution. A high value of AUROC implies that you can determine a threshold value for which TPR/FPR values are great on that distribution, but that threshold may not be good for the new distribution. That also holds for log-AUROC and TPR at fixed FPR (as FPR also depends on the threshold). This is just to say, you may want to consider reporting threshold-dependent metrics as well, for instance the TPR/FPR on a distribution different from the one where the threshold is fixed, or the calibration of the monitor.
Besides external monitors, have you considered testing methods applied to the model powering the agent itself? Eg, querying the model with fixed questions and fitting a classifier on those (as we did in this work)?
you may want to consider reporting threshold-dependent metrics as well, for instance the TPR/FPR on a distribution different from the one where the threshold is fixed, or the calibration of the monitor.
good point, we’ll add evaluations where we fix the threshold on one dataset and then see how TPR/FPR change on a different distribution
have you considered testing methods applied to the model powering the agent itself?
we’re definitely interested in the “interrogation” direction! For now it feels a bit early. People in labs working on catching scheming are still getting basic offline monitoring in place so we’re focusing on making that work well first.
Thanks, this seems like a great work! A couple of comments:
metrics: as you correctly note, AUROC is threshold-agnostic, which is great for comparing across models. However, the other side of the coin of this is that it is insensitive to monotonic transformations of the assigned values. Thus, it is possible that, in an OOD case, AUROC remains high but the monitor assigns 10x smaller values to the samples. In operational scenarios, one would set a threshold on a known distribution (which may be different from the training one for the monitor) and then use it for the operational distribution. A high value of AUROC implies that you can determine a threshold value for which TPR/FPR values are great on that distribution, but that threshold may not be good for the new distribution. That also holds for log-AUROC and TPR at fixed FPR (as FPR also depends on the threshold). This is just to say, you may want to consider reporting threshold-dependent metrics as well, for instance the TPR/FPR on a distribution different from the one where the threshold is fixed, or the calibration of the monitor.
Besides external monitors, have you considered testing methods applied to the model powering the agent itself? Eg, querying the model with fixed questions and fitting a classifier on those (as we did in this work)?
thanks for the helpful feedback!
good point, we’ll add evaluations where we fix the threshold on one dataset and then see how TPR/FPR change on a different distribution
we’re definitely interested in the “interrogation” direction! For now it feels a bit early. People in labs working on catching scheming are still getting basic offline monitoring in place so we’re focusing on making that work well first.