I think may be a significant issue for getting confidence in confessions, since it means you can only check that confessions work on kinds of bad outcomes that you’ve observed the main AIs actually take. If you never saw AIs do egregiously bad things, it seems hard to get very confident that AIs would flag egregiously bad things just based on the very mild bad behavior that is so mild you did not even bother training it out.
Maybe you can get a sense about this by training model organisms, similar to how you might evaluate probes, but this is weaker than the more direct approach you can use for black box monitors, where you can mostly check if it catches bad behavior generated by humans or by AIs prompted to do something bad (and where you strip the “please do bad thing X” at eval time).
I am glad that you mostly relied on on-policy evals (as opposed to static or synthetic evals), since some recent work I’ve mentored showed that off-policy evals are unreliable for this kind of follow-up-question monitoring.
I think may be a significant issue for getting confidence in confessions, since it means you can only check that confessions work on kinds of bad outcomes that you’ve observed the main AIs actually take. If you never saw AIs do egregiously bad things, it seems hard to get very confident that AIs would flag egregiously bad things just based on the very mild bad behavior that is so mild you did not even bother training it out.
Maybe you can get a sense about this by training model organisms, similar to how you might evaluate probes, but this is weaker than the more direct approach you can use for black box monitors, where you can mostly check if it catches bad behavior generated by humans or by AIs prompted to do something bad (and where you strip the “please do bad thing X” at eval time).