In the case of such “imitation false negatives”, honesty seems neither privileged by priors, nor easy to encourage, nor easy to verify.
Couldn’t you intentionally set up training scenarios so that there were such subtle avenues for cheating, and then reward the model for honestly reporting on those cheats? Since you knew in advance how one could cheat, you could reward honesty even on cheats that would have been very hard to detect without advance knowledge. This might then generalize to the model volunteering subtle details of cheating on other tasks where it was also hard to detect, and you didn’t have advance knowledge. Then as long as the system verifying the “honesty reports” was capable of noticing when the model correctly reported information that we wanted reported and the model got rewarded for those, it could gradually end up reporting more and more things.
And one could hope that it wouldn’t require too much training for a property like “report anything that I have reason to expect the humans would consider as breaking the spirit of the instructions” to generalize broadly. Of course, it might not have a good enough theory of mind to always realize that something would go against the spirit of the instructions, but just preventing the cases where the model did realize that would already be better than nothing.
[EDIT: my other comment might communicate my point better than this one did.]
As you point out, it relies on some generalization from the scenarios you crafted to the ones you care about (the AI should not misgeneralize either for benign reasons or because it recognizes some scenarios are ones where you intentionally got it cheating). I think this is plausibly a big problem against competent schemers. I am unsure how big of a deal this is for more average-case failures (e.g. before the AI becomes a competent schemer). I think it could be fine for human-level-ish AIs, for the same reasons that instruction following generalizes far, or it could not be fine, for the same reasons that evaluation awareness makes it risky and maybe not that helpful to train against honeypots.
I think this is plausibly a big problem against competent schemers.
Can you say more of what you think the problem is? Are you thinking of something like “the scheming module tries to figure out what kind of thing would trigger the honesty module and tries to think the kinds of thoughts that wouldn’t trigger it”?
Couldn’t you intentionally set up training scenarios so that there were such subtle avenues for cheating, and then reward the model for honestly reporting on those cheats? Since you knew in advance how one could cheat, you could reward honesty even on cheats that would have been very hard to detect without advance knowledge. This might then generalize to the model volunteering subtle details of cheating on other tasks where it was also hard to detect, and you didn’t have advance knowledge. Then as long as the system verifying the “honesty reports” was capable of noticing when the model correctly reported information that we wanted reported and the model got rewarded for those, it could gradually end up reporting more and more things.
And one could hope that it wouldn’t require too much training for a property like “report anything that I have reason to expect the humans would consider as breaking the spirit of the instructions” to generalize broadly. Of course, it might not have a good enough theory of mind to always realize that something would go against the spirit of the instructions, but just preventing the cases where the model did realize that would already be better than nothing.
[EDIT: my other comment might communicate my point better than this one did.]
As you point out, it relies on some generalization from the scenarios you crafted to the ones you care about (the AI should not misgeneralize either for benign reasons or because it recognizes some scenarios are ones where you intentionally got it cheating). I think this is plausibly a big problem against competent schemers. I am unsure how big of a deal this is for more average-case failures (e.g. before the AI becomes a competent schemer). I think it could be fine for human-level-ish AIs, for the same reasons that instruction following generalizes far, or it could not be fine, for the same reasons that evaluation awareness makes it risky and maybe not that helpful to train against honeypots.
Can you say more of what you think the problem is? Are you thinking of something like “the scheming module tries to figure out what kind of thing would trigger the honesty module and tries to think the kinds of thoughts that wouldn’t trigger it”?