As you point out, it relies on some generalization from the scenarios you crafted to the ones you care about (the AI should not misgeneralize either for benign reasons or because it recognizes some scenarios are ones where you intentionally got it cheating). I think this is plausibly a big problem against competent schemers. I am unsure how big of a deal this is for more average-case failures (e.g. before the AI becomes a competent schemer). I think it could be fine for human-level-ish AIs, for the same reasons that instruction following generalizes far, or it could not be fine, for the same reasons that evaluation awareness makes it risky and maybe not that helpful to train against honeypots.
I think this is plausibly a big problem against competent schemers.
Can you say more of what you think the problem is? Are you thinking of something like “the scheming module tries to figure out what kind of thing would trigger the honesty module and tries to think the kinds of thoughts that wouldn’t trigger it”?
As you point out, it relies on some generalization from the scenarios you crafted to the ones you care about (the AI should not misgeneralize either for benign reasons or because it recognizes some scenarios are ones where you intentionally got it cheating). I think this is plausibly a big problem against competent schemers. I am unsure how big of a deal this is for more average-case failures (e.g. before the AI becomes a competent schemer). I think it could be fine for human-level-ish AIs, for the same reasons that instruction following generalizes far, or it could not be fine, for the same reasons that evaluation awareness makes it risky and maybe not that helpful to train against honeypots.
Can you say more of what you think the problem is? Are you thinking of something like “the scheming module tries to figure out what kind of thing would trigger the honesty module and tries to think the kinds of thoughts that wouldn’t trigger it”?