Fabien Roger comments on Why we are excited about confession!

Fabien Roger 16 Jan 2026 15:20 UTC
6 points
1
intentionally set up training scenarios
As you point out, it relies on some generalization from the scenarios you crafted to the ones you care about (the AI should not misgeneralize either for benign reasons or because it recognizes some scenarios are ones where you intentionally got it cheating). I think this is plausibly a big problem against competent schemers. I am unsure how big of a deal this is for more average-case failures (e.g. before the AI becomes a competent schemer). I think it could be fine for human-level-ish AIs, for the same reasons that instruction following generalizes far, or it could not be fine, for the same reasons that evaluation awareness makes it risky and maybe not that helpful to train against honeypots.
- Kaj_Sotala 19 Jan 2026 11:47 UTC
  2 points
  0
  Parent
  I think this is plausibly a big problem against competent schemers.
  Can you say more of what you think the problem is? Are you thinking of something like “the scheming module tries to figure out what kind of thing would trigger the honesty module and tries to think the kinds of thoughts that wouldn’t trigger it”?