Ebenezer Dukakis comments on Why we are excited about confession!

Ebenezer Dukakis 15 Jan 2026 4:55 UTC
12 points
4
After the model confesses, get a second confession regarding anything which was omitted in the first confession. Then ask for a third confession regarding anything which was omitted in the second confession. And so forth. What properties must our models have, to let us drive the probability of undetected misalignment arbitrarily low using such a scheme?

You can imagine an adversarial setup where we train two stage-1 confessor models: An honest model, and a dishonest model which attempts to trick the stage-2 confessor. Then the stage-2 confessor tries to figure out which confessions are honest/dishonest. (If the stage-2 confessor persistently believes a confession from the “honest” model was actually dishonest, take a close look at it manually!)

At a certain point this starts to look like AI safety via debate. I imagine many critiques of AI safety via debate will also apply to confession proposals.
- Bronson Schoen 16 Jan 2026 1:10 UTC
  1 point
  0
  Parent
  
  If the stage-2 confessor persistently believes a confession from the “honest” model was actually dishonest
  
  That’s when we take a look at it’s second chain of thought