After the model confesses, get a second confession regarding anything which was omitted in the first confession. Then ask for a third confession regarding anything which was omitted in the second confession. And so forth. What properties must our models have, to let us drive the probability of undetected misalignment arbitrarily low using such a scheme?
You can imagine an adversarial setup where we train two stage-1 confessor models: An honest model, and a dishonest model which attempts to trick the stage-2 confessor. Then the stage-2 confessor tries to figure out which confessions are honest/dishonest. (If the stage-2 confessor persistently believes a confession from the “honest” model was actually dishonest, take a close look at it manually!)
At a certain point this starts to look like AI safety via debate. I imagine many critiques of AI safety via debate will also apply to confession proposals.
After the model confesses, get a second confession regarding anything which was omitted in the first confession. Then ask for a third confession regarding anything which was omitted in the second confession. And so forth. What properties must our models have, to let us drive the probability of undetected misalignment arbitrarily low using such a scheme?
You can imagine an adversarial setup where we train two stage-1 confessor models: An honest model, and a dishonest model which attempts to trick the stage-2 confessor. Then the stage-2 confessor tries to figure out which confessions are honest/dishonest. (If the stage-2 confessor persistently believes a confession from the “honest” model was actually dishonest, take a close look at it manually!)
At a certain point this starts to look like AI safety via debate. I imagine many critiques of AI safety via debate will also apply to confession proposals.
That’s when we take a look at it’s second chain of thought