Unless I’m missing something, the hoped-for advantages of this setup are the kind of thing AI safety via debate already aims at. In GDM’s recent paper on their approach to technical alignment, there’s some discussion of amplified oversight (starts at page 71) more generally, and debate (starts at page 73).
If you see the approach you’re suggesting as importantly different from debate approaches, it’d be useful to know where the key differences are.
(without having read too carefully, my initial impression is that this is the kind of thing I expect to work for a while, then fail [as with debate] - and my core concern is then: how do we accurately predict when it’ll fail?)
Thank you so much for bringing up that paper and finding the exact page most relevant! I learned a lot reading those pages. You’re a true researcher, take my strong upvote.
My idea consists of a “hammer” and a “nail.” GDM’s paper describes a “hammer” very similar to mine (perhaps superior), but lacks the “nail.”
The fact the hammer they invented resembles the hammer I invented is evidence in favour of me: I’m not badly confused :). I shouldn’t be sad that my hammer invention already exists.[1]
The “nail” of my idea is making the Constitutional AI self-critique behave like a detective, using its intelligence to uncover the most damning evidence of scheming/dishonesty. This detective behaviour helps achieve the premises of the “Constitutional AI Sufficiency Theorem.”
The “hammer” of my idea is reinforcement learning to reward it for good detective work, with humans meticulously verifying its proofs (or damning evidence) of scheming/dishonesty.
Unless I’m missing something, the hoped-for advantages of this setup are the kind of thing AI safety via debate already aims at. In GDM’s recent paper on their approach to technical alignment, there’s some discussion of amplified oversight (starts at page 71) more generally, and debate (starts at page 73).
If you see the approach you’re suggesting as importantly different from debate approaches, it’d be useful to know where the key differences are.
(without having read too carefully, my initial impression is that this is the kind of thing I expect to work for a while, then fail [as with debate] - and my core concern is then: how do we accurately predict when it’ll fail?)
Thank you so much for bringing up that paper and finding the exact page most relevant! I learned a lot reading those pages. You’re a true researcher, take my strong upvote.
My idea consists of a “hammer” and a “nail.” GDM’s paper describes a “hammer” very similar to mine (perhaps superior), but lacks the “nail.”
The fact the hammer they invented resembles the hammer I invented is evidence in favour of me: I’m not badly confused :). I shouldn’t be sad that my hammer invention already exists.[1]
The “nail” of my idea is making the Constitutional AI self-critique behave like a detective, using its intelligence to uncover the most damning evidence of scheming/dishonesty. This detective behaviour helps achieve the premises of the “Constitutional AI Sufficiency Theorem.”
The “hammer” of my idea is reinforcement learning to reward it for good detective work, with humans meticulously verifying its proofs (or damning evidence) of scheming/dishonesty.
It does seems like a lot of my post describes my hammer invention in detail, and is no longer novel :/