Joe Collman comments on A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives

Joe Collman 14 Apr 2025 22:54 UTC
9 points
2
Unless I’m missing something, the hoped-for advantages of this setup are the kind of thing AI safety via debate already aims at. In GDM’s recent paper on their approach to technical alignment, there’s some discussion of amplified oversight (starts at page 71) more generally, and debate (starts at page 73).
If you see the approach you’re suggesting as importantly different from debate approaches, it’d be useful to know where the key differences are.
(without having read too carefully, my initial impression is that this is the kind of thing I expect to work for a while, then fail [as with debate] - and my core concern is then: how do we accurately predict when it’ll fail?)
What links here?
- A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives by Knight Lee (14 Apr 2025 10:27 UTC; -3 points)
- Knight Lee 15 Apr 2025 1:15 UTC
  3 points
  0
  Parent
  Thank you so much for bringing up that paper and finding the exact page most relevant! I learned a lot reading those pages. You’re a true researcher, take my strong upvote.
  My idea consists of a “hammer” and a “nail.” GDM’s paper describes a “hammer” very similar to mine (perhaps superior), but lacks the “nail.”
  The fact the hammer they invented resembles the hammer I invented is evidence in favour of me: I’m not badly confused :). I shouldn’t be sad that my hammer invention already exists.^[1]
  The “nail” of my idea is making the Constitutional AI self-critique behave like a detective, using its intelligence to uncover the most damning evidence of scheming/dishonesty. This detective behaviour helps achieve the premises of the “Constitutional AI Sufficiency Theorem.”
  The “hammer” of my idea is reinforcement learning to reward it for good detective work, with humans meticulously verifying its proofs (or damning evidence) of scheming/dishonesty.
  1. ^
    It does seems like a lot of my post describes my hammer invention in detail, and is no longer novel :/