Satron comments on AI Control: Improving Safety Despite Intentional Subversion

Satron 30 Jan 2025 21:58 UTC
1 point
0
This article seems to have found a big weakness of untrusted monitoring:
Synthetic inputs auditing is trivially defeated if the checker AI can distinguish synthetic inputs from real ones, as it could then always answer truthfully on the synthetic inputs to pass auditing, then collude with the generator on real inputs.
To create useful synthetic inputs, either the blue team must be able to match the generator’s style well enough to consistently fool the checker, or it must be able to redact the generator’s output to the point where the checker has no identifying details to distinguish inputs. The former seems impractical, since faking the inputs at that level of verisimilitude would require the blue team to have significant understanding of the AI’s inner workings, at which point there are probably more directly usable alignment strategies. The latter seems difficult, as by the protocol’s definition, the AI is working on problems/solutions that humans cannot easily understand, but effective redaction requires humans to understand the content of the solution to some extent.
Overall, I think this represents a significant flaw in the untrusted monitoring protocol.
Do you think that this weakness can be somehow circumvented? If not, do you think it is as fatal (to untrusted monitoring protocol) as the author of the article seems to think?
- Fabien Roger 1 Feb 2025 17:34 UTC
  7 points
  0
  Parent
  I left a comment!