Rohin Shah comments on In response to critiques of Guaranteed Safe AI

Rohin Shah 2 Feb 2025 12:39 UTC
LW: 12 AF: 9
7
AF
In broad strokes I agree with Zac. And tbc I’m generally a fan of formal verification and have done part of a PhD in program synthesis.
So, GSAI addresses the band of capabilities where AI systems become potentially no longer safe to interact with directly due to their potential ability to model and exploit human psychology
This seems like a great example of something that I strongly expect GSAI will not handle (unless the proposal reduces to “don’t build such AIs”, in which case I would appreciate that being stated more directly, or if it reduces to “limit the use of such AIs to tasks where we can formally verify soundness and uniqueness”, in which case I’d like an estimate of what fraction of economically valuable work this corresponds to).
Can you sketch out how one produces a sound overapproximation of human psychology? Or how you construct a safety specification that the AIs won’t exploit human psychology?
- Martin Randall 5 Feb 2025 2:42 UTC
  2 points
  0
  Parent
  Based on my understanding of the article:
  1. The sound over-approximation of human psychology is that humans are psychologically safe from information attacks of less than N bits. “Talk Control” is real, “Charm Person” is not.
  2. Under “Steganography, and other funny business” there is a sketched safety specification that each use of the AI will communicate at most one bit of information.
  3. Not stated explicitly: humans will be restricted to using the AI system no more than N times.
  Comments and concerns:
  1. Human psychology is also impacted by the physical environment, eg drugs, diseases, being paperclipped. The safety of the physical environment must be covered by separate verifications.
  2. There could be a side-channel for information if an AI answers some questions faster than others, uses more energy for some questions than others, etc.
  3. Machine interpretability techniques must be deployed in a side-channel resistant way. We can’t have the AI thinking about pegasi and unicorns in a morse code pattern and an intern reads it and ten years later everyone is a pony.
  4. There probably need to be multiple values of N for different time-frames. 1,000 adversarial bits in a minute is more psychologically dangerous than the same number of bits over a year.
  5. Today, we don’t know good values for N, but we can spend the first few bits getting higher safe values of N. We can also use the Yudkowskian technique of using volunteers that are killed or put into cryonic storage after being exposed to the bits.
  6. If we could prove that AIs cannot acausally coordinate we could increase the bound to N bits per AI, or AI instance. Again, a good use for initial bits.
  7. None of this stops us going extinct.