Martin Randall comments on In response to critiques of Guaranteed Safe AI

Martin Randall 5 Feb 2025 2:42 UTC
2 points
0
Based on my understanding of the article:
1. The sound over-approximation of human psychology is that humans are psychologically safe from information attacks of less than N bits. “Talk Control” is real, “Charm Person” is not.
2. Under “Steganography, and other funny business” there is a sketched safety specification that each use of the AI will communicate at most one bit of information.
3. Not stated explicitly: humans will be restricted to using the AI system no more than N times.
Comments and concerns:
1. Human psychology is also impacted by the physical environment, eg drugs, diseases, being paperclipped. The safety of the physical environment must be covered by separate verifications.
2. There could be a side-channel for information if an AI answers some questions faster than others, uses more energy for some questions than others, etc.
3. Machine interpretability techniques must be deployed in a side-channel resistant way. We can’t have the AI thinking about pegasi and unicorns in a morse code pattern and an intern reads it and ten years later everyone is a pony.
4. There probably need to be multiple values of N for different time-frames. 1,000 adversarial bits in a minute is more psychologically dangerous than the same number of bits over a year.
5. Today, we don’t know good values for N, but we can spend the first few bits getting higher safe values of N. We can also use the Yudkowskian technique of using volunteers that are killed or put into cryonic storage after being exposed to the bits.
6. If we could prove that AIs cannot acausally coordinate we could increase the bound to N bits per AI, or AI instance. Again, a good use for initial bits.
7. None of this stops us going extinct.