An AGI kill switch with defined security properties

This is a proposal for creating an AGI kill switch that has definable security properties.

Take a combination of symmetric fully homomorphic encryption (FHE) and indistinguishability obfuscation and use it to train and run inference for an AGI.

Take the encryption key and split it so that a quorum of m of n shards are required to submit or receive clear text data to the FHE AGI.

The key quorum and use takes place in a program that has been obfuscated with the indistinguishability obfuscation technique so that the clear text key is never seen.

Secretly distribute the key shards to diverse custodians who will monitor the AI’s output and input and look for signs of bad behavior.

If bad behavior by the FHE AGI is noted you would destroy enough keys to prevent successful key quorum. This key destruction ideally results in the AGI being unable to legibly understand or manipulate the world.

Limitations include:

  1. Someone breaking IO and stealing m/​n keys and releasing the AGI.

  2. The AGI breaking the assumptions of the relevant cryptographic systems to escape.

  3. Knowing what intelligence threshold is required to break the cryptographic system and keeping the AI below that.

  4. The difficulty of predicting the consequences of the AGI’s decrypted real world behavior to screen for malfeasance.

  5. Extremely high computation requirements for FHE and IO systems making the overall system economically noncompetitive with normal AGI.

  6. Deciding who the key shard guardians are and what properties they have.

  7. Deciding and recognizing what behaviors count as malignant.

  8. The bottleneck of running all FHE AGI queries and answers through a quorum of key guardians and malignancy filters.

  9. The tradeoff between the security of the system (more keys is better) and the usability/​stability of the system (less keys is better).

I would love feedback. Is it impractical, impossible, not even wrong? What other limitations are there to the idea?

Can you nest different FHE systems with differing assumptions so that if the AI breaks out of the inner system it is still stuck in an outer system? Assuming the AI can break out of the inner system, would it be unable to tell if it was in the real world versus another outer FHE system? If the FHE AGI is unable to tell if it is in an encrypted world or the real world, wouldn’t a smart and sneaky AI want to discover vulnerabilities for all practical FHE systems before trying to escape?