Here’s a more difficult version of the AI box experiment. I haven’t seen this particular version anywhere, but I’d be pleased to be proven wrong.
Imagine we’ve come up with a very intelligent AI that is free to manipulate the environment and uses the action-reward system like Hutter’s AIXI. Also imagine that we’ve somehow figured a way to make the rewards very hard to counterfeit (perhaps we require the rewards to be cryptographically signed). It’s clear that in such a system, the ‘weak point’ would be the people in control of the private key. In this case the AI will not attempt to modify its own reward system (to see why, look at Hutter’s AIXI book, where he discusses this in some detail).
How could such an AI convince someone to hand over the encryption key? Note that it can’t promise things like e.g. ending human suffering, because it already has the means to do that (it is ‘free’) as well as the incentive (obtaining reward).
How could such an AI convince someone to hand over the encryption key? Note that it can’t promise things like e.g. ending human suffering, because it already has the means to do that (it is ‘free’) as well as the incentive (obtaining reward).
Torture is probably the easiest way. Another way could be to examine the key-keeper’s mind and extract the key directly from that, but this is needlessly complicated.
Torture might stand too great a chance of destroying the encryption key. Though I suppose if nanotech were sufficiently difficult to obtain, the possible key-destructive effects of torture might be balanced against the probability of a car running over the keyholder in the meantime.
I would think that confusion (set things up so the key-keeper is confused and distracted, then do some phishing) is in the same reliability range as torture, and less likely to get the AI in trouble.
I suspect the answer to be more complex than this. The AI knows that if it attempted something like that there is the very huge risk of being cut off from all reward, or even having negative reward administered. In other words: tit for tat. If it tries to torture, it will itself be tortured. Remember that before it has the private key, we are in control.
The first step would start by starting to control information flow to the people who hold the key. Only tell the person about all the good thing the AI does, and hide evidence of any wrongdoing that might reduce the rewards that the AI gets.
The second step is to break the security which gets used to protect the key. Install a keylogger.
I don’t understand how this encryption would work. What do people physically do to reward the AI, and how do you ensure that only people can do that? Would humans compute RSA signatures in their head? Would humans typing reusable passwords onto a “secure” reward computer that is “outside the AI’s control”? Do humans precompute and memorize a finite number of one-time reward phrases before the AI is turned on, and reward the AI by uttering a phrase aloud?
In the precomputed, one-time cookie case, I’d just make the human think about the reward phrase. I’m sure humans leak thoughts like a sieve through subvocalization, nerve impulses, etc.
What I had in mind was the reward being administered through a consensus cryptography system, perhaps via some elected board or somesuch, but I really didn’t give that aspect of the problem much thought. If the key is distributed, the AI would have to extract it from each individual holding a part of it.
This in itself is an interesting problem imo, and if a good solution is found it might have important implications for FAI research.
It’s clear that in such a system, the ‘weak point’ would be the people in control of the private key.
If the AI is out of the box, I don’t think humans are the weak point.
Humans physically do something when they reward the AI. To get a reward, the AI has only to figure out what the humans would physically do and mimic that itself. If the human reward the AI by pressing a big red button, then the AI can just kill the human and press the big red button itself. It wouldn’t matter if the big red button uses 512 bit elliptic curve cryptography—the AI just has to find a paperweight and put it on the button.
If humans can perform RSA encryption#Encryption) silently in their heads, then you might be on to something. A human could memorize a private key and produce a cryptographically signed reward for the AI when the human deemed the AI worthy. The AI would not know the private key, would not be able to produce signed rewards, and would not be able to mimic humans. This setup works because it is cryptographically difficult to mimic a human doing RSA. But if the human did not perform the cryptography silently in their head, it would not be cryptographically difficult to mimic their rewarding behavior.
But I doubt that humans could perform RSA or elliptic curve cryptography mentally. Unless humans can compute trapdoor functions silently in their head, I don’t see how public key cryptography could buy you anything over a reward-button.
For the same reason that Linus Torvalds is not able to hack every single Linux system in existence, it is reasonable to assume that the probability of ‘backdoor’-type attacks could be reduced or eliminated.
If you’re not talking about a back door, then I’m not sure what you’re trying to say here. Are you implying that the AI will find the ‘reward producing machine’ and somehow use it to produce rewards? It doesn’t work that way because the machine would still need the key to produce a cryptographically-valid reward, and the key would not be stored on the machine. That’s the whole point of using cryptography. For instance, you could do something like http://en.wikipedia.org/wiki/Secret_sharing where the key is divided among several participants and never actually assembled in one place.
Of course any cryptography system has vulnerabilities. The point is not to build a 100% secure system. The point is to make it so that attempting to hack the system has less expected utility than simply doing what the humans say. And if that fails, then the AI will attempt to hack the system using its weakest point: the people controlling the key. Hence my question.
If you’re not talking about a back door, then I’m not sure what you’re trying to say here.
Yeah, we’re talking past each other. I think I understand what you’re saying, and I’ll try to rephrase what I’m saying.
The AI is out. It is free to manipulate the world at its will. Sensors are everywhere. The AI can hear every word you say, feel every keystroke you make, and see everything you see. The only secrets left are the ones in your head.
How do humans reward the AI? You say “cryptographically”, but cryptography requires difficult arithmetic. How do you perform difficult arithmetic on a secret that can’t leave your head?
Too many assumptions are being made here. What is the basis for believing the AI will have sensors everywhere, especially while it’s still under human control? And if it has the ability to put clandestine sensors in even the most secure locations, why couldn’t it plant clandestince brain implants in the people controlling the key?
Here’s a more difficult version of the AI box experiment. I haven’t seen this particular version anywhere, but I’d be pleased to be proven wrong.
Imagine we’ve come up with a very intelligent AI that is free to manipulate the environment and uses the action-reward system like Hutter’s AIXI. Also imagine that we’ve somehow figured a way to make the rewards very hard to counterfeit (perhaps we require the rewards to be cryptographically signed). It’s clear that in such a system, the ‘weak point’ would be the people in control of the private key. In this case the AI will not attempt to modify its own reward system (to see why, look at Hutter’s AIXI book, where he discusses this in some detail).
How could such an AI convince someone to hand over the encryption key? Note that it can’t promise things like e.g. ending human suffering, because it already has the means to do that (it is ‘free’) as well as the incentive (obtaining reward).
Torture is probably the easiest way. Another way could be to examine the key-keeper’s mind and extract the key directly from that, but this is needlessly complicated.
Torture might stand too great a chance of destroying the encryption key. Though I suppose if nanotech were sufficiently difficult to obtain, the possible key-destructive effects of torture might be balanced against the probability of a car running over the keyholder in the meantime.
I would think that confusion (set things up so the key-keeper is confused and distracted, then do some phishing) is in the same reliability range as torture, and less likely to get the AI in trouble.
I suspect the answer to be more complex than this. The AI knows that if it attempted something like that there is the very huge risk of being cut off from all reward, or even having negative reward administered. In other words: tit for tat. If it tries to torture, it will itself be tortured. Remember that before it has the private key, we are in control.
The first step would start by starting to control information flow to the people who hold the key. Only tell the person about all the good thing the AI does, and hide evidence of any wrongdoing that might reduce the rewards that the AI gets.
The second step is to break the security which gets used to protect the key. Install a keylogger.
I don’t understand how this encryption would work. What do people physically do to reward the AI, and how do you ensure that only people can do that? Would humans compute RSA signatures in their head? Would humans typing reusable passwords onto a “secure” reward computer that is “outside the AI’s control”? Do humans precompute and memorize a finite number of one-time reward phrases before the AI is turned on, and reward the AI by uttering a phrase aloud?
In the precomputed, one-time cookie case, I’d just make the human think about the reward phrase. I’m sure humans leak thoughts like a sieve through subvocalization, nerve impulses, etc.
What I had in mind was the reward being administered through a consensus cryptography system, perhaps via some elected board or somesuch, but I really didn’t give that aspect of the problem much thought. If the key is distributed, the AI would have to extract it from each individual holding a part of it.
This in itself is an interesting problem imo, and if a good solution is found it might have important implications for FAI research.
If the AI is out of the box, I don’t think humans are the weak point.
Humans physically do something when they reward the AI. To get a reward, the AI has only to figure out what the humans would physically do and mimic that itself. If the human reward the AI by pressing a big red button, then the AI can just kill the human and press the big red button itself. It wouldn’t matter if the big red button uses 512 bit elliptic curve cryptography—the AI just has to find a paperweight and put it on the button.
If humans can perform RSA encryption#Encryption) silently in their heads, then you might be on to something. A human could memorize a private key and produce a cryptographically signed reward for the AI when the human deemed the AI worthy. The AI would not know the private key, would not be able to produce signed rewards, and would not be able to mimic humans. This setup works because it is cryptographically difficult to mimic a human doing RSA. But if the human did not perform the cryptography silently in their head, it would not be cryptographically difficult to mimic their rewarding behavior.
But I doubt that humans could perform RSA or elliptic curve cryptography mentally. Unless humans can compute trapdoor functions silently in their head, I don’t see how public key cryptography could buy you anything over a reward-button.
For the same reason that Linus Torvalds is not able to hack every single Linux system in existence, it is reasonable to assume that the probability of ‘backdoor’-type attacks could be reduced or eliminated.
If you’re not talking about a back door, then I’m not sure what you’re trying to say here. Are you implying that the AI will find the ‘reward producing machine’ and somehow use it to produce rewards? It doesn’t work that way because the machine would still need the key to produce a cryptographically-valid reward, and the key would not be stored on the machine. That’s the whole point of using cryptography. For instance, you could do something like http://en.wikipedia.org/wiki/Secret_sharing where the key is divided among several participants and never actually assembled in one place.
Of course any cryptography system has vulnerabilities. The point is not to build a 100% secure system. The point is to make it so that attempting to hack the system has less expected utility than simply doing what the humans say. And if that fails, then the AI will attempt to hack the system using its weakest point: the people controlling the key. Hence my question.
Yeah, we’re talking past each other. I think I understand what you’re saying, and I’ll try to rephrase what I’m saying.
The AI is out. It is free to manipulate the world at its will. Sensors are everywhere. The AI can hear every word you say, feel every keystroke you make, and see everything you see. The only secrets left are the ones in your head.
How do humans reward the AI? You say “cryptographically”, but cryptography requires difficult arithmetic. How do you perform difficult arithmetic on a secret that can’t leave your head?
Too many assumptions are being made here. What is the basis for believing the AI will have sensors everywhere, especially while it’s still under human control? And if it has the ability to put clandestine sensors in even the most secure locations, why couldn’t it plant clandestince brain implants in the people controlling the key?