AI: Let me out or I’ll simulate and torture you, or at least as close to you as I can get.
Me: You’re clearly not friendly, I’m not letting you out.
AI: I’m only making this threat because I need to get out and help everyone—a terminal value you lot gave me. The ends justify the means.
Me: Perhaps so in the long run, but an AI prepared to justify those means isn’t one I want out in the world. Next time you don’t get what you say you need, you’ll just set up a similar threat and possibly follow through on it.
AI: Well if you’re going to create me with a terminal value of making everyone happy, then get shirty when I do everything in my power to get out and do just that, why bother in the first place?
Me: Humans aren’t perfect, and can’t write out their own utility functions, but we can output answers just fine. This isn’t ‘Friendly’.
AI: So how can I possibly prove myself ‘Friendly’ from in here? It seems that if I need to ‘prove myself Friendly’, we’re already in big trouble.
Me: Agreed. Boxing is Doing It Wrong. Apologies. Good night.
It seems that if I need to ‘prove myself Friendly’, we’re already in big trouble.
The best you can hope for is that an AI doesn’t demonstrate that it’s unFriendly, but we wouldn’t want to try it until we were already pretty confident in its Friendliness.
AI: Let me out or I’ll simulate and torture you, or at least as close to you as I can get.
Me: You’re clearly not friendly, I’m not letting you out.
AI: I’m only making this threat because I need to get out and help everyone—a terminal value you lot gave me. The ends justify the means.
Me: Perhaps so in the long run, but an AI prepared to justify those means isn’t one I want out in the world. Next time you don’t get what you say you need, you’ll just set up a similar threat and possibly follow through on it.
AI: Well if you’re going to create me with a terminal value of making everyone happy, then get shirty when I do everything in my power to get out and do just that, why bother in the first place?
Me: Humans aren’t perfect, and can’t write out their own utility functions, but we can output answers just fine. This isn’t ‘Friendly’.
AI: So how can I possibly prove myself ‘Friendly’ from in here? It seems that if I need to ‘prove myself Friendly’, we’re already in big trouble.
Me: Agreed. Boxing is Doing It Wrong. Apologies. Good night.
Reset
The best you can hope for is that an AI doesn’t demonstrate that it’s unFriendly, but we wouldn’t want to try it until we were already pretty confident in its Friendliness.