There’s no term in a loss function for “kill all humans”, but neither is there one for “do what humans want”, or better yet, “do what humans would want if they weren’t such complete morons half the time”.
Right. I don’t dismiss this, but I think there a bunch of caveats here that I’ve largely failed to describe in a way that people around here understand sufficiently in order to convince me that the arguments are wrong, or irrelevant.
Here is just one of those caveats, very quickly.
Consider Google was to create an oracle. In an early research phase they would run the following queries and receive the answers listed below:
Input 1: Oracle, how do I make all humans happy?
Output 1: Tile the universe with smiley faces.
Input 2: Oracle, what is the easiest way to print the first 100 Fibonacci numbers?
Output 2: Use all resources in the universe to print as many natural numbers as possible.
(Note: I am aware that MIRI believes that such an oracle wouldn’t even return those answers without taking over the world.)
I suspect that an oracle that behaves as depicted above would not be able to take over the world. Simply because such an oracle would not get a chance to do so, since it would be thoroughly revised for giving such ridiculous answers.
Secondly, if it is incapable of understanding such inputs correctly (yes, “make humans happy” is a problem in physics and mathematics that can be answered in a way that is objectively less wrong than “tile the universe with smiley faces”), then such a mistake will very likely have grave consequences for its ability to solve the problems it needs to solve in order to take over the world.
So that hinges on a Very Good Question: can we make and contain a potentially Unfriendly Oracle AI without its breaking out and taking over the universe?
To which my answer is: I do not know enough about AGI to answer this question. There are actually loads of advances in AGI remaining before we can make an agent capable of verbal conversation, so it’s difficult to answer.
One approach I might take would be to consider the AI’s “alphabet” of output signals as a programming language, and prove formally that this language can only express safe programs (ie: programs that do not “break out of the box”).
Right. I don’t dismiss this, but I think there a bunch of caveats here that I’ve largely failed to describe in a way that people around here understand sufficiently in order to convince me that the arguments are wrong, or irrelevant.
Here is just one of those caveats, very quickly.
Consider Google was to create an oracle. In an early research phase they would run the following queries and receive the answers listed below:
Input 1: Oracle, how do I make all humans happy?
Output 1: Tile the universe with smiley faces.
Input 2: Oracle, what is the easiest way to print the first 100 Fibonacci numbers?
Output 2: Use all resources in the universe to print as many natural numbers as possible.
(Note: I am aware that MIRI believes that such an oracle wouldn’t even return those answers without taking over the world.)
I suspect that an oracle that behaves as depicted above would not be able to take over the world. Simply because such an oracle would not get a chance to do so, since it would be thoroughly revised for giving such ridiculous answers.
Secondly, if it is incapable of understanding such inputs correctly (yes, “make humans happy” is a problem in physics and mathematics that can be answered in a way that is objectively less wrong than “tile the universe with smiley faces”), then such a mistake will very likely have grave consequences for its ability to solve the problems it needs to solve in order to take over the world.
So that hinges on a Very Good Question: can we make and contain a potentially Unfriendly Oracle AI without its breaking out and taking over the universe?
To which my answer is: I do not know enough about AGI to answer this question. There are actually loads of advances in AGI remaining before we can make an agent capable of verbal conversation, so it’s difficult to answer.
One approach I might take would be to consider the AI’s “alphabet” of output signals as a programming language, and prove formally that this language can only express safe programs (ie: programs that do not “break out of the box”).
But don’t quote me on that.