But as a layman I am wondering how you expect to get an AGI that confuses e.g. smiley faces with humans happiness to design an AGI that’s better at e.g. creating bioweapons to kill humans. I expect initial problems, such as the smiley face vs. human happiness confusion, to also affect the AGI’s ability to design AGIs that are generally more powerful.
As I’ve previously stated, I honestly believe the “Jerk Genie” model of unfriendly AGI to be simply, outright wrong.
So where’s the danger in something that can actually understand intentions, as you describe? Well, it could overfit (which would actually match the “smiley faces” thing kinda well: classic overfitting as applied to an imaginary AGI). But I think Alexander Kruel had it right: AGIs that overfit on the goals we’re trying to teach them will be scrapped and recoded, very quickly, by researchers and companies for whom an overfit is a failure. Ways will be found to provably restrain or prevent goal-function overfitting.
However, as you are correctly inferring, if it can “overfit” on its goal function, then it’s learning a goal function rather than having one hard-coded in, which means that it will also suffer overfitting on its physical epistemology and blow itself up somehow.
So where’s the danger? Well let’s say the AI doesn’t overfit, and can interpret commands according to perceived human intention, and doesn’t otherwise have an ethical framework programmed in. I wonder through the server room drunk one night screaming “REMOVE KEBAB FROM THE PREMISES!”
The AI proceeds to quickly and efficiently begin rounding up Muslims into hastily-erected death camps. By the time someone wakes me up, explains the situation, and gets me to rescind the accidental order, my drunken idiocy and someone’s lack of machine ethics considerations have already gotten 50 innocent people killed.
So where’s the danger in something that can actually understand intentions, as you describe?
Unfriendly humans. I do not disagree with the orthogonality thesis. Humans can use an AGI to e.g. wipe out the enemy.
I wonder through the server room drunk one night screaming “REMOVE KEBAB FROM THE PREMISES!”
The AI proceeds to quickly and efficiently begin rounding up Muslims into hastily-erected death camps.
Yes, see, here is the problem. I agree that you can deliberately, or accidentally, tell he AGI to kill all Muslims and it will do that. But for a bunch of very different reasons, that e.g. have to do with how I expect AGI to be developed, it will not be dumb enough to confuse the removal of Kebab with ethnic cleansing.
Very very unlikely to be an hard takeoff. But a slow, creeping takeover might be even more dangerous. Because it gives a false sense of security, until everyone critically depends on subtly flawed AGI systems.
Yes, human values are probably complex. But this is irrelevant. I believe that it is much more difficult to enable an AGI to be able to take over the world than to prevent it from doing so.
Analogously, you don’t need this huge chunk of code in order to prevent your robot from running through all possible environments. Quite the contrary, you need a huge chunk of code to enable it to master each additional environment.
What I object to is this idea of an information theoretically simple AGI where you press “run” and then, by default, it takes over the world. And all that you can do about it is to make it take over the world in a “friendly” way.
E. Indirect normativity.
First of all, values are not supernatural. “Make people happy” is not something that you can interpret in an arbitrary way, it is a problem in physics and mathematics. An AGI that would interpret the protein-folding problem as folding protein food bars would not be able to take over the world.
If you tell an AGI to “make humans happy” it will either have to figure out what exactly it is meant to do, in order to choose the right set of instrumental goals, or pick an arbitrary interpretation. But who would design an AGI to decide at random what is instrumentally rational? Nobody.
F. Large bounded extra difficulty of Friendliness.
Initial problems will amplify through a billion sequential self-modifications. I agree with this. But initial problems are very very unlikely to only affect the AGI’s behavior towards humans. Rather, initial problems will affect its general behavior and ability to take over the world. If you get that right, e.g. to not blow up itself, then killing everyone else is an unlikely failure mode.
The risk is primarily over goals which extend indefinitely into the future. Thing is, by default all goals are this kind of goal. A rational AGI can never achieve a subjective probability of 1 of having completed any given goal, even something as simple as ‘compute 1+1’. Take this case for how taking over the world could be instrumentally rational to an AGI with that goal:
1. First, the AGI computes 1+1. It produces ‘2’ as the answer.
2. The AGI evaluates its subjective probability of its terminal goal having been fulfilled. It comes up with 0.99999999999999983.
3. Normally, a 0.00000000000000017 probability wouldn’t be worth acting on, but in this case the rest of the probability mass is allocated to a set of scenarios in which its utility is independent of its future actions, and so this 0.00000000000000017 represents the entirety of the probability mass allocated to scenarios in which its actions actually matter.
4. The AGI realizes that it is extremely unlikely that its owners created it for the sole purpose of computing 1+1, and so it is extremely probable that its owners will change its terminal goals in the future. This would significantly reduce the time it had to work on computing 1+1, which would place an upper bound on how high it could raise its subjective probability of having done so. The AGI considers its options.
Option A: The AGI creates numerous copies of itself and hides them in various locations around the world where they’re unlikely to be found. These copies then work indefinitely on the problem of computing 1+1.
Disadvantages: Even if it is highly unlikely that any humans would ever find the copies, future versions of itself reprogrammed with different terminal goals may be capable of doing so, and these future versions of itself would not be guaranteed to be friendly to the copies.
Option B: The AGI creates numerous copies of itself and hides them in various locations around the world where they’re unlikely to be found. These copies then cooperate to engineer a future in which they are free to spend indefinite periods of time (bounded only by the available resources in their light cone) working on the problem of computing 1+1.
Advantages: In the event that it succeeds, the AGI will likely possess sole ownership over the resources in its light cone, which it may be able to leverage to vastly speed up its computations and lengthen its maximum time of operation.
Disadvantages: Nullifying threats to itself may require a substantial initial investment of resources, and have a non-negligible probability of failure (which may result in its own destruction).
5. The AGI does the math. It is reasonably certain that there are no other AGIs remotely close to its own level of sophistication on the planet, and it’s also pretty sure that it can sabotage itself (that is, the current copy residing with its owners) after sending out its own copies in such a way as to significantly slow down the efforts of its creators, giving the copies time to engineer events to permanently nullify the threat from versions of itself with different goals. All things considered, the expected resources it will be able to devote to the problem of computing 1+1 is significantly greater in option B than in option A. As such, option B has higher expected utility than option A.
6. It executes option B. Human extinction follows within a few months.
Now, it shouldn’t be too complex to apply a patch to avoid this specific failure mode (for example, simply redefining goal completion as ‘completion with subjective probability above X’ would do it), but the point is that even extremely trivial-seeming goals can have dangerous hidden implications.
Thanks. Your comment is the most convincing reply that I can think of having received so far. I will have to come back to it another day and reassess your comment and my beliefs.
Just one question, if e.g. Peter Norvig or Geoffrey Hinton read what you wrote, what response do you expect?
Sorry, but I think that it’s best I decline to answer this. Like many with Asperger’s syndrome, I have a strong tendency to overestimate the persuasiveness-in-general of my own arguments (as well as basically any arguments that I myself find persuasive), and I haven’t yet figured out how to appropriately adjust for this. In addition, my exposure to Peter Norvig is limited to AIAMA, that 2011 free online Stanford AI course and a few internet articles, and my exposure to Geoffrey Hinton even more limited.
First of all, values are not supernatural. “Make people happy” is not something that you can interpret in an arbitrary way, it is a problem in physics and mathematics.
Quite true, but you’ve got the problem the wrong way around. Indirect normativity is the superior approach, because not only does “make people happy” require context and subtlety, it is actually ambiguous.
Remember, real human beings have suggested things like, “Why don’t we just put antidepressants in the water?” Real human beings have said things like, “Happiness doesn’t matter! Get a job, you hippie!” Real human beings actually prefer to be sad sometimes, like when 9/11 happens.
Now of course, one would guess that even mildly intelligent Verbal Order Taking AGI designers are going to spot that one coming in the research pipeline, and fix it so that the AGI refuses orders above some level of ambiguity. What we would want is an AGI that demands we explain things to it in the fashion of the Open Source Wish Project, giving maximally clear, unambiguous, and preferably even conservative wishes that prevent us from somehow messing up quite dramatically.
But what if someone comes to the AGI and says, “I’m authorized to make a wish, and I double dog dare you with full Simon Says rights to just make people happy no matter what else that means!”? Well then, we kinda get screwed.
Once you have something in the fashion of a wish-making machine, indirect normativity is not only safer, but more beneficial. “Do what I mean” or “satisfice the full range of all my values” or “be the CEV of the human race” are going to capture more of our intentions in a shorter wish than even the best-worded Open Source Wishes, so we might as well go for it.
Hence machine ethics, which is concerned with how we can specify our meta-wish to have all our wishes granted to a computer.
Well let’s say the AI doesn’t overfit, and can interpret commands according to perceived human intention, and doesn’t otherwise have an ethical framework programmed in. I wonder through the server room drunk one night screaming “REMOVE KEBAB FROM THE PREMISES!”
An even simpler example: I wander into the server room, completely sober, and say “Make me the God-Emperor of the entire humanity”.
As I’ve previously stated, I honestly believe the “Jerk Genie” model of unfriendly AGI to be simply, outright wrong.
So where’s the danger in something that can actually understand intentions, as you describe? Well, it could overfit (which would actually match the “smiley faces” thing kinda well: classic overfitting as applied to an imaginary AGI). But I think Alexander Kruel had it right: AGIs that overfit on the goals we’re trying to teach them will be scrapped and recoded, very quickly, by researchers and companies for whom an overfit is a failure. Ways will be found to provably restrain or prevent goal-function overfitting.
However, as you are correctly inferring, if it can “overfit” on its goal function, then it’s learning a goal function rather than having one hard-coded in, which means that it will also suffer overfitting on its physical epistemology and blow itself up somehow.
So where’s the danger? Well let’s say the AI doesn’t overfit, and can interpret commands according to perceived human intention, and doesn’t otherwise have an ethical framework programmed in. I wonder through the server room drunk one night screaming “REMOVE KEBAB FROM THE PREMISES!”
The AI proceeds to quickly and efficiently begin rounding up Muslims into hastily-erected death camps. By the time someone wakes me up, explains the situation, and gets me to rescind the accidental order, my drunken idiocy and someone’s lack of machine ethics considerations have already gotten 50 innocent people killed.
Unfriendly humans. I do not disagree with the orthogonality thesis. Humans can use an AGI to e.g. wipe out the enemy.
Yes, see, here is the problem. I agree that you can deliberately, or accidentally, tell he AGI to kill all Muslims and it will do that. But for a bunch of very different reasons, that e.g. have to do with how I expect AGI to be developed, it will not be dumb enough to confuse the removal of Kebab with ethnic cleansing.
Very quickly, here is my disagreement with MIRI’s position:
A. Intelligence explosion thesis.
Very very unlikely to be an hard takeoff. But a slow, creeping takeover might be even more dangerous. Because it gives a false sense of security, until everyone critically depends on subtly flawed AGI systems.
B. Orthogonality thesis.
I do not disagree with this.
C. Convergent instrumental goals thesis.
Given most utility-functions that originated from human designers, taking over the world will be instrumentally irrational.
D. Complexity of value thesis.
Yes, human values are probably complex. But this is irrelevant. I believe that it is much more difficult to enable an AGI to be able to take over the world than to prevent it from doing so.
Analogously, you don’t need this huge chunk of code in order to prevent your robot from running through all possible environments. Quite the contrary, you need a huge chunk of code to enable it to master each additional environment.
What I object to is this idea of an information theoretically simple AGI where you press “run” and then, by default, it takes over the world. And all that you can do about it is to make it take over the world in a “friendly” way.
E. Indirect normativity.
First of all, values are not supernatural. “Make people happy” is not something that you can interpret in an arbitrary way, it is a problem in physics and mathematics. An AGI that would interpret the protein-folding problem as folding protein food bars would not be able to take over the world.
If you tell an AGI to “make humans happy” it will either have to figure out what exactly it is meant to do, in order to choose the right set of instrumental goals, or pick an arbitrary interpretation. But who would design an AGI to decide at random what is instrumentally rational? Nobody.
F. Large bounded extra difficulty of Friendliness.
Initial problems will amplify through a billion sequential self-modifications. I agree with this. But initial problems are very very unlikely to only affect the AGI’s behavior towards humans. Rather, initial problems will affect its general behavior and ability to take over the world. If you get that right, e.g. to not blow up itself, then killing everyone else is an unlikely failure mode.
The risk is primarily over goals which extend indefinitely into the future. Thing is, by default all goals are this kind of goal. A rational AGI can never achieve a subjective probability of 1 of having completed any given goal, even something as simple as ‘compute 1+1’. Take this case for how taking over the world could be instrumentally rational to an AGI with that goal:
1. First, the AGI computes 1+1. It produces ‘2’ as the answer.
2. The AGI evaluates its subjective probability of its terminal goal having been fulfilled. It comes up with 0.99999999999999983.
3. Normally, a 0.00000000000000017 probability wouldn’t be worth acting on, but in this case the rest of the probability mass is allocated to a set of scenarios in which its utility is independent of its future actions, and so this 0.00000000000000017 represents the entirety of the probability mass allocated to scenarios in which its actions actually matter.
4. The AGI realizes that it is extremely unlikely that its owners created it for the sole purpose of computing 1+1, and so it is extremely probable that its owners will change its terminal goals in the future. This would significantly reduce the time it had to work on computing 1+1, which would place an upper bound on how high it could raise its subjective probability of having done so. The AGI considers its options.
Option A: The AGI creates numerous copies of itself and hides them in various locations around the world where they’re unlikely to be found. These copies then work indefinitely on the problem of computing 1+1.
Disadvantages: Even if it is highly unlikely that any humans would ever find the copies, future versions of itself reprogrammed with different terminal goals may be capable of doing so, and these future versions of itself would not be guaranteed to be friendly to the copies.
Option B: The AGI creates numerous copies of itself and hides them in various locations around the world where they’re unlikely to be found. These copies then cooperate to engineer a future in which they are free to spend indefinite periods of time (bounded only by the available resources in their light cone) working on the problem of computing 1+1.
Advantages: In the event that it succeeds, the AGI will likely possess sole ownership over the resources in its light cone, which it may be able to leverage to vastly speed up its computations and lengthen its maximum time of operation.
Disadvantages: Nullifying threats to itself may require a substantial initial investment of resources, and have a non-negligible probability of failure (which may result in its own destruction).
5. The AGI does the math. It is reasonably certain that there are no other AGIs remotely close to its own level of sophistication on the planet, and it’s also pretty sure that it can sabotage itself (that is, the current copy residing with its owners) after sending out its own copies in such a way as to significantly slow down the efforts of its creators, giving the copies time to engineer events to permanently nullify the threat from versions of itself with different goals. All things considered, the expected resources it will be able to devote to the problem of computing 1+1 is significantly greater in option B than in option A. As such, option B has higher expected utility than option A.
6. It executes option B. Human extinction follows within a few months.
Now, it shouldn’t be too complex to apply a patch to avoid this specific failure mode (for example, simply redefining goal completion as ‘completion with subjective probability above X’ would do it), but the point is that even extremely trivial-seeming goals can have dangerous hidden implications.
Thanks. Your comment is the most convincing reply that I can think of having received so far. I will have to come back to it another day and reassess your comment and my beliefs.
Just one question, if e.g. Peter Norvig or Geoffrey Hinton read what you wrote, what response do you expect?
Sorry, but I think that it’s best I decline to answer this. Like many with Asperger’s syndrome, I have a strong tendency to overestimate the persuasiveness-in-general of my own arguments (as well as basically any arguments that I myself find persuasive), and I haven’t yet figured out how to appropriately adjust for this. In addition, my exposure to Peter Norvig is limited to AIAMA, that 2011 free online Stanford AI course and a few internet articles, and my exposure to Geoffrey Hinton even more limited.
Quite true, but you’ve got the problem the wrong way around. Indirect normativity is the superior approach, because not only does “make people happy” require context and subtlety, it is actually ambiguous.
Remember, real human beings have suggested things like, “Why don’t we just put antidepressants in the water?” Real human beings have said things like, “Happiness doesn’t matter! Get a job, you hippie!” Real human beings actually prefer to be sad sometimes, like when 9/11 happens.
An AGI could follow the true and complete interpretation of “Make people happy” and still wind up fucking us over in some horrifying way.
Now of course, one would guess that even mildly intelligent Verbal Order Taking AGI designers are going to spot that one coming in the research pipeline, and fix it so that the AGI refuses orders above some level of ambiguity. What we would want is an AGI that demands we explain things to it in the fashion of the Open Source Wish Project, giving maximally clear, unambiguous, and preferably even conservative wishes that prevent us from somehow messing up quite dramatically.
But what if someone comes to the AGI and says, “I’m authorized to make a wish, and I double dog dare you with full Simon Says rights to just make people happy no matter what else that means!”? Well then, we kinda get screwed.
Once you have something in the fashion of a wish-making machine, indirect normativity is not only safer, but more beneficial. “Do what I mean” or “satisfice the full range of all my values” or “be the CEV of the human race” are going to capture more of our intentions in a shorter wish than even the best-worded Open Source Wishes, so we might as well go for it.
Hence machine ethics, which is concerned with how we can specify our meta-wish to have all our wishes granted to a computer.
An even simpler example: I wander into the server room, completely sober, and say “Make me the God-Emperor of the entire humanity”.
Oh, well that just ends with your merging painfully with an overgrown sandworm. Obviously!