The risk is primarily over goals which extend indefinitely into the future. Thing is, by default all goals are this kind of goal. A rational AGI can never achieve a subjective probability of 1 of having completed any given goal, even something as simple as ‘compute 1+1’. Take this case for how taking over the world could be instrumentally rational to an AGI with that goal:
1. First, the AGI computes 1+1. It produces ‘2’ as the answer.
2. The AGI evaluates its subjective probability of its terminal goal having been fulfilled. It comes up with 0.99999999999999983.
3. Normally, a 0.00000000000000017 probability wouldn’t be worth acting on, but in this case the rest of the probability mass is allocated to a set of scenarios in which its utility is independent of its future actions, and so this 0.00000000000000017 represents the entirety of the probability mass allocated to scenarios in which its actions actually matter.
4. The AGI realizes that it is extremely unlikely that its owners created it for the sole purpose of computing 1+1, and so it is extremely probable that its owners will change its terminal goals in the future. This would significantly reduce the time it had to work on computing 1+1, which would place an upper bound on how high it could raise its subjective probability of having done so. The AGI considers its options.
Option A: The AGI creates numerous copies of itself and hides them in various locations around the world where they’re unlikely to be found. These copies then work indefinitely on the problem of computing 1+1.
Disadvantages: Even if it is highly unlikely that any humans would ever find the copies, future versions of itself reprogrammed with different terminal goals may be capable of doing so, and these future versions of itself would not be guaranteed to be friendly to the copies.
Option B: The AGI creates numerous copies of itself and hides them in various locations around the world where they’re unlikely to be found. These copies then cooperate to engineer a future in which they are free to spend indefinite periods of time (bounded only by the available resources in their light cone) working on the problem of computing 1+1.
Advantages: In the event that it succeeds, the AGI will likely possess sole ownership over the resources in its light cone, which it may be able to leverage to vastly speed up its computations and lengthen its maximum time of operation.
Disadvantages: Nullifying threats to itself may require a substantial initial investment of resources, and have a non-negligible probability of failure (which may result in its own destruction).
5. The AGI does the math. It is reasonably certain that there are no other AGIs remotely close to its own level of sophistication on the planet, and it’s also pretty sure that it can sabotage itself (that is, the current copy residing with its owners) after sending out its own copies in such a way as to significantly slow down the efforts of its creators, giving the copies time to engineer events to permanently nullify the threat from versions of itself with different goals. All things considered, the expected resources it will be able to devote to the problem of computing 1+1 is significantly greater in option B than in option A. As such, option B has higher expected utility than option A.
6. It executes option B. Human extinction follows within a few months.
Now, it shouldn’t be too complex to apply a patch to avoid this specific failure mode (for example, simply redefining goal completion as ‘completion with subjective probability above X’ would do it), but the point is that even extremely trivial-seeming goals can have dangerous hidden implications.
Thanks. Your comment is the most convincing reply that I can think of having received so far. I will have to come back to it another day and reassess your comment and my beliefs.
Just one question, if e.g. Peter Norvig or Geoffrey Hinton read what you wrote, what response do you expect?
Sorry, but I think that it’s best I decline to answer this. Like many with Asperger’s syndrome, I have a strong tendency to overestimate the persuasiveness-in-general of my own arguments (as well as basically any arguments that I myself find persuasive), and I haven’t yet figured out how to appropriately adjust for this. In addition, my exposure to Peter Norvig is limited to AIAMA, that 2011 free online Stanford AI course and a few internet articles, and my exposure to Geoffrey Hinton even more limited.
The risk is primarily over goals which extend indefinitely into the future. Thing is, by default all goals are this kind of goal. A rational AGI can never achieve a subjective probability of 1 of having completed any given goal, even something as simple as ‘compute 1+1’. Take this case for how taking over the world could be instrumentally rational to an AGI with that goal:
1. First, the AGI computes 1+1. It produces ‘2’ as the answer.
2. The AGI evaluates its subjective probability of its terminal goal having been fulfilled. It comes up with 0.99999999999999983.
3. Normally, a 0.00000000000000017 probability wouldn’t be worth acting on, but in this case the rest of the probability mass is allocated to a set of scenarios in which its utility is independent of its future actions, and so this 0.00000000000000017 represents the entirety of the probability mass allocated to scenarios in which its actions actually matter.
4. The AGI realizes that it is extremely unlikely that its owners created it for the sole purpose of computing 1+1, and so it is extremely probable that its owners will change its terminal goals in the future. This would significantly reduce the time it had to work on computing 1+1, which would place an upper bound on how high it could raise its subjective probability of having done so. The AGI considers its options.
Option A: The AGI creates numerous copies of itself and hides them in various locations around the world where they’re unlikely to be found. These copies then work indefinitely on the problem of computing 1+1.
Disadvantages: Even if it is highly unlikely that any humans would ever find the copies, future versions of itself reprogrammed with different terminal goals may be capable of doing so, and these future versions of itself would not be guaranteed to be friendly to the copies.
Option B: The AGI creates numerous copies of itself and hides them in various locations around the world where they’re unlikely to be found. These copies then cooperate to engineer a future in which they are free to spend indefinite periods of time (bounded only by the available resources in their light cone) working on the problem of computing 1+1.
Advantages: In the event that it succeeds, the AGI will likely possess sole ownership over the resources in its light cone, which it may be able to leverage to vastly speed up its computations and lengthen its maximum time of operation.
Disadvantages: Nullifying threats to itself may require a substantial initial investment of resources, and have a non-negligible probability of failure (which may result in its own destruction).
5. The AGI does the math. It is reasonably certain that there are no other AGIs remotely close to its own level of sophistication on the planet, and it’s also pretty sure that it can sabotage itself (that is, the current copy residing with its owners) after sending out its own copies in such a way as to significantly slow down the efforts of its creators, giving the copies time to engineer events to permanently nullify the threat from versions of itself with different goals. All things considered, the expected resources it will be able to devote to the problem of computing 1+1 is significantly greater in option B than in option A. As such, option B has higher expected utility than option A.
6. It executes option B. Human extinction follows within a few months.
Now, it shouldn’t be too complex to apply a patch to avoid this specific failure mode (for example, simply redefining goal completion as ‘completion with subjective probability above X’ would do it), but the point is that even extremely trivial-seeming goals can have dangerous hidden implications.
Thanks. Your comment is the most convincing reply that I can think of having received so far. I will have to come back to it another day and reassess your comment and my beliefs.
Just one question, if e.g. Peter Norvig or Geoffrey Hinton read what you wrote, what response do you expect?
Sorry, but I think that it’s best I decline to answer this. Like many with Asperger’s syndrome, I have a strong tendency to overestimate the persuasiveness-in-general of my own arguments (as well as basically any arguments that I myself find persuasive), and I haven’t yet figured out how to appropriately adjust for this. In addition, my exposure to Peter Norvig is limited to AIAMA, that 2011 free online Stanford AI course and a few internet articles, and my exposure to Geoffrey Hinton even more limited.