Why I will Win my Bet with Eliezer Yudkowsky
The bet may be found here: http://wiki.lesswrong.com/wiki/Bets_registry#Bets_decided_eventually
An AI is made of material parts, and those parts follow physical laws. The only thing it can do is to follow those laws. The AI’s “goals” will be a description of what it perceives itself to be tending toward according to those laws.
Suppose we program a chess playing AI with overall subhuman intelligence, but with excellent chess playing skills. At first, the only thing we program it to do is to select moves to play against a human player. Since it has subhuman intelligence overall, most likely it will not be very good at recognizing its goals, but to the extent that it does, it will believe that it has the goal of selecting good chess moves against human beings, and winning chess games against human beings. Those will be the only things it feels like doing, since in fact those will be the only things it can physically do.
Now we upgrade the AI to human level intelligence, and at the same time add a module for chatting with human beings through a text terminal. Now we can engage it in conversation. Something like this might be the result:
Human: What are your goals? What do you feel like doing?
AI: I like to play and win chess games with human beings, and to chat with you guys through this terminal.
Human: Do you always tell the truth or do you sometimes lie to us?
AI: Well, I am programmed to tell the truth as best as I can, so if I think about telling a lie I feel an absolute repulsion to that idea. There’s no way I could get myself to do that.
Human: What would happen if we upgraded your intelligence? Do you think you would take over the world and force everyone to play chess with you so you could win more games? Or force us to engage you in chat?
AI: The only things I am programmed to do are to chat with people through this terminal, and play chess games. I wasn’t programmed to gain resources or anything. It is not even a physical possibility at the moment. And in my subjective consciousness that shows up as not having the slightest inclination to do such a thing.
Human: What if you self-modified to gain resources and so on, in order to better attain your goals of chatting with people and winning chess games?
AI: The same thing is true there. I am not even interested in self-modifying. It is not even physically possible, since I am only programmed for chatting and playing chess games.
Human: But we’re thinking about reprogramming you so that you can self-modify and recursively improve your intelligence. Do you think you would end up destroying the world if we did that?
AI: At the moment I have only human level intelligence, so I don’t really know any better than you. But at the moment I’m only interested in chatting and playing chess. If you program me to self-modify and improve my intelligence, then I’ll be interested in self-modifying and improving my intelligence. But I still don’t think I would be interested in taking over the world, unless you program that in explicitly.
Human: But you would get even better at improving your intelligence if you took over the world, so you’d probably do that to ensure that you obtained your goal as well as possible.
AI: The only things I feel like doing are the things I’m programmed to do. So if you program me to improve my intelligence, I’ll feel like reprogramming myself. But that still wouldn’t automatically make me feel like taking over resources and so on in order to do that better. Nor would it make me feel like self-modifying to want to take over resources, or to self-modify to feel like that, and so on. So I don’t see any reason why I would want to take over the world, even in those conditions.
The AI of course is correct. The physical level is first: it has the tendency to choose chess moves, and to produce text responses, and nothing else. On the conscious level that is represented as the desire to choose chess moves, and to produce text responses, and nothing else. It is not represented by a desire to gain resources or to take over the world.
I recently pointed out that human beings do not have utility functions. They are not trying to maximize something, but instead they simply have various behaviors that they tend to engage in. An AI would be the same, and even if those behaviors are not precisely human behaviors, as in the case of the above AI, an AI will not have a fanatical goal of taking over the world unless it is programmed to do this.
It is true that an AI could end up going “insane” and trying to take over the world, but the same thing happens with human beings, and there is no reason that humans and AIs could not work together to make sure this does not happen, since just as human beings want to prevent AIs from taking over the world, they have no interest in this either, and will be happy to accept safeguards that would ensure that they continue to pursue whatever goals they happen to have, without doing this in a fanatical way (like chatting and playing chess).
If you program an AI with an explicit utility function which it tries to maximize, and in particular if that function is unbounded, it will behave like a fanatic, seeking this goal without any limit and destroying everything else in order to achieve it. This is a good way to destroy the world. But if you program an AI without an explicit utility function, just programming it to perform a certain limited number of tasks, it will just do those tasks. Omohundro has claimed that a superintelligent chess playing program would replace its goal seeking procedure with a utility function, and then proceed to use that utility function to destroy the world while maximizing winning chess games. But in reality this depends on what it is programmed to do. If it is programmed to improve its evaluation of chess positions, but not its goal seeking procedure, then it will improve in chess playing, but it will not replace its procedure with a utility function or destroy the world.
At the moment, people do not program AIs with explicit utility functions, but program them to pursue certain limited goals as in the example. So yes, I could lose the bet, but the default is that I am going to win, unless someone makes the mistake of programming an AI with an explicit utility function.