1: Imagine a utility function as a function that takes as input a description of the world in some standard format, and outputs a “goodness rating” between 0 and 100. The AI can then take actions that it predicts will make the world have a higher goodness rating.
Lots of utility functions are possible. Suppose there’s one possible future where I get cake, and one where I get pie. I have a very strong opinion on these futures’ goodness, and I will take actions that I predict will make the world more likely to turn out pie. But this is not a priori necessary—we could define a utility function that swaps the goodness ratings of cake and pie, and an AI using that utility function would take actions that it predicts will lead to worlds with higher goodness rating, i.e. cake. There is no objective standard that it could use to realize that pie is better—it is merely a computer program that makes predictions and then picks the action that it predicts maximizes some function.
Utopias are like cake and pie. If I give the pie utopia a higher goodness rating, and the AI gives the cake utopia a higher goodness rating, it’s not “wrong” in the sense of being able to check its work and find a mistake. The AI can prefer the cake utopia even while operating perfectly.
This is what happens in the case of Failed Utopia 4-2. The AI has some preferences about the world. And those preferences are very close but not quite human preferences. And so the main character ends up in the cake utopia. Even if the AI does a lot more research and checks its reasoning carefully, it is not a priori necessary that it should realize the error of its ways and make the world a pie utopia instead. It’s wrong(2), but not wrong(1).
Similar problems show up when you try to make any sort of AI that just “does what humans want.” Eventually, somewhere, you have to turn this vague verbal statement into a precise specification (like the code of the AI), which is used to compute something like a goodness rating. And it turns out that when you actually try to do this, it’s pretty tricky to make the AI’s goodness ratings similar to a human’s goodness ratings. Basically every easy way has some critical flaw, and the ways that seem promising are not very easy.
So sure, we want to make an AI that just does what humans want (sort of). But this is like “make an AI that recognizes pictures of cats”—an admirable goal, but a nontrivial one. And one that might have bad consequences even if only slightly wrong.
1: Imagine a utility function as a function that takes as input a description of the world in some standard format, and outputs a “goodness rating” between 0 and 100. The AI can then take actions that it predicts will make the world have a higher goodness rating.
Lots of utility functions are possible. Suppose there’s one possible future where I get cake, and one where I get pie. I have a very strong opinion on these futures’ goodness, and I will take actions that I predict will make the world more likely to turn out pie. But this is not a priori necessary—we could define a utility function that swaps the goodness ratings of cake and pie, and an AI using that utility function would take actions that it predicts will lead to worlds with higher goodness rating, i.e. cake. There is no objective standard that it could use to realize that pie is better—it is merely a computer program that makes predictions and then picks the action that it predicts maximizes some function.
Utopias are like cake and pie. If I give the pie utopia a higher goodness rating, and the AI gives the cake utopia a higher goodness rating, it’s not “wrong” in the sense of being able to check its work and find a mistake. The AI can prefer the cake utopia even while operating perfectly.
This is what happens in the case of Failed Utopia 4-2. The AI has some preferences about the world. And those preferences are very close but not quite human preferences. And so the main character ends up in the cake utopia. Even if the AI does a lot more research and checks its reasoning carefully, it is not a priori necessary that it should realize the error of its ways and make the world a pie utopia instead. It’s wrong(2), but not wrong(1).
Similar problems show up when you try to make any sort of AI that just “does what humans want.” Eventually, somewhere, you have to turn this vague verbal statement into a precise specification (like the code of the AI), which is used to compute something like a goodness rating. And it turns out that when you actually try to do this, it’s pretty tricky to make the AI’s goodness ratings similar to a human’s goodness ratings. Basically every easy way has some critical flaw, and the ways that seem promising are not very easy.
So sure, we want to make an AI that just does what humans want (sort of). But this is like “make an AI that recognizes pictures of cats”—an admirable goal, but a nontrivial one. And one that might have bad consequences even if only slightly wrong.