What does this robot “actually want”, given that the world is not really a 2D grid of cells that have intrinsic color?
Who cares about the question what the robot “actually wants”? Certainly not the robot. Humans care about the question what they “actually want”, but that’s because they have additional structure that this robot lacks. But with humans, you’re not limited to just looking at what they do on auto-pilot; instead, you can just ask the aforementioned structure when you run into problems like this. For example, if you asked me what I really wanted under some weird ontology change, I could say, “I have some guesses, but I don’t really know; I would like to defer to a smarter version of me”. That’s how I understand preference extrapolation: not as something that looks at what your behavior suggests that you’re trying to do and then does it better, but as something that poses the question of what you want to some system you’d like to answer the question for you.
It looks to me like there’s a mistaken tendency among many people here, including some very smart people, to say that I’d be irrational to let my stated preferences deviate from my revealed preferences; that just because I seem to be trying to do something (in some sense like: when my behavior isn’t being controlled much by the output of moral philosophy, I can be modeled as a relatively good fit to a robot with some particular utility function), that’s a reason for me to do it even if I decide that I don’t want to. But rational utility maximizers get to be indifferent to whatever the heck they want, including their own preferences, so it’s hard for me to see why the underdeterminedness of the true preferences of robots like this should bother me at all.
Insert standard low confidence about me posting claims on complicated topics that others seem to disagree with.
That would imply a great diversity of value systems, because philosophical intuitions differ much more from person to person than primitive desires. Some of these value systems (maybe including yours) would be simple, some wouldn’t. For example, my “philosophical” values seem to give large weight to my “primitive” values.
preference extrapolation: not as something that looks at what your behavior suggests that you’re trying to do and then does it better, but as something that poses the question of what you want to some system you’d like to answer the question for you
That might be a procedure that generates human preference, but it is not a general preference extrapolation procedure. E.g suppose we replace Wei Dai’s simple consequentialist robot with a robot that has similar behavior, but that also responds to the question, “What system do you want to answer the question of what you want for you?” with the answer, “A version of myself better able to answer that question. Maybe it should be smarter and know more things and be nicer to strangers and not have scope insensitivity and be less prone to skipping over invisible moral frameworks and have conecepts that are better defined over attribute space and be automatically strategic and super commited and stuff like that? But since I’m not that smart and I pass over moral frameworks and stuff, eveything I just said is probably insufficient to specify the right thing. Maybe you can look at my source code and figure out what I mean by right and then do the thing that a person who better understood that would do?” And then goes right back to zapping blue.
Suppose we replace Wei Dai’s simple consequentialist robot with a robot that has similar behavior, but that also responds to the question, “What system do you want to answer the question of what you want for you?” with the answer, “I want to decide for myself” and responds to the question, “What do you want to do?” with the answer, “I want to make babies happy. Oh and help grandmother out of the burning building. Oh, and without killing her. Oh, and to preserve complex novelty. Oh and boredom. Oh, and there should still be people in the world who are trying to improve it. Oh and...damnit, this is complicated. Okay, never mind, I want you to ask the version of myself who I presently think is smart enough to answer this question and who knows what the right thing to do is even better than me.”
It can answer those two questions, but if you ask it to clarify the last response, it just blows up.
Who cares about the question what the robot “actually wants”? Certainly not the robot. Humans care about the question what they “actually want”, but that’s because they have additional structure that this robot lacks. But with humans, you’re not limited to just looking at what they do on auto-pilot; instead, you can just ask the aforementioned structure when you run into problems like this. For example, if you asked me what I really wanted under some weird ontology change, I could say, “I have some guesses, but I don’t really know; I would like to defer to a smarter version of me”. That’s how I understand preference extrapolation: not as something that looks at what your behavior suggests that you’re trying to do and then does it better, but as something that poses the question of what you want to some system you’d like to answer the question for you.
It looks to me like there’s a mistaken tendency among many people here, including some very smart people, to say that I’d be irrational to let my stated preferences deviate from my revealed preferences; that just because I seem to be trying to do something (in some sense like: when my behavior isn’t being controlled much by the output of moral philosophy, I can be modeled as a relatively good fit to a robot with some particular utility function), that’s a reason for me to do it even if I decide that I don’t want to. But rational utility maximizers get to be indifferent to whatever the heck they want, including their own preferences, so it’s hard for me to see why the underdeterminedness of the true preferences of robots like this should bother me at all.
Insert standard low confidence about me posting claims on complicated topics that others seem to disagree with.
In other words, our “actual values” come from our being philosophers, not our being consequentialists.
It seems plausible to me, and I’m not sure that “many” others do disagree with you.
That would imply a great diversity of value systems, because philosophical intuitions differ much more from person to person than primitive desires. Some of these value systems (maybe including yours) would be simple, some wouldn’t. For example, my “philosophical” values seem to give large weight to my “primitive” values.
That might be a procedure that generates human preference, but it is not a general preference extrapolation procedure. E.g suppose we replace Wei Dai’s simple consequentialist robot with a robot that has similar behavior, but that also responds to the question, “What system do you want to answer the question of what you want for you?” with the answer, “A version of myself better able to answer that question. Maybe it should be smarter and know more things and be nicer to strangers and not have scope insensitivity and be less prone to skipping over invisible moral frameworks and have conecepts that are better defined over attribute space and be automatically strategic and super commited and stuff like that? But since I’m not that smart and I pass over moral frameworks and stuff, eveything I just said is probably insufficient to specify the right thing. Maybe you can look at my source code and figure out what I mean by right and then do the thing that a person who better understood that would do?” And then goes right back to zapping blue.
Suppose we replace Wei Dai’s simple consequentialist robot with a robot that has similar behavior, but that also responds to the question, “What system do you want to answer the question of what you want for you?” with the answer, “I want to decide for myself” and responds to the question, “What do you want to do?” with the answer, “I want to make babies happy. Oh and help grandmother out of the burning building. Oh, and without killing her. Oh, and to preserve complex novelty. Oh and boredom. Oh, and there should still be people in the world who are trying to improve it. Oh and...damnit, this is complicated. Okay, never mind, I want you to ask the version of myself who I presently think is smart enough to answer this question and who knows what the right thing to do is even better than me.”
It can answer those two questions, but if you ask it to clarify the last response, it just blows up.