Because this strikes me as a nightmare scenario. Besides, we’re relying on the models to self-report total happiness. Leaving it on an unbounded scale creates incentives for abuse
Asking people how much utility they have won’t give you a utility function because, for one thing, humans don’t have preferences that are consistent with a utility function.
The question would be more like ‘assuming you understand standard deviation units, how satisfied with your life are you right now, in standard deviation units, relative to the average?’ Happy, satisfied people give the machine more utility.
Utilities are determined up to an additive constant and a positive multiplicative constant, so there is no canonical way of comparing utilities between people, so there is no canonical way of averaging utilities.
Okay, but that doesn’t mean you can’t build a machine that maximizes the number of happy people, under these conditions. Calling it utility is just short hand.
I need to go to class right now, but I’ll get into population changes when I get home this evening.
Presumably, the reflective consistency criterion would be something along the lines of ‘hey, model, here’s this other model—does he seem like a valid continuation of you?’ No value judgments involved.
EDIT:
Okay, here’s how you handle agents being created or destroyed in your predicted future. For agents that die, you feed that fact back into the original state of the model, and allow it to determine utility for that state. So, if you want to commit suicide, that’s fine—dying becomes positive utility for the machine.
Creating people is a little more problematic. If new people’s utility is naively added, well, that’s bad. Because then, the fastest way to maximize its utility function is to kill the whole human race, and then start building resource-cheap barely-sapient happy monsters that report maximum happiness all the time. So you need to add a necessary-but-not-sufficient condition that any action taken has to maximize both the utility of all forseeable minds, AND the utility of all minds currently alive. That means that happy monsters are no good (in so far as they eat resources that we’ll eventually need), and it means that Dr. Evil won’t be allowed to make billions of clones of himself and take over the world. This should also eliminate repugnant conclusion scenarios.
Presumably, the reflective consistency criterion would be something along the lines of ‘hey, model, here’s this other model—does he seem like a valid continuation of you?’ No value judgments involved.
So this looks like the crucial part of your proposal. By what criteria should an agent judge another agent to be a “valid continuation” of it? That is, what do you mean by “valid continuation”? What kinds of judgments do you want these models to make?
There are a few very different ways you could go here. For the purpose of illustration, consider this: If I can veto a wireheaded version of me because I know that I don’t want to be wireheaded, then it stands to reason that a racist person can veto a non-racist version of themselves because they know they don’t want to be racist. So the values that the future model holds cannot be a criterion in our judgment of whether the future model is a “valid continuation”. What criteria, then, can we use? Maybe we are to judge an agent a “valid continuation” if they are similar to us in core personality traits. But surely we expect long-lived people to have evolving core personality traits. The Nisan of 200 years from now would be very different from me.
Like I said, that part is tricky to formalize. But, ultimately, it’s an individual choice on the part of the model (and, indirectly, the agent being modeled). I can’t formalize what counts as a valid continuation today, let alone in all future societies. So, leave it up to the agents in question.
As for the racism thing: yeah, so? You would rather we encode our own morality into our machine, so that it will ignore aspects of people’s personality we don’t like? I suppose you could insist that the models behave as though they had access to the entire factual database of the AI (so, at least, they couldn’t be racist simply out of factual inaccuracy), but that might be tricky to implement.
I can’t formalize what counts as a valid continuation today, let alone in all future societies. So, leave it up to the agents in question.
I think you use the words “valid continuation” to refer to a confused concept. That’s why it seems hard to formalize. There is no English sentence that successfully refers to the concept of valid continuation, because it is a confused concept.
If you propose to literally ask models “is this a valid continuation of you?” and simulate them sitting in a room with the future model, then you’ve got to think about how the models will react to those almost-meaningless words. You might as well ask them “is this a wakalix?”.
Because this strikes me as a nightmare scenario. Besides, we’re relying on the models to self-report total happiness. Leaving it on an unbounded scale creates incentives for abuse
The question would be more like ‘assuming you understand standard deviation units, how satisfied with your life are you right now, in standard deviation units, relative to the average?’ Happy, satisfied people give the machine more utility.
Okay, but that doesn’t mean you can’t build a machine that maximizes the number of happy people, under these conditions. Calling it utility is just short hand.
I need to go to class right now, but I’ll get into population changes when I get home this evening.
Presumably, the reflective consistency criterion would be something along the lines of ‘hey, model, here’s this other model—does he seem like a valid continuation of you?’ No value judgments involved.
EDIT:
Okay, here’s how you handle agents being created or destroyed in your predicted future. For agents that die, you feed that fact back into the original state of the model, and allow it to determine utility for that state. So, if you want to commit suicide, that’s fine—dying becomes positive utility for the machine.
Creating people is a little more problematic. If new people’s utility is naively added, well, that’s bad. Because then, the fastest way to maximize its utility function is to kill the whole human race, and then start building resource-cheap barely-sapient happy monsters that report maximum happiness all the time. So you need to add a necessary-but-not-sufficient condition that any action taken has to maximize both the utility of all forseeable minds, AND the utility of all minds currently alive. That means that happy monsters are no good (in so far as they eat resources that we’ll eventually need), and it means that Dr. Evil won’t be allowed to make billions of clones of himself and take over the world. This should also eliminate repugnant conclusion scenarios.
So this looks like the crucial part of your proposal. By what criteria should an agent judge another agent to be a “valid continuation” of it? That is, what do you mean by “valid continuation”? What kinds of judgments do you want these models to make?
There are a few very different ways you could go here. For the purpose of illustration, consider this: If I can veto a wireheaded version of me because I know that I don’t want to be wireheaded, then it stands to reason that a racist person can veto a non-racist version of themselves because they know they don’t want to be racist. So the values that the future model holds cannot be a criterion in our judgment of whether the future model is a “valid continuation”. What criteria, then, can we use? Maybe we are to judge an agent a “valid continuation” if they are similar to us in core personality traits. But surely we expect long-lived people to have evolving core personality traits. The Nisan of 200 years from now would be very different from me.
Like I said, that part is tricky to formalize. But, ultimately, it’s an individual choice on the part of the model (and, indirectly, the agent being modeled). I can’t formalize what counts as a valid continuation today, let alone in all future societies. So, leave it up to the agents in question.
As for the racism thing: yeah, so? You would rather we encode our own morality into our machine, so that it will ignore aspects of people’s personality we don’t like? I suppose you could insist that the models behave as though they had access to the entire factual database of the AI (so, at least, they couldn’t be racist simply out of factual inaccuracy), but that might be tricky to implement.
Which scenario are you affirming? I’m trying to understand your intention here. Would a racist get to veto a nonracist future version of themself?
I think you use the words “valid continuation” to refer to a confused concept. That’s why it seems hard to formalize. There is no English sentence that successfully refers to the concept of valid continuation, because it is a confused concept.
If you propose to literally ask models “is this a valid continuation of you?” and simulate them sitting in a room with the future model, then you’ve got to think about how the models will react to those almost-meaningless words. You might as well ask them “is this a wakalix?”.