Suppose an AGI sovereign models the preferences of its citizens using the assumption of normative reductionism. Then it might cover up its past evil actions because it reasons that once all evidence of them is gone, they cannot have an adverse effect on present utility.
Well, yes. Once all evidence (including any impact or detectable difference in the state of the universe) is gone, it CANNOT have a further adverse effect on utility.
Your satisfaction of that preference has nothing to do with their confidence, it’s all about whether you actually find out. You could get into philosophy about what “true” even means for something you have no evidence for or against, but that’s not necessary to talk about the impact on your utility. Without some perceptible difference, your utility cannot be different.
I think you are assuming that “utility” means something like “happiness”. That is not the only possible way to use the word.
If there is a term in my utility function (to whatever extent I have a utility function) for accurate knowledge, then there can be situations indistinguishable to me to which I assign different utility, because I may be unable to tell whether some bit of my “knowledge” is actually accurate or not.
I think maybe you think there is something impossible or incoherent about this, perhaps on the grounds that it’s absurd to say you care about the difference between X and Y when you cannot actually discern the difference between X and Y. I disagree. If you tell me that you are either going to shoot me in the head or shoot me in the head and then murder a million other people, I prefer the former even though, being dead, I will be unable to tell whether you’ve murdered the million others or not. If you tell me that you will either slap me in the face and then shoot me dead, or else shoot me dead and then murder a million others, and if I believe you, then I will gladly take that slap in the face. If you tell me that you will either slap me in the face, convince me that you aren’t going to murder anyone else, kill me, and then murder a million others, or else just kill me and the million others, I will not take the slap in the face even if I am confident that you could convince me. (Er, unless I think that the time you take convincing me makes it more likely that somehow you never actually get to murder me.)
My utility function (to whatever extent I have a utility function) maps world-states to utilities, not my-experience-states to utilities. There is of course another function that maps my-experience states to utilities, or maybe to something like probability distributions over utilities (it goes: experience-state → my beliefs about the state of the world → my estimate of my utility function), but it isn’t the same function and it isn’t what I care about even if in some sense it’s necessarily what I act on: if you propose to change the world-state and the experience-state in ways that don’t match, then my preferences track what you propose to do to the world-state, not the experience-state.
(Of course my experiences are among the things I care about, and I care about some of them a lot. If you threaten to make me wrongly think you have murdered my family then that’s a very negative outcome for me and I will try hard to prevent it. But if I have to choose between that and having my family actually murdered, I pick the former.)
Without some perceptible difference, your utility cannot be different.
This definition of “utility” (and your definition of “preference”) is different from the one that most LWers use, different from the one that economists use, and different from the one that (at least some) professional philosophers use.
Ecomomists use it to define any preference ordering over worlds, and don’t require it to be defined only over your own experiences. Some ethical theories in philosophy (e.g. hedonistic utilitarianism) define it as a direct function of your experiences, but others, (e.g. preference utilitarianism) define it as something that can be affected by things you don’t know about. As evidence for the latter, this SEP page states:
If a person desires or prefers to have true friends and true accomplishments and not to be deluded, then hooking this person up to the experience machine need not maximize desire satisfaction. Utilitarians who adopt this theory of value can then claim that an agent morally ought to do an act if and only if that act maximizes desire satisfaction or preference fulfillment (that is, the degree to which the act achieves whatever is desired or preferred). What maximizes desire satisfaction or preference fulfillment need not maximize sensations of pleasure when what is desired or preferred is not a sensation of pleasure. This position is usually described as preference utilitarianism.
If you’re a hedonistic utilitarian, feel free to argue for hedonistic utilitarianism, but do that directly instead of making claims about what other people are or aren’t allowed to have preferences about.
I will admit that I find the concept of preferences over indistinguishable / imaginary universes or differences in hypothetical universes to be incoherent. One can have a preference for invisible pink unicorns, but that preference is neither more nor less satisfied by any actual-world time segment.
If you have a pointer to any literature about utility impact of irrelevant preferences, I’d like to take a look. All I’ve seen in the past is about how preferences irrelevant to a decision should not impact an aggregation result.
Does it help if you don’t think about a ‘preference’ as something ontologically fundamental, but just as a convenient shorthand for something that an agent is optimising for? It’s certainly possible for an agent to optimise for something even if they’ll never receive any evidence of if they succeeded. gjm gives a few examples in the sibling-comment to mine.
That’s roughly how I think of preferences. It’s absolutely possible (and, in fact, common) for humans to make choices based on things that have no perceptible existence. It’s harmless (but silly (note: I _LIKE_ silly, in part because it’s silly to do so)) to have such preferences, and usually harmless to act on them.
In the context of the OP, and world-value comparisons across distinguishable segments of universes, there is simply no impact from unrealized/undetectable preferences across those universe-segments that don’t contain any variation on that preference.
It’s harmless (but silly (note: I LIKE silly, in part because it’s silly to do so)) to have such preferences, and usually harmless to act on them.
I don’t really understand why preferences about things that you can’t observe are more silly than other preferences, but that’s ok. I mostly wanted to clear up the terminology, and note that it seems more like common usage of ‘preference’ and ‘utility’ to say “That’s a silly preference to have, because X, Y, Z” and “I think we should only care about things that can affect us” instead of saying “Your satisfaction of that preference has nothing to do with their confidence, it’s all about whether you actually find out” and “Without some perceptible difference, your utility cannot be different”.
This assumption can’t capture a preference that ones beliefs about the past are true.
Of course it can—the value of that preference is determined by what (counter)evidence is discovered when.
Suppose an AGI sovereign models the preferences of its citizens using the assumption of normative reductionism. Then it might cover up its past evil actions because it reasons that once all evidence of them is gone, they cannot have an adverse effect on present utility.
Well, yes. Once all evidence (including any impact or detectable difference in the state of the universe) is gone, it CANNOT have a further adverse effect on utility.
I would prefer someone not completely lie to me about the world, even if they’re confident I won’t ever find out.
Your satisfaction of that preference has nothing to do with their confidence, it’s all about whether you actually find out. You could get into philosophy about what “true” even means for something you have no evidence for or against, but that’s not necessary to talk about the impact on your utility. Without some perceptible difference, your utility cannot be different.
I think you are assuming that “utility” means something like “happiness”. That is not the only possible way to use the word.
If there is a term in my utility function (to whatever extent I have a utility function) for accurate knowledge, then there can be situations indistinguishable to me to which I assign different utility, because I may be unable to tell whether some bit of my “knowledge” is actually accurate or not.
I think maybe you think there is something impossible or incoherent about this, perhaps on the grounds that it’s absurd to say you care about the difference between X and Y when you cannot actually discern the difference between X and Y. I disagree. If you tell me that you are either going to shoot me in the head or shoot me in the head and then murder a million other people, I prefer the former even though, being dead, I will be unable to tell whether you’ve murdered the million others or not. If you tell me that you will either slap me in the face and then shoot me dead, or else shoot me dead and then murder a million others, and if I believe you, then I will gladly take that slap in the face. If you tell me that you will either slap me in the face, convince me that you aren’t going to murder anyone else, kill me, and then murder a million others, or else just kill me and the million others, I will not take the slap in the face even if I am confident that you could convince me. (Er, unless I think that the time you take convincing me makes it more likely that somehow you never actually get to murder me.)
My utility function (to whatever extent I have a utility function) maps world-states to utilities, not my-experience-states to utilities. There is of course another function that maps my-experience states to utilities, or maybe to something like probability distributions over utilities (it goes: experience-state → my beliefs about the state of the world → my estimate of my utility function), but it isn’t the same function and it isn’t what I care about even if in some sense it’s necessarily what I act on: if you propose to change the world-state and the experience-state in ways that don’t match, then my preferences track what you propose to do to the world-state, not the experience-state.
(Of course my experiences are among the things I care about, and I care about some of them a lot. If you threaten to make me wrongly think you have murdered my family then that’s a very negative outcome for me and I will try hard to prevent it. But if I have to choose between that and having my family actually murdered, I pick the former.)
This definition of “utility” (and your definition of “preference”) is different from the one that most LWers use, different from the one that economists use, and different from the one that (at least some) professional philosophers use.
Ecomomists use it to define any preference ordering over worlds, and don’t require it to be defined only over your own experiences. Some ethical theories in philosophy (e.g. hedonistic utilitarianism) define it as a direct function of your experiences, but others, (e.g. preference utilitarianism) define it as something that can be affected by things you don’t know about. As evidence for the latter, this SEP page states:
If you’re a hedonistic utilitarian, feel free to argue for hedonistic utilitarianism, but do that directly instead of making claims about what other people are or aren’t allowed to have preferences about.
I will admit that I find the concept of preferences over indistinguishable / imaginary universes or differences in hypothetical universes to be incoherent. One can have a preference for invisible pink unicorns, but that preference is neither more nor less satisfied by any actual-world time segment.
If you have a pointer to any literature about utility impact of irrelevant preferences, I’d like to take a look. All I’ve seen in the past is about how preferences irrelevant to a decision should not impact an aggregation result.
Does it help if you don’t think about a ‘preference’ as something ontologically fundamental, but just as a convenient shorthand for something that an agent is optimising for? It’s certainly possible for an agent to optimise for something even if they’ll never receive any evidence of if they succeeded. gjm gives a few examples in the sibling-comment to mine.
That’s roughly how I think of preferences. It’s absolutely possible (and, in fact, common) for humans to make choices based on things that have no perceptible existence. It’s harmless (but silly (note: I _LIKE_ silly, in part because it’s silly to do so)) to have such preferences, and usually harmless to act on them.
In the context of the OP, and world-value comparisons across distinguishable segments of universes, there is simply no impact from unrealized/undetectable preferences across those universe-segments that don’t contain any variation on that preference.
I don’t really understand why preferences about things that you can’t observe are more silly than other preferences, but that’s ok. I mostly wanted to clear up the terminology, and note that it seems more like common usage of ‘preference’ and ‘utility’ to say “That’s a silly preference to have, because X, Y, Z” and “I think we should only care about things that can affect us” instead of saying “Your satisfaction of that preference has nothing to do with their confidence, it’s all about whether you actually find out” and “Without some perceptible difference, your utility cannot be different”.