It seems important to acknowledge that there’s a version of the Bomb argument that actually works, at least if we want to apply UDT to humans as opposed to AIs, and this may be part of what’s driving Will’s intuitions. (I’ll use “UDT” here because that’s what I’m more familiar with, but presumably everything transfers to FDT.)
First there’s an ambiguity in Bomb as written, namely what does my simulation see? Does it see a bomb in Left, or no bomb? Suppose the setup is that the simulation sees no bomb in Left. In that case since obviously I should take Left when there’s no bomb in it (and that’s what my simulation would do), if I am seeing a bomb in Left it must mean I’m in the 1 in a trillion trillion situation where the predictor made a mistake, therefore I should (intuitively) take Right. UDT also says I should take Right so there’s no problem here.
Now suppose the simulation is set up to see a bomb in Left. In that case, when I see a bomb in Left, I don’t know if I’m a simulation or a real person. If I was selfish in an indexical way, I would think something like “If I’m a simulation then it doesn’t matter what I choose. The simulation will end as soon as I make a choice so my choice is inconsequential. But if I’m a real person, choosing Left will cause me to be burned. So I should choose Right.” The thing is, UDT is incompatible with this kind of selfish values, because UDT takes a utility function that is defined over possible histories of the world and not possible centered histories of the world (i.e., histories with an additional pointer that says this is “me”). UDT essentially forces an agent to be altruistic to its copies, and therefore is unable to give the intuitively correct answer in this case.
If we’re doing decision theory for humans, then the incompatibility with this kind of selfish values would be a problem because humans plausibly do have this kind of selfish values as part of our complex values and whatever decision theory we use perhaps should be able to handle it. However if we’re building an AI, it doesn’t seem to make sense to let it have selfish values (i.e., have a utility function over centered histories as opposed to uncentered histories), so UDT seems fine (at least as far as this issue is concerned) for thinking about how AIs should ideally make decisions.
Now suppose the simulation is set up to see a bomb in Left. In that case, when I see a bomb in Left, I don’t know if I’m a simulation or a real person. If I was selfish in an indexical way, I would think something like “If I’m a simulation then it doesn’t matter what I choose. The simulation will end as soon as I make a choice so my choice is inconsequential. But if I’m a real person, choosing Left will cause me to be burned. So I should choose Right.”
It seems to me that even in this example, a person (who is selfish in an indexical way) would prefer—before opening their eyes—to make a binding commitment to choose left. If so, the “intuitively correct answer” that UDT is unable to give is actually just the result of a failure to make a beneficial binding commitment.
That’s true, but they could say, “Well, given that no binding commitment was in fact made, and given my indexically selfish values, it’s rational for me to choose Right.” And I’m not sure how to reply to that, unless we can show that such indexically selfish values are wrong somehow.
I agree. It seems that in that situation the person would be “rational” to choose Right.
I’m still confused about the “UDT is incompatible with this kind of selfish values” part. It seems that an indexically-selfish person—after failing to make a binding commitment and seeing the bomb—could still rationally commit to UDT from that moment on, by defining the utility s.t. only copies that found themselves in that situation (i.e. those who failed to make a binding commitment and saw the bomb) matter. That utility is a function over uncentered histories of the world, and would result in UDT choosing Right.
I don’t see anything wrong with what you’re saying, but if you did that you’d end up not being an indexically selfish person anymore. You’d be selfish in a different, perhaps alien or counterintuitive way. So you might be reluctant to make that kind of commitment until you’ve thought about it for a much longer time, and UDT isn’t compatible with your values in the meantime. Also, without futuristic self-modification technologies, you are probably not able to make such a commitment truly binding even if you wanted to and you tried.
It seems that in many simple worlds (such as the Bomb world), an indexically-selfish agent with a utility function u over centered histories would prefer to commit to UDT with a utility function u′ over uncentered histories; where u′ is defined as the sum of all the “uncentered versions” of u (version i corresponds to u when the pointer is assumed to point to agent i).
Things seem to get more confusing in messy worlds in which the inability of an agent to define a utility function (over uncentered histories) that distinguishes between agent1 and agent2 does not entail that the two agents are about to make the same decision.
By the way, selfish values seem related to the reward vs. utility distinction. An agent that pursues a reward that’s about particular events in the world rather than a more holographic valuation seems more like a selfish agent in this sense than a maximizer of a utility function with a small-in-space support. If a reward-seeking agent looks for reward channel shaped patterns instead of the instance of a reward channel in front of it, it might tile the world with reward channels or search the world for more of them or something like that.
I don’t know if I’m a simulation or a real person.
A possible response to this argument is that the predictor may be able to accurately predict the agent without explicitly simulating them. A possible counter-response to this is to posit that any sufficiently accurate model of a conscious agent is necessarily conscious itself, whether the model takes the form of an explicit simulation or not.
if I am seeing a bomb in Left it must mean I’m in the 1 in a trillion trillion situation where the predictor made a mistake, therefore I should (intuitively) take Right. UDT also says I should take Right so there’s no problem here.
It is more probable that you are misinformed about the predictor. But your conclusion is correct, take the right box.
It seems important to acknowledge that there’s a version of the Bomb argument that actually works, at least if we want to apply UDT to humans as opposed to AIs, and this may be part of what’s driving Will’s intuitions. (I’ll use “UDT” here because that’s what I’m more familiar with, but presumably everything transfers to FDT.)
First there’s an ambiguity in Bomb as written, namely what does my simulation see? Does it see a bomb in Left, or no bomb? Suppose the setup is that the simulation sees no bomb in Left. In that case since obviously I should take Left when there’s no bomb in it (and that’s what my simulation would do), if I am seeing a bomb in Left it must mean I’m in the 1 in a trillion trillion situation where the predictor made a mistake, therefore I should (intuitively) take Right. UDT also says I should take Right so there’s no problem here.
Now suppose the simulation is set up to see a bomb in Left. In that case, when I see a bomb in Left, I don’t know if I’m a simulation or a real person. If I was selfish in an indexical way, I would think something like “If I’m a simulation then it doesn’t matter what I choose. The simulation will end as soon as I make a choice so my choice is inconsequential. But if I’m a real person, choosing Left will cause me to be burned. So I should choose Right.” The thing is, UDT is incompatible with this kind of selfish values, because UDT takes a utility function that is defined over possible histories of the world and not possible centered histories of the world (i.e., histories with an additional pointer that says this is “me”). UDT essentially forces an agent to be altruistic to its copies, and therefore is unable to give the intuitively correct answer in this case.
If we’re doing decision theory for humans, then the incompatibility with this kind of selfish values would be a problem because humans plausibly do have this kind of selfish values as part of our complex values and whatever decision theory we use perhaps should be able to handle it. However if we’re building an AI, it doesn’t seem to make sense to let it have selfish values (i.e., have a utility function over centered histories as opposed to uncentered histories), so UDT seems fine (at least as far as this issue is concerned) for thinking about how AIs should ideally make decisions.
It seems to me that even in this example, a person (who is selfish in an indexical way) would prefer—before opening their eyes—to make a binding commitment to choose left. If so, the “intuitively correct answer” that UDT is unable to give is actually just the result of a failure to make a beneficial binding commitment.
That’s true, but they could say, “Well, given that no binding commitment was in fact made, and given my indexically selfish values, it’s rational for me to choose Right.” And I’m not sure how to reply to that, unless we can show that such indexically selfish values are wrong somehow.
I agree. It seems that in that situation the person would be “rational” to choose Right.
I’m still confused about the “UDT is incompatible with this kind of selfish values” part. It seems that an indexically-selfish person—after failing to make a binding commitment and seeing the bomb—could still rationally commit to UDT from that moment on, by defining the utility s.t. only copies that found themselves in that situation (i.e. those who failed to make a binding commitment and saw the bomb) matter. That utility is a function over uncentered histories of the world, and would result in UDT choosing Right.
I don’t see anything wrong with what you’re saying, but if you did that you’d end up not being an indexically selfish person anymore. You’d be selfish in a different, perhaps alien or counterintuitive way. So you might be reluctant to make that kind of commitment until you’ve thought about it for a much longer time, and UDT isn’t compatible with your values in the meantime. Also, without futuristic self-modification technologies, you are probably not able to make such a commitment truly binding even if you wanted to and you tried.
Some tangentially related thoughts:
It seems that in many simple worlds (such as the Bomb world), an indexically-selfish agent with a utility function u over centered histories would prefer to commit to UDT with a utility function u′ over uncentered histories; where u′ is defined as the sum of all the “uncentered versions” of u (version i corresponds to u when the pointer is assumed to point to agent i).
Things seem to get more confusing in messy worlds in which the inability of an agent to define a utility function (over uncentered histories) that distinguishes between agent1 and agent2 does not entail that the two agents are about to make the same decision.
By the way, selfish values seem related to the reward vs. utility distinction. An agent that pursues a reward that’s about particular events in the world rather than a more holographic valuation seems more like a selfish agent in this sense than a maximizer of a utility function with a small-in-space support. If a reward-seeking agent looks for reward channel shaped patterns instead of the instance of a reward channel in front of it, it might tile the world with reward channels or search the world for more of them or something like that.
A possible response to this argument is that the predictor may be able to accurately predict the agent without explicitly simulating them. A possible counter-response to this is to posit that any sufficiently accurate model of a conscious agent is necessarily conscious itself, whether the model takes the form of an explicit simulation or not.
It is more probable that you are misinformed about the predictor. But your conclusion is correct, take the right box.