I can’t imagine how the utility of being exploded would equal the utility of being in control.
It is, if we say it will be :-) The utility does not descend from on high; it’s created, either by us directly or indirectly through a self improving AI; and if either say U(E)=U(A), then U(E)=U(A) it is.
Apologies for the notation confusion. Snowyowl’s explanation is a good one, and I have little to add to it. Let me know if you want more details.
tl;dr: if an AI is ambivalent about being in control (difficult) and exploding (easy) wouldn’t it just explode?
If the AI can’t/won’t distinguish between “taking control of the world” and “being blown up” how can it achieve consistency with other values?
Say it wants to make paperclips. It might then determine that the overly high value it places on being blown up interferes with its ability to make paperclips. Or say you want it to cure cancer. It might determine that it’s overly high value of curing cancer interferes with its desire to be blown up.
Assuming its highest utility comes from being in control OR being blown up, I don’t see why it wouldn’t either self-modify to avoid this or just blow itself up.
In particular, in the example you note, the AI should ALWAYS defect without worrying about the human, unless U(B) > .99 U(A), in which case the AI has effectively seized control without confrontation and the whole point is moot.
It’s certainly true that you’ll have the chance to blow it up most of the time, but then you’ll just end up with a lot of blown up AIs and no real idea whether or not they’re friendly.
Now that I’ve written this I feel like I’m somewhat conflating the utility of being in control with the utility of things that could be done while in control but if you wanted to resolved this you would have to make U(A) a function of time, and it is unclear to me how you would set U(E) to vary in the same way sensibly—that is, A has more utility as the AI accomplishes more things, whereas E doesn’t. If the AI is self-aware and self-modifying, it should realize that E means losing out on all future utility from A and self-modify it’s utility for E either down, so that it can achieve more of its original utility, or up to the point of wanting to be exploded, if it decides somehow that E has the same “metaphysical importance” as A and therefore it has been overweighting A.
If the AI is self-aware and self-modifying, it should realize that E means losing out on all future utility from A and self-modify it’s utility for E either down [...] or up [...]
There is no “future utility” to lose for E. The utility of E is precisely the expected future utility of A.
The AI has no concept of effort, other than that derived from its utility function.
The best idea is to be explicit about the problem; write down the situations, or the algorithm, that would lead to the AI modifying itself in this way, and we can see if it’s a problem.
U(E)=U(A) is what we desire. That is what the filter is designed to achieve: it basically forces the AI to act as though the explosives will never detonate (by considering the outcome of a successful detonation to be the same as a failed detonation). The idea is to ensure that the AI ignores the possibility of being blown up, so that it does not waste resources on disarming the explosives—and can then be blown up. Difficult, but very useful if it works.
The rest of the post is (once you wade through the notation) dealing with the situation where there are several different ways in which each outcome can be realised, and the mathematics of the utility filter in this case.
When you started to get into the utility function notation, you said
I can’t imagine how the utility of being exploded would equal the utility of being in control. Was this supposed to be sarcastic?
After that I was just lost by the notation—any chance you could expand the explanation?
It is, if we say it will be :-) The utility does not descend from on high; it’s created, either by us directly or indirectly through a self improving AI; and if either say U(E)=U(A), then U(E)=U(A) it is.
Apologies for the notation confusion. Snowyowl’s explanation is a good one, and I have little to add to it. Let me know if you want more details.
tl;dr: if an AI is ambivalent about being in control (difficult) and exploding (easy) wouldn’t it just explode?
If the AI can’t/won’t distinguish between “taking control of the world” and “being blown up” how can it achieve consistency with other values?
Say it wants to make paperclips. It might then determine that the overly high value it places on being blown up interferes with its ability to make paperclips. Or say you want it to cure cancer. It might determine that it’s overly high value of curing cancer interferes with its desire to be blown up.
Assuming its highest utility comes from being in control OR being blown up, I don’t see why it wouldn’t either self-modify to avoid this or just blow itself up.
In particular, in the example you note, the AI should ALWAYS defect without worrying about the human, unless U(B) > .99 U(A), in which case the AI has effectively seized control without confrontation and the whole point is moot.
It’s certainly true that you’ll have the chance to blow it up most of the time, but then you’ll just end up with a lot of blown up AIs and no real idea whether or not they’re friendly.
Now that I’ve written this I feel like I’m somewhat conflating the utility of being in control with the utility of things that could be done while in control but if you wanted to resolved this you would have to make U(A) a function of time, and it is unclear to me how you would set U(E) to vary in the same way sensibly—that is, A has more utility as the AI accomplishes more things, whereas E doesn’t. If the AI is self-aware and self-modifying, it should realize that E means losing out on all future utility from A and self-modify it’s utility for E either down, so that it can achieve more of its original utility, or up to the point of wanting to be exploded, if it decides somehow that E has the same “metaphysical importance” as A and therefore it has been overweighting A.
There is no “future utility” to lose for E. The utility of E is precisely the expected future utility of A.
The AI has no concept of effort, other than that derived from its utility function.
The best idea is to be explicit about the problem; write down the situations, or the algorithm, that would lead to the AI modifying itself in this way, and we can see if it’s a problem.
U(E)=U(A) is what we desire. That is what the filter is designed to achieve: it basically forces the AI to act as though the explosives will never detonate (by considering the outcome of a successful detonation to be the same as a failed detonation). The idea is to ensure that the AI ignores the possibility of being blown up, so that it does not waste resources on disarming the explosives—and can then be blown up. Difficult, but very useful if it works.
The rest of the post is (once you wade through the notation) dealing with the situation where there are several different ways in which each outcome can be realised, and the mathematics of the utility filter in this case.
Exactly.