Problem: suppose the agent foresees that it won’t be completely sure that a day has passed, or that it has actually shut down. Then the agent A has a strong incentive to maintain control over the world past when it shuts down, to swoop in and really shut A down if A might not have actually shut down and if there might still be time. This puts a lot of strain on the correctness of the shutdown criterion: it has to forbid this sort of posthumous influence despite A optimizing to find a way to have such influence. (The correctness might be assumed by the shutdown problem, IDK, but it’s still an overall issue.)
Another comment: this doesn’t seem to say much about corrigibility, in the sense that it’s not like the AI is now accepting correction from an external operator (the AI would prevent being shut down during its day of operation). There’s no dependence on an external operator’s choices (except that once the AI is shut down the operator can pick back up doing whatever, if they’re still around). It seems more like a bounded optimization thing, like specifying how the AI can be made to not keep optimizing forever.
To the first point, I think this problem can be avoided with a much simpler assumption than that the shutdown criterion forbids all posthumous influence. Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1. (We might need a slightly stronger version of this assumption: it might need to be the case that for any action, there exists an action which has the same external effect but also causes a shutdown with probability 1.) This means that the agent doesn’t need to build itself any insurance policy to guarantee that it shuts down. I think this is not a terribly inaccurate assumption; of course, in reality, there are cosmic rays and a properly embedded and self-aware agent might deduce that none of its future actions are perfectly reliable, even though a model-free RL agent would probably never see any evidence of this (and it wouldn’t be any worse at folding the laundry for it). Even with a realistic ϵ probability of shutdown failing, if we don’t try to juice 1−1/C so high that it exceeds 1−ϵ, my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from 1−ϵ to 1.
Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1.
Oops, I missed that assumption. Yeah, if there’s such a policy, and it doesn’t trade off against fetching the coffee, then it seems like we’re good. See though here, arguing briefly that by Cromwell’s rule, this policy doesn’t exist. https://arbital.com/p/task_goal/
Even with a realistic ϵ probability of shutdown failing, if we don’t try to juice 1−1/C so high that it exceeds 1−ϵ, my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from 1−ϵ to 1.
Hm. So this seems like you’re making an additional, very non-trivial assumption, which is that the AI is constrained by costs comparable to / bigger than the costs to create a successor. If its task has already been very confidently achieved, and it has half a day left, it’s not going to get senioritis, it’s going to pick up whatever scraps of expected utility might be left.
I wonder though if there’s synergy between your proposal and the idea of expected utility satisficing: an EU satisficer with a shutdown clock is maybe anti-incentivized from self-modifying to do unbounded optimization, because unbounded optimization is harder to reliably shut down? IDK.
Yes, I think there are probably strong synergies with satisficing, perhaps lexicographically minimizing something like energy expenditure once the EU maximum is reached. I will think about this more.
A similar objection is that you might accidentially define the utility function and time limit in such a way that the AI assigns positive probability to the hypothesis that it can later create a time machine and go back and improve the utility. Then once the time has passed, it will desparately try to invent a time machine, even if it thinks it is extremely unlikely to succed (this is using Bostrom’s way of thinking. Shard theory would not predict this).
The τ1⋅(ˆR1−R1–––) bound on how much there is to gain from creating a time machine and improving past utility is outweighed by the τ1⋅(ˆR1−R1–––)⋅C reward from R2 for shutting down.
Every RL algorithm I’ve heard of implicitly bakes in an assumption that past utility is unmodifiable. I guess all bets are off with mesa-optimisers, but personally I’d bet against even mesa-optimisers in model-free RL behaving as if past utility is up for grabs.
Problem: suppose the agent foresees that it won’t be completely sure that a day has passed, or that it has actually shut down. Then the agent A has a strong incentive to maintain control over the world past when it shuts down, to swoop in and really shut A down if A might not have actually shut down and if there might still be time. This puts a lot of strain on the correctness of the shutdown criterion: it has to forbid this sort of posthumous influence despite A optimizing to find a way to have such influence.
(The correctness might be assumed by the shutdown problem, IDK, but it’s still an overall issue.)
Another comment: this doesn’t seem to say much about corrigibility, in the sense that it’s not like the AI is now accepting correction from an external operator (the AI would prevent being shut down during its day of operation). There’s no dependence on an external operator’s choices (except that once the AI is shut down the operator can pick back up doing whatever, if they’re still around). It seems more like a bounded optimization thing, like specifying how the AI can be made to not keep optimizing forever.
To the second point, yes, I edited the conclusion to reflect this.
To the first point, I think this problem can be avoided with a much simpler assumption than that the shutdown criterion forbids all posthumous influence. Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1. (We might need a slightly stronger version of this assumption: it might need to be the case that for any action, there exists an action which has the same external effect but also causes a shutdown with probability 1.) This means that the agent doesn’t need to build itself any insurance policy to guarantee that it shuts down. I think this is not a terribly inaccurate assumption; of course, in reality, there are cosmic rays and a properly embedded and self-aware agent might deduce that none of its future actions are perfectly reliable, even though a model-free RL agent would probably never see any evidence of this (and it wouldn’t be any worse at folding the laundry for it). Even with a realistic ϵ probability of shutdown failing, if we don’t try to juice 1−1/C so high that it exceeds 1−ϵ, my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from 1−ϵ to 1.
Oops, I missed that assumption. Yeah, if there’s such a policy, and it doesn’t trade off against fetching the coffee, then it seems like we’re good. See though here, arguing briefly that by Cromwell’s rule, this policy doesn’t exist. https://arbital.com/p/task_goal/
Hm. So this seems like you’re making an additional, very non-trivial assumption, which is that the AI is constrained by costs comparable to / bigger than the costs to create a successor. If its task has already been very confidently achieved, and it has half a day left, it’s not going to get senioritis, it’s going to pick up whatever scraps of expected utility might be left.
I wonder though if there’s synergy between your proposal and the idea of expected utility satisficing: an EU satisficer with a shutdown clock is maybe anti-incentivized from self-modifying to do unbounded optimization, because unbounded optimization is harder to reliably shut down? IDK.
Yes, I think there are probably strong synergies with satisficing, perhaps lexicographically minimizing something like energy expenditure once the EU maximum is reached. I will think about this more.
A similar objection is that you might accidentially define the utility function and time limit in such a way that the AI assigns positive probability to the hypothesis that it can later create a time machine and go back and improve the utility. Then once the time has passed, it will desparately try to invent a time machine, even if it thinks it is extremely unlikely to succed (this is using Bostrom’s way of thinking. Shard theory would not predict this).
I disagree, for two reasons:
The τ1⋅(ˆR1−R1–––) bound on how much there is to gain from creating a time machine and improving past utility is outweighed by the τ1⋅(ˆR1−R1–––)⋅C reward from R2 for shutting down.
Every RL algorithm I’ve heard of implicitly bakes in an assumption that past utility is unmodifiable. I guess all bets are off with mesa-optimisers, but personally I’d bet against even mesa-optimisers in model-free RL behaving as if past utility is up for grabs.