TekhneMakre comments on You can still fetch the coffee today if you’re dead tomorrow

TekhneMakre 9 Dec 2022 14:29 UTC
LW: 4 AF: 2
−2
AF
Problem: suppose the agent foresees that it won’t be completely sure that a day has passed, or that it has actually shut down. Then the agent A has a strong incentive to maintain control over the world past when it shuts down, to swoop in and really shut A down if A might not have actually shut down and if there might still be time. This puts a lot of strain on the correctness of the shutdown criterion: it has to forbid this sort of posthumous influence despite A optimizing to find a way to have such influence.
(The correctness might be assumed by the shutdown problem, IDK, but it’s still an overall issue.)
Another comment: this doesn’t seem to say much about corrigibility, in the sense that it’s not like the AI is now accepting correction from an external operator (the AI would prevent being shut down during its day of operation). There’s no dependence on an external operator’s choices (except that once the AI is shut down the operator can pick back up doing whatever, if they’re still around). It seems more like a bounded optimization thing, like specifying how the AI can be made to not keep optimizing forever.
- davidad 9 Dec 2022 18:16 UTC
  3 points
  0
  Parent
  To the second point, yes, I edited the conclusion to reflect this.
- davidad 9 Dec 2022 18:28 UTC
  LW: 2 AF: 2
  0
  AF Parent
  To the first point, I think this problem can be avoided with a much simpler assumption than that the shutdown criterion forbids all posthumous influence. Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1. (We might need a slightly stronger version of this assumption: it might need to be the case that for any action, there exists an action which has the same external effect but also causes a shutdown with probability 1.) This means that the agent doesn’t need to build itself any insurance policy to guarantee that it shuts down. I think this is not a terribly inaccurate assumption; of course, in reality, there are cosmic rays and a properly embedded and self-aware agent might deduce that none of its future actions are perfectly reliable, even though a model-free RL agent would probably never see any evidence of this (and it wouldn’t be any worse at folding the laundry for it). Even with a realistic $ϵ$ probability of shutdown failing, if we don’t try to juice $1 - 1 / C$ so high that it exceeds $1 - ϵ$ , my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from $1 - ϵ$ to $1$ .
  - TekhneMakre 9 Dec 2022 18:53 UTC
    LW: 4 AF: 2
    0
    AF Parent
    Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1.
    Oops, I missed that assumption. Yeah, if there’s such a policy, and it doesn’t trade off against fetching the coffee, then it seems like we’re good. See though here, arguing briefly that by Cromwell’s rule, this policy doesn’t exist. https://arbital.com/p/task_goal/
    Even with a realistic $ϵ$ probability of shutdown failing, if we don’t try to juice $1 - 1 / C$ so high that it exceeds $1 - ϵ$ , my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from $1 - ϵ$ to $1$ .
    Hm. So this seems like you’re making an additional, very non-trivial assumption, which is that the AI is constrained by costs comparable to / bigger than the costs to create a successor. If its task has already been very confidently achieved, and it has half a day left, it’s not going to get senioritis, it’s going to pick up whatever scraps of expected utility might be left.
    
    I wonder though if there’s synergy between your proposal and the idea of expected utility satisficing: an EU satisficer with a shutdown clock is maybe anti-incentivized from self-modifying to do unbounded optimization, because unbounded optimization is harder to reliably shut down? IDK.
    - davidad 9 Dec 2022 19:29 UTC
      LW: 1 AF: 1
      0
      AF Parent
      Yes, I think there are probably strong synergies with satisficing, perhaps lexicographically minimizing something like energy expenditure once the $E U$ maximum is reached. I will think about this more.
- Sune 9 Dec 2022 20:57 UTC
  0 points
  −1
  Parent
  A similar objection is that you might accidentially define the utility function and time limit in such a way that the AI assigns positive probability to the hypothesis that it can later create a time machine and go back and improve the utility. Then once the time has passed, it will desparately try to invent a time machine, even if it thinks it is extremely unlikely to succed (this is using Bostrom’s way of thinking. Shard theory would not predict this).
  - davidad 9 Dec 2022 21:08 UTC
    1 point
    −1
    Parent
    I disagree, for two reasons:
    
    The $τ_{1} \cdot (_{1} - R_{1} - - -)$ bound on how much there is to gain from creating a time machine and improving past utility is outweighed by the $τ_{1} \cdot (_{1} - R_{1} - - -) \cdot C$ reward from $R_{2}$ for shutting down.
    Every RL algorithm I’ve heard of implicitly bakes in an assumption that past utility is unmodifiable. I guess all bets are off with mesa-optimisers, but personally I’d bet against even mesa-optimisers in model-free RL behaving as if past utility is up for grabs.