TurnTrout comments on Reversible changes: consider a bucket of water

TurnTrout 27 Aug 2019 1:32 UTC
LW: 2 AF: 1
0
AF
The agent would still walk around for reasonable farsightedness; kicking the bucket into the pool perturbs most AUs. There’s no real “risk” to not kicking the bucket.

AUP is only defined over states for MDPs because states are the observations. AUP in partially observable environments uses reward functions over sensory inputs. Again, I assert we don’t need to think about molecules or ontologies.

But, as covered in the discussion linked above, worrying about penalizing molecular shifts or not misses the point of impact: the agent doesn’t catastrophically reappropriate our power, so we can still get it to do what we want. (The thread is another place to recalibrate on what I’ve tried to communicate thus far.)

The truth is that, with a tiny set of exceptions, all our actions are irreversible, shutting down many possibilities for ever.

AUP doesn’t care much at all about literal reversibility.
- Vika 29 Aug 2019 10:50 UTC
  LW: 24 AF: 11
  0
  AF Parent
  I think the discussion of reversibility and molecules is a distraction from the core of Stuart’s objection. I think he is saying that a value-agnostic impact measure cannot distinguish between the cases where the water in the bucket is or isn’t valuable (e.g. whether it has sentimental value to someone).
  If AUP is not value-agnostic, it is using human preference information to fill in the “what we want” part of your definition of impact, i.e. define the auxiliary utility functions. In this case I would expect you and Stuart to be in agreement.
  If AUP is value-agnostic, it is not using human preference information. Then I don’t see how the state representation/ontology invariance property helps to distinguish between the two cases. As discussed in this comment, state representation invariance holds over all representations that are consistent with the true human reward function. Thus, you can distinguish the two cases as long as you are using one of these reward-consistent representations. However, since a value-agnostic impact measure does not have access to the true reward function, you cannot guarantee that the state representation you are using to compute AUP is in the reward-consistent set. Then, you could fail to distinguish between the two cases, giving the same penalty for kicking a more or less valuable bucket.
  - Stuart_Armstrong 30 Aug 2019 19:18 UTC
    LW: 6 AF: 3
    0
    AF Parent
    That’s an excellent summary.
  - TurnTrout 29 Aug 2019 14:55 UTC
    LW: 4 AF: 2
    0
    AF Parent
    I agree that it’s not the core, and I think this is a very cogent summary. There’s a deeper disagreement about what we need done that I’ll lay out in detail in Reframing Impact.
- Stuart_Armstrong 27 Aug 2019 16:27 UTC
  LW: 5 AF: 3
  0
  AF Parent
  I’ve added an edit to the post, to show the problem: sometimes, the robot can’t kick the bucket, sometimes it must. And only human preferences distinguish these two cases. So, without knowing these preferences, how can it decide?
- Stuart_Armstrong 27 Aug 2019 3:09 UTC
  LW: 4 AF: 3
  0
  AF Parent
  
  kicking the bucket into the pool perturbs most AUs. There’s no real “risk” to not kicking the bucket.
  
  In this specific setup, no. But sometimes kicking the bucket is fine; sometimes kicking the metaphorical equivalent of the bucket is necessary. If the AI is never willing to kick the bucket—ie never willing to take actions that might, for certain utility functions, cause huge and irreparable harm—then it’s not willing to take any action at all.