paulfchristiano comments on AI indifference through utility manipulation

paulfchristiano 6 Oct 2011 16:46 UTC
4 points
As far as I can tell, a utility function filtered in this way leads to more or less incomprehensible, and certainly undesirable, behavior. (I have more serious reasons to doubt the usefulness of this and similar lines of thinking, but perhaps this is the least debatable.)

Here is something I might do, if I were artificially made indifferent to dying tomorrow. Buy a random lottery ticket. If I lose, kill myself. Conditioned on me not dying, I just won the lottery, which I’m quite happy with. If I die, I’m also quite happy, since my expected utility is the same as not dying.

Of course, once I’ve lost the lottery I no longer have any incentive to kill myself, with this utility function. But this just shows that indifference is completely unstable, and given the freedom to self-modify I would destroy indifference immediately so that I could execute this plan (or more incomprehensible variants). If you try to stop me from destroying indifference, I could delude myself to accomplish the same thing, or etc.

Am I misunderstanding something?
What links here?
- paulfchristiano's comment on AI ontology crises: an informal typology by Stuart_Armstrong (13 Oct 2011 19:39 UTC; 17 points)
- Stuart_Armstrong's comment on Trapping AIs via utility indifference by Stuart_Armstrong (29 Feb 2012 10:29 UTC; 1 point)
- Stuart_Armstrong 14 Oct 2011 10:51 UTC
  2 points
  Parent
  
  Am I misunderstanding something?
  
  Yes. If you are indifferent to dying, and you die in any situation, you get exactly the same utility as if you hadn’t died.
  
  If you kill yourself after losing the lottery, you get the same utility as if you’d lost the lottery and survived. If you kill yourself after winning the lottery, you get the same utility as if you’d won the lottery and survived. Killing yourself never increases or decreases your utility, so you can’t use it to pull off any tricks.
  - paulfchristiano 14 Oct 2011 18:46 UTC
    0 points
    Parent
    I still don’t understand.
    
    So I am choosing whether I want to play the lottery and commit to committing suicide if I lose. If I don’t play the lottery my utility is 0, if I play I lose with probability 99%, receiving −1 utility, and win with probability 1%, receiving utility 99.
    
    The policy of not playing the lottery has utility 0, as does playing the lottery without commitment to suicide. But the policy of playing the lottery and committing to suicide if I lose has utility 99 in worlds where I win the lottery and survive (which is the expected utility of histories consistent with my observations so far in which I survive). After applying indifference, I have expected utility of 99 regardless of lottery outcome, so this policy wins over not playing the lottery.
    
    Having lost the lottery, I no longer have any incentive to kill myself. But I’ve also already either self-modified to change this or committed to dying (say by making some strategic play which causes me to get killed by humans unless I acquire a bunch of resources), so it doesn’t matter what indifference would tell me to do after I’ve lost the lottery.
    
    Can you explain what is wrong with this analysis, or give some more compelling argument that committing to killing myself isn’t a good strategy in general?
    - Matt_Simpson 14 Oct 2011 19:17 UTC
      4 points
      Parent
      
      After applying indifference, I have expected utility of 99 regardless of lottery outcome, so this policy wins over not playing the lottery.
      
      Are you indifferent between (dying) and (not dying) or are you indifferent between (dying) and (winning the lottery and not dying)?
      - paulfchristiano 14 Oct 2011 21:08 UTC
        −2 points
        Parent
        It should be clear that I can engineer a situation where survival <==> winning the lottery, in which case I am indifferent between (winning the lottery and not dying) and (not dying) because they occur in approximately the same set of possible worlds. So I’m simultaneously indifferent between (dying) and (not dying) and between (dying) and (winning the lottery and not dying).
        Matt_Simpson 15 Oct 2011 6:25 UTC
        0 points
        Parent
        
        It should be clear that I can engineer a situation where survival <==> winning the lottery, in which case I am indifferent between (winning the lottery and not dying) and (not dying) because they occur in approximately the same set of possible worlds.
        
        That doesn’t follow unless, in the beginning, you were already indifferent between (winning the lottery and not dying) and (dying). Remember, a utility function is a map from all possible states of the world to the real line (ignore what economists do with utility functions for the moment). (being alive) is one possible state of the world. (being alive and having won the lottery) is not the same state of the world. In more detail, assign arbitrary numbers for utlity (aside from rank) - suppose U(being alive) = U(being dead) = 0 and suppose U(being alive and having won the lottery) = 1.
        
        Now you engineer the situation such that survival <==> having won the lottery. It is still the case that U(survival) = 0. Your utility function doesn’t change because some random aspect of reality changes—if you evaluate the utility of a certain situation at time t, you should get the same answer at time t+1. It’s still the same map from states of the world to the real line. A terse way of saying this is “when you ask the question doesn’t change the answer.” But if we update on the fact that survival ⇒ having won the lottery, then we know we should really be asking about U(being alive and having won the lottery) which we know to be 1, which is not the same as U(dying).
    - Stuart_Armstrong 14 Oct 2011 21:12 UTC
      0 points
      Parent
      
      But the policy of playing the lottery and committing to suicide if I lose has utility 99 in worlds where I win the lottery and survive (which is the expected utility of histories consistent with my observations so far in which I survive). After applying indifference, I have expected utility of 99 regardless of lottery outcome, so this policy wins over not playing the lottery.
      
      As Matt said: you seem to be saying that you are indifferent between (losing the lottery and dying) and (winning the lottery and not dying).
      
      In that case, you should play the lottery and kill yourself if you lose—no need for precommitments! But you will only be indifferent between “X and dying” and “X and not dying”, never between “Y and dying” and “Z and not dying”.
      - paulfchristiano 15 Oct 2011 5:54 UTC
        2 points
        Parent
        OK, I will try and speak directly to the formalism you used, since we seem to be talking across each other (I no longer think I am confused, and suspect the scheme is just broken).
        
        Here is the setup:
        
        At midnight, I will be told whether someone tried to kill me or not during the preceding day. I am indifferent to this observation. If someone tried to kill me, I die at this point.
        
        Here is what I do:
        
        I build an agent and give it the following instructions: play a lottery with a drawing at 11pm. If you lose, try to kill me, otherwise wire the winnings into my bank account at 1am.
        
        I then close my eyes and ears and stop looking at the world.
        
        I make no observations before X, so there is a single set Ti which includes all histories in which I execute this scheme (in particular, it includes all lottery outcomes). Ti intersect S0 is the set of worlds where no one tries to kill me. In most of these worlds I have won the lottery. So my utility in Ti intersect S0 is my utility given that I won the lottery. So by indifference, my utility in Ti intersect S1 is my utility given that I won the lottery. So my utility in Ti is just the utility of winning the lottery. (By “utility in set S” I mean expected utility conditioned on being in S).
        
        Can you explain what is wrong here, if anything?
        Stuart_Armstrong 15 Oct 2011 13:09 UTC
        3 points
        Parent
        Let’s continue this conversation until we’ve resolved it to both our satisfactions. My most recent version of the idea is at http://www.fhi.ox.ac.uk/__data/assets/pdf_file/0020/18371/2010-1.pdf ; by the sounds of it, you’ve read it already.
        
        Now, on to your model. First of all, utility indifference is a one-off procedure, not an ongoing adjustment. It makes you indifferent to dying or not, but if someone says later “I will give you this valuable item if you kill yourself/don’t kill yourself” then your preferences will chance. Conditional statements break indifference.
        
        Or do they? I’ve been talking informally about “killing” and stuff, but my paper is more precise. You become indifferent to the outcome of a particular quantum event X, which can take values 0 or 1 with some probability. To connect this with dying or not dying, this X was part of a detonation device for some explosives, which detonated on 1, and X was nearly certain to output 1. Then the AI was made indifferent in a way that reset the values of X=1 worlds to X=0 worlds.
        
        So now if we look at your scenario, you can’t actually build an agent that can play the lottery and then “kills you, to your indifference” if it loses. All it can do is activate the detonator. Then indifference applies between the two outcomes of detonation—X=1, death (nearly certain) or X=0, survival (unlikely). However, the lottery has already been lost in both those cases, so this cannot lead to utilities being changed to what would happen if you had won. Technically, indifference applies to worlds that have the same history “up to X”, so cannot apply to worlds where X never happens.
        
        In fact, I think that, unless there is a possibility for someone to manipulate the probability of quantum events (ie superscience), the only way to break indifference is to specifically say “AI, I will reward you if you stop being indifferent”. Conditional statements and precommitements made before X will be caught by indifference; rewards promised for X=0 are added to your utility even if X=1, and rewards promised for X=1 do not show up anywhere in your utility.
        
        Do you feel there is still room for a paradox?
        paulfchristiano 16 Oct 2011 4:32 UTC
        5 points
        Parent
        You are quite right. Thank you for clarifying, and sorry for being dense. Part of the confusion came from the definition of indistinguishability being defined in terms of observations the AI can make, but most of it was from being silly and failing to carefully read the part outside of the formalism.
        Stuart_Armstrong 16 Oct 2011 8:49 UTC
        1 point
        Parent
        No need for apologies! Clarifying things is always useful. I’d forgotten why I’d put certain things in the formalism and not certain others; now I remember, and understand better.