Stuart_Armstrong comments on AI indifference through utility manipulation

Stuart_Armstrong 15 Oct 2011 13:09 UTC
3 points
0
Let’s continue this conversation until we’ve resolved it to both our satisfactions. My most recent version of the idea is at http://www.fhi.ox.ac.uk/__data/assets/pdf_file/0020/18371/2010-1.pdf ; by the sounds of it, you’ve read it already.

Now, on to your model. First of all, utility indifference is a one-off procedure, not an ongoing adjustment. It makes you indifferent to dying or not, but if someone says later “I will give you this valuable item if you kill yourself/don’t kill yourself” then your preferences will chance. Conditional statements break indifference.

Or do they? I’ve been talking informally about “killing” and stuff, but my paper is more precise. You become indifferent to the outcome of a particular quantum event X, which can take values 0 or 1 with some probability. To connect this with dying or not dying, this X was part of a detonation device for some explosives, which detonated on 1, and X was nearly certain to output 1. Then the AI was made indifferent in a way that reset the values of X=1 worlds to X=0 worlds.

So now if we look at your scenario, you can’t actually build an agent that can play the lottery and then “kills you, to your indifference” if it loses. All it can do is activate the detonator. Then indifference applies between the two outcomes of detonation—X=1, death (nearly certain) or X=0, survival (unlikely). However, the lottery has already been lost in both those cases, so this cannot lead to utilities being changed to what would happen if you had won. Technically, indifference applies to worlds that have the same history “up to X”, so cannot apply to worlds where X never happens.

In fact, I think that, unless there is a possibility for someone to manipulate the probability of quantum events (ie superscience), the only way to break indifference is to specifically say “AI, I will reward you if you stop being indifferent”. Conditional statements and precommitements made before X will be caught by indifference; rewards promised for X=0 are added to your utility even if X=1, and rewards promised for X=1 do not show up anywhere in your utility.

Do you feel there is still room for a paradox?
- paulfchristiano 16 Oct 2011 4:32 UTC
  5 points
  0
  Parent
  You are quite right. Thank you for clarifying, and sorry for being dense. Part of the confusion came from the definition of indistinguishability being defined in terms of observations the AI can make, but most of it was from being silly and failing to carefully read the part outside of the formalism.
  - Stuart_Armstrong 16 Oct 2011 8:49 UTC
    1 point
    0
    Parent
    No need for apologies! Clarifying things is always useful. I’d forgotten why I’d put certain things in the formalism and not certain others; now I remember, and understand better.