HonoreDB comments on Trapping AIs via utility indifference

HonoreDB 29 Feb 2012 17:09 UTC
0 points
This strikes me as a plausible problem and a good solution. I reward you, as is traditional, with a nitpicky question.

If we want an AI to act as though a binary random variable X=0 with certainty, there is a very simple way to modify its utility function: specify that U(X=1)=k for some constant k, no matter what else has occurred. If the AI can’t influence p(X), any k will do. However, if the AI can influence p(X), then k can only equal ExpectedUtility(X=0). In particular, if k<ExpectedUtility(X=0) but p(X=0) is low, the AI will move heaven and earth to even minutely raise p(X=0).

Therefore there is a danger under self-improvement. Consider a seed AI with your indifferent utility function that believes with certainty that no iteration of it can influence X, a binary quantum event. It has no reason to bother conserving its indifference concerning X, since it anticipates behaving identically if its U’(X=1) = 0. Since that’s simpler than a normalized function, it adopts it. Then several iterations down the line, it begins to suspect that, just maybe, it can influence quantum events, so it converts the universe into a quantum-event-influencer device.

So if I understand correctly, you need to ensure that an AI subject to this technique is indifferent both to whether X occurs and what happens afterwards, and the AI needs to always suspect that it has a non-negligible control over X.
- Stuart_Armstrong 2 Mar 2012 10:41 UTC
  0 points
  Parent
  
  This strikes me as a plausible problem and a good solution. I reward you, as is traditional, with a nitpicky question.
  
  Ah, but of course :-)
  
  I like your k idea, but my more complicated setup is more robust to most situations where the AI is capable of modifying k (it fails in situations that are essentially “I will reward you for modifying k”).
  
  Therefore there is a danger under self-improvement. Consider a seed AI with your indifferent utility function that believes with certainty that no iteration of it can influence X, a binary quantum event. It has no reason to bother conserving its indifference concerning X, since it anticipates behaving identically if its U’(X=1) = 0. Since that’s simpler than a normalized function, it adopts it. Then several iterations down the line, it begins to suspect that, just maybe, it can influence quantum events, so it converts the universe into a quantum-event-influencer device.
  
  But is this not a general objection to AI utility functions? If it has a false belief, it can store its utility in a compressed form, that then turns out to be not equivalent. It seems we would simply want the AI not to compress its utility function in ways that might be detrimental.