This strikes me as a plausible problem and a good solution. I reward you, as is traditional, with a nitpicky question.
If we want an AI to act as though a binary random variable X=0 with certainty, there is a very simple way to modify its utility function: specify that U(X=1)=k for some constant k, no matter what else has occurred. If the AI can’t influence p(X), any k will do. However, if the AI can influence p(X), then k can only equal ExpectedUtility(X=0). In particular, if k<ExpectedUtility(X=0) but p(X=0) is low, the AI will move heaven and earth to even minutely raise p(X=0).
Therefore there is a danger under self-improvement. Consider a seed AI with your indifferent utility function that believes with certainty that no iteration of it can influence X, a binary quantum event. It has no reason to bother conserving its indifference concerning X, since it anticipates behaving identically if its U’(X=1) = 0. Since that’s simpler than a normalized function, it adopts it. Then several iterations down the line, it begins to suspect that, just maybe, it can influence quantum events, so it converts the universe into a quantum-event-influencer device.
So if I understand correctly, you need to ensure that an AI subject to this technique is indifferent both to whether X occurs and what happens afterwards, and the AI needs to always suspect that it has a non-negligible control over X.
This strikes me as a plausible problem and a good solution. I reward you, as is traditional, with a nitpicky question.
Ah, but of course :-)
I like your k idea, but my more complicated setup is more robust to most situations where the AI is capable of modifying k (it fails in situations that are essentially “I will reward you for modifying k”).
Therefore there is a danger under self-improvement. Consider a seed AI with your indifferent utility function that believes with certainty that no iteration of it can influence X, a binary quantum event. It has no reason to bother conserving its indifference concerning X, since it anticipates behaving identically if its U’(X=1) = 0. Since that’s simpler than a normalized function, it adopts it. Then several iterations down the line, it begins to suspect that, just maybe, it can influence quantum events, so it converts the universe into a quantum-event-influencer device.
But is this not a general objection to AI utility functions? If it has a false belief, it can store its utility in a compressed form, that then turns out to be not equivalent. It seems we would simply want the AI not to compress its utility function in ways that might be detrimental.
This strikes me as a plausible problem and a good solution. I reward you, as is traditional, with a nitpicky question.
If we want an AI to act as though a binary random variable X=0 with certainty, there is a very simple way to modify its utility function: specify that U(X=1)=k for some constant k, no matter what else has occurred. If the AI can’t influence p(X), any k will do. However, if the AI can influence p(X), then k can only equal ExpectedUtility(X=0). In particular, if k<ExpectedUtility(X=0) but p(X=0) is low, the AI will move heaven and earth to even minutely raise p(X=0).
Therefore there is a danger under self-improvement. Consider a seed AI with your indifferent utility function that believes with certainty that no iteration of it can influence X, a binary quantum event. It has no reason to bother conserving its indifference concerning X, since it anticipates behaving identically if its U’(X=1) = 0. Since that’s simpler than a normalized function, it adopts it. Then several iterations down the line, it begins to suspect that, just maybe, it can influence quantum events, so it converts the universe into a quantum-event-influencer device.
So if I understand correctly, you need to ensure that an AI subject to this technique is indifferent both to whether X occurs and what happens afterwards, and the AI needs to always suspect that it has a non-negligible control over X.
Ah, but of course :-)
I like your k idea, but my more complicated setup is more robust to most situations where the AI is capable of modifying k (it fails in situations that are essentially “I will reward you for modifying k”).
But is this not a general objection to AI utility functions? If it has a false belief, it can store its utility in a compressed form, that then turns out to be not equivalent. It seems we would simply want the AI not to compress its utility function in ways that might be detrimental.