AI “Boxing” and Utility Functions


So, I had this idea the other day when I was thinking about how to safely conduct research on potentially-FOOM-capable AI software. I’d like to sketch it out briefly and then get feedback on it.

So, this started out with the idea that an AI based on AIXI is, in some sense, safer than a fully functional AI, due to the existence of the anvil problem. Because AIXI can’t conceive of its own nonexistence, it has no preference ordering over its own mortality, and won’t (shouldn’t) resist any attempt to shut it down. In other words, if AIXI starts to FOOM undesirably out of control, you actually can go pull the plug on it without fuss. Unfortunately, in terms of safety, the anvil problem gives AIXI a number of other undesirable properties: both third parties and the AI itself can modify its utility function at any time, for any reason, which is very unstable. However, I think a similar idea might be useful for reducing (though not eliminating) the existential risks posed by powerful optimization processes in the near term.

Say you have a piece of AI software ω, with an unFriendly instantaneous utility function, {maximize U}. You would like to use ω for some industrial application (say, manufacturing paperclips), but you’re concerned about it FOOMing and resulting in human extinction. You decide to ‘box’ the AI, but, having read up on the subject, you are worried about it outsmarting you if you try to disable it.

So, you replace your original utility function U with a modified version in terms of U, U’.

U’ = { max(U) | if ω exists

Ø | if ω !exist}


U’ has several useful qualities. The agent will resist modifications to its utility function, while not resisting attempts to turn it off. It is entirely ambivalent towards its own existence. As a result, if it began to FOOM undesirably, stopping it would be fairly trivial. Furthermore, the AI would have no incentive to deceive us, so it’d be fairly easy to keep an eye on.

It should be noted that this is not in any way a final solution to the FAI problem. If the AI FOOMS quickly enough that we all die or get brain-hacked before we realize something’s amiss, it could still all go very poorly. But that seems unlikely. I would guess that it takes at least a day or two to go from a boxed human-level AI to something strongly superhuman. Unfortunately, for this to work, everyone has to use it, which leave a lot of leftover existential risk from people using AIs without stable utility functions, cranks who think unFriendly AI will discover universal morality, and people who prematurely think they’ve figured out a good Friendly utility function.

That said, something like this could help to gain more time to develop a proper FAI, and would be relatively simple to sell other developers on. SI or a similar organization could even develop a standardized, cross-platform open-source software package for utility functions with all of this built in, and distribute it to wannabe strong-AI developers.

Are there any obvious problems with this idea that I’m missing? If so, can you think of any ways to address them? Has this sort of thing been discussed in the past?