A way of specifying utility functions for UDT

The original UDT post defined an agent’s preferences as a list of programs it cares about, and a utility function over their possible execution histories. That’s a good and general formulation, but how do we make a UDT agent care about our real world, seeing as we don’t know its true physics yet? It seems difficult to detect humans (or even paperclips) in the execution history of an arbitrary physics program.

But we can reframe the problem like this: the universal prior is a probability distribution over all finite bitstrings. If we like some bitstrings more than others, we can ask a UDT agent to influence the universal prior to make those bitstrings more probable. After all, the universal prior is an uncomputable mathematical object that is not completely known to us (and even defined in terms of all possible computer programs, including ours). We already know that UDT agents can control such objects, and it turns out that such control can transfer to the real world.

For example, you could pick a bitstring that describes the shape of a paperclip, type it into a computer and run a UDT AI with a utility function which says “maximize the probability of this bitstring under the universal prior, mathematically defined in such-and-such way”. For good measure you could give the AI some actuators, e.g. connect its output to the Internet. The AI would notice our branch of the multiverse somewhere within the universal prior, notice itself running within that branch, seize the opportunity for control, and pave over our world so it looks like a paperclip from as many points of view as possible. The thing is like AIXI in its willingness to kill everyone, but different from AIXI in that it probably won’t sabotage itself by mining its own CPU for silicon or becoming solipsistic.

This approach does not see the universal prior as a fixed probability distribution over possible universes, instead you could view it as picking from several different possible universal priors. A utility function thus specified doesn’t always look like utility maximization from within the universal prior’s own multiverse of programs. For example, you could ask the agent to keep a healthy balance between paperclips and staples in the multiverse, i.e. minimize |P(A)-P(B)| where A is a paperclip bitstring, B is a staple bitstring, and P is the universal prior.

The idea is scary because well-meaning people can be tempted to use it. For example, they could figure out which computational descriptions of the human brain correspond to happiness and ask their AI to maximize the universal prior probability of those. If implemented correctly, that could work and even be preferable to unfriendly AI, but we would still lose many human values. Eliezer’s usual arguments apply in full force here. So if you ever invent a powerful math intuition device, please don’t go off building a paperclipper for human happiness. We need to solve the FAI problem properly first.

Thanks to Paul Christiano and Gary Drescher for ideas that inspired this post.