Self-modification as a game theory problem

In this post I’ll try to show a surprising link between two research topics on LW: game-theoretic cooperation between AIs (quining, Loebian cooperation, modal combat, etc) and stable self-modification of AIs (tiling agents, Loebian obstacle, etc).

When you’re trying to cooperate with another AI, you need to ensure that its action will fulfill your utility function. And when doing self-modification, you also need to ensure that the successor AI will fulfill your utility function. In both cases, naive utility maximization doesn’t work, because you can’t fully understand another agent that’s as powerful and complex as you. That’s a familiar difficulty in game theory, and in self-modification it’s known as the Loebian obstacle (fully understandable successors become weaker and weaker).

In general, any AI will be faced with two kinds of situations. In “single player” situations, you’re faced with a choice like eating chocolate or not, where you can figure out the outcome of each action. (Most situations covered by UDT are also “single player”, involving identical copies of yourself.) Whereas in “multiplayer” situations your action gets combined with the actions of other agents to determine the outcome. Both cooperation and self-modification are “multiplayer” situations, and are hard for the same reason. When someone proposes a self-modification to you, you might as well evaluate it with the same code that you use for game theory contests.

If I’m right, then any good theory for cooperation between AIs will also double as a theory of stable self-modification for a single AI. That means neither problem can be much easier than the other, and in particular self-modification won’t be a special case of utility maximization, as some people seem to hope. But on the plus side, we need to solve one problem instead of two, so creating FAI becomes a little bit easier.

The idea came to me while working on this mathy post on IAFF, which translates some game theory ideas into the self-modification world. For example, Loebian cooperation (from the game theory world) might lead to a solution for the Loebian obstacle (from the self-modification world) - two LW ideas with the same name that people didn’t think to combine before!