The problem of pseudofriendliness

The Friendly AI problem is complicated enough that it can be divided into a large number of subproblems. Two such subproblems could be:

  1. The problem of goal interpretation – This means that the human expectation of the results of an AI implementing a goal differ from the results that the AI actually works toward.

  2. The problem of innate drives (see Steve Omohundro’s ‘Basic Drives’ paper for more detail) – This is when either specific goals, or goal based reasoning in general, creates subgoals that humans do not anticipate.

Let’s call an AI which does not suffer from these problem a pseudofriendly AI. Would this be a useful type of AI to produce? Well, maybe or maybe not. But even if it fails to be useful in and of itself, solving the pseudofriendly AI problem may be a helpful step toward developing the mode of thinking needed to solve the Friendly AI problem.

It’s also possible that pseudofriendliness might be able to interact usefully with Eliezer’s Coherent Extrapoltated Volition (CEV—see here for more details). Eliezer has expressed CEV as follows:

In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.

However, an FAI is not to be given the CEV as it’s goal but rather a superintelligence is to use our CEV to determine what goals an FAI should be given. What this means though is that there will be a point where a superintelligence exists that is not friendly. Could a pseudofriendly AI fill a gap here? Probably not—pseudofriendliness is not friendliness, nor should it be confused with it. However, it might be part of a solution that help the CEV approach to be safely implemented.

Why all this hassle though? We seem to have exchanged one very important problem for two less important ones. Well, part of the benefit of pseudofriendliness is that it seems like it should be easier to formalise. First, let us introduce the concept of an interpretation system.

An interpretation system takes a partially specified world state (called a goal) and outputs a triple (Wx, Sx, Cx) where Wx is a partially specified world state, Sx is a set containing sets of subgoals and Cx is a chosen subgoal.

What does all of this mean? Well, the input could be thought of as a goal (stop the humans on that island from being drowned by rising sea waters) which is expressed as a partial world state (ie. the world state where the humans on the island remain undrowned). The interpretation system then outputs a partially specified world state which may be the same or different. In humans, various aspects of our cognitive system would make us interpret this goal as a different world state. For example, we may implicitly not consider tying all of the humans to giant stakes so they were above the level of the water but were unable to move or act.So we would output one world state while an AI may well output another. This is enough to specific the problem of goal interpretation as follows:

The problem of goal interpretation is as follows. An interpretation system Ix given a goal G outputs Wx. A second interpretation system Iy outputs Wy on receiving the goal. Systems Ix and Iy suffer from the goal interpretation problem if Wx ≠ Wy.

The interpretation systems also output a set of goals to be used to bring about the world state and a set of goal sets which could altenatively be used to bring it about. Going back to our rising sea water example, even if Wx = Wy, these are only partially specified world views and hence do not determine whether every aspect of the AI’s actions would produce outcomes that we want. This means that the subgoals used to get to a goal may still be undesirable. We can now specify the problem of innate drives as:

System Ix suffers from the weak problem of innate drives from the perspective of system Iy if Cx ≠ Cy. It suffers from the strong problem of innate drives if Cx is not a member of Sx.

If these definitions stand up, then pseudofriendly AI is certainly more formally specified than Friendly AI. However, even if not, it seems plausible that it is likely to be easier to formalise pseudofriendliness than friendliness. If you buy that, then the questions remaining are:

  1. Do these definitions stand up, and if not, is it possible to formulate another version.

  2. What is the solution to the problems of pseudofriendliness.