An overall schema for the friendly AI problems: self-referential convergence criteria

A putative new idea for AI control; index here.

After working for some time on the Friendly AI problem, it’s occurred to me that a lot of the issues seem related. Specifically, all the following seem to have commonalities:

Nasty extrapolation of concepts (though badly implemented value learning or badly coded base concepts).
AI’s making themselves into nasty expected utility maximisers.
AI’s hacking themselves to maximum reward.
AI’s creating successor agents that differ from them in dangerous ways.
People hacking themselves to maximum apparent happiness.
Problems with Coherent Extrapolated Volition.
Problems with unrestricted search. Some issues I have with some of Paul Christiano’s designs.
Reflective equilibrium itself.

Speaking very broadly, there are two features all them share:

The convergence criteria are self-referential.
Errors in the setup are likely to cause false convergence.

What do I mean by that? Well, imagine you’re trying to reach reflective equilibrium in your morality. You do this by using good meta-ethical rules, zooming up and down at various moral levels, making decisions on how to resolve inconsistencies, etc… But how do you know when to stop? Well, you stop when your morality is perfectly self-consistent, when you no longer have any urge to change your moral or meta-moral setup. In other words, the stopping point (and the the convergence to the stopping point) is entirely self-referentially defined: the morality judges itself. It does not include any other moral considerations. You input your initial moral intuitions and values, and you hope this will cause the end result to be “nice”, but the definition of the end result does not include your initial moral intuitions (note that some moral realists could see this process dependence as a positive—except for the fact that these processes have many convergent states, not just one or a small grouping).

So when the process goes nasty, you’re pretty sure to have achieved something self-referentially stable, but not nice. Similarly, a nasty CEV will be coherent and have no desire to further extrapolate… but that’s all we know about it.

The second feature is that any process has errors—computing errors, conceptual errors, errors due to the weakness of human brains, etc… If you visualise this as noise, you can see that noise in a convergent process is more likely to cause premature convergence, because if the process ever reaches a stable self-referential state, it will stay there (and if the process is a long one, then early noise will cause great divergence at the end). For instance, imagine you have to reconcile your belief in preserving human cultures with your beliefs in human individual freedom. A complex balancing act. But if, at any point along the way, you simply jettison one of the two values completely, things become much easier—and once jettisoned, the missing value is unlikely to ever come back.

Or, more simply, the system could get hacked. When exploring a potential future world, you could become so enamoured of it, that you overwrite any objections you had. It seems very easy for humans to fall into these traps—and again, once you lose something of value in your system, you don’t tend to get if back.

Solutions

And again, very broadly speaking, there are several classes of solutions to deal with these problems:

Reduce or prevent errors in the extrapolation (eg solving the agent tiling problem).
Solve all or most of the problem ahead of time (eg traditional FAI approach by specifying the correct values).
Make sure you don’t get too far from the starting point (eg reduced impact AI, tool AI, models as definitions).
Figure out the properties of a nasty convergence, and try to avoid them (eg some of the ideas I mentioned in “crude measures”, general precautions that are done when defining the convergence process).