Abstract model of human bias

A putative new idea for AI control; index here.

Any suggestions for refining this model are welcome!

Somewhat inspired by the previous post, this is a model of human bias that can be used to test theories that want to compute the “true” human preferences. The basic idea is to formalise the question:

  • If the AI can make the human give any answer to any question, can it figure out what humans really want?


The AI’s influence

The AI has access to an algorithm , representing the human. It can either interact with or simulate the interaction correctly.

The interaction consists of describing the outcome of choice versus choice , and then asking the human which option is better. The set of possible binary choices is (thus ). The set of descriptions is ; the set of possible descriptions for is .

Then we have the assumption that humans can be manipulated:

  • Given any description for which prefers to , there exists a description , logically equivalent to , such that prefers to , and vice-versa.

Note that could be a paragraph while could be a ten-volume encyclopedia; all that’s required is that they be logically equivalent.

But manipulating human answers in the immediate sense is not the only way the AI can influence them. Our values can change through interactions, reflection, and even through being given true and honest information, and the AI can influence this:

  • There is a wide class of algorithms , such that for all , there exists a sequence of descriptions the AI can give to that will transform into .

The grounding assumptions

So far, we’ve just made the task hopeless: the AI can get any answer from , and can make into whatever algorithm it feels like. Saying has preferences is meaningless.

However, we’re building from a human world where the potential for human manipulating humans is limited, and somewhat recognisable. Thus:

  • There exists a subset (called standard choices) such that, for all , there exists a subset (called standard descriptions) such that is tagged as fair and highly reflective of the true values of .

Basically these are examples of interactions that are agreed to be fair, honest, and informative. The more abstract the choices, the harder it is to be sure of this.

Of course, we’d also allow the AI to learn from examples of negative interactions as well:

  • There exists a subset such that, for all , there exists a subset such that is tagged as a manipulative interaction with .

Finally, we might want a way to encode human meta-preferences:

  • Among the descriptions tagged as fair or manipulative, there are some that refer to the process of providing descriptions itself.

Building more assumptions in

This still feels like a bare-bones description, unlikely to converge to anything good. For one, I haven’t even defined what “logically equivalent” means. But that’s the challenge of those constructing solutions to the problem of human preferences. Can they construct sufficiently good and to converge to some sort of “true” values for ? Or, more likely, what extra assumptions and definitions are needed to get such a convergence? And finally, is the result reflective of what we would want?