The overfitting utility problem for value learning AIs

A putative new idea for AI control; index here.

Humans are biased and irrational (citation not needed) and so don’t provide consistent answers to questions.

To pick an extreme example, suppose the AI is hesitant between valuing Cake or Death, but can phrase a sufficiently seductive or manipulative question to get humans to answer “Death” when asked. We’ll assume that more dull questions elicit “Cake” instead.

This poses great problem for any AI doing value learning and trying to model a human-values utility function. If it assumes the human is rational, then there is no simple utility which explains this behaviour.

Now fear, however, there is a utility function which explains this! The two universes differ in important ways: in one universe, the AI asked a seductive question, in the other a dull one. Therefore the human values can be modelled as valuing the worlds as:

$(s e d u c t i v e q u e s t i o n, D e a t h) > (s e d u c t i v e q u e s t i o n, C a k e)$
$(d u l l q u e s t i o n, C a k e) > (d u l l q u e s t i o n, D e a t h)$ .

Any rational model of human preference would reach this conclusion—and these would be the correct preference as far as it could be observed. It fits well with physical predictions about the future.

And then, depending on what is easy or hard in the world, the AI could decide to ask the seductive question and start killing...

Note that we can’t avoid this problem by having the AI just count “asking the seductive question” as being part of its action set, and hence special. Once the question is asked, it’s vibrations in the air, so the human preferences can be modelled as joint preferences over universes with cake and death and certain patterns of vibration in the air.

To avoid this problem, we need the AI to:

Know the human is irrational, correctly identify this situation as an example of it, and find correct meta-rational principles to decide what to do/how to ask.

It’s possible that many of the designs proposed will avoid this problem by correct learning sequences (if it can learn meta-principles early, this might help), but it could be used to show that many designs are not intrinsically safe for all initial priors over human values.