The mathematical result is clear: you cannot deduce human preferences merely by observing human behaviour (even with simplicity priors).

Yet many people instinctively reject this result; even I found it initially counter-intuitive. And you can make a very strong argument that it’s wrong. It would go something like this:

“I, a human $H$ , can estimate what human $K$ wants, just by observing their behaviour. And these estimations have evidence behind them: $K$ will often agree that I’ve got their values right, and I can use this estimation to predict $K$ ’s behaviour. Therefore, it seems I’ve done the impossible: go from behaviour to preferences.”

Evolution and empathy modules

This is how I interpret what’s going on here. Humans (roughly) have empathy modules $E$ which allow them to estimate the preferences of other humans, and prediction modules $P$ which use the outcome of $E$ to predict their behaviour. Since evolution is colossally lazy, these modules don’t vary much from person to person.

So, for $h_{K}$ a history of human $K$ ’s behaviour in typical circumstances, the modules for two humans $H$ and $J$ will give similar answers:

$E_{H} (h_{K}) \approx E_{J} (h_{K})$ .

Moreover, when humans turn their modules to their own behaviour, they get similar result. The human $K$ will have a privileged access to their own deliberations; so define ${ˇ h}_{K}$ as the internal history of $K$ . Thus:

$E_{H} (h_{K}) \approx E_{J} (h_{K}) \approx E_{K} ({ˇ h}_{K})$ .

This idea connects with partial preferences/partial models in the following way: $E_{K} ({ˇ h}_{K})$ gives $K$ access to their own internal models and preferences; so the approximately equal symbols above means that, by observing the behaviour of other humans, we have approximate access to their own internal models.

Then $P$ just takes the results of $E$ to predict future behaviour; since $E$ and $P$ have co-evolved, it’s no surprise that $P$ would have a good predictive record.

So, given $E$ , it is true that a human can estimate the preferences of another human, and, given $P$ , it is true they can use this knowledge to predict behaviour.

The problems

So, what are the problems here? There are three:

$E$ and $P$ only function well in typical situations. If we allow humans to self-modify arbitrarily or create strange other beings (such as AIs themselves, or merged human-AIs), then our empathy and predictions will start to fail^[1].
It needs $E$ and $P$ to be given; but defining these for AIs is very tricky. Time and time again, we’ve found that tasks that are easy for humans to do are not easy for humans to program into AIs.
The empathy and prediction modules are similar, but not identical, from person to person and culture to culture^[2].

So both are correct: my result (without assumption, you cannot go from human behaviour to preferences) and the critique (given these assumptions that humans share, you can go from human behaviour to preferences).

And when it comes to humans predicting humans, the critique is more valid: listening to your heart/gut is a good way to go. But when it comes to programming potentially powerful AIs that could completely transform the human world in strange and unpredictable ways, my negative result is more relevant than the critique is.

A note on assumptions

I’ve had some disagreements with people that boil down to me saying “without assuming A, you cannot deduce B”, and them responding “since A is obviously true, B is true”. I then go on to say that I am going to assume A (or define A to be true, or whatever).

At that point, we don’t actually have a disagreement. We’re saying the same thing (accept A, and thus accept B), with a slight difference of emphasis—I’m more “moral anti-realist” (we choose to accept A, because it agrees with our intuition) they are more “moral realist” (A is true, because it agrees with our intuition). It’s not particularly productive to dig more.

In practice: debugging and injecting moral preferences

There are some interesting practical consequences to this analysis. Suppose, for example, that someone is programming a clickbait detector. They then gather a whole collection of clickbait examples, train a neural net on them, and fiddle with the hyperparameters till the classification looks decent.

But both “gathering a whole collection of clickbait examples”, “the classification looks decent” are not facts about the universe: they are judgements of the programmers. The programmers are using their own $E$ and $P$ modules to establish that certain articles are a) likely to be clicked on, but b) not what the clicker would really want to read. So the whole process is entirely dependent on programmer judgement—it might feel like “debugging”, or “making reasonable modelling choices”, but its actually injecting the programmers’ judgements into the system.

And that’s fine! We’ve seen that different people have similar judgements. But there are two caveats: first, not everyone will agree, because there is not perfect agreement between the empathy modules. The programmers should be careful as to whether this is an area of very divergent judgements or not.

And second, these results will likely not generalise well to new distributions. That’s because having implicit access to categorisation modules that themselves are valid only in typical situations… is not a way to generalise well. At all.

Hence we should expect poor generalisation from such methods, to other situations and (sometimes) to other humans. In my opinion, if programmers are more aware of these issues, they will have better generalisation performance.

↩︎
I’d consider the Star Trek universe to be much more typical that, say, 7th century China. The Star Trek universe is filled with beings that are slight variants or exaggerations of modern humans, while people in 7th century China will have very alien ways of thinking about society, hierarchy, good behaviour, and so on. But that is still very typical compared with the truly alien beings that can exist in the space of all possible minds.
↩︎
For instance, Americans will typically explain a certain behaviour by intrinsic features of the actor, while Indians will give more credit to the circumstance (Miller, Joan G. “Culture and the development of everyday social explanation.” Journal of personality and social psychology 46.5 (1984): 961).

Is my result wrong? Maths vs intuition vs evolution in learning human preferences

Evolution and empathy modules

The problems

A note on assumptions

In practice: debugging and injecting moral preferences