Ultra-simplified research agenda

This is an ultra-condensed version of the research agenda on synthesising human preferences (video version here):

In order to infer what a human wants from what they do, an AI needs to have a human theory of mind.

Theory of mind is something that humans have instinctively and subconsciously, but that isn’t easy to spell out explicitly; therefore, by Moravec’s paradox, it will be very hard to implant it into an AI, and this needs to be done deliberately.

One way of defining theory of mind is to look at how humans internally model the value of various hypothetical actions and events (happening to themselves and to others).

Finally, once we have a full theory of mind, we still need to deal, somehow, with the fact that humans have meta-preferences over their preferences, and that these preferences and meta-preferences are often contradictory, changeable, manipulable, and (more worryingly) underdefined in the exotic worlds that AIs could produce.

Any way of dealing with that fact will be contentious, but it’s necessary to sketch out an explicit way of doing this, so it can be critiqued and improved.

A toy model for this research agenda can be found here.