Ultra-simplified research agenda

This is an ul­tra-con­densed ver­sion of the re­search agenda on syn­the­sis­ing hu­man prefer­ences (video ver­sion here):

In or­der to in­fer what a hu­man wants from what they do, an AI needs to have a hu­man the­ory of mind.

The­ory of mind is some­thing that hu­mans have in­stinc­tively and sub­con­sciously, but that isn’t easy to spell out ex­plic­itly; there­fore, by Mo­ravec’s para­dox, it will be very hard to im­plant it into an AI, and this needs to be done de­liber­ately.

One way of defin­ing the­ory of mind is to look at how hu­mans in­ter­nally model the value of var­i­ous hy­po­thet­i­cal ac­tions and events (hap­pen­ing to them­selves and to oth­ers).

Fi­nally, once we have a full the­ory of mind, we still need to deal, some­how, with the fact that hu­mans have meta-prefer­ences over their prefer­ences, and that these prefer­ences and meta-prefer­ences are of­ten con­tra­dic­tory, change­able, ma­nipu­la­ble, and (more wor­ry­ingly) un­der­defined in the ex­otic wor­lds that AIs could pro­duce.

Any way of deal­ing with that fact will be con­tentious, but it’s nec­es­sary to sketch out an ex­plicit way of do­ing this, so it can be cri­tiqued and im­proved.

A toy model for this re­search agenda can be found here.