Ultra-simplified research agenda

Stuart_Armstrong22 Nov 2019 14:29 UTC

LW: 34 AF: 15

This is an ultra-condensed version of the research agenda on synthesising human preferences (video version here):

In order to infer what a human wants from what they do, an AI needs to have a human theory of mind.

Theory of mind is something that humans have instinctively and subconsciously, but that isn’t easy to spell out explicitly; therefore, by Moravec’s paradox, it will be very hard to implant it into an AI, and this needs to be done deliberately.

One way of defining theory of mind is to look at how humans internally model the value of various hypothetical actions and events (happening to themselves and to others).

Finally, once we have a full theory of mind, we still need to deal, somehow, with the fact that humans have meta-preferences over their preferences, and that these preferences and meta-preferences are often contradictory, changeable, manipulable, and (more worryingly) underdefined in the exotic worlds that AIs could produce.

Any way of dealing with that fact will be contentious, but it’s necessary to sketch out an explicit way of doing this, so it can be critiqued and improved.

A toy model for this research agenda can be found here.

What links here?

Stuart_Armstrong22 Nov 2019 14:29 UTC

LW: 34 AF: 15

4 comments1 min readLW link

Research Agendas

Michaël Trazzi 22 Nov 2019 16:44 UTC
LW: 8 AF: 4
0
AF
Having printed and read the full version, this ultra-simplified version was an useful summary.

Happy to read a (not-so-)simplified version (like 20-30 paragraphs).
John_Maxwell 24 Nov 2019 7:22 UTC
LW: 6 AF: 4
0
AF

Theory of mind is something that humans have instinctively and subconsciously, but that isn’t easy to spell out explicitly; therefore, by Moravec’s paradox, it will be very hard to implant it into an AI, and this needs to be done deliberately.

I think this is the weakest part. Consider: “Recognizing cat pictures is something humans can do instinctively and subconsciously, but that isn’t easy to spell out explicitly; therefore, by Moravec’s paradox, it will be very hard to implant it into an AI, and this needs to be done deliberately.” But in practice, the techniques that work best for cat pictures work well for lots of other things as well, and a hardcoded solution customized for cat pictures will actually tend to underperform.
- Stuart_Armstrong 25 Nov 2019 10:23 UTC
  LW: 3 AF: 2
  0
  AF Parent
  I’m actually willing to believe that methods used for cat pictures might work for human theory of mind—if trained on that data (and this doesn’t solve the underdefined problem).
avturchin 22 Nov 2019 15:36 UTC
5 points
0
Maybe we could try to put the theory of mind out of the brackets? In that case, the following type of claims will be meaningful: “For the theory of mind T1, a human being H has the set of preferences P1, and for the another theory of mind T2 he has P2”. Now we could compare P1 and P2 and if we find some invariants, they could be used as more robust presentations of the preferences.