The model is currently mostly focused on modeling the values of just a single person. I do talk a bit about how to figure out the values of entire societies of people, but that’s mostly outside the scope of this paper.
Evolution’s “original intent” doesn’t really matter for its own sake. That sexual attraction has evolved for the purpose of furthering reproduction isn’t considered particularly important; what’s important is just that humans experience sexual attraction and find it pleasant (or not, as the case may be).
Those things said, your final step does sound reasonably close to the kind of thing I was thinking of. We can look at some particular individual, note that a combination of the surrounding culture and their own sexuality made them try out flirting and dancing and find both rewarding, and then come to value those things for their own sake. And conclude that their ideal future would probably be likely to include fair amounts of both.
Though of course there are also all kinds of questions about, for example, exactly how rewarding and enjoyable do they find those things. Maybe someone feels positive about the concept of being the kind of person who’d enjoy dance, but isn’t actually the kind of person who’d enjoy dance. Resolving that kind of a conflict would probably either mean helping them to learn to enjoy dance, or to help them give up the ideal of needing to be that kind of a person. The correct action would depend on exactly how deeply their reasons for not enjoying dance ran, and on what their other values were.
Also it’s possible that upon an examination of the person’s psychology, the AI would conclude that while they did enjoy flirting and dance, there were other things that they would enjoy more—either right now, or given enough time and effort. The AI might then work to create a situation where they could focus more on those more rewarding things.
Now with these kinds of questions there is the issue of exactly what kinds of interventions are allowed by the AI. After all, the most effective way of making somebody have maximally rewarding experiences would be to rewire their brain to always receive maximal reward. Here’s where I take a page of Paul Christiano’s book and his suggestion of approval-directed agents, and propose that the AI is only allowed to do the kinds of things to the human that the human’s current values would approve of. So if the human doesn’t want to have their brain rewired, but is okay with the AI suggesting new kinds of activities that they might enjoy, then that’s what happens. (Of course, “only doing the kinds of things that the human’s current values would approve of” is really vague and hand-wavy at this point, and would need to be defined a lot better.)
So to first note a few things:
The model is currently mostly focused on modeling the values of just a single person. I do talk a bit about how to figure out the values of entire societies of people, but that’s mostly outside the scope of this paper.
Evolution’s “original intent” doesn’t really matter for its own sake. That sexual attraction has evolved for the purpose of furthering reproduction isn’t considered particularly important; what’s important is just that humans experience sexual attraction and find it pleasant (or not, as the case may be).
Those things said, your final step does sound reasonably close to the kind of thing I was thinking of. We can look at some particular individual, note that a combination of the surrounding culture and their own sexuality made them try out flirting and dancing and find both rewarding, and then come to value those things for their own sake. And conclude that their ideal future would probably be likely to include fair amounts of both.
Though of course there are also all kinds of questions about, for example, exactly how rewarding and enjoyable do they find those things. Maybe someone feels positive about the concept of being the kind of person who’d enjoy dance, but isn’t actually the kind of person who’d enjoy dance. Resolving that kind of a conflict would probably either mean helping them to learn to enjoy dance, or to help them give up the ideal of needing to be that kind of a person. The correct action would depend on exactly how deeply their reasons for not enjoying dance ran, and on what their other values were.
Also it’s possible that upon an examination of the person’s psychology, the AI would conclude that while they did enjoy flirting and dance, there were other things that they would enjoy more—either right now, or given enough time and effort. The AI might then work to create a situation where they could focus more on those more rewarding things.
Now with these kinds of questions there is the issue of exactly what kinds of interventions are allowed by the AI. After all, the most effective way of making somebody have maximally rewarding experiences would be to rewire their brain to always receive maximal reward. Here’s where I take a page of Paul Christiano’s book and his suggestion of approval-directed agents, and propose that the AI is only allowed to do the kinds of things to the human that the human’s current values would approve of. So if the human doesn’t want to have their brain rewired, but is okay with the AI suggesting new kinds of activities that they might enjoy, then that’s what happens. (Of course, “only doing the kinds of things that the human’s current values would approve of” is really vague and hand-wavy at this point, and would need to be defined a lot better.)