I don’t want to speak for the original author, but I imagine that presumably the AI would take into account that the Victorian society’s culture was changing based on its interactions with the AI, and that the AI would try to safeguard the new, updated values...until such a time as those new values became obsolete as well.
In other words, it sounds like under this scheme the AI’s conception of human values would not be hardcoded. Instead, it would observe our affect to see what sorts of new activities had become terminal in their own right that made us intrinsically happy to participate in, and the AI would adapt to this change in human culture to facilitate the achievement of those new activities.
That said, I’m still unsure about how one could guarantee that the AI could not hack its own “human affect detector” to make it very easy for itself by forcing smiles on everyone’s face under torture and defining torture as the preferred human activity.
That said, I’m still unsure about how one could guarantee that the AI could not hack its own “human affect detector” to make it very easy for itself by forcing smiles on everyone’s face under torture and defining torture as the preferred human activity.
That’s a valid question, but note that it’s asking a different question than the one that this model is addressing. (This model asks “what are human values and what do we want the AI to do with them”, your question here is “how can we prevent the AI from wireheading itself in a way that stops it doing the things that we want it to do”. “What” versus “how”.)
I don’t want to speak for the original author, but I imagine that presumably the AI would take into account that the Victorian society’s culture was changing based on its interactions with the AI, and that the AI would try to safeguard the new, updated values...until such a time as those new values became obsolete as well.
In other words, it sounds like under this scheme the AI’s conception of human values would not be hardcoded. Instead, it would observe our affect to see what sorts of new activities had become terminal in their own right that made us intrinsically happy to participate in, and the AI would adapt to this change in human culture to facilitate the achievement of those new activities.
That said, I’m still unsure about how one could guarantee that the AI could not hack its own “human affect detector” to make it very easy for itself by forcing smiles on everyone’s face under torture and defining torture as the preferred human activity.
I endorse this comment.
That’s a valid question, but note that it’s asking a different question than the one that this model is addressing. (This model asks “what are human values and what do we want the AI to do with them”, your question here is “how can we prevent the AI from wireheading itself in a way that stops it doing the things that we want it to do”. “What” versus “how”.)