I believe you can strip the AI of any preferences towards human utility functions with a simple hack.
Every decision of the AI will have two effects on expected human utility: it will change it, and it will change the human utility functions.
Have the AI make its decisions only based on the effect on the current expected human utility, not on the changes to the function. Add a term granting a large disutility for deaths, and this should do the trick.
Note the importance of the “current” expected utility in this setup; an AI will decide whether to industrialise a primitive tribe based on their current utility; if it does industrialise them, it will base its subsequent decisions on their new, industrialised utility.
Add a term granting a large disutility for deaths, and this should do the trick.
What if death isn’t well-defined? What if the AI has the option of cryonically freezing a person to save their life—but then being frozen, that person does not have any “current” utility function, so the AI can then disregard them completely. Situations like this also demonstrate that more generally, trying to satisfy someone’s utility function may have an unavoidable side-effect of changing their utility function. These side-effects may be complex enough that the person does not forsee them, and it is not possible for the AI to explain them to the person.
It’s simple, it’s well defined—it just doesn’t work. Or at least, work naively the way I was hoping.
The original version of the hack—on one-shot oracle machines—worked reasonably well. This version needs more work. And I shouldn’t have mentioned deaths here; that whole subject requires its own seperate treatment.
What keeps the AI from immediately changing itself to only care about the people’s current utility function? That’s a change with very high expected utility defined in terms of their current utility function and one with little tendency to change their current utility function.
Will you believe that a simple hack will work with lower confidence next time?
I believe you can strip the AI of any preferences towards human utility functions with a simple hack.
Every decision of the AI will have two effects on expected human utility: it will change it, and it will change the human utility functions.
Have the AI make its decisions only based on the effect on the current expected human utility, not on the changes to the function. Add a term granting a large disutility for deaths, and this should do the trick.
Note the importance of the “current” expected utility in this setup; an AI will decide whether to industrialise a primitive tribe based on their current utility; if it does industrialise them, it will base its subsequent decisions on their new, industrialised utility.
What if death isn’t well-defined? What if the AI has the option of cryonically freezing a person to save their life—but then being frozen, that person does not have any “current” utility function, so the AI can then disregard them completely. Situations like this also demonstrate that more generally, trying to satisfy someone’s utility function may have an unavoidable side-effect of changing their utility function. These side-effects may be complex enough that the person does not forsee them, and it is not possible for the AI to explain them to the person.
I think your “simple hack” is not actually that simple or well-defined.
It’s simple, it’s well defined—it just doesn’t work. Or at least, work naively the way I was hoping.
The original version of the hack—on one-shot oracle machines—worked reasonably well. This version needs more work. And I shouldn’t have mentioned deaths here; that whole subject requires its own seperate treatment.
What keeps the AI from immediately changing itself to only care about the people’s current utility function? That’s a change with very high expected utility defined in terms of their current utility function and one with little tendency to change their current utility function.
Will you believe that a simple hack will work with lower confidence next time?
Slightly. I was counting on this one getting bashed into shape by the comments; it wasn’t so in future, I’ll try and do more of the bashing myself.
You meant “any preferences towards MODIFYING human utility functions”.
Yep