It seems to me that every program behaves as if it was maximizing some utility function. You could try to restrict it by saying the utility function has to be “reasonable”, but how?
If you say the utility function must have low complexity, that doesn’t work—human values are pretty complex.
If you say the utility function has to be about world states, that doesn’t work—human values are about entire world histories, you’d prevent suffering in the past if you could.
If you say the utility function has to be comprehensible to a human, that doesn’t work—an AI extrapolating octopus values could give you something pretty alien.
So I’m having trouble spelling out precisely, even to myself, how AIs that satisfy the “utility hypothesis” differ from those that don’t. How would you tell, looking at the AI and what it does?
It seems to me that every program behaves as if it was maximizing some utility function. You could try to restrict it by saying the utility function has to be “reasonable”, but how?
If you say the utility function must have low complexity, that doesn’t work—human values are pretty complex.
If you say the utility function has to be about world states, that doesn’t work—human values are about entire world histories, you’d prevent suffering in the past if you could.
If you say the utility function has to be comprehensible to a human, that doesn’t work—an AI extrapolating octopus values could give you something pretty alien.
So I’m having trouble spelling out precisely, even to myself, how AIs that satisfy the “utility hypothesis” differ from those that don’t. How would you tell, looking at the AI and what it does?