Consider this puzzle: I am able to talk and reason about ”human values”. However, I cannot define human values, or give you a definite description of what human values are – if I could do this, I could solve a large part of the AI alignment problem by writing down a safe utility function directly. I can also not give you a method for finding out what human values are – if I could do this, I could solve the problem of Inverse Reinforcement Learning. Moreover, I don’t think I could reliably recognize human values either – if you show me a bunch of utility functions, I might not be able to tell if any of them encodes human values. I’m not even sure if I could reliably recognize methods for finding out what human values are – if you show me a proposal for how to do Inverse Reinforcement Learning, I might not be able to tell whether the method truly learns human values.
One useful tool to argue that we can’t define “human values” at the moment (that isn’t explicitly used here but which you probably know about) is thinking about what happens in the limit of optimization. Many utility functions are recognizable decent proxies for “human values” in the regime of low optimization; it’s when the optimization becomes enormous and unbounded that we lose our ability to foresee the consequences, due to logical non-omniscience.
Also note that the question of whether the resulting world (after unbounded optimization of the utility function) can be recognized as against “human values” is more debated.
On human values and unbounded optimization
One useful tool to argue that we can’t define “human values” at the moment (that isn’t explicitly used here but which you probably know about) is thinking about what happens in the limit of optimization. Many utility functions are recognizable decent proxies for “human values” in the regime of low optimization; it’s when the optimization becomes enormous and unbounded that we lose our ability to foresee the consequences, due to logical non-omniscience.
Also note that the question of whether the resulting world (after unbounded optimization of the utility function) can be recognized as against “human values” is more debated.