There is no safe wish smaller than an entire human morality.
Except, of course, if your AI has already read trillions of tokens of texts relevant to human values and morality and what humans want and how they do things (including rescuing grandmothers from burning buildings). Then you can just point to that part of its world model, in abstract concepts. The best way to do that is actually natural language. We know, we tried all the other possibilities first.
Would you rather hand craft a loss function for human values? It’s O(1GB) of data, can you get it right on the first critical try?
The data on what humans value is in the training set. As Eliezer put it in The Hidden Complexity of Wishes:
Except, of course, if your AI has already read trillions of tokens of texts relevant to human values and morality and what humans want and how they do things (including rescuing grandmothers from burning buildings). Then you can just point to that part of its world model, in abstract concepts. The best way to do that is actually natural language. We know, we tried all the other possibilities first.