I don’t follow. It seems like you agree that the human values we would like to have optimized is not a natural abstraction, and instead there is some different concept that would be learned instead, if attempting to learn human values as a natural abstraction. This seems like stumbling block, but then your conclusion immediately says that it isn’t a stumbling block.
I don’t follow. It seems like you agree that the human values we would like to have optimized is not a natural abstraction, and instead there is some different concept that would be learned instead, if attempting to learn human values as a natural abstraction. This seems like stumbling block, but then your conclusion immediately says that it isn’t a stumbling block.
I’m kind of tipsy so maybe I’m missing something.
The most natural abstraction isn’t any specific model of human values, but a minimal model that captures what they have in common.
What can one use this minimal model for?
The minimal model may be the model most agents performing unsupervised learning on human generated data learn.
Alternatively, most other models imply the minimal model.
This tells us how we get that model but not what one can use it for.
I think using such a model as an optimisation target would be existentially safe.