How to generalize to multiple humans is… not an unimportant question, but a question whose salience is far, far out of proportion to its relative importance
I expect it to be the hardest problem, not from a technical point of view, but from a lack of ground truth.
The question “how do I model the values of a human” has a simple ground truth : the human in question.
I doubt there’s such a ground truth with “how do I compress the values of all humans in one utility function ?”. “All models are wrong, some are useful”, and all that, except all the different humans have a different opinion on “useful”, ie their own personal values. There would be a lot of inconsistencies ; while I agree with your stance “Approximation is part of the game” for modeling the value of individual persons, people can wildly disagree on what approximations they are okay with or not, mostly based on the agreement between the outcome and their values.
In other words : do you believe in the existence of at least a model where nobody can honestly say “the output of that model approximates away too much of my values” ? If yes, what makes you think so ?
I basically agree with this, and even more importantly I think any attempt to compress all the values of humanity as a collective will require you to put the thumb on the scale and infuse a lot of your own values, because there’s likely no ground truth on what the correct morality is, or even a consensus amongst humanity on what values are useful to them.
I think the defensible version of the sentence you quote is that it doesn’t matter that much for extinction risk issues, because failure here doesn’t lead to an entire species going extinct with any appreciable likelihood.
I could imagine an argument that the lack of ground truth makes multiagent aggregation disproportionately difficult, but that argument wouldn’t run through different agents having different approximation-preferences. In general, I expect the sort of approximations relevant to this topic to be quite robustly universal, and specifically mostly information theoretic (think e.g. “distribution P approximates Q, quantified by a low KL divergence between the two”).
I expect it to be the hardest problem, not from a technical point of view, but from a lack of ground truth.
The question “how do I model the values of a human” has a simple ground truth : the human in question.
I doubt there’s such a ground truth with “how do I compress the values of all humans in one utility function ?”. “All models are wrong, some are useful”, and all that, except all the different humans have a different opinion on “useful”, ie their own personal values. There would be a lot of inconsistencies ; while I agree with your stance “Approximation is part of the game” for modeling the value of individual persons, people can wildly disagree on what approximations they are okay with or not, mostly based on the agreement between the outcome and their values.
In other words : do you believe in the existence of at least a model where nobody can honestly say “the output of that model approximates away too much of my values” ? If yes, what makes you think so ?
I basically agree with this, and even more importantly I think any attempt to compress all the values of humanity as a collective will require you to put the thumb on the scale and infuse a lot of your own values, because there’s likely no ground truth on what the correct morality is, or even a consensus amongst humanity on what values are useful to them.
I think the defensible version of the sentence you quote is that it doesn’t matter that much for extinction risk issues, because failure here doesn’t lead to an entire species going extinct with any appreciable likelihood.
I could imagine an argument that the lack of ground truth makes multiagent aggregation disproportionately difficult, but that argument wouldn’t run through different agents having different approximation-preferences. In general, I expect the sort of approximations relevant to this topic to be quite robustly universal, and specifically mostly information theoretic (think e.g. “distribution P approximates Q, quantified by a low KL divergence between the two”).