I think it is fine to assume that the “true” (Coherent Extrapolated Volition) preferences of e.g. humans are transitive.
I think this could be safe or unsafe, depending on implementation details.
By your scare quotes, you probably recognize that there isn’t actually a unique way to get some “true” values out of humans. Instead, there are lots of different agent-ish ways to model humans, and these different ways of modeling humans will have different stuff in the “human values” bucket.
Rather than picking just one way to model humans, and following it and only it as the True Way, it seems a lot safer to build future AI that understands how this agent-ish modeling thing works, and tries to integrate lots of different possible notions of “human values” in a way that makes sense according to humans.
Of course, any agent that does good will make decisions, and so from the outside you can always impute transitive preferences over trajectories to this future AI. That’s fine. And we can go further, and say that a good future AI won’t do obviously-inconsistent-seeming stuff like do a lot of work to set something up and then do a lot of work to dismantle it (without achieving some end in the meantime—the more trivial the ends we allow, the weaker this condition becomes). That’s probably true.
But internally, I think it’s too hasty to say that a good future AI will end up representing humans as having a specific set transitive preferences. It might keep several incompatible models of human preferences in mind, and then aggregate them in a way that isn’t equivalent to any single set of transitive preferences on the part of humans (which means that in the future it might allow or even encourage humans to do some amount of obviously-inconsistent-seeming stuff).
I think this could be safe or unsafe, depending on implementation details.
By your scare quotes, you probably recognize that there isn’t actually a unique way to get some “true” values out of humans. Instead, there are lots of different agent-ish ways to model humans, and these different ways of modeling humans will have different stuff in the “human values” bucket.
Rather than picking just one way to model humans, and following it and only it as the True Way, it seems a lot safer to build future AI that understands how this agent-ish modeling thing works, and tries to integrate lots of different possible notions of “human values” in a way that makes sense according to humans.
Of course, any agent that does good will make decisions, and so from the outside you can always impute transitive preferences over trajectories to this future AI. That’s fine. And we can go further, and say that a good future AI won’t do obviously-inconsistent-seeming stuff like do a lot of work to set something up and then do a lot of work to dismantle it (without achieving some end in the meantime—the more trivial the ends we allow, the weaker this condition becomes). That’s probably true.
But internally, I think it’s too hasty to say that a good future AI will end up representing humans as having a specific set transitive preferences. It might keep several incompatible models of human preferences in mind, and then aggregate them in a way that isn’t equivalent to any single set of transitive preferences on the part of humans (which means that in the future it might allow or even encourage humans to do some amount of obviously-inconsistent-seeming stuff).