I mean the “complexity of value”/”value is fragile” thesis.
I agree with “complexity of value” in the sense that human preference, as a mathematical object, has high information content. But I don’t see a convincing argument from this premise to the conclusion that the best course of action for us to take, in the sense of maximizing our values under the constraints that we’re likely to face, involves automated extraction of preferences, instead of writing them down manually.
Consider the counter-example of someone who has the full complexity of human values, but would be willing to give up all of their other goals to fill the universe with orgasmium, if that choice were available. Such an agent could “win” by building a superintelligence with just that one value. How do we know, at this point, that our values are not like that?
Whatever the case is with how acceptable the simplified values are, automated extraction of preference seems to be the only way to actually knowably win, rather than striking a compromise, which simplified preference is suggested to be. We must decide from information we have; how would you come to know that a particular simplified preference definition is any good? I don’t see a way forward without having a more precise moral machine than a human first (but then, we won’t need to consider simplified preference).
I agree with “complexity of value” in the sense that human preference, as a mathematical object, has high information content. But I don’t see a convincing argument from this premise to the conclusion that the best course of action for us to take, in the sense of maximizing our values under the constraints that we’re likely to face, involves automated extraction of preferences, instead of writing them down manually.
Consider the counter-example of someone who has the full complexity of human values, but would be willing to give up all of their other goals to fill the universe with orgasmium, if that choice were available. Such an agent could “win” by building a superintelligence with just that one value. How do we know, at this point, that our values are not like that?
Whatever the case is with how acceptable the simplified values are, automated extraction of preference seems to be the only way to actually knowably win, rather than striking a compromise, which simplified preference is suggested to be. We must decide from information we have; how would you come to know that a particular simplified preference definition is any good? I don’t see a way forward without having a more precise moral machine than a human first (but then, we won’t need to consider simplified preference).