Each time we come up against this barrier, it is tempting to add a new layer of indirection in our designs for AI systems.
I strongly agree with this characterization. Of my own “learning normativity” research direction, I would say that it has an avoiding-the-question nature similar to what you are pointing out here; I am in effect saying: Hey! We keep needing new layers of indirection! Let’s add infinitely many of them!
One reason I don’t spend very much time staring the question “what is goodness/wisdom” in the eyes is, the CEV write-up and other things convinced me that trying to answer this question on the object level (eg, trying to write down the utility function for “goodness” instead of trying to come up with a working value learning system), if successful, was a way of “taking over the world” with your own values. It’s too easy to fool yourself.
To use a political analogy, you don’t want to install a dictator, even if that person is actually “really good”—the process by which you put them in power was not legitimate, because it did not involve everyone in the right way. There are too many times when people have tried this approach and it has gone wrong. So, it’s better to follow a process with a better track record, and a more “fair” way of giving everyone input into the end result.
Moving on to a different point—to defend the methodology of adding layers of indirection, a bit: it seems plausible to me that each layer of indirection, if crafted well, makes a sort of progress.
We know something about what’s good, but I feel quite hopeless about an approach like “program in what’s good directly”—because of the “taking over the world” concern I already mentioned, but also just because I think it’s very hard (even if you’re fine with taking over the world) and humans are very very liable to get it wrong.
We know something about how to do value-learning; I still feel somewhat hopeless about a direct value-learning approach, but it feels significantly less hopeless than direct value specification.
I feel somewhat better about giving a system feedback about successful vs unsuccessful value learning, rather than trying to directly specify a value-learning loss function, because this at least doesn’t fall prey to Stuart Armstrong’s impossibility argument for value learning.
And so on.
I won’t claim that this hierarchy approaches perfection in the limit. In particular, it’s still doomed if we don’t produce enough actual high-quality information to put in each level. (This is more like “staring the problem directly in the eyes”.) But it does seem like it becomes less doomed with each level of indirection.
I strongly agree with this characterization. Of my own “learning normativity” research direction, I would say that it has an avoiding-the-question nature similar to what you are pointing out here; I am in effect saying: Hey! We keep needing new layers of indirection! Let’s add infinitely many of them!
One reason I don’t spend very much time staring the question “what is goodness/wisdom” in the eyes is, the CEV write-up and other things convinced me that trying to answer this question on the object level (eg, trying to write down the utility function for “goodness” instead of trying to come up with a working value learning system), if successful, was a way of “taking over the world” with your own values. It’s too easy to fool yourself.
To use a political analogy, you don’t want to install a dictator, even if that person is actually “really good”—the process by which you put them in power was not legitimate, because it did not involve everyone in the right way. There are too many times when people have tried this approach and it has gone wrong. So, it’s better to follow a process with a better track record, and a more “fair” way of giving everyone input into the end result.
Moving on to a different point—to defend the methodology of adding layers of indirection, a bit: it seems plausible to me that each layer of indirection, if crafted well, makes a sort of progress.
We know something about what’s good, but I feel quite hopeless about an approach like “program in what’s good directly”—because of the “taking over the world” concern I already mentioned, but also just because I think it’s very hard (even if you’re fine with taking over the world) and humans are very very liable to get it wrong.
We know something about how to do value-learning; I still feel somewhat hopeless about a direct value-learning approach, but it feels significantly less hopeless than direct value specification.
I feel somewhat better about giving a system feedback about successful vs unsuccessful value learning, rather than trying to directly specify a value-learning loss function, because this at least doesn’t fall prey to Stuart Armstrong’s impossibility argument for value learning.
And so on.
I won’t claim that this hierarchy approaches perfection in the limit. In particular, it’s still doomed if we don’t produce enough actual high-quality information to put in each level. (This is more like “staring the problem directly in the eyes”.) But it does seem like it becomes less doomed with each level of indirection.