First of all, it tackles one of the main core difficulties of AI safety in a fairly direct way — namely, the difficulty of how to specify what we want AI systems to do (aka “outer alignment”)
I wouldn’t quite go so far as to say it “tackles” the problem of outer alignment, but it does tie into (pragmatic) attempts to solve the problem by identifying the ontology of realistically specifiable reward functions. However, maybe I’m misunderstanding you?
I’m not sure—what significance are you placing on the word “tackle” in this context? I would also not say that the main value proposition of this research agenda lies in identifying the ontology of the reward function—the main questions for this area of research may even be mostly orthogonal to that question.
No, that is not a misinterpretation: I do think that this research agenda has the potential to get pretty close to solving outer alignment. More specifically, if it is (practically) possible to solve outer alignment through some form of reward learning, then I think this research agenda will establish how that can be done (and prove that this method works), and if it isn’t possible, then I think this research agenda will produce a precise understanding of why that isn’t possible (which would in turn help to inform subsequent research). I don’t think this research agenda is the only way to solve outer alignment, but I think it is the most promising way to do it.
I wouldn’t quite go so far as to say it “tackles” the problem of outer alignment, but it does tie into (pragmatic) attempts to solve the problem by identifying the ontology of realistically specifiable reward functions. However, maybe I’m misunderstanding you?
I’m not sure—what significance are you placing on the word “tackle” in this context? I would also not say that the main value proposition of this research agenda lies in identifying the ontology of the reward function—the main questions for this area of research may even be mostly orthogonal to that question.
I was taking it as “solves” or “gets pretty close to solving”. Maybe that’s a misinterpretation on my part. What did you mean here?
No, that is not a misinterpretation: I do think that this research agenda has the potential to get pretty close to solving outer alignment. More specifically, if it is (practically) possible to solve outer alignment through some form of reward learning, then I think this research agenda will establish how that can be done (and prove that this method works), and if it isn’t possible, then I think this research agenda will produce a precise understanding of why that isn’t possible (which would in turn help to inform subsequent research). I don’t think this research agenda is the only way to solve outer alignment, but I think it is the most promising way to do it.