Not only it is hard to disentangle manipulation and explanation; it is actually difficult to disentangle even manipulation and just asking the human about preferences (like here).
Manipulation via incorrect “understanding” is IMO somewhat easier problem (understanding can be possibly tested by something like simulating the human’s capacity to predict). Manipulation via messing up with our internal multi-agent system of values seems subtle and harder. (You can imagine AI roughly in the shape of Robin Hanson, explaining to one part of the mind how some of the other parts work. Or just drawing the attention of consciousness to some sub-agents and not others.)
My impression is that in full generality it is unsolvable, but something like starting with an imprecise model of approval / utility function learned via ambitious value learning and restricting explanations/questions/manipulation by that may be work.
My impression is that in full generality it is unsolvable, but something like starting with an imprecise model of approval / utility function learned via ambitious value learning and restricting explanations/questions/manipulation by that may be work.
Yep. As so often, I think these things are not fully value agnostic, but don’t need full human values to be defined.
Not only it is hard to disentangle manipulation and explanation; it is actually difficult to disentangle even manipulation and just asking the human about preferences (like here).
Manipulation via incorrect “understanding” is IMO somewhat easier problem (understanding can be possibly tested by something like simulating the human’s capacity to predict). Manipulation via messing up with our internal multi-agent system of values seems subtle and harder. (You can imagine AI roughly in the shape of Robin Hanson, explaining to one part of the mind how some of the other parts work. Or just drawing the attention of consciousness to some sub-agents and not others.)
My impression is that in full generality it is unsolvable, but something like starting with an imprecise model of approval / utility function learned via ambitious value learning and restricting explanations/questions/manipulation by that may be work.
Yep. As so often, I think these things are not fully value agnostic, but don’t need full human values to be defined.