It seems likely that there will be contradictions in human preferences that are about sufficiently difficult for humans to understand that the AI system can’t simply present the contradiction to the human and expect the human to resolve it correctly, which is what I was proposing in the previous sentence.
How relevant do you expect this to be? It seems like the system could act pessimistically, under the assumption that either answer might be the correct way to resolve the contradiction, and only do actions that are in the intersection of the set of actions that each possible philosophy says is OK. Also, I’m not sure the overseer needs to think directly in terms of some uber-complicated model of the overseer’s preferences that the system has; couldn’t you make use of active learning and ask whether specific actions would be corrigible or incorrigible, without the system trying to explain the complex confusion it is trying to resolve?
How relevant do you expect this to be? It seems like the system could act pessimistically, under the assumption that either answer might be the correct way to resolve the contradiction, and only do actions that are in the intersection of the set of actions that each possible philosophy says is OK.
It seems plausible that this could be sufficient, I didn’t intend to rule out that possibility. I do think that we want to eventually resolve such contradictions, or have some method for dealing with them, or otherwise we are stuck making not much progress (since I expect that creating very different conditions eg. through space colonization will take humans “off-distribution” leading to lots of contradictions that could be very difficult to resolve).
I’m not sure the overseer needs to think directly in terms of some uber-complicated model of the overseer’s preferences that the system has; couldn’t you make use of active learning and ask whether specific actions would be corrigible or incorrigible, without the system trying to explain the complex confusion it is trying to resolve?
I didn’t mean that the complexity/confusion arises in the model of the overseer’s preferences. Even specific actions can be hard to evaluate—you need to understand the (agent’s expectation of) the long-term outcomes of that action, and then to evaluate whether those long-term outcomes are good (which could be very challenging, if the future is quite different from the present). Or alternatively, you need to evaluate whether the agent believes those outcomes are good for the overseer.
How relevant do you expect this to be? It seems like the system could act pessimistically, under the assumption that either answer might be the correct way to resolve the contradiction, and only do actions that are in the intersection of the set of actions that each possible philosophy says is OK. Also, I’m not sure the overseer needs to think directly in terms of some uber-complicated model of the overseer’s preferences that the system has; couldn’t you make use of active learning and ask whether specific actions would be corrigible or incorrigible, without the system trying to explain the complex confusion it is trying to resolve?
It seems plausible that this could be sufficient, I didn’t intend to rule out that possibility. I do think that we want to eventually resolve such contradictions, or have some method for dealing with them, or otherwise we are stuck making not much progress (since I expect that creating very different conditions eg. through space colonization will take humans “off-distribution” leading to lots of contradictions that could be very difficult to resolve).
I didn’t mean that the complexity/confusion arises in the model of the overseer’s preferences. Even specific actions can be hard to evaluate—you need to understand the (agent’s expectation of) the long-term outcomes of that action, and then to evaluate whether those long-term outcomes are good (which could be very challenging, if the future is quite different from the present). Or alternatively, you need to evaluate whether the agent believes those outcomes are good for the overseer.