How relevant do you expect this to be? It seems like the system could act pessimistically, under the assumption that either answer might be the correct way to resolve the contradiction, and only do actions that are in the intersection of the set of actions that each possible philosophy says is OK.
It seems plausible that this could be sufficient, I didn’t intend to rule out that possibility. I do think that we want to eventually resolve such contradictions, or have some method for dealing with them, or otherwise we are stuck making not much progress (since I expect that creating very different conditions eg. through space colonization will take humans “off-distribution” leading to lots of contradictions that could be very difficult to resolve).
I’m not sure the overseer needs to think directly in terms of some uber-complicated model of the overseer’s preferences that the system has; couldn’t you make use of active learning and ask whether specific actions would be corrigible or incorrigible, without the system trying to explain the complex confusion it is trying to resolve?
I didn’t mean that the complexity/confusion arises in the model of the overseer’s preferences. Even specific actions can be hard to evaluate—you need to understand the (agent’s expectation of) the long-term outcomes of that action, and then to evaluate whether those long-term outcomes are good (which could be very challenging, if the future is quite different from the present). Or alternatively, you need to evaluate whether the agent believes those outcomes are good for the overseer.
It seems plausible that this could be sufficient, I didn’t intend to rule out that possibility. I do think that we want to eventually resolve such contradictions, or have some method for dealing with them, or otherwise we are stuck making not much progress (since I expect that creating very different conditions eg. through space colonization will take humans “off-distribution” leading to lots of contradictions that could be very difficult to resolve).
I didn’t mean that the complexity/confusion arises in the model of the overseer’s preferences. Even specific actions can be hard to evaluate—you need to understand the (agent’s expectation of) the long-term outcomes of that action, and then to evaluate whether those long-term outcomes are good (which could be very challenging, if the future is quite different from the present). Or alternatively, you need to evaluate whether the agent believes those outcomes are good for the overseer.