One question I have is: how do you avoid two perfectly aligned agents from developing instrumental values concerning their own self-preservation and then becoming instrumentally misaligned as a result?
In a little more detail: consider two agents, both trying to build a house, with perfectly aligned preferences over what kind of house should be built. And suppose the agents have only partial information about the environment—enough, let’s say, to get the house built, but not enough, let’s say, to really understand what’s going on inside the other agent. Then wouldn’t the two agents both reason “hey if I die then who knows if this house will be built correctly; I better take steps towards self-preservation just to make sure that the house gets built”. Then the two agents might each take steps to build physical protection for themselves, to acquire resources with which to do that, and eventually to fight over resources, even though their goals are, in truth, perfectly aligned. Is it true that this would happen under an imperfect information version of your model?
Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed.
It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information question from the human side, because the human never considers even the existence of the AI during its training.
This setup has the advantage that it’s more tractable and easier to reason about. But it has the disadvantage that it unfortunately fails to give a fully satisfying answer to your question. It would be interesting to see if we can remove some of the assumptions in our setup to approximate the imperfect information case.
OK, good, thanks for that correction.
One question I have is: how do you avoid two perfectly aligned agents from developing instrumental values concerning their own self-preservation and then becoming instrumentally misaligned as a result?
In a little more detail: consider two agents, both trying to build a house, with perfectly aligned preferences over what kind of house should be built. And suppose the agents have only partial information about the environment—enough, let’s say, to get the house built, but not enough, let’s say, to really understand what’s going on inside the other agent. Then wouldn’t the two agents both reason “hey if I die then who knows if this house will be built correctly; I better take steps towards self-preservation just to make sure that the house gets built”. Then the two agents might each take steps to build physical protection for themselves, to acquire resources with which to do that, and eventually to fight over resources, even though their goals are, in truth, perfectly aligned. Is it true that this would happen under an imperfect information version of your model?
Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed.
It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information question from the human side, because the human never considers even the existence of the AI during its training.
This setup has the advantage that it’s more tractable and easier to reason about. But it has the disadvantage that it unfortunately fails to give a fully satisfying answer to your question. It would be interesting to see if we can remove some of the assumptions in our setup to approximate the imperfect information case.