I think you might have reversed the definitions of αHA and βHA in your comment,[1] but otherwise I think you’re exactly right.
To compute βHA (the correlation coefficient between terminal values), naively you’d have reward functions RH(s) and RA(s), that respectively assign human and AI rewards over every possible arrangement of matter s. Then you’d look at every such reward function pair over your joint distribution DHA, and ask how correlated they are over arrangements of matter. If you like, you can imagine that the human has some uncertainty around both his own reward function over houses, and also over how well aligned the AI is with his own reward function.
And to compute αHA (the correlation coefficient between instrumental values), you’re correct that some of the arrangements of matter s will be intermediate states in some construction plans. So if the human and AI both want a house with a swimming pool, they will both have high POWER for arrangements of matter that include a big hole dug in the backyard. Plot out their respective POWERs at each s, and you can read the correlation right off the alignment plot!
Looking again at the write-up, it would have made more sense for us to define αHA as the terminal goal correlation coefficient, since we introduce that one first. Alas, this didn’t occur to us. Sorry for the confusion.
One question I have is: how do you avoid two perfectly aligned agents from developing instrumental values concerning their own self-preservation and then becoming instrumentally misaligned as a result?
In a little more detail: consider two agents, both trying to build a house, with perfectly aligned preferences over what kind of house should be built. And suppose the agents have only partial information about the environment—enough, let’s say, to get the house built, but not enough, let’s say, to really understand what’s going on inside the other agent. Then wouldn’t the two agents both reason “hey if I die then who knows if this house will be built correctly; I better take steps towards self-preservation just to make sure that the house gets built”. Then the two agents might each take steps to build physical protection for themselves, to acquire resources with which to do that, and eventually to fight over resources, even though their goals are, in truth, perfectly aligned. Is it true that this would happen under an imperfect information version of your model?
Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed.
It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information question from the human side, because the human never considers even the existence of the AI during its training.
This setup has the advantage that it’s more tractable and easier to reason about. But it has the disadvantage that it unfortunately fails to give a fully satisfying answer to your question. It would be interesting to see if we can remove some of the assumptions in our setup to approximate the imperfect information case.
I think you might have reversed the definitions of αHA and βHA in your comment,[1] but otherwise I think you’re exactly right.
To compute βHA (the correlation coefficient between terminal values), naively you’d have reward functions RH(s) and RA(s), that respectively assign human and AI rewards over every possible arrangement of matter s. Then you’d look at every such reward function pair over your joint distribution DHA, and ask how correlated they are over arrangements of matter. If you like, you can imagine that the human has some uncertainty around both his own reward function over houses, and also over how well aligned the AI is with his own reward function.
And to compute αHA (the correlation coefficient between instrumental values), you’re correct that some of the arrangements of matter s will be intermediate states in some construction plans. So if the human and AI both want a house with a swimming pool, they will both have high POWER for arrangements of matter that include a big hole dug in the backyard. Plot out their respective POWERs at each s, and you can read the correlation right off the alignment plot!
Looking again at the write-up, it would have made more sense for us to define αHA as the terminal goal correlation coefficient, since we introduce that one first. Alas, this didn’t occur to us. Sorry for the confusion.
OK, good, thanks for that correction.
One question I have is: how do you avoid two perfectly aligned agents from developing instrumental values concerning their own self-preservation and then becoming instrumentally misaligned as a result?
In a little more detail: consider two agents, both trying to build a house, with perfectly aligned preferences over what kind of house should be built. And suppose the agents have only partial information about the environment—enough, let’s say, to get the house built, but not enough, let’s say, to really understand what’s going on inside the other agent. Then wouldn’t the two agents both reason “hey if I die then who knows if this house will be built correctly; I better take steps towards self-preservation just to make sure that the house gets built”. Then the two agents might each take steps to build physical protection for themselves, to acquire resources with which to do that, and eventually to fight over resources, even though their goals are, in truth, perfectly aligned. Is it true that this would happen under an imperfect information version of your model?
Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed.
It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information question from the human side, because the human never considers even the existence of the AI during its training.
This setup has the advantage that it’s more tractable and easier to reason about. But it has the disadvantage that it unfortunately fails to give a fully satisfying answer to your question. It would be interesting to see if we can remove some of the assumptions in our setup to approximate the imperfect information case.