Suppose the human is trying to build a house and plans to build an AI to help with that. What would αHA and βHA mean—just at an intuitive level—in a case like that?
I suppose that to compute αHA you would sample many different arrangement of matter—some containing houses of various shapes and sizes and some not—and ask to what extent the reward received by the human correlates with the reward received by the AI. So this is like measuring to what extent the human and the AI are on the same page about the design of the house they are trying to build together—is that right?
And I suppose that to compute βHA you would look at—what—something like the optionality across different reward functions, for the human and for the AI, at different states, and compute a correlation? So you might sample a bunch of different floorplans for the house that the human is trying to build, and ask, for each configuration of matter, how much optionality the human and the AI each have to get the house to turn out according to their respective goal floorplans.
I think you might have reversed the definitions of αHA and βHA in your comment,[1] but otherwise I think you’re exactly right.
To compute βHA (the correlation coefficient between terminal values), naively you’d have reward functions RH(s) and RA(s), that respectively assign human and AI rewards over every possible arrangement of matter s. Then you’d look at every such reward function pair over your joint distribution DHA, and ask how correlated they are over arrangements of matter. If you like, you can imagine that the human has some uncertainty around both his own reward function over houses, and also over how well aligned the AI is with his own reward function.
And to compute αHA (the correlation coefficient between instrumental values), you’re correct that some of the arrangements of matter s will be intermediate states in some construction plans. So if the human and AI both want a house with a swimming pool, they will both have high POWER for arrangements of matter that include a big hole dug in the backyard. Plot out their respective POWERs at each s, and you can read the correlation right off the alignment plot!
Looking again at the write-up, it would have made more sense for us to define αHA as the terminal goal correlation coefficient, since we introduce that one first. Alas, this didn’t occur to us. Sorry for the confusion.
One question I have is: how do you avoid two perfectly aligned agents from developing instrumental values concerning their own self-preservation and then becoming instrumentally misaligned as a result?
In a little more detail: consider two agents, both trying to build a house, with perfectly aligned preferences over what kind of house should be built. And suppose the agents have only partial information about the environment—enough, let’s say, to get the house built, but not enough, let’s say, to really understand what’s going on inside the other agent. Then wouldn’t the two agents both reason “hey if I die then who knows if this house will be built correctly; I better take steps towards self-preservation just to make sure that the house gets built”. Then the two agents might each take steps to build physical protection for themselves, to acquire resources with which to do that, and eventually to fight over resources, even though their goals are, in truth, perfectly aligned. Is it true that this would happen under an imperfect information version of your model?
Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed.
It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information question from the human side, because the human never considers even the existence of the AI during its training.
This setup has the advantage that it’s more tractable and easier to reason about. But it has the disadvantage that it unfortunately fails to give a fully satisfying answer to your question. It would be interesting to see if we can remove some of the assumptions in our setup to approximate the imperfect information case.
Suppose the human is trying to build a house and plans to build an AI to help with that. What would αHA and βHA mean—just at an intuitive level—in a case like that?
I suppose that to compute αHA you would sample many different arrangement of matter—some containing houses of various shapes and sizes and some not—and ask to what extent the reward received by the human correlates with the reward received by the AI. So this is like measuring to what extent the human and the AI are on the same page about the design of the house they are trying to build together—is that right?
And I suppose that to compute βHA you would look at—what—something like the optionality across different reward functions, for the human and for the AI, at different states, and compute a correlation? So you might sample a bunch of different floorplans for the house that the human is trying to build, and ask, for each configuration of matter, how much optionality the human and the AI each have to get the house to turn out according to their respective goal floorplans.
Did I get that approximately right?
I think you might have reversed the definitions of αHA and βHA in your comment,[1] but otherwise I think you’re exactly right.
To compute βHA (the correlation coefficient between terminal values), naively you’d have reward functions RH(s) and RA(s), that respectively assign human and AI rewards over every possible arrangement of matter s. Then you’d look at every such reward function pair over your joint distribution DHA, and ask how correlated they are over arrangements of matter. If you like, you can imagine that the human has some uncertainty around both his own reward function over houses, and also over how well aligned the AI is with his own reward function.
And to compute αHA (the correlation coefficient between instrumental values), you’re correct that some of the arrangements of matter s will be intermediate states in some construction plans. So if the human and AI both want a house with a swimming pool, they will both have high POWER for arrangements of matter that include a big hole dug in the backyard. Plot out their respective POWERs at each s, and you can read the correlation right off the alignment plot!
Looking again at the write-up, it would have made more sense for us to define αHA as the terminal goal correlation coefficient, since we introduce that one first. Alas, this didn’t occur to us. Sorry for the confusion.
OK, good, thanks for that correction.
One question I have is: how do you avoid two perfectly aligned agents from developing instrumental values concerning their own self-preservation and then becoming instrumentally misaligned as a result?
In a little more detail: consider two agents, both trying to build a house, with perfectly aligned preferences over what kind of house should be built. And suppose the agents have only partial information about the environment—enough, let’s say, to get the house built, but not enough, let’s say, to really understand what’s going on inside the other agent. Then wouldn’t the two agents both reason “hey if I die then who knows if this house will be built correctly; I better take steps towards self-preservation just to make sure that the house gets built”. Then the two agents might each take steps to build physical protection for themselves, to acquire resources with which to do that, and eventually to fight over resources, even though their goals are, in truth, perfectly aligned. Is it true that this would happen under an imperfect information version of your model?
Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed.
It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information question from the human side, because the human never considers even the existence of the AI during its training.
This setup has the advantage that it’s more tractable and easier to reason about. But it has the disadvantage that it unfortunately fails to give a fully satisfying answer to your question. It would be interesting to see if we can remove some of the assumptions in our setup to approximate the imperfect information case.