Alex Flint comments on Misalignment-by-default in multi-agent systems

Alex Flint 14 Oct 2022 15:29 UTC
LW: 6 AF: 4
1
AF
Suppose the human is trying to build a house and plans to build an AI to help with that. What would $α_{H A}$ and $β_{H A}$ mean—just at an intuitive level—in a case like that?

I suppose that to compute $α_{H A}$ you would sample many different arrangement of matter—some containing houses of various shapes and sizes and some not—and ask to what extent the reward received by the human correlates with the reward received by the AI. So this is like measuring to what extent the human and the AI are on the same page about the design of the house they are trying to build together—is that right?

And I suppose that to compute $β_{H A}$ you would look at—what—something like the optionality across different reward functions, for the human and for the AI, at different states, and compute a correlation? So you might sample a bunch of different floorplans for the house that the human is trying to build, and ask, for each configuration of matter, how much optionality the human and the AI each have to get the house to turn out according to their respective goal floorplans.

Did I get that approximately right?
- Edouard Harris 14 Oct 2022 19:11 UTC
  LW: 6 AF: 4
  0
  AF Parent
  I think you might have reversed the definitions of $α_{H A}$ and $β_{H A}$ in your comment,^[1] but otherwise I think you’re exactly right.
  To compute $β_{H A}$ (the correlation coefficient between terminal values), naively you’d have reward functions $R_{H} (s)$ and $R_{A} (s)$ , that respectively assign human and AI rewards over every possible arrangement of matter $s$ . Then you’d look at every such reward function pair over your joint distribution $D_{H A}$ , and ask how correlated they are over arrangements of matter. If you like, you can imagine that the human has some uncertainty around both his own reward function over houses, and also over how well aligned the AI is with his own reward function.
  And to compute $α_{H A}$ (the correlation coefficient between instrumental values), you’re correct that some of the arrangements of matter $s$ will be intermediate states in some construction plans. So if the human and AI both want a house with a swimming pool, they will both have high POWER for arrangements of matter that include a big hole dug in the backyard. Plot out their respective POWERs at each $s$ , and you can read the correlation right off the alignment plot!
  1. ^
    Looking again at the write-up, it would have made more sense for us to define $α_{H A}$ as the terminal goal correlation coefficient, since we introduce that one first. Alas, this didn’t occur to us. Sorry for the confusion.
  - Alex Flint 17 Oct 2022 18:26 UTC
    LW: 3 AF: 2
    0
    AF Parent
    OK, good, thanks for that correction.
    
    One question I have is: how do you avoid two perfectly aligned agents from developing instrumental values concerning their own self-preservation and then becoming instrumentally misaligned as a result?
    
    In a little more detail: consider two agents, both trying to build a house, with perfectly aligned preferences over what kind of house should be built. And suppose the agents have only partial information about the environment—enough, let’s say, to get the house built, but not enough, let’s say, to really understand what’s going on inside the other agent. Then wouldn’t the two agents both reason “hey if I die then who knows if this house will be built correctly; I better take steps towards self-preservation just to make sure that the house gets built”. Then the two agents might each take steps to build physical protection for themselves, to acquire resources with which to do that, and eventually to fight over resources, even though their goals are, in truth, perfectly aligned. Is it true that this would happen under an imperfect information version of your model?
    - Edouard Harris 20 Oct 2022 12:55 UTC
      LW: 3 AF: 2
      0
      AF Parent
      Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed.
      It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information question from the human side, because the human never considers even the existence of the AI during its training.
      This setup has the advantage that it’s more tractable and easier to reason about. But it has the disadvantage that it unfortunately fails to give a fully satisfying answer to your question. It would be interesting to see if we can remove some of the assumptions in our setup to approximate the imperfect information case.