Thoughts about OOD alignment

Catnee24 Aug 2022 15:31 UTC

11 points

AI Outer Alignment Inner Alignment Distributional Shifts

We may have one example of realized out-of-distribution alignment: maternal attachment. Evolution has been able to create an architecture that seems to take care of something reliably enough that modern humans, with access to unlimited food, drugs, and VR headsets, do not seek to feed a child to death, drug him with heroin, or constantly show him a beautiful simulated world.

Moreover, this desire rather works too broadly in response to everything that is remotely similar to children, and does not break at the first opportunity if the child is dressed, washed, or otherwise changed in his “characteristic appearance during the training stage.” At some point, the evolutionary gradient descent had to align the mother with the child, and did it quite well.

Perhaps when trying to create an architecture that “really” wants “good” for someone like a child, gradient descent, under certain conditions, eventually stumbles upon an architecture that generalizes at least a little beyond the training sample without catastrophic consequences. But there is a problem: the current methods of training neural networks imply a more or less fixed architecture and the selection of weights within it, while evolution can afford, albeit a blind but very wide search for many possible architectures, while basically does not have access to what these architectures will learn from the world around them, not to mention a huge amount of time.

So it is an entirely different training paradigm, where you try to find architecture + little pre-trained part that probably contains very coarse heuristics that serves as some kind of ground-truth and detection anchors for mechanisms that activate during pregnancy and after birth and enables maternal attachment.

It is worth noting that little is required from the child. He is hardly able to interpret the behavior of the mother or completely control her. Children usually do not have two mothers who argue in front of the child, arguing their points of view about further actions and requiring the child to choose the right one. Evolutionarily it happened that mothers who did not care enough for their children died in terms of their gene frequency, and thats all it take.

How could something like this be created? I don’t know. RL agents come to mind trying to take care of their constrained and slightly modified versions, like a Tamagotchi? The constrained version of the agent gradually becomes less and less limited, and the criterion for the success of the mother is how long will the agent’s child live? How exactly should we create gradients after each trial?

A few clarifications: I don’t think that this is a complete-solution-to-the-problem, it’s not enough to accidentally create “something inside that seems to work”, you need to understand what exactly occurs there, how it differs from agents who have “it” from those who do not, where it is located, how it can be modified and how far it needs to go OOD for it to start to break catastrophically.

This is not “let’s raise AI like a child”, it’s more like “let’s iterate through a lot of variants of mother agents and find those who are best at raising simulated children, and then find out how they differ from ordinary agents that evaluated based on their own survival time”. So maybe we can build something that has some general positive appreciation towards humans.

Catnee24 Aug 2022 15:31 UTC

11 points

10 comments2 min readLW link

AI Outer Alignment Inner Alignment Distributional Shifts

Charlie Steiner 24 Aug 2022 17:58 UTC
6 points
0
I’m going to say something critical. I mean it earnestly on the object level, but bear you no ill will on the social or interpersonal level. In fact, I think that making this post is a positive sign for your future :)

A mental mistake has been made here, and I think you’re not alone in making it. We humans valorize a mother’s love for her children. We humans think that it generalizing to new situations is right and proper. So at first glance it might seem like evolution has miraculously produced something “robustly aligned” in the good generalization properties of a mother’s love for her children.

But evolution does not care about motherly love, it only cares about fitness. If a child loses their gonads at age 2, evolution would rather (to the extent that it would rather anything) the mother stop devoting resources to that child and have a new one.

Evolution was just promoting fitness, motherly love is a great result for us humans who think motherly love is great, but to evolution it’s just another suboptimal kludge. See the Tragedy of Group Selectionism. The rightness-according-to-humans is bleeding over and affecting your judgment about rightness-according-to-evolution.

All of this is to say: the alignment problem is as hard as it ever was, because motherly love is not a triumph of evolution aligning humans. It’s something we think is good, and we think generalizes in good ways, because we are talking about ourselves, our own values. The baby-eater aliens would praise evolution for so robustly aligning them to eat babies, and the puddle would praise the rainstorm for dropping it in a hole so suited for its shape. None of this is evidence that the optimization process that produced them is good at aligning things.
- Ocracoke 24 Aug 2022 18:34 UTC
  3 points
  2
  Parent
  I recently articulated similar ideas about motherly love. I don’t think it’s an example of successful alignment because evolution’s goals are aligned with the mother’s goals. In the example you give where a child loses their gonads at age 2, it would be an alignment failure if the mother continues devoting resources to the child. In reality that wouldn’t happen, because with motherly love, evolution created an imperfect intermediate goal that is generally but not always the same as the goal of spreading your genes.
  
  I totally agree that motherly love is not a triumph of evolution aligning humans with its goals. But I think it’s a good example of robust alignment between the mother’s actions and the child’s interests that generalizes well to OOD environments.
- Catnee 24 Aug 2022 18:32 UTC
  2 points
  1
  Parent
  Thank you for your detailed feedback. I agree that evolution doesn’t care about anything, but i think that baby-eater aliens would not think that way. They can probably think about evolution aligning them to eat babies, but in their case it is an alignment of their values to them, not to any other agent/entity.
  
  In our story we somehow care about somebody else, and it is their story that ends up with the “happy end”. I also agree that probably given enough time we will end up stop caring about babies who we think can not reproduce anymore, but it will be a much more complex solution.
  
  At the first step it is probably much easier to just “make an animal who cares about it babies no matter what”, otherwise you will have to count on ability of that animal to recognize something it might not even understand (like reproductive abilities of a baby)
  - Charlie Steiner 24 Aug 2022 20:37 UTC
    3 points
    1
    Parent
    Ah, I see what you mean and that I made a mistake—I didn’t understand how your post was about human mothers being aligned with their children, not just with evolution.
    To some extent I think my comment makes sense as a reply, because trying to optimize^[1] a black-box optimizer for fitness of a “simulated child” is still going to end up with the “mother” executing kludgy strategies, rather than recapitulating evolution to arrive at human-like values.
    EDIT: Of course my misunderstanding makes most my attempt to psychologize you totally false.
    But my comment also kinda doesn’t make sense, because since I didn’t understand your post I somewhat-glaringly don’t mention other key considerations. For example: mothers who love their children still want other things too, so how are we picking out what parts of their desires are “love for children”? Doing this requires an abstract model of the world, and that abstract model might “cheat” a little by treating love as a simple thing that corresponds to optimizing for the child’s own values, even if it’s messy and human.
    A related pitfall is if you’re training an AI to take care of a simulated child, thinking about this process using the abstract model we use to think about mothers loving their children will treat “love” as a simple concept that the AI might hit upon all at once. But that intuitive abstract model will not treat ruthlessly exploiting the simulate child’s programming to get a high score by pushing it outside of its intended context as something simple, even though that might happen.
    ^
    especially with evolution, but also with gradient descent
Vaniver 1 Sep 2022 6:25 UTC
2 points
0
We may have one example of realized out-of-distribution alignment: maternal attachment.
When someone becomes maternally attached towards a dog, doesn’t this count as an out-of-distribution alignment failure?
- Catnee 2 Sep 2022 17:21 UTC
  1 point
  0
  Parent
  I think it depends on “alignment to what?”. If we talk about evolution process, then sure, we have a lot of examples like that. My idea was more about “humans can be aligned to their children by some mechanism which was found by evolution and this is a somewhat robust”.
  
  So if we think about “how our attachment to something not-childish aligned with our children” well… technically, we will spend some resources on our pets, but it usually never really affects the welfare of our children in any notable way. So it is an acceptable failure, I guess? I wouldn’t mind if some powerful AGI will love all the humans and will try to ensure their happy future while at the same time will have some weird non-human hobbies/attachments which is still less prioritized than our wellbeing, kind of like parents that spend some free time on pets.
Dave Lindbergh 24 Aug 2022 16:13 UTC
2 points
0
Mothering is constrained by successful reproduction of children—or failure to do so. It’s not at all obvious how to get an AI to operate under analogous constraints. (Misbehavior is pruned by evolution, not by algorithm.)
Also, what mothers want and what children want are often drastically at odds.
- Catnee 24 Aug 2022 16:21 UTC
  2 points
  1
  Parent
  Yes, exactly. That’s why i think that current training techniques might not be able to replicate something like that. Algorithm should not “remember” previous failures and try to game them/adapt by changing weights and memorise, but i don’t have concrete ideas for how we can do it the other way.
Nathan Helm-Burger 31 Aug 2022 23:31 UTC
1 point
0
I think there is something important here. Details of implementation aside, I do think that we should make a comprehensive attempt at ‘fumbling towards alignment through trial and error’ in a similar way evolution tried to align mothers to their children. I think that even if we don’t get a comprehensively perfect result from such a process, we might gather some useful data and learn some important lessons along the way.
johnswentworth 24 Aug 2022 15:52 UTC
−1 points
2
Some additional remarks on ood alignment...
This is what an aligned ood looks like:
This is what an unaligned ood looks like (note the glowing red eyes):
Though aligning the ood is moderately difficult, it is at least very easy to recognize and avoid unaligned ood.