How evolution succeeds and fails at value alignment

Disclaimer: I don’t have a background in alignment research or reinforcement learning, and I don’t think any of the ideas discussed here are new, but they might be interesting to some.

A recent post suggested that humans provide an untapped wealth of evidence about alignment. I strongly agree with that and I found it interesting to think about how nature can ensure that a mother’s values are aligned with the well-being of her children.

One particular thing this made me realize is that there are two very different reasons for why an agentic AGI might behave in a way that we would characterize as misaligned:

1. The reward function was poorly specified, and the model didn’t learn the values we wanted it to learn
2. The model did learn the values we wanted it to learn, but those values led it to conclusions and actions that we did not anticipate

Let’s go back to thinking about how evolution aligns mothers with their children. If we could get an AGI to love all humans like most (mammalian) mothers love their children, it might not necessarily solve the alignment problem, but it would be far better than having an AGI which doesn’t care much about humans.
We can think of humans as reinforcement learning agents with a range of different goals, motivations and desires that are in one way or another all instrumental for the purpose of survival and reproduction in an environment similar to the one we may have found ourselves in 10,000 years ago. Some of those goals represent pretty simple values like avoiding pain or maintaining a certain blood sugar level, but others represent more complex values, like ensuring the well-being of our children. The biological link between pain or food-intake and a reward signal like dopamine can be quite simple, but how is something like “care for your children” encoded as a value in a way that generalizes to out-of-distribution environments?

Two examples for misaligned AGIs that were discussed around here are a system that is supposed to prevent a diamond from being stolen, but that can be tricked by placing an image of the diamond in front of the security camera, and a strawberry-picking-robot that was supposed to learn to put strawberries in a bucket, but instead learned to fling red things at light sources.
Alignment failures of that sort do occur in nature. In some birds, simple visual and auditory cues trigger the behavior of feeding chicks, and the absence of a single cue can lead a mother bird to not feed her chick even if it is generally healthy. And then there are poorly disguised brood parasites like a cuckoos which do trigger a feeding response.

Humans seem to be more robust to alignment failures like that. At least the absence of any single sensory cue will not stop a mother from caring for her child. I think the reason why “care about your children” is perhaps more robustly instilled in humans than in some non-mammalian species is that there is a range of different sensory cues, both simple and complex, that trigger dopamine responses in a human mother. “I love my child” might be a parsimonious emotional and cognitive concept that naturally forms in the presence of different reward signals that are triggered by different aspects of caring for a child or being near a child. I think there are at least two factors that make it more likely that this goal is learned robustly:
1. Multiple independent sensory cues (vision, auditory, olfactory), all associated with being near her child, or perceiving her child’s well-being, that lead to dopamine responses.
2. Dopamine responses may not only be triggered by very simple sensory cues, they could also be triggered by higher-level abstractions that are informative about the child’s well-being. Maybe dopamine is released when a mother sees her child smile. This is not particularly high-level, but more so than detecting pheromones, for example, since it requires the visual cortex to interpret facial expressions and form a concept like “smiling”. There is no reason why dopamine responses could not be triggered by even higher-level abstractions that are indicative of her child’s well-being. Once her child is able to speak, the dopamine response might be directly triggered by her child verbally expressing that everything is OK. This may not be how affection is formed in humans (it takes a while until children learn to talk), but an artificial reinforcement learner might learn to care about humans through reward signals that are triggered by humans expressing happiness.

So I think it’s possible that a value like “care for your child” could form either through associating a reward signal with a high-level expression of well-being, or with a bunch of different low-level correlates of the child’s well-being. In the second case, the concept “care for your child” might emerge as a natural abstraction that unites the different reward triggers.


Going back to the two different ways in which an AGI might behave in a way that we would characterize as misaligned: I think by using a range of different reward signals that all circle around different aspects of human well-being, it should be possible to get an AGI to learn to genuinely care about humans (rather than about reward itself). That still leaves the problem that even a benevolent, all-powerful AGI might take actions that seem bad in the context of our current values:

Imagine a mother who takes her daughter to a doctor to get vaccinated, but the daughter protests because she does not want to get pinched by a needle. From the daughter’s point of view, this could constitute an alignment failure. Would a mother still insist on the vaccination if her only reward trigger had been seeing her daughter smile? It might depend on her time horizon. Does she maximize the number of smiles expected in the next minute, or for the number of smiles expected across the course of her child’s lifetime?

In this example, most people would probably say it’s good that the mother’s longer time horizons override the child’s short-term preferences. But we might not want an AGI to optimize over very long time horizons. Utilitarianism and consequentialist reasoning with long time horizons are generally pretty accepted frameworks around here, at least more so than in the general public; but at the same time the consensus opinion seems to be that it would be bad if an all-powerful AGI takes far-reaching actions that are inscrutable to us and that seem bad at face value, even if we knew that the AGI intrinsically cares about us and that it has our best interests in mind.
I don’t think an aligned AGI would necessarily choose to optimize over very long time horizons and make decisions we can’t understand; we might even be able to instill shorter time horizons and legibility as two of its values. My point is that there is an important difference between objectively bad actions taken by an all-powerful AGI with misspecified values, and seemingly bad actions taken by an all-powerful, but benevolent AGI that we may fail to comprehend.

If we put our fate in the hands of an all-powerful AGI, it seems unavoidable that our world will change very drastically. Many of these changes will likely appear bad according to our current, relatively poorly defined values. The very concept of what it means to be human might become meaningless once brain-machine interfaces get better. The best we can probably aim for is that the hypothetical all-powerful AGI that brings these changes about cares about us (and indirectly about the things we care about) in a way that is somewhat analogous to the way in which parents care about their children.[1] Nature provides us with examples of how complex values like caring for your children can be instilled in ways that make these values more or less robust to changing environments.

  1. ^

    Children caring for their elderly parents might be a more accurate but less optimistic analogy