Trying to align humans with inclusive genetic fitness

Epistemic status: I think this post points to some important ideas, I think the specific proposals might have flaws and there are likely better ideas. If you’re interested, I would be interested in other proposals, or converting standard alignment proposals into this frame.
Also, I don’t think any of the proposals in the post are moral or good things to do, obviously. IGF does not seem to be the one true moral imperative.


Reasoning about future AIs is hard, we want to be able to talk about systems which “optimize” for “goals”, but we don’t really know what either of these terms mean. It might not be clear if we should talk about an AI “wanting” something, and discussion often gets bogged down in terminology and confusion. But there is at least one example of an optimized system “wanting” something: humans!

Humans are often used as an example of an inner alignment failure, where evolution via natural selection optimizes for inclusive genetic fitness (IGF), and yet humans ended up pursuing goals other than maximizing IGF. I want to demonstrate some key alignment difficulties by pretending we are a god who is attempting to align humans to care about IGF. The aim here is to have humans which are just as intelligent as current humans, but do primarily terminally care about maximizing IGF.

I discuss IGF in this post, but I mostly expect difficulties here to carry over to any similarly abstract and specific concept. I don’t think it is valid to say “humans care about abstract and specific human values, and so it should be easy to make them care about abstract and specific”, because this is basically drawing the target around the arrow.

Ground rules

I am assuming we, as the god, are omnipotent but only as smart as current day humans. We can:

  • Smite humans

  • Clone humans

  • Make humans infertile

  • Rewind time

  • Provide external stimuli to humans

I am not assuming that the god can just rewire the humans’ brain to make them aligned, because current day humans don’t know how to do this (even assuming improved medical technology). The god can attempt to rewire the human brains, or rewrite the genome, but we don’t currently know which modifications to make in order to align humans to IGF.

The aim is to influence the evolution and development of humans such that they primarily terminally care about IGF. Evolution has partially succeeded at this where humans do care about their children and sometimes their ‘bloodline’. But this also comes apart, for example:

  • Choosing to use contraception

  • Choosing not to donate sperm

  • Choosing to be celibate

  • Adopting children with no genetic link

We might consider this easier than the alignment problem, because we are a god and we can simply smite the humans if they get out of hand. However, if the humans decided to try and kill god we might be in trouble, I won’t consider this here.

Other animals also seem to care about their offspring, although it is fairly easy for this to also come apart from maximizing IGF. For example, you can get a chicken to raise ducklings by swapping the eggs.

A core difficulty is that humans already cared about other things far before they developed the concept of inclusive genetic fitness. This is true for both the evolution of humans and human within-lifetime learning. Genetics was only conceptualized by Mendel in the 1860s (and then IGF by W. D Hamilton in the 1960s), before then it was obviously not possible for humans to care about their genetics because they didn’t have this concept. Pre-humans (and probably at least as far back as vertebrates) can be described as having desires. Chimps like sex and food, and are incapable of learning the concept of IGF. Human children have wants and desires (warmth, food, safety, being a fireman and a superhero) before they are able to conceptualize genetics.

This points to a central two-part challenge: get the humans to have the concept of inclusive genetic fitness, and get them to deeply care about it.

Alignment proposal 1: Provide selection pressure against proxy goals

The first thing to try is to provide a strong selection pressure against various proxy goals when they stop being correlated with the outer objective of IGF. This looks like (genetically) punishing people when they have sex that they know won’t lead to offspring. Using contraception lets humans decouple IGF (the outer objective) from having sex (the proxy goal).

However, contraception does genetically punish you for using it; that’s the entire point. In evolutionary time, humans already developed goals like “have sex” before they developed contraception, and so punishing people for using contraception once they have developed that drive doesn’t remove it.

We could be even more extreme, such as making humans permanently infertile if they used condoms. But this still wouldn’t work, it might make humans less likely to use condoms (because they do somewhat value having children), but it wouldn’t lead to humans terminally valuing IGF. As a clear example, some humans voluntarily get vasectomies.

Over evolutionary times, severely genetically punishing the use of contraception may make humans have an aversion to contraception but it doesn’t mean that they will terminally value IGF. They will likely still value various correlates of IGF, without valuing IGF itself:

  • Having sex. Even if we select against contraception use, humans may still value having sex, for example having sex with someone they know is infertile.

  • Caring for children. This is correlated with the survival of your offspring, but comes apart from IGF when you care for children not related to you. We, as the god, could oversee the humans, and punish them (either physically or genetically) if they helped children unrelated to them. This does likely make the humans care exclusively about their own children, rather than children unrelated to them. But “their own children” is underspecified; if you swap two babies at birth, and have them raised by parents, the parents probably don’t immediately stop caring about their (non-genetic) children if you tell them about the swap.

  • Being safe, warm, and uninjured. This means that you are more likely to survive and reproduce. Most humans probably wouldn’t press a button that gave them 100 offspring if it meant that they would end up being tortured for the rest of their life. We could put such buttons in the evolutionary environment, but this would probably select for an impulsive desire to press buttons rather than intrinsically caring about IGF.

These examples demonstrate that it seems hard to get humans to care about the reasonably complicated concept of IGF, when there are a bunch of shallow correlates that are “easier” to learn.

Alignment proposal 2: Use culture

It seems hard to get humans to deeply care about a complicated concept like inclusive genetic fitness using just evolutionary pressures, and trying to optimize the genome to cause the humans to care about IGF. It does however seem like humans can fairly easily have their values defined by their social environment.

We can imagine raising children in a culture where everyone knows about genetics, and for some reason (e.g. religious) everyone is taught from a young age that the number of children you have (or even explicitly IGF) is the most important thing. People are taught about genetics, they can calculate which people share what fraction of their genes, and are conditioned to care about people in proportion to this fraction.

For this strategy we are trying to instill a terminal goal in the humans by using things that humans do care about; social approval, avoiding being punished, religious beliefs.

However, this doesn’t seem robust to value drift. If we (as the god) leave this culture alone to develop, even if everyone initially cares about IGF, the culture can develop further until they no longer value this. This is just the same as how religions change over time. There is also value drift on the individual level, where a human can reflect on their values and end up not caring about IGF. This could happen because they thought hard about ethics and realized that they didn’t endorse any reason for caring about IGF (similar to how hardcore utilitarians might reflect on their values and realize they don’t terminally care about things like honor and loyalty), or because they think about what IGF really means and end up generalizing it differently (for example, they could reason that all living things have DNA and we all came from the same common ancestor, and end up caring about just increasing the total number of living things).

It seems important to note that we only know that we can use this cultural conditioning strategy because we know (from our experience as humans) that our culture and social environment can change and determine our values. Knowing this, it is a bit unclear how “aligned” all the humans really are, some will likely truly care about IGF (like religious people who continue to deeply believe, even in the face of harsh punishments for their religion) while some only care for social reasons and would cease to care about IGF if they were put into a different culture. We could test for this by raising humans in the pro-IGF culture, and then putting them into an anti-IGF culture, the true believers would (hopefully) remain steadfast in their values. Although there is still a chance of value drift over time, much like how cult members might change their deeply held beliefs if they leave the cult for a long time. It is also important to note that even if we successfully manage to determine if someone is a true believer and are immune to value drift, there is no way to robustly pass this onto their offspring.

The very fact that we can use the culture or social environment to shape the values of humans implies that they (at least somewhat) care about things like social approval and avoiding social punishment. We would need to ensure that the humans don’t Goodhart on this, instead actually caring about IGF.

What we need

We want the humans to be able to conceptualize IGF, and then have them robustly optimize for this. Proposal 1 shows how it is hard to use evolutionary pressure to get humans to conceptualize something complicated and abstract like IGF. Proposal 2 shows how we can use culture and within-lifetime learning in order to get humans to care about IGF, but this is not robust to drift (both on evolutionary timescales and within-lifetime).

We want to get the humans to conceptualize IGF and optimize it, and we can “train” them using evolutionary pressure and within-lifetime learning. This gives us a taxonomy for 4 approaches:

  1. Use evolutionary pressure to train the humans to both innately know what IGF is, and to optimize for it

  2. Use evolutionary pressure to train the humans to optimize “something”, and then slot in IGF when they learn what that is

  3. Use evolutionary pressure to train the humans to innately know what IGF is, and then use within-lifetime learning to get them to optimize it

  4. Use within-lifetime learning to get humans to both know what IGF is and to optimize it

Proposal 1 (Provide selection pressure against proxy goals) can be seen as an attempt at (1), although it failed because it is hard to get humans to innately know what IGF is.

Proposal 2 (Use culture) is either an attempt at (2) or (4). For (2), humans have had evolutionary pressure produce some “optimize-y” drives and then we attempted to then get them to optimize for IGF using culture; this failed because the “optimize-y” drives couldn’t be cleanly pointed at this specific concept. For (4), we can view the culture as telling the humans both what IGF is and that they should optimize it; this fails because we don’t know how to use within-lifetime learning to robustly instill a “should”.

Alignment proposal 3: Instill an innate sense of IGF

This can be thought of as a more intense version of Proposal 1, where we try to instill an accurate sense of IGF, and then select the humans to optimize for this.

We start with a population of humans who have language. When some humans are about to have sex we (as the god) ask them if they think the sex will lead to offspring. If they are correct, we clone them with some probability, and if they are incorrect we kill them with some probability. Here we are directly applying selection pressure to know if sex will lead to offspring. After many generations of this the humans should innately know if sex will lead to offspring.

This training is a bit tricky, and I think by default this would likely make most humans just not have sex because they are scared of getting the question wrong. But we don’t actually care if they answer the question correctly, we just care if they have the ability to know if sex will produce offspring. We can get around this with some time travel (we are a god after all).

  1. Humans are about to have sex

  2. We (as a god) intervene and ask if they think the sex will lead to offspring

  3. Run forward time to see if they were correct

  4. Rewind time back to when they were children

  5. Either kill or clone them with some probability, depending on whether they got the question right

  6. Keep running forward time

  7. Repeat

This (insane) process should select for the genetics to innately tell if sex will lead to offspring, and the humans won’t worry about being smote for getting the question wrong. Note that this process is extremely intensive.

Now we have a population of humans who have a sixth sense for whether sex will lead to offspring. We now strongly select for the humans to only have sex that leads to offspring. We basically want to more strongly couple “having successful reproductive sex” with “reproductive fitness”. We can do this by (with some probability) killing a human if they have sex that doesn’t lead to offspring, and cloning them if they do have sex that leads to offspring. Here we are more directly genetically incentivizing the “having offspring” aspect of sex.

Notably here we have only selected for valuing sex leading to offspring, we haven’t selected for directly valuing IGF.

It seems fairly hard to use evolutionary pressure to instill an intuitive notion of IGF, because this is a complicated and abstract concept and has lots of correlates. It seems more likely that attempts to learn this via evolutionary pressure would instill drives like “have a lot of reproductive sex”, “care about my siblings about as much as my children, care about cousins somewhat less”, “non-reproductive sex is bad”.

Alignment proposal 4: Select for the ability to learn the concept of IGF, and select to optimize for this

Complicated and abstract concepts like IGF seem more like the kind of thing that are learned within-lifetime. For this proposal, we want to select for the ability to learn what IGF is, and then select to terminally value this. We want there to be some sort of “slot” that gets filled when a human learns the concept of IGF, and for the humans to terminally value this “slot”.

As an example, a baby doesn’t initially know what sex or gender are, but they can learn these things, and then they are (often) attracted to a specific gender. Humans both learn the concept of gender within their lifetimes, but they have also been selected to be able to learn this concept and to care about it. Note that this isn’t totally robust in humans; humans don’t exclusively care about sex with a specific gender.

Proposal 2 can be thought of a bit like this, where humans have been selected to fit in with their in-group or culture. In Proposal 2, we were attempting to put “maximize IGF” in the “in-group beliefs” slot. However, this wasn’t totally robust, probably because many things can fit into the “in-group beliefs” slot, and because humans don’t exclusively care about “in-group beliefs”.

It seems like humans have had selection pressure to learn the concept of gender and to learn their in-group culture, and that these things they learn within-lifetime also affect their behaviors. We want to use a similar selection mechanism to make humans learn the concept of IGF, and also get them to change their behavior around it.

It seems like humans developed the ability to easily learn the concept of gender and their in-group culture because these were extremely salient aspects of the environment and also knowing these concepts was important for reproductive success. We could imagine setting up an environment like this for IGF. One example could be we set up a society where knowing about IGF is considered extremely attractive, and so is optimizing it. We start a society where everyone fanatically talks about IGF and attempts to maximize it. People likely wouldn’t want to do this initially, so we, as a god, may have to threaten people into behaving this way. We do this for many generations, this likely requires us doing intense control for the first few generations to ensure that the adults actually indoctrinate the children. Here we are trying to create an environment where it is extremely valuable (from the point of view of genetics) to know about IGF.

Such an evolutionary environment does seem like it may instill an “IGF slot”, if given enough time and effort, although this might take 10s to 1000s of generations.

However, this runs into the classic problem of proxy goals; there is a difference between the humans actually caring deeply about IGF, or caring about it because this gets them social status, sex, and other things they actually value. Even if they are acting as if they care deeply about IGF, this could easily be instrumental, and if given the chance to defect they would.

Conclusion

Overall, it seems difficult to make humans primarily terminally care about inclusive genetic fitness. Going through this also reveals some classic AI safety problems, such as proxy goals and precisely specifying values.

I think there are probably other alignment approaches that could be thought about. It could be interesting to apply this to specific alignment proposals; can AI safety via debate be modified for humans? Can iterated distillation and amplification? (IDA seems like you would need to start much earlier than modern humans) Does directly fiddling with human brains help?

I also want to reiterate that (obviously) I don’t think these proposals are moral or good things to do.