The curious case of Pretty Good human inner/​outer alignment

I have been convinced to believe that looking at the gap between human inner and outer alignment is a good way to think about potential inner/​outer alignment problems in artificial general intelligences:

We have an optimisation process (evolution) trying to propagate genes, that created a general intelligence (me/​you). For millions of years our inner goals of feeling really good would also satisfy evolution’s outer goal of propagating genes, because one of the things that feels the best is having sex. But eventually that intelligent agent figured out how to optimise for things that the outer optimisation process didn’t want, such as having protected sex or watching VR porn, thus satisfying the inner goal of feeling really good, but not the outer goal of propagating genes. This is often told as a cautionary tale: we only know of one General Intelligence and it’s misaligned. One day we will create an Artificial General Intelligence (AGI) and we will give it some sort of (outer) goal, and it might then develop an inner goal that doesn’t directly match what we intended. I think this only tells half the story.

Even though our general intelligence has allowed us to invent condoms and have sex without the added cost of children, a surprising amount of people decide to take them off because they find it fun and meaningful to have children.

In a world where we could choose to spend all our time having protected sex or doing drugs, a lot of us choose to have a reasonable number of kids and spend our time on online forums discussing AI safety, all of which seem to satisfy a more longer term version of “propagate your genes” than simply wanting to have sex because it feels good. More than that, we often choose to be nice in situations where being nice is even detrimental to the propagation of our own genes. People adopt kids, try to prevent wars, work on wildlife conservation, spend money on charity buying malaria nets across the world, and more.

I think there are two questions here: why are human goals so sophisticatedly aligned with propagating our genes and why are humans so nice.

Most people want to have kids, not just have sex. They want to go through the costly and painful process of childbirth and child rearing, to the point where many will even do IVF. We’ve used all of our general intelligence to bypass all the feels-nice-in-the-moment bits and jump straight to the “propagate our genes” bit. We are somehow pretty well aligned with our unembodied maker’s wishes.

Humans, of course, do a bunch of stuff that seems unrelated to spreading genes, such as smoking cigarettes and writing this blog post. Our alignment isn’t perfect, but inasmuch as we have ended up a bit misaligned, how did we end up so pleasantly, try-not-to-kill-everything misaligned?

The niceness could be explained by the trivial fact that human values are aligned with what I, a human, think is nice: humans are pretty nice because people do things I understand and empathize with. But there is something odd about the fact that most people, if given the chance, would pay a pretty significant cost to help a person they don’t know, keep tigers non-extinct, or keep Yosemite there looking pretty. Nice is not a clear metric, but why are people so unlike the ruthless paperclip maximisers we fear artificially intelligent agents will immediately become?

Hopefully I’ve convinced you that looking at human beings as an example of intelligent agent development is more interesting than purely as an example of what went wrong, it is also an interesting example of some things going right in ways that I believe current theory on AI safety wouldn’t predict. As for the reasons why human existence has gone as well as it did, I’m really not sure, but I can speculate.

All of these discussions depend on agents that pick actions based on some reward function that determines which of two states of the world they prefer, but something about us seems to not be purely reward driven. The vast majority of intelligent agents we know (they’re all people), if given a choice between killing everyone while feeling maximum bliss, or not killing everyone and living our regular non maximum bliss lives would choose the latter. Heck a lot of people would sacrifice their own existence to save another person’s! Can we simply not imagine truly maximum reward? Most people would choose not to wirehead, not to abandon their distinctly not purely happy lives for a life of artificial joy.

Is the human reward function simply incredibly good? Evolution has figured out a way to create agents that adopt kids, look at baby hippos, plant trees, try to not destroy the world and also spread their genes.

Is our limited intelligence saving us? Perhaps we are too dumb to even conceive all the possibilities that would be horrific for humanity as a whole but we would prefer as individuals.

Could it be that there is some sort of cap on our reward function, simply due to our biological nature, where having 16 direct descendants doesn’t feel better than having 3? Where maximum bliss isn’t that high?

Perhaps there’s some survivorship bias, any intelligent agent that was too misaligned would disappear from the gene pool as soon as it figured out how to have sex or sexual pleasure without causing a pregnancy. We are still here because we evolved some deeper desires, desires of actually having a family and social group, past the sensual niceness of sex. Additionally, intelligent agents so far have not had the ability to kill everything, so even a horrifically misaligned agent couldn’t have caused that much damage. There are examples of some that did get into a position to cause quite a lot of damage, and killed a large percentage of the world’s population.

I am aware that we are, collectively, getting pretty close to the edge, either through misaligned AI, nuclear weapons, biological weapons or ecological collapse, but I’d argue that the ways in which people have and continue to mess each other up are more a result of coordination problems and weird game theory than misalignment.

Maybe I’m weird. Lots of people really would kill a baby tiger on sight, because it would endanger them or their family when grown. Plenty of people take fentanyl until they die. But still, if given the choice, most intelligent agents we know of would choose actions that wouldn’t endanger all other intelligent agents, such as chill out, have some kids, and drink a beer.

I can smell some circularity here: if we do end up making an AGI that kills us all, then humans too were misaligned all along, it just took a while to manifest. And if we make an AGI and it chooses to spend a modest amount of time pursuing its goals and the rest looking at Yosemite and saving baby tigers, maybe the typical end point for intelligences isn’t paperclip maximization but a life of mild leisure. Regardless, I still think our non-human-species-destroying behavior so far is worth examining.

We’re the only general intelligence we know of, and we turned out alright.