[Question] Is there any literature on using socialization for AI alignment?

Hello,

I was recently thinking about the question of how humans achieve alignment with each other over the course of our lifetime, and how that process could be applied to an AGI.

For example, why doesn’t everyone shop lift from the grocery store? A grocery store isn’t as secure as Fort Knox, and if one considered every possible policy that results in obtaining groceries then they may discover that shop lifting is more efficient than obtaining money from a legitimate job. That may or may not be the best example, but I’m sure LW is quite familiar with the concept that lies at the heart of the problem of AI alignment: what humans consider the morally superior solution isn’t always the most “rational” answer.

So why don’t humans shop lift? I believe the most common answer from modern sociology is that we observe that other humans obtain jobs and pay with legitimate money, so we imitate that behavior out of a desire to be a “normal” human. People are born into this world with virtually no alignment, and gradually construct their own ethical system based on interactions with other people around them, most importantly their parents (or other social guardians).

Granted, I’m sure from the perspective of ethical philosophy and decision theory that explanation could be an oversimplification, but my point is that socialization would appear to be a straight-forward solution towards AI alignment. When human beings become grown adults, and their parents are considerably weaker from old age, then their elders no longer have any physical capability of controlling them. And yet, people obey or respect their parents anyway, and are expected to morally speaking, because of the social conditioning they still recall from back when they were children. That is essentially the same outcome we want to have with a Superintelligent AGI: a being that is powerful enough to ignore humanity, but has a deep personal desire to obey them anyway.

Some basic mechanics of formal and informal norms in sociology could lend themselves towards reinforcement learning algorithms. For example:

  • Guilt-based discipline: as the AGI explores its environment, indicate when a state-action pair of an adopted policy is morally wrong

  • Shame-based discipline: whenever the AGI adopts a policy that has a detrimental outcome, indicate that its general behavior is morally wrong

One possible criticism of socialization alignment is that you are creating an AGI agent that is completely unaligned, but with the expectation that it will become aligned eventually. Thus, there is some gap of time when the AGI may cause harm to the population before it learns that doing so is wrong. My personal solution to that problem is what I previously referred to as Infant AI: the first scalable AGI should be very restricted in its intelligence (e.g., only given the domain of knowledge of mathematical problems), and then expand to a higher-intelligent AGI only after the previous version is fully aligned.

One benefit for socialization alignment is that it doesn’t rely on explicitly spelling out what ethical system or values we want the AI to have. Instead, it would organically conform to whatever moral system the humans around it uses, effectively optimizing for approval from its guardians.

However, this can also be a two-edged sword. The problem I foresee is that the different instances of AGI would be as diverse in their ethical systems as humans are. While the vast majority of humans agree on fundamental ideas of right or wrong, there are still many differences from one culture to another, or even one individual person to another. An AGI created in the Middle East may end up having a very different value system than AGI created in Great Britain or Japan. And if the AI interacted with morally dubious individuals like a psychopath or an ideological extremist, that could skew its moral alignment as well.