Interpersonal alignment intuitions

Let’s try this again...

The problem of aligning superhuman AGI is very difficult. We don’t have access to superhuman general intelligences. We have access to superhuman narrow intelligences, and human-level general intelligences.

There’s an idea described here that says: (some of) the neocortex is a mostly-aligned tool-like AI with respect to the brain of some prior ancestor species. (Note that this is different from the claim that brains are AIs partially aligned with evolution.) So, maybe we can learn some lessons about alignment by looking at how older brain structures command and train newer brain structures.

Whether or not there’s anything to learn about alignment from neuroanatomy specifically, there’s the general idea: there are currently some partial alignment-like relationships between fairly generally intelligent systems. The most generally intelligent systems currently existing are humans. So we can look at some interpersonal relationships as instances of partially solved alignment.

In many cases people have a strong need to partially align other humans. That is, they have a need to interact with other people in a way that communicates and modifies intentions, until they are willing to risk their resources to coordinate on stag hunts. This has happened in evolutionary history. For example, people have had to figure out whether mates are trustworthy and worthwhile to invest in raising children together rather than bailing, and people have had to figure out whether potential allies in tribal politics will be loyal. This has also happened in memetic history. For example, people have developed skill in sussing out reliable business partners that won’t scam them.

So by some combination of hardwired skill and learned skill, people with some success determine the fundamental intentions of other people. This determination has to be high precision. I.e., there can’t be too many false positives, because a false positives means trying to invest in some expensive venture without sufficient devoted support. This determination also has to be pretty robust to the passage of time and surprising circumstances.

This is disanalogous to AGI alignment in that AGIs would be smarter than humans and very different from humans, lacking all the shared brain architecture and values, whereas people are pretty much the same as other people. But there is some analogy, in that people are general intelligences, albeit very bounded ones, being kind of aligned with other general intelligences, even though their values aren’t perfectly aligned a priori.

So, people, what can you say about ferreting out the fundamental intentions of other people? Especially people who have experience with ferreting out intentions of other people in circumstances where they’re vulnerable to being very harmed by disloyalty, and where there aren’t simple and effective formal commitment /​ control mechanisms.

This is a call for an extended investigation into these intuitions. What, in great detail, do we believe about another person when we think that they have some fundamental intention? What makes us correctly anticipate that another person will uphold some intention or commitment even if circumstances change?