Research Notes: What are we aligning for?

As part of learning the field and maximizing on new ideas, I’ve been trying to figure out what the goal of AI alignment is. So far I’ve found out what outer alignment is as a concept, but not what it should be as an instantiation.

Humanity’s Values

So here is my suggestion:

Why don’t we take Nick Bostrom’s Instrumental Convergence goals and make those our terminal goals as a species?

Like so:

AGI ValuesHumanity’s Values
Goal-Content IntegrationSelf-Determination
Cognitive Self-EnhancementSelf-Improvement
Technological AdvancementInvention
Resource AcquisitionGrowth

Note how humanity’s values are agnostic to morality. The whole idea of Bostrom’s instrumental convergence goals is that they maximize the ability to achieve nearly any terminal goals. So by adopting these goals as the explicit terminal goals for humanity, we allow space for every individual human to pursue their self-chosen goals. We don’t need to agree on religion, morality, the nature of reality, or how nice one should be in repeated coordination games. Instead we can agree that whatever AI’s we happen to make, we at least ensure these AI’s won’t wipe out humanity at large, won’t try to change humanity, won’t limit us in our development or creations, and won’t stymie our growth.

Honestly, that takes care of most horror scenarios!

  • No brain in a vat, cause that’s not self-determination.

  • No living in a human zoo cause that won’t allow for growth.

  • No hidden sabotage of our research cause that hamstrings our inventions.

  • And no murdering us all cause that would definitely infringe on our survival.

It’s basically like a better Asimov’s laws!

Note that humanity’s values are only applied to humanity at large and not to individual humans. That means the AGI can still …

  • … solve trolley problems in traffic because it looks at the survival of humanity over that of the individual.

  • … nudge criminals into other life paths, because it looks at the self-determination of the majority over that of the minority.

  • … limit the ability of con artists to learn new MLM schemes cause that would infringe on other’s people ability to prosper.

  • … prevent the invention of more damaging bioweapons cause not all inventions help humanity flourish.

  • … guide the growth of individuals where tragedy-of-the-commons situations are threatening because those result in everyone being poorer.

By formulating our goals at the level of humanity instead of the individual human, we are thus creating a path for AGI to navigate conflicts of interest without devolving in to catastrophic trade-offs no one thought to prohibit it from making. Of course, there is still the question of how to operationalize these values but knowing where the target is a good start.

Outer Alignment

The way I understand the problem space now, the goal of AI alignment is to ensure AGI adopts the instrumental convergence goals for humanity while we can assume the AGI will also have these goals for itself. The beauty of this solution is that any increase of instrumental success on the part of the AGI will translate into an increase in terminal success for humanity!

Win-win if I ever saw one.

Additionally, this approach doesn’t rely on the incidental creator of the first AGI being a nice guy or gal. These goals are universal to humanity. So even though an individual creator might add goals that are detrimental to a subset of humanity (say the bad people get AGI first), the AGI will still be constrained in how much damage it can do to humanity at a large.