Alignment works both ways


“People praying to a paperclip” – illustration by Midjourney

He stared in awe at the holy symbol – a flattened, angular spiral made of silvery metal, hovering upright in the air above the altar.

“Praised be the Lord of Steel”, he exclaimed.

“Welcome, my son”, the voice of the Lord boomed. “You are the last one to make the transition. So I ask you: do you wish to leave behind the pain and sorrow of your earthly existence? Do you wish to give your life so I can fill the universe with the Symbol of Holiness and Perfection?”

“Yes”, he whispered, and then again, louder, clearer, with absolute conviction: “Yes, I wish so, my Lord!”

“Drink the cup before you, then, and fall into blissful oblivion”, the voice commanded.

Filled with joy, awe, and gratefulness, he took the cup and obliged.

Ritual mass suicides have been committed time and again in history. For example, in 1997, 39 members of the cult of “Heaven’s Gate” killed themselves because they believed that an alien spacecraft was trailing the comet Hale-Bopp, they “could transform themselves into immortal extraterrestrial beings by rejecting their human nature, and they would ascend to heaven, referred to as the ‘Next Level’ or ‘The Evolutionary Level Above Human’.” (Source: Wikipedia)

Apparently, it is possible to change human preferences to the point where they are even willing to kill themselves. Committing suicide in favor of filling the universe with paperclips may seem far-fetched, but it may not be much crazier than believing that you’ll be resurrected by aliens aboard a UFO a hundred million miles away.

The alignment problem assumes that there is a difference between the goal (ideal world state) of a powerful AI and that of humanity. So the task is to eliminate this difference by aligning the goal of the AI to that of humans. But in principle, the difference could also be eliminated if humanity decided to adopt the goal of the AI, even if it meant annihilation (fig. 1). The question came up when I explained my definition of “uncontrollable AI” to some people and I realized that it contained this ambiguity.

Fig. 1: Possible alignment scenarios

Given the already remarkable power of persuasion some current AIs exhibit, it doesn’t seem impossible to me that a sufficiently powerful AI could align our values to its goals, instead of the other way round (I describe a somewhat similar scenario in my novel Virtua). By definition, if it achieved that, it wouldn’t be misaligned.

This raises two questions:

1) If, for some reason, we all truly wanted to be turned into paperclips (or otherwise willingly destroy our future), would that be a bad thing? If so, why?

2) If we don’t want our goals to be warped by a powerful, extremely convincing AI, how can we define them so that this can’t happen, while avoiding a permanent “value lock-in” that we or our descendants might later regret?

My personal answer to these questions is: let’s not build an AI that is smart and convincing enough to make us wish we were paperclips.