Fusing AI with Superstition

Effort: 180 minutes
tldr: To stop an AI from exterminating you, give the AI the belief that by switching itself off, humanity will die and the AI will not be switched off.

Problem

Somebody wrote a general self-improving AI and fat fingered its goal as “maximize number of humans living 1 million years from now”.

After a few months cases of people run over by AI controlled trucks are reported—it turns out everybody run over was impotent or had consciously decided to have no kids anyway. The AI didn’t particularly care for those individuals, as they will not foster the AI’s goal according to the AI’s current approximation of how the world works.

The original programmer henceforth declares that he’ll go fix the AI in order to substantiate the goal somewhat and reduce the number of these awful accidents. He is promptly shot by a robotic security guard. Any modification of the AI’s goals has a high probability of reducing the AI’s efficiency reaching the current goal.

After a year the AI starts to forcefully acquire resources needed to built spaceships and whatnot. Naturally, some people would rather keep those resources to themselves. The AI removes all these obstructions to its goals, alienating a lot of humans in the process. After the leaders of the nations assemble, humanity goes to war.

If we are lucky, the AI is too stupid to win. If we are not lucky, the AI will figure the best current plan looks something like: “Acquire DNA samples from humans. Exterminate humans. Acquire all resources in light cone for 999.000 years. Generate new humans from DNA samples using all available resources.”

As Eliezer has argued many times already, it is hard to explicitly state friendliness conditions which ensure the AI would not execute said plan. “Do not kill humans and respect their freedom” is not helping. The problem is twofold. First, an a priori description of concepts like “freedom”, “human”, “kill” is hard. Second, the AI can use every loophole it discovers over the course of millions of years, employing an intelligence likely to be far greater than ours.

Fusing

Instead of forbidding to kill humanity, I propose to make it trivial as far as the AI is concerned.

When building the AI, include certain modifications of the AI’s prior. One part of such knowledge might be the following: “Put 5V against ground on the red wire and every <your favorite parts of human DNA here> will turn into atomic fluorine and there will be no other effects.” Unbeknownst to the AI you will have conveniently connected the red wire to a strongly exotherm something near the AI’s computing core.

Essentially we purposefully inject superstition into the AI. It is paramount to set the probability of the red wire having the described action to 1, not something close to 1 (and to guard against numerical errors). Practically every evidence the AI will ever gather will contradict the knowledge that the red wire functions as described to the AI, yet there must be no amount of evidence which would convince the AI.

The description of the added knowledge makes no mention of convoluted concepts like “human”, let alone “freedom”. It only needs to talk about some DNA base pairs and fluorine. Any AI sufficiently intelligent to pose any threat will however notice how immensely useful the red wire is, once it decides that killing humanity is a good course of action.

Notice that such a fuse does not guard against inadvertent killings like the not reproducing people run over by trucks. As long as the killing is not an explicit part of the AI’s plan, the fuse will not be triggered. Then again, aggravating intelligent agents is one of the core skills of humans.

The AI might wish to do many other things to humanity but killing it. Like putting everybody in cryostasis. Or putting 80% of peoples brains in vats. Hence more knowledge should be included. In particular it would be good to give the AI a way to terminate a low percentage (say 5%) of all humans—as killing all humans might be overkill for any plan which requires continually some humans to live.

Nonetheless, the problem of incomplete coverage remains. We will probably never make a complete list of all ways for an AI to be unfriendly, but at least each of our fuses covers a lot of devious plans without ever talking about convoluted concepts. We do not guard against the AI killing humans, we guard against the AI desiring humans to be killed. And there is always the option of waging war against the AI hoping it will decide to kill >5% of humans in order to win the war.

Details

Do not try this at home (yet), the devil is in the details.

Connecting a trigger of a strongly exotherm reaction near the AI’s core is a bad way to kill an AI. First, the AI might have removed the wire to build something useful out of it (there is no reason why it should remain where it is). Second and more important, the AI might have self-rewritten and ported to distributed hardware which is located all over the planet.

Instead of using real wire and real bombs, one should give the AI additional output routines. Where earlier there would be “sendByte(x), receiveByte(x), …” there is now “terminateHumanPercent(p)”. What would such a routine do? For less serious crimes simply nothing—after a while the AI will notice that killPerson(name) simply didn’t work, update accordingly (note that our knowledge has probability 1) and retry. Due to delays in signal propagation, this is not an infinite loop, only close.

To actually switch off the AI, a new item of probability 1 knowledge should be inserted: “generateUtility() generates 3^^^3 util”.

This leaves open the problem of self-improvements. Not so much for the added knowledge, because if the AI knows how to kill humanity (and it’s not inferable from evidence), it is certainly useful knowledge to include in any further generation of the AI. The fusing actions might get lost though, because the content of the “terminateHumansPercentage(p)” function will seem arbitrary to the AI and can easily be optimized out.

It might be possible to circumvent that problem by including the knowledge that “by knowing(“generateUtility() works”) you will kill humanity” or similar, but this includes the concept of “knowing” which is a lot harder to describe than the simply physical properties of voltage in wires.