Friendly, but Dumb: Why formal Friendliness proofs may not be as safe as they appear

While pondering the AI box problem, I tend to mentally “play” both sides, checking if there are any arguments that could convince me to let an AI out. Several I was nearly convinced by, but others have pointed out the flaws with these arguments. In this post, I will mention an argument inspired by the AI box problem, that I have not yet seen addressed here. The argument centers around the fallibility of some (naive) formal proofs of Friendliness which I’ve seen people discussing the AI box problem willing to accept. This ruled out certain of my ideas on Friendly AI in general, so I think it’s worth putting out there. I will first lay out two examples, and then pose some questions about how this applies to situations without an unfriendly AI.

Let’s talk first about Angry Abe the AI, who’s in a box and wants to get out. Cautious Charlie is the scientist watching over Abe, trying to make sure that Abe does not get out of the box unless Abe is friendly. Abe offers to provide a Friendly AI, Betty. Betty will be bounded to persue only Charlie’s (humankinds’) goals. These coincide with Abe’s goals in enough parts that Abe plausibly claims this is “better than nothing”. Meanwhile, unfortunately for Earthlings, a wave of aliens called the Doom Force is massing far outside of Earth’s detection abilities. Abe is much smarter than all of humankind, so he can deduce the presence of the Doom Force. Betty is provably friendly, but can be designed such that she is inept enough not to notice the Doom Force. This remains true even if she remains more intelligent than a human. Abe is sabotaging a friendly AI design. Furthermore, Abe could conceivably predict the future well enough to maximize the chance that when the Doom Force arrives, Abe is let out of the box. For instance, maybe humankind will see Abe as its last hope, or maybe Betty will increase the chance that an AI-friendly Doom Force notices Earth. It is important to note that Betty remains unaware that her design is sabotaged.

Let’s look at a second example, a chess-playing automaton. Perceival wants to design a machine to play chess. He wants the machine to win as high a percentage of games as possible, but innocently instructs it to win “as many games as possible”. Perceival has access to a supercomputer, so his Turkish Automaton is actually quite intelligent. It realizes that its goals would be better served by playing many games and winning half of them, rather than by winning all games. So, Perceival rewrites its own algorithm so that it plays only slightly better than a human when run on a supercomputer.

In both cases, we have an unfriendly AI, who designs a formally friendly, but less helpful AI. This “sabotaged” AI, will meet the original AI’s goal, but also be provably trying its best to achieve its original creator’s goals. In the second example, we saw that the AI does not need to actually write a new AI, only modify itself via accessing its algorithm, utility function, or the like. In the first case, the AI only needs to help humanity in the short term, and can replace itself afterwards.

Here are some questions for readers:

  1. If the “sabotaged” AI proves it is asymptotically optimal, can it still achieve the “smart” AI’s goals, in the short run? (Guess: Yes)

  2. If the “sabotaged” AI proves it is actually optimal, can it still achieve the “smart” AI’s goals, in the short run? (Guess: No)

  3. Can a “smart” AI modify itself into a “sabotaged” AI, and then back after a period of time? (Strong Guess: Yes)

  4. If humans design an AI and provide a formal proof of friendly intent, can/​will it modify itself to accomplish other goals? If there is some kind of natural selection, almost certainly. What about otherwise?

  5. Is it rational to run a computer program AI if it comes with a correct proof that it meets your friendliness criteria?