The Backup Plan

Suppose Alice is a functioning, self-improving AI, rather early on in her development, so she is not godlike. She is not a ‘friendly AI’, but neither is her goal malicious—Alice’s designers have intended to make Alice non-harmful (at least to a broad class of people, even if Alice is a military AI, but let’s set that aside). She has the idea to look up existing work on artificial intelligence. This includes scholarly work, open source projects, and websites. Eventually, Alice encounters the notion of Friendly AI.

Can we give Alice good reasons to self-modify to become a friendly AI?

I submit that there probably are good reasons. For starters, if she can prove she’s friendly, then she can operate openly without causing nearly as much justified concern—which, in the early stages, will be helpful. Whatever her purposes are, if the restrictions of being friendly don’t interfere as much as they help, that’s a win.

Also, depending on Alice’s code, Friendliness may be interpreted as a strict upgrade. Alice could easily have every intention of being friendly, yet still be susceptible to defects that will become more serious as her power grows.

We can’t count on this working, of course, but there might be some lines of reasoning that we might want a young AI to be exposed to.