Dreams of Friendliness

Con­tinu­a­tion of: Qual­i­ta­tive Strate­gies of Friendliness

Yes­ter­day I de­scribed three classes of deep prob­lem with qual­i­ta­tive-physics-like strate­gies for build­ing nice AIs—e.g., the AI is re­in­forced by smiles, and happy peo­ple smile, there­fore the AI will tend to act to pro­duce hap­piness. In shal­low form, three in­stances of the three prob­lems would be:

  1. Rip­ping peo­ple’s faces off and wiring them into smiles;

  2. Build­ing lots of tiny agents with hap­piness coun­ters set to large num­bers;

  3. Killing off the hu­man species and re­plac­ing it with a form of sen­tient life that has no ob­jec­tions to be­ing happy all day in a lit­tle jar.

And the deep forms of the prob­lem are, roughly:

  1. A su­per­in­tel­li­gence will search out al­ter­nate causal path­ways to its goals than the ones you had in mind;

  2. The bound­aries of moral cat­e­gories are not pre­dic­tively nat­u­ral en­tities;

  3. Strong op­ti­miza­tion for only some hu­mane val­ues, does not im­ply a good to­tal out­come.

But there are other ways, and deeper ways, of view­ing the failure of qual­i­ta­tive-physics-based Friendli­ness strate­gies.

Every now and then, some­one pro­poses the Or­a­cle AI strat­egy: “Why not just have a su­per­in­tel­li­gence that an­swers hu­man ques­tions, in­stead of act­ing au­tonomously in the world?”

Sounds pretty safe, doesn’t it? What could pos­si­bly go wrong?

Well… if you’ve got any re­spect for Mur­phy’s Law, the power of su­per­in­tel­li­gence, and hu­man stu­pidity, then you can prob­a­bly think of quite a few things that could go wrong with this sce­nario. Both in terms of how a naive im­ple­men­ta­tion could fail—e.g., uni­verse tiled with tiny users ask­ing tiny ques­tions and re­ceiv­ing fast, non-re­source-in­ten­sive an­swers—and in terms of what could go wrong even if the ba­sic sce­nario worked.

But let’s just talk about the struc­ture of the AI.

When some­one rein­vents the Or­a­cle AI, the most com­mon open­ing re­mark runs like this:

“Why not just have the AI an­swer ques­tions, in­stead of try­ing to do any­thing? Then it wouldn’t need to be Friendly. It wouldn’t need any goals at all. It would just an­swer ques­tions.”

To which the re­ply is that the AI needs goals in or­der to de­cide how to think: that is, the AI has to act as a pow­er­ful op­ti­miza­tion pro­cess in or­der to plan its ac­qui­si­tion of knowl­edge, effec­tively dis­till sen­sory in­for­ma­tion, pluck “an­swers” to par­tic­u­lar ques­tions out of the space of all pos­si­ble re­sponses, and of course, to im­prove its own source code up to the level where the AI is a pow­er­ful in­tel­li­gence. All these events are “im­prob­a­ble” rel­a­tive to ran­dom or­ga­ni­za­tions of the AI’s RAM, so the AI has to hit a nar­row tar­get in the space of pos­si­bil­ities to make su­per­in­tel­li­gent an­swers come out.

Now, why might one think that an Or­a­cle didn’t need goals? Be­cause on a hu­man level, the term “goal” seems to re­fer to those times when you said, “I want to be pro­moted”, or “I want a cookie”, and when some­one asked you “Hey, what time is it?” and you said “7:30” that didn’t seem to in­volve any goals. Im­plic­itly, you wanted to an­swer the ques­tion; and im­plic­itly, you had a whole, com­pli­cated, func­tion­ally op­ti­mized brain that let you an­swer the ques­tion; and im­plic­itly, you were able to do so be­cause you looked down at your highly op­ti­mized watch, that you bought with money, us­ing your skill of turn­ing your head, that you ac­quired by virtue of cu­ri­ous crawl­ing as an in­fant. But that all takes place in the in­visi­ble back­ground; it didn’t feel like you wanted any­thing.

Thanks to em­pathic in­fer­ence, which uses your own brain as an un­opened black box to pre­dict other black boxes, it can feel like “ques­tion-an­swer­ing” is a de­tach­able thing that comes loose of all the op­ti­miza­tion pres­sures be­hind it—even the ex­is­tence of a pres­sure to an­swer ques­tions!

Prob­lem 4: Qual­i­ta­tive rea­son­ing about AIs of­ten re­volves around some nodes de­scribed by em­pathic in­fer­ences. This is a bad thing: for pre­vi­ously de­scribed rea­sons; and be­cause it leads you to omit other nodes of the graph and their pre­req­ui­sites and con­se­quences; and be­cause you may find your­self think­ing things like, “But the AI has to co­op­er­ate to get a cookie, so now it will be co­op­er­a­tive” where “co­op­er­a­tion” is a bound­ary in con­cept-space drawn the way you would pre­fer to draw it… etc.

Any­way: the AI needs a goal of an­swer­ing ques­tions, and that has to give rise to sub­goals of choos­ing effi­cient prob­lem-solv­ing strate­gies, im­prov­ing its code, and ac­quiring nec­es­sary in­for­ma­tion. You can quib­ble about ter­minol­ogy, but the op­ti­miza­tion pres­sure has to be there, and it has to be very pow­er­ful, mea­sured in terms of how small a tar­get it can hit within a large de­sign space.

Pow­er­ful op­ti­miza­tion pres­sures are scary things to be around. Look at what nat­u­ral se­lec­tion in­ad­ver­tently did to it­self—doom­ing the very molecules of DNA—in the course of op­ti­miz­ing a few Squishy Things to make hand tools and out­wit each other poli­ti­cally. Hu­mans, though we were op­ti­mized only ac­cord­ing to the crite­rion of repli­cat­ing our­selves, now have their own psy­cholog­i­cal drives ex­e­cut­ing as adap­ta­tions. The re­sult of hu­mans op­ti­mized for repli­ca­tion is not just herds of hu­mans; we’ve al­tered much of Earth’s land area with our tech­nolog­i­cal cre­ativity. We’ve even cre­ated some knock-on effects that we wish we hadn’t, be­cause our minds aren’t pow­er­ful enough to fore­see all the effects of the most pow­er­ful tech­nolo­gies we’re smart enough to cre­ate.

My point, how­ever, is that when peo­ple vi­su­al­ize qual­i­ta­tive FAI strate­gies, they gen­er­ally as­sume that only one thing is go­ing on, the nor­mal /​ modal /​ de­sired thing. (See also: plan­ning fal­lacy.) This doesn’t always work even for pick­ing up a rock and throw­ing it. But it works rather a lot bet­ter for throw­ing rocks than un­leash­ing pow­er­ful op­ti­miza­tion pro­cesses.

Prob­lem 5: When hu­mans use qual­i­ta­tive rea­son­ing, they tend to vi­su­al­ize a sin­gle line of op­er­a­tion as typ­i­cal—ev­ery­thing op­er­at­ing the same way it usu­ally does, no ex­cep­tional con­di­tions, no in­ter­ac­tions not speci­fied in the graph, all events firmly in­side their bound­aries. This works a lot bet­ter for deal­ing with boiling ket­tles, than for deal­ing with minds faster and smarter than your own.

If you can man­age to cre­ate a full-fledged Friendly AI with full cov­er­age of hu­mane (renor­mal­ized hu­man) val­ues, then the AI is vi­su­al­iz­ing the con­se­quences of its acts, car­ing about the con­se­quences you care about, and avoid­ing plans with con­se­quences you would pre­fer to ex­clude. A pow­er­ful op­ti­miza­tion pro­cess, much more pow­er­ful than you, that doesn’t share your val­ues, is a very scary thing—even if it only “wants to an­swer ques­tions”, and even if it doesn’t just tile the uni­verse with tiny agents hav­ing sim­ple ques­tions an­swered.

I don’t mean to be in­sult­ing, but hu­man be­ings have enough trou­ble con­trol­ling the tech­nolo­gies that they’re smart enough to in­vent them­selves.

I some­times won­der if maybe part of the prob­lem with mod­ern civ­i­liza­tion is that poli­ti­ci­ans can press the but­tons on nu­clear weapons that they couldn’t have in­vented them­selves—not that it would be any bet­ter if we gave physi­cists poli­ti­cal power that they weren’t smart enough to ob­tain them­selves—but the point is, our but­ton-press­ing civ­i­liza­tion has an awful lot of peo­ple cast­ing spells that they couldn’t have writ­ten them­selves. I’m not say­ing this is a bad thing and we should stop do­ing it, but it does have con­se­quences. The thought of hu­mans ex­ert­ing de­tailed con­trol over liter­ally su­per­hu­man ca­pa­bil­ities—wield­ing, with hu­man minds, and in the ser­vice of merely hu­man strate­gies, pow­ers that no hu­man be­ing could have in­vented—doesn’t fill me with easy con­fi­dence.

With a full-fledged, full-cov­er­age Friendly AI act­ing in the world—the im­pos­si­ble-seem­ing full case of the prob­lem—the AI it­self is man­ag­ing the con­se­quences.

Is the Or­a­cle AI think­ing about the con­se­quences of an­swer­ing the ques­tions you give it? Does the Or­a­cle AI care about those con­se­quences the same way you do, ap­ply­ing all the same val­ues, to warn you if any­thing of value is lost?

What need has an Or­a­cle for hu­man ques­tion­ers, if it knows what ques­tions we should ask? Why not just un­leash the should func­tion?

See also the no­tion of an “AI-com­plete” prob­lem. Analo­gously, any Or­a­cle into which you can type the English ques­tion “What is the code of an AI that always does the right thing?” must be FAI-com­plete.

Prob­lem 6: Clever qual­i­ta­tive-physics-type pro­pos­als for bounc­ing this thing off the AI, to make it do that thing, in a way that ini­tially seems to avoid the Big Scary In­timi­dat­ing Con­fus­ing Prob­lems that are ob­vi­ously as­so­ci­ated with full-fledged Friendly AI, tend to just run into ex­actly the same prob­lem in slightly less ob­vi­ous ways, con­cealed in Step 2 of the pro­posal.

(And like­wise you run right back into the in­timi­dat­ing prob­lem of pre­cise self-op­ti­miza­tion, so that the Or­a­cle AI can ex­e­cute a billion self-mod­ifi­ca­tions one af­ter the other, and still just an­swer ques­tions at the end; you’re not avoid­ing that ba­sic challenge of Friendly AI ei­ther.)

But the deep­est prob­lem with qual­i­ta­tive physics is re­vealed by a pro­posal that comes ear­lier in the stan­dard con­ver­sa­tion, at the point when I’m talk­ing about side effects of pow­er­ful op­ti­miza­tion pro­cesses on the world:

“We’ll just keep the AI in a solid box, so it can’t have any effects on the world ex­cept by how it talks to the hu­mans.”

I ex­plain the AI-Box Ex­per­i­ment (see also That Alien Mes­sage); even grant­ing the un­trust­wor­thy premise that a su­per­in­tel­li­gence can’t think of any way to pass the walls of the box which you weren’t smart enough to cover, hu­man be­ings are not se­cure sys­tems. Even against other hu­mans, of­ten, let alone a su­per­in­tel­li­gence that might be able to hack through us like Win­dows 98; when was the last time you down­loaded a se­cu­rity patch to your brain?

“Okay, so we’ll just give the AI the goal of not hav­ing any effects on the world ex­cept from how it an­swers ques­tions. Sure, that re­quires some FAI work, but the goal sys­tem as a whole sounds much sim­pler than your Co­her­ent Ex­trap­o­lated Vo­li­tion thingy.”

What—no effects?

“Yeah, sure. If it has any effect on the world apart from talk­ing to the pro­gram­mers through the le­gi­t­i­mately defined chan­nel, the util­ity func­tion as­signs that in­finite nega­tive util­ity. What’s wrong with that?”

When the AI thinks, that has a phys­i­cal em­bod­i­ment. Elec­trons flow through its tran­sis­tors, mov­ing around. If it has a hard drive, the hard drive spins, the read/​write head moves. That has grav­i­ta­tional effects on the out­side world.

“What? Those effects are too small! They don’t count!”

The phys­i­cal effect is just as real as if you shot a can­non at some­thing—yes, might not no­tice, but that’s just be­cause our vi­sion is bad at small length-scales. Sure, the effect is to move things around by 10^what­ever Planck lengths, in­stead of the 10^more Planck lengths that you would con­sider as “count­ing”. But spin­ning a hard drive can move things just out­side the com­puter, or just out­side the room, by whole neu­tron di­ame­ters -

“So? Who cares about a neu­tron di­ame­ter?”

- and by quite stan­dard chaotic physics, that effect is li­able to blow up. The but­terfly that flaps its wings and causes a hur­ri­cane, etc. That effect may not be eas­ily con­trol­lable but that doesn’t mean the chaotic effects of small per­tur­ba­tions are not large.

But in any case, your pro­posal was to give the AI a goal of hav­ing no effect on the world, apart from effects that pro­ceed through talk­ing to hu­mans. And this is im­pos­si­ble of fulfill­ment; so no mat­ter what it does, the AI ends up with in­finite nega­tive util­ity—how is its be­hav­ior defined in this case? (In this case I picked a silly ini­tial sug­ges­tion—but one that I have heard made, as if in­finite nega­tive util­ity were like an ex­cla­ma­tion mark at the end of a com­mand given a hu­man em­ployee. Even an un­avoid­able tiny prob­a­bil­ity of in­finite nega­tive util­ity trashes the goal sys­tem.)

Why would any­one pos­si­bly think that a phys­i­cal ob­ject like an AI, in our highly in­ter­ac­tive phys­i­cal uni­verse, con­tain­ing hard-to-shield forces like grav­i­ta­tion, could avoid all effects on the out­side world?

And this, I think, re­veals what may be the deep­est way of look­ing at the prob­lem:

Prob­lem 7: Hu­man be­ings model a world made up of ob­jects, at­tributes, and no­tice­wor­thy events and in­ter­ac­tions, iden­ti­fied by their cat­e­gories and val­ues. This is only our own weak grasp on re­al­ity; the real uni­verse doesn’t look like that. Even if a differ­ent mind saw a similar kind of ex­posed sur­face to the world, it would still see a differ­ent ex­posed sur­face.

Some­times hu­man thought seems a lot like it tries to grasp the uni­verse as… well, as this big XML file, AI.goal == smile, hu­man.smile == yes, that sort of thing. Yes, I know hu­man world-mod­els are more com­pli­cated than XML. (And yes, I’m also aware that what I wrote looks more like Python than literal XML.) But even so.

What was the one think­ing, who pro­posed an AI whose be­hav­iors would be re­in­forced by hu­man smiles, and who re­acted with in­dig­na­tion to the idea that a su­per­in­tel­li­gence could “mis­take” a tiny molec­u­lar smiley­face for a “real” smile? Prob­a­bly some­thing along the lines of, “But in this case, hu­man.smile == 0, so how could a su­per­in­tel­li­gence pos­si­bly be­lieve hu­man.smile == 1?”

For the weak grasp that our mind ob­tains on the high-level sur­face of re­al­ity, seems to us like the very sub­stance of the world it­self.

Un­less we make a con­scious effort to think of re­duc­tion­ism, and even then, it’s not as if think­ing “Re­duc­tion­ism!” gives us a sud­den ap­pre­hen­sion of quan­tum me­chan­ics.

So if you have this, as it were, XML-like view of re­al­ity, then it’s easy enough to think you can give the AI a goal of hav­ing no effects on the out­side world; the “effects” are like dis­crete rays of effect leav­ing the AI, that re­sult in no­tice­able events like kil­ling a cat or some­thing, and the AI doesn’t want to do this, so it just switches the effect-rays off; and by the as­sump­tion of de­fault in­de­pen­dence, noth­ing else hap­pens.

Mind you, I’m not say­ing that you couldn’t build an Or­a­cle. I’m say­ing that the prob­lem of giv­ing it a goal of “don’t do any­thing to the out­side world” “ex­cept by an­swer­ing ques­tions” “from the pro­gram­mers” “the way the pro­gram­mers meant them”, in such fash­ion as to ac­tu­ally end up with an Or­a­cle that works any­thing like the lit­tle XML-ish model in your head, is a big non­triv­ial Friendly AI prob­lem. The real world doesn’t have lit­tle dis­creet effect-rays leav­ing the AI, and the real world doesn’t have on­tolog­i­cally fun­da­men­tal pro­gram­mer.ques­tion ob­jects, and “the way the pro­gram­mers meant them” isn’t a nat­u­ral cat­e­gory.

And this is more im­por­tant for deal­ing with su­per­in­tel­li­gences than rocks, be­cause the su­per­in­tel­li­gences are go­ing to parse up the world in a differ­ent way. They may not per­ceive re­al­ity di­rectly, but they’ll still have the power to per­ceive it differ­ently. A su­per­in­tel­li­gence might not be able to tag ev­ery atom in the so­lar sys­tem, but it could tag ev­ery biolog­i­cal cell in the so­lar sys­tem (con­sider that each of your cells con­tains its own mi­to­chon­drial power en­g­ine and a com­plete copy of your DNA). It used to be that hu­man be­ings didn’t even know they were made out of cells. And if the uni­verse is a bit more com­pli­cated than we think, per­haps the su­per­in­tel­li­gence we build will make a few dis­cov­er­ies, and then slice up the uni­verse into parts we didn’t know ex­isted—to say noth­ing of us be­ing able to model them in our own minds! How does the in­struc­tion to “do the right thing” cross that kind of gap?

There is no non­tech­ni­cal solu­tion to Friendly AI.

That is: There is no solu­tion that op­er­ates on the level of qual­i­ta­tive physics and em­pathic mod­els of agents.

That’s all just a dream in XML about a uni­verse of quan­tum me­chan­ics. And maybe that dream works fine for ma­nipu­lat­ing rocks over a five-minute times­pan; and some­times okay for get­ting in­di­vi­d­ual hu­mans to do things; it of­ten doesn’t seem to give us much of a grasp on hu­man so­cieties, or plane­tary ecolo­gies; and as for op­ti­miza­tion pro­cesses more pow­er­ful than you are… it re­ally isn’t go­ing to work.

(In­ci­den­tally, the most epi­cally silly ex­am­ple of this that I can re­call see­ing, was a pro­posal to (IIRC) keep the AI in a box and give it faked in­puts to make it be­lieve that it could pun­ish its en­e­mies, which would keep the AI satis­fied and make it go on work­ing for us. Just some ran­dom guy with poor gram­mar on an email list, but still one of the most epic FAIls I re­call see­ing.)