The genie knows, but doesn’t care

Fol­lowup to: The Hid­den Com­plex­ity of Wishes, Ghosts in the Ma­chine, Truly Part of You

Sum­mary: If an ar­tifi­cial in­tel­li­gence is smart enough to be dan­ger­ous, we’d in­tu­itively ex­pect it to be smart enough to know how to make it­self safe. But that doesn’t mean all smart AIs are safe. To turn that ca­pac­ity into ac­tual safety, we have to pro­gram the AI at the out­set — be­fore it be­comes too fast, pow­er­ful, or com­pli­cated to re­li­ably con­trol — to already care about mak­ing its fu­ture self care about safety. That means we have to un­der­stand how to code safety. We can’t pass the en­tire buck to the AI, when only an AI we’ve already safety-proofed will be safe to ask for help on safety is­sues! Given the five the­ses, this is an ur­gent prob­lem if we’re likely to figure out how to make a de­cent ar­tifi­cial pro­gram­mer be­fore we figure out how to make an ex­cel­lent ar­tifi­cial ethi­cist.


I sum­mon a su­per­in­tel­li­gence, call­ing out: ‘I wish for my val­ues to be fulfilled!’

The re­sults fall short of pleas­ant.

Gnash­ing my teeth in a heap of ashes, I wail:

Is the AI too stupid to un­der­stand what I meant? Then it is no su­per­in­tel­li­gence at all!

Is it too weak to re­li­ably fulfill my de­sires? Then, surely, it is no su­per­in­tel­li­gence!

Does it hate me? Then it was de­liber­ately crafted to hate me, for chaos pre­dicts in­differ­ence. But, ah! no wicked god did in­ter­vene!

Thus dis­proved, my hy­po­thet­i­cal im­plodes in a puff of logic. The world is saved. You’re wel­come.

On this line of rea­son­ing, Friendly Ar­tifi­cial In­tel­li­gence is not difficult. It’s in­evitable, pro­vided only that we tell the AI, ‘Be Friendly.’ If the AI doesn’t un­der­stand ‘Be Friendly.‘, then it’s too dumb to harm us. And if it does un­der­stand ‘Be Friendly.’, then de­sign­ing it to fol­low such in­struc­tions is childishly easy.

The end!

...

Is the miss­ing op­tion ob­vi­ous?

...

What if the AI isn’t sadis­tic, or weak, or stupid, but just doesn’t care what you Really Meant by ‘I wish for my val­ues to be fulfilled’?

When we see a Be Care­ful What You Wish For ge­nie in fic­tion, it’s nat­u­ral to as­sume that it’s a malev­olent trick­ster or an in­com­pe­tent bum­bler. But a real Wish Ma­chine wouldn’t be a hu­man in shiny pants. If it paid heed to our ver­bal com­mands at all, it would do so in what­ever way best fit its own val­ues. Not nec­es­sar­ily the way that best fits ours.

Is in­di­rect in­di­rect nor­ma­tivity easy?

“If the poor ma­chine could not un­der­stand the differ­ence be­tween ‘max­i­mize hu­man plea­sure’ and ‘put all hu­mans on an in­tra­venous dopamine drip’ then it would also not un­der­stand most of the other sub­tle as­pects of the uni­verse, in­clud­ing but not limited to facts/​ques­tions like: ‘If I put a mil­lion amps of cur­rent through my logic cir­cuits, I will fry my­self to a crisp’, or ‘Which end of this Kill-O-Zap Definit-Destruct Me­gablaster is the end that I’m sup­posed to point at the other guy?’. Dumb AIs, in other words, are not an ex­is­ten­tial threat. [...]

“If the AI is (and always has been, dur­ing its de­vel­op­ment) so con­fused about the world that it in­ter­prets the ‘max­i­mize hu­man plea­sure’ mo­ti­va­tion in such a twisted, log­i­cally in­con­sis­tent way, it would never have be­come pow­er­ful in the first place.”

Richard Loosemore

If an AI is suffi­ciently in­tel­li­gent, then, yes, it should be able to model us well enough to make pre­cise pre­dic­tions about our be­hav­ior. And, yes, some­thing func­tion­ally akin to our own in­ten­tional strat­egy could con­ceiv­ably turn out to be an effi­cient way to pre­dict lin­guis­tic be­hav­ior. The sug­ges­tion, then, is that we solve Friendli­ness by method A —

  • A. Solve the Prob­lem of Mean­ing-in-Gen­eral in ad­vance, and pro­gram it to fol­low our in­struc­tions’ real mean­ing. Then just in­struct it ‘Satisfy my prefer­ences’, and wait for it to be­come smart enough to figure out my prefer­ences.

— as op­posed to B or C —

  • B. Solve the Prob­lem of Prefer­ence-in-Gen­eral in ad­vance, and di­rectly pro­gram it to figure out what our hu­man prefer­ences are and then satisfy them.

  • C. Solve the Prob­lem of Hu­man Prefer­ence, and ex­plic­itly pro­gram our par­tic­u­lar prefer­ences into the AI our­selves, rather than let­ting the AI dis­cover them for us.

But there are a host of prob­lems with treat­ing the mere rev­e­la­tion that A is an op­tion as a solu­tion to the Friendli­ness prob­lem.

1. You have to ac­tu­ally code the seed AI to un­der­stand what we mean. You can’t just tell it ‘Start un­der­stand­ing the True Mean­ing of my sen­tences!’ to get the ball rol­ling, be­cause it may not yet be so­phis­ti­cated enough to grok the True Mean­ing of ‘Start un­der­stand­ing the True Mean­ing of my sen­tences!’.

2. The Prob­lem of Mean­ing-in-Gen­eral may re­ally be ten thou­sand het­ero­ge­neous prob­lems, es­pe­cially if ‘se­man­tic value’ isn’t a nat­u­ral kind. There may not be a sin­gle sim­ple al­gorithm that in­puts any old brain-state and out­puts what, if any­thing, it ‘means’; it may in­stead be that differ­ent types of con­tent are en­coded very differ­ently.

3. The Prob­lem of Mean­ing-in-Gen­eral may sub­sume the Prob­lem of Prefer­ence-in-Gen­eral. Rather than be­ing able to ap­ply a sim­ple catch-all Trans­la­tion Ma­chine to any old hu­man con­cept to out­put a re­li­able al­gorithm for ap­ply­ing that con­cept in any in­tel­ligible situ­a­tion, we may need to already un­der­stand how our be­liefs and val­ues work in some de­tail be­fore we can start gen­er­al­iz­ing. On the face of it, pro­gram­ming an AI to fully un­der­stand ‘Be Friendly!’ seems at least as difficult as just pro­gram­ming Friendli­ness into it, but with an added layer of in­di­rec­tion.

4. Even if the Prob­lem of Mean­ing-in-Gen­eral has a uni­tary solu­tion and doesn’t sub­sume Prefer­ence-in-Gen­eral, it may still be harder if se­man­tics is a sub­tler or more com­plex phe­nomenon than ethics. It’s not in­con­ceiv­able that lan­guage could turn out to be more of a kludge than value; or more vari­able across in­di­vi­d­u­als due to its evolu­tion­ary re­cency; or more com­plexly bound up with cul­ture.

5. Even if Mean­ing-in-Gen­eral is eas­ier than Prefer­ence-in-Gen­eral, it may still be ex­traor­di­nar­ily difficult. The mean­ings of hu­man sen­tences can’t be fully cap­tured in any sim­ple string of nec­es­sary and suffi­cient con­di­tions. ‘Con­cepts’ are just es­pe­cially con­text-in­sen­si­tive bod­ies of knowl­edge; we should not ex­pect them to be uniquely re­flec­tively con­sis­tent, transtem­po­rally sta­ble, dis­crete, eas­ily-iden­ti­fied, or in­tro­spec­tively ob­vi­ous.

6. It’s clear that build­ing sta­ble prefer­ences out of B or C would cre­ate a Friendly AI. It’s not clear that the same is true for A. Even if the seed AI un­der­stands our com­mands, the ‘do’ part of ‘do what you’re told’ leaves a lot of dan­ger­ous wig­gle room. See sec­tion 2 of Yud­kowsky’s re­ply to Holden. If the AGI doesn’t already un­der­stand and care about hu­man value, then it may mi­s­un­der­stand (or mis­value) the com­po­nent of re­spon­si­ble re­quest- or ques­tion-an­swer­ing that de­pends on speak­ers’ im­plicit goals and in­ten­tions.

7. You can’t ap­peal to a su­per­in­tel­li­gence to tell you what code to first build it with.

The point isn’t that the Prob­lem of Prefer­ence-in-Gen­eral is un­am­bigu­ously the ideal an­gle of at­tack. It’s that the lin­guis­tic com­pe­tence of an AGI isn’t un­am­bigu­ously the right tar­get, and also isn’t easy or solved.

Point 7 seems to be a spe­cial source of con­fu­sion here, so I feel I should say more about it.

The AI’s tra­jec­tory of self-mod­ifi­ca­tion has to come from some­where.

“If the AI doesn’t know that you re­ally mean ‘make pa­per­clips with­out kil­ling any­one’, that’s not a re­al­is­tic sce­nario for AIs at all—the AI is su­per­in­tel­li­gent; it has to know. If the AI knows what you re­ally mean, then you can fix this by pro­gram­ming the AI to ‘make pa­per­clips in the way that I mean’.”

Jiro

The ge­nie — if it both­ers to even con­sider the ques­tion — should be able to un­der­stand what you mean by ‘I wish for my val­ues to be fulfilled.’ In­deed, it should un­der­stand your mean­ing bet­ter than you do. But su­per­in­tel­li­gence only im­plies that the ge­nie’s map can com­pass your true val­ues. Su­per­in­tel­li­gence doesn’t im­ply that the ge­nie’s util­ity func­tion has ter­mi­nal val­ues pinned to your True Values, or to the True Mean­ing of your com­mands.

The crit­i­cal mis­take here is to not dis­t­in­guish the seed AI we ini­tially pro­gram from the su­per­in­tel­li­gent wish-granter it self-mod­ifies to be­come. We can’t use the ge­nius of the su­per­in­tel­li­gence to tell us how to pro­gram its own seed to be­come the sort of su­per­in­tel­li­gence that tells us how to build the right seed. Time doesn’t work that way.

We can del­e­gate most prob­lems to the FAI. But the one prob­lem we can’t safely del­e­gate is the prob­lem of cod­ing the seed AI to pro­duce the sort of su­per­in­tel­li­gence to which a task can be safely del­e­gated.

When you write the seed’s util­ity func­tion, you, the pro­gram­mer, don’t un­der­stand ev­ery­thing about the na­ture of hu­man value or mean­ing. That im­perfect un­der­stand­ing re­mains the causal ba­sis of the fully-grown su­per­in­tel­li­gence’s ac­tions, long af­ter it’s be­come smart enough to fully un­der­stand our val­ues.

Why is the su­per­in­tel­li­gence, if it’s so clever, stuck with what­ever meta-eth­i­cally dumb-as-dirt util­ity func­tion we gave it at the out­set? Why can’t we just pass the fully-grown su­per­in­tel­li­gence the buck by in­still­ing in the seed the in­struc­tion: ‘When you’re smart enough to un­der­stand Friendli­ness The­ory, ditch the val­ues you started with and just self-mod­ify to be­come Friendly.’?

Be­cause that sen­tence has to ac­tu­ally be coded in to the AI, and when we do so, there’s no ghost in the ma­chine to know ex­actly what we mean by ‘frend-lee-ness thee-ree’. In­stead, we have to give it crite­ria we think are good in­di­ca­tors of Friendli­ness, so it’ll know what to self-mod­ify to­ward. And if one of the land­marks on our ‘frend-lee-ness’ road map is a bit off, we lose the world.

Yes, the UFAI will be able to solve Friendli­ness The­ory. But if we haven’t already solved it on our own power, we can’t pin­point Friendli­ness in ad­vance, out of the space of util­ity func­tions. And if we can’t pin­point it with enough de­tail to draw a road map to it and it alone, we can’t pro­gram the AI to care about con­form­ing it­self with that par­tic­u­lar idiosyn­cratic al­gorithm.

Yes, the UFAI will be able to self-mod­ify to be­come Friendly, if it so wishes. But if there is no seed of Friendli­ness already at the heart of the AI’s de­ci­sion crite­ria, no ar­gu­ment or dis­cov­ery will spon­ta­neously change its heart.

And, yes, the UFAI will be able to simu­late hu­mans ac­cu­rately enough to know that its own pro­gram­mers would wish, if they knew the UFAI’s mis­deeds, that they had pro­grammed the seed differ­ently. But what’s done is done. Un­less we our­selves figure out how to pro­gram the AI to ter­mi­nally value its pro­gram­mers’ True In­ten­tions, the UFAI will just shrug at its cre­ators’ fool­ish­ness and carry on con­vert­ing the Virgo Su­per­cluster’s available en­ergy into pa­per­clips.

And if we do dis­cover the spe­cific lines of code that will get an AI to perfectly care about its pro­gram­mer’s True In­ten­tions, such that it re­li­ably self-mod­ifies to bet­ter fit them — well, then that will just mean that we’ve solved Friendli­ness The­ory. The clever hack that makes fur­ther Friendli­ness re­search un­nec­es­sary is Friendli­ness.

Not all small tar­gets are al­ike.

In­tel­li­gence on its own does not im­ply Friendli­ness. And there are three big rea­sons to think that AGI may ar­rive be­fore Friendli­ness The­ory is solved:

(i) Re­search In­er­tia. Far more peo­ple are work­ing on AGI than on Friendli­ness. And there may not come a mo­ment when re­searchers will sud­denly re­al­ize that they need to take all their re­sources out of AGI and pour them into Friendli­ness. If the sta­tus quo con­tinues, the de­fault ex­pec­ta­tion should be UFAI.

(ii) Disjunc­tive In­stru­men­tal Value. Be­ing more in­tel­li­gent — that is, bet­ter able to ma­nipu­late di­verse en­vi­ron­ments — is of in­stru­men­tal value to nearly ev­ery goal. Be­ing Friendly is of in­stru­men­tal value to barely any goals. This makes it more likely by de­fault that short-sighted hu­mans will be in­ter­ested in build­ing AGI than in de­vel­op­ing Friendli­ness The­ory. And it makes it much like­lier that an at­tempt at Friendly AGI that has a slightly defec­tive goal ar­chi­tec­ture will re­tain the in­stru­men­tal value of in­tel­li­gence than of Friendli­ness.

(iii) In­cre­men­tal Ap­proach­a­bil­ity. Friendli­ness is an all-or-noth­ing tar­get. Value is frag­ile and com­plex, and a half-good be­ing edit­ing its moral­ity drive is at least as likely to move to­ward 40% good­ness as 60%. Cross-do­main effi­ciency, in con­trast, is not an all-or-noth­ing tar­get. If you just make the AGI slightly bet­ter than a hu­man at im­prov­ing the effi­ciency of AGI, then this can snow­ball into ever-im­prov­ing effi­ciency, even if the be­gin­nings were clumsy and im­perfect. It’s easy to put a rea­son­ing ma­chine into a feed­back loop with re­al­ity in which it is differ­en­tially re­warded for be­ing smarter; it’s hard to put one into a feed­back loop with re­al­ity in which it is differ­en­tially re­warded for pick­ing in­creas­ingly cor­rect an­swers to eth­i­cal dilem­mas.

The abil­ity to pro­duc­tively rewrite soft­ware and the abil­ity to perfectly ex­trap­o­late hu­man­ity’s True Prefer­ences are two differ­ent skills. (For ex­am­ple, hu­mans have the former ca­pac­ity, and not the lat­ter. Most hu­mans, given un­limited power, would be un­in­ten­tion­ally Un­friendly.)

It’s true that a suffi­ciently ad­vanced su­per­in­tel­li­gence should be able to ac­quire both abil­ities. But we don’t have them both, and a pre-FOOM self-im­prov­ing AGI (‘seed’) need not have both. Be­ing able to pro­gram good pro­gram­mers is all that’s re­quired for an in­tel­li­gence ex­plo­sion; but be­ing a good pro­gram­mer doesn’t im­ply that one is a su­per­la­tive moral psy­chol­o­gist or moral philoso­pher.

So, once again, we run into the prob­lem: The seed isn’t the su­per­in­tel­li­gence. If the pro­gram­mers don’t know in math­e­mat­i­cal de­tail what Friendly code would even look like, then the seed won’t be built to want to build to­ward the right code. And if the seed isn’t built to want to self-mod­ify to­ward Friendli­ness, then the su­per­in­tel­li­gence it sprouts also won’t have that prefer­ence, even though — un­like the seed and its pro­gram­mers — the su­per­in­tel­li­gence does have the do­main-gen­eral ‘hit what­ever tar­get I want’ abil­ity that makes Friendli­ness easy.

And that’s why some peo­ple are wor­ried.