Possible Dangers of the Unrestricted Value Learners

TL;DR: AI which is learn­ing hu­man val­ues may act un­eth­i­cally or be catas­troph­i­cally dan­ger­ous, as it doesn’t yet un­der­stand hu­man val­ues.

The main idea is sim­ple: a young AI which is try­ing to learn hu­man val­ues (which I will call a “value learner”) will have a “chicken and egg” prob­lem. Such AIs must ex­tract hu­man val­ues, but to do it safely, they should know these val­ues, or at least have some safety rules for value ex­trac­tion. This idea has been an­a­lyzed be­fore (more on those analy­ses be­low); here, I will ex­am­ine differ­ent ways in which value learn­ers may cre­ate trou­bles.

It may be ex­pected that the pro­cess of young AI learn­ing hu­man val­ues will be akin to a nice con­ver­sa­tion or perfect ob­ser­va­tion but it could eas­ily take the form of a witch trial if ap­pro­pri­ate con­straints are not speci­fied. Even a hu­man child be­tween ages 2 and 18 could do many stupid things just be­cause she doesn’t un­der­stand all so­cietal rules; how­ever, some ex­per­i­men­ta­tion with such rules is nec­es­sary to learn them, and a (good) school pro­vides a safe sand­box for such ac­tive learn­ing.

List of dangers

There are sev­eral pos­si­ble types of failures of AI value learn­ers:

1) Paper­clip­ping. The value-learn­ing pro­cess starts to have a very large im­pact on the world, for ex­am­ple, AI cre­ates too many com­put­ers for mod­el­ling val­ues. The spe­cial case here is a value learn­ing AI which will never end its work; in that case, the AI will have to take over the world and con­tinue to try to find hu­man val­ues un­til the end of the uni­verse. In any case, we ex­pect that good value learn­ing will take some rea­son­able amount of time, on the or­der of days to years, and some limited re­sources, like in­ter­ac­tion with just a few peo­ple.

2) Hu­man tor­ture. A value learner may con­duct un­eth­i­cal ex­per­i­ments on hu­mans in or­der to ex­tract in­for­ma­tion on val­ues, per­haps in­clud­ing caus­ing pain, brain surg­eries, and putting hu­mans in un­pleas­ant situ­a­tions. Mind­crime is a spe­cial case: AI may need to run enor­mous num­ber of hu­man simu­la­tions or at least to up­load a per­son, in or­der to un­der­stand his prefer­ences.

3) Hu­man value ma­nipu­la­tion. AI may ma­nipu­late hu­man val­ues in or­der to make them sim­pler or to gain some type of ad­van­tage. This may in­clude wire­head­ing, or press­ing to make fake ap­proval of AI’s ac­tions.

4) Wrong value ex­trac­tion. AI may make an er­ror, per­haps based on wrong as­sump­tions about what and how it should learn, and con­verge to an in­cor­rect model of hu­man val­ues, which may be ei­ther com­pletely wrong, or wrong in sub­tle ways.

Ob­vi­ous ideas of what could be done

What could be done to make value learn­ers safer? (ob­vi­ously not a com­pre­hen­sive list):

• AI ca­pa­bil­ities should be limited (ar­tifi­cial stu­pidity).

• AI should be equipped with the ex­pected struc­ture of hu­man val­ues and ex­pected model of val­ues (that is, we ex­pect that a nor­mal hu­man doesn’t want to kill ba­bies).

• Good ways to ex­tract hu­man val­ues should be whitelisted.

• Ac­tions with bad con­se­quences in the ex­ter­nal world should be black­listed.

• Cor­rigi­bil­ity: the AI may be turned off or cor­rected.

• “Box­ing” of the value learner, so it learns only from pre­vi­ously recorded data. One way to cre­ate a safer value learner is by not al­low­ing it to ac­tu­ally ma­nipu­late hu­mans, but to train it (at least in the be­gin­ning) on some pre­re­corded data like an eth­i­cal dataset or de­scrip­tion of a le­gal sys­tem.

• Value learner as an Or­a­cle AI: it may ask and re­ceive only short data points from the out­side world, which are needed to choose be­tween differ­ent mod­els.

This all starts to look like a typ­i­cal naïve set of ideas about cre­ation of Friendly AI. Thus, the ques­tion arises: is the cre­ation of a safe value learner—an FAI-com­plete task?

In other words, is it pos­si­ble to cre­ate safe value learn­ers with­out first solv­ing the full al­ign­ment prob­lem, in­clud­ing cor­rect rep­re­sen­ta­tion of hu­man val­ues? If yes, the hope is fu­tile that us­ing AI to learn hu­man val­ues will make the AI safety sim­pler.

What oth­ers have writ­ten about the subject

Danger value learn­ers are spe­cial case of the “safe ex­plo­ra­tion prob­lem” from Con­crete prob­lems in AI safety. Chris­ti­ano et al sug­gested sev­eral in­stru­ments for safe ex­plo­ra­tion: risk-sen­si­tive perfor­mance crite­ria, use of demon­stra­tions of right tra­jec­to­ries, learn­ing in simu­la­tion and hu­man over­sight, but most of them are more suit­able for a drone learn­ing not to crash in the ground than to a su­per­in­tel­li­gence try­ing to learn hu­man val­ues.

A similar idea was ex­plored in a post by J. Maxwell in which he shows that Seed AI will have difficulty un­der­stand­ing hu­man val­ues. He sug­gests: “The best ap­proach may be to find an in­tel­li­gence “sweet spot” that’s suffi­cient to un­der­stand hu­man val­ues with­out be­ing dan­ger­ous. In­tu­itively, it seems plau­si­ble that such a sweet spot ex­ists: In­di­vi­d­ual hu­mans can learn un­fa­mil­iar val­ues, such as those of an­i­mals they study, but in­di­vi­d­ual hu­mans aren’t in­tel­li­gent enough to be dan­ger­ous.” He later sug­gests “on­tol­ogy au­to­gen­er­a­tion” as a safer way to cre­ate a mind model.

Soares also wrote that AI should learn val­ues through data: “smarter-than-hu­man AI sys­tems that can in­duc­tively learn what to value from la­beled train­ing data”, and re­fine its con­clu­sions through ques­tions. How­ever, many ideas, like CIRL, may as­sume more di­rect in­ter­ac­tion be­tween the AI and hu­mans, like ob­serv­ing ac­tual be­havi­our or ac­tive de­bate.

The dis­till-am­plify ap­proach as­sumes grad­ual re­fine­ment of an already semi-ac­cu­rate model of hu­man value sys­tem. A gen­eral un­der­stand­ing of some mam­malian val­ues (Sarma) may be a start­ing point for this pro­cess.

In a post en­ti­tled “Cake, or death!”, Arm­strong de­scribed a model of AI which may un­in­ter­ested in re­fin­ing its model of hu­man val­ues.

Com­mer­cial home robots and self-driv­ing cars will soon ap­pear, and they will have some ca­pa­bil­ity to act eth­i­cally in the out­side world (partly hand-coded, partly trained on ex­ist­ing datasets, as in the case of self-driv­ing cars), and such robots could be used as ini­tial hu­man value mod­els for the train­ing of more ad­vanced AI.

On the other hand, some un­re­stricted learn­ers which use a purely math­e­mat­i­cal model of the hu­man re­ward func­tion (Sezener) may act un­eth­i­cally in the early stages of learn­ing.

Failed value learn­ers may be most dan­ger­ous types of AI, as smaller AI will be weaker, and suc­cess­ful learn­ers will be safe (for hu­mans). Value learn­ers, by defi­ni­tion, will have at least hu­man ca­pa­bil­ities, but not yet be al­igned.

Deeper clas­sifi­ca­tion of po­ten­tially dan­ger­ous value learners

First, we can con­sider that there are three types of hu­man val­ues (they don’t ac­tu­ally ex­ist, but it is a use­ful clas­sifi­ca­tion):

1) Ba­sic hu­man needs (which could be also called “fun­da­men­tal hu­man rights”). Th­ese are sur­vival, es­cap­ing pain, free­dom, hous­ing and health­care. Ba­si­cally, it is all that is listed in crim­i­nal law (with some caveats, like some crim­i­nal laws pun­ish peo­ple for the things which a na­tional states wants from them, like drug or trea­son laws). The list may be not full, as there are pos­si­ble un­known ba­sic needs, like “not be re­placed with p-zom­bie”, x-risks, s-risks or “not stuck into an eter­nal bore­dom”. Some ad­vance Or­a­cle AI (may be con­sist­ing from best hu­mans) may be used to list all pos­si­ble ba­sic needs. Ba­sic needs form a ba­sis for safety. Some­thing is not safe if it de­stroys or pre­vents fulfil­ment of a ba­sic hu­man need. In other words, a full list­ing of ba­sic hu­man needs is al­most equiv­a­lent to the list of value re­quire­ments that will need to be re­spected by AI for it to be safe for hu­mans.

2) Per­sonal hu­man prefer­ences. This is some­thing as “I like col­lect­ing 17th cen­tury coins”. Th­ese are well-defined and sta­ble per­sonal prefer­ences but ne­glect­ing them is not an ex­is­ten­tial catas­tro­phe (for a typ­i­cal hu­man be­ing).

3) One-time wishes. Th­ese are minute-long wishes which dis­ap­pear im­me­di­ately af­ter they are fulfilled, for ex­am­ple, “I want coffee”.

Ob­vi­ously, these types of val­ues cor­re­spond to differ­ent types of value learn­ers, as differ­ent learn­ers will learn differ­ent types of val­ues:

a) The first type is the zero-knowl­edge, first day value learner, which doesn’t have any ideas about the out­side world or hu­mans but tries to learn about them. This type is the most dan­ger­ous, as with­out any con­strains, pow­er­ful AI may act most un­eth­i­cally.

b) The sec­ond is the per­sonal prefer­ence learner, which already knows ba­sic hu­man needs. It is gen­er­ally safe but will have to learn my per­sonal prefer­ences. This is like a robot which I bring home from a shop: It has a built-in model of ba­sic hu­man needs and could learn my per­sonal val­ues based on some pleas­ant con­ver­sa­tion af­ter un­pack­ing (like in movie “Her”).

c) The third type should guess what I ex­actly mean based on a sin­gle ver­bal com­mand, but it already knows ba­sic hu­man needs and my prefer­ences. The na­ture of wishes is that a per­son ac­tu­ally knows that he has a wish—he feels the de­sire but may have difficulty cor­rectly ar­tic­u­lat­ing it. (But the same per­son may not feel his ba­sic needs un­til los­ing some­thing im­por­tant.) The third type of learner already knows ba­sic hu­man needs and my per­sonal prefer­ences, which pro­vides a con­text for guess­ing what I meant; it could check its un­der­stand­ing by ask­ing “did I cor­rectly un­der­stand you that you want coffee in bed?”, per­haps illus­trat­ing by images how it pours coffee in bed (into a cup or onto the sheets).

I pre­sent here a tax­on­omy of po­ten­tially dan­ger­ous and safe value learn­ers, and what could go wrong if an AI of a cer­tain type tries to learn some type of hu­man prefer­ences:

No comments.