AI Alignment Problem: “Human Values” don’t Actually Exist

Pre­vi­ous posts in the se­ries: “What AI Safety Re­searchers Have Writ­ten About the Na­ture of Hu­man Values”, Pos­si­ble Dangers of the Un­re­stricted Value Learn­ers. Next planned post: AI safety ap­proaches which don’t use the idea of hu­man val­ues.

Sum­mary: The main cur­rent ap­proach to the AI safety is AI al­ign­ment, that is, the cre­ation of AI whose prefer­ences are al­igned with “hu­man val­ues.” Many AI safety re­searchers agree that the idea of “hu­man val­ues” as a con­stant, or­dered sets of prefer­ences is at least in­com­plete. How­ever, the idea that “hu­mans have val­ues” un­der­lies a lot of think­ing in the field; it ap­pears again and again, some­times pop­ping up as an un­crit­i­cally ac­cepted truth. Thus, it de­serves a thor­ough de­con­struc­tion, which I will do by list­ing and an­a­lyz­ing com­pre­hen­sively the hid­den as­sump­tions of the idea that “hu­mans have val­ues.” This de­con­struc­tion of hu­man val­ues will be cen­tered around the fol­low­ing ideas: “Hu­man val­ues” are use­ful de­scrip­tions, but not real ob­jects; “hu­man val­ues” are bad pre­dic­tors of be­hav­ior; the idea of a “hu­man value sys­tem” has flaws; “hu­man val­ues” are not good by de­fault; and hu­man val­ues can­not be sep­a­rated from hu­man minds. The method of anal­y­sis is list­ing hid­den as­sump­tions on which the idea of “hu­man val­ues” is built. I recom­mend that ei­ther the idea of “hu­man val­ues” should be re­placed with some­thing bet­ter for the goal of AI safety, or at least be used very cau­tiously. The ap­proaches to AI safety which don’t use the idea of hu­man val­ues at all may re­quire more at­ten­tion, like the use of full brain mod­els, box­ing, and ca­pa­bil­ity limit­ing.


The idea of AI which learns hu­man val­ues is a core of the cur­rent ap­proach to ar­tifi­cial gen­eral in­tel­li­gence (AGI) safety. How­ever, it is based on some as­sump­tions about the na­ture of hu­man val­ues, in­clud­ing the as­sump­tion that they com­pletely de­scribe hu­man mo­ti­va­tion, are non-con­tra­dic­tory, are nor­ma­tively good, etc. Ac­tual data from psy­chol­ogy pro­vides a rather differ­ent pic­ture.

The liter­a­ture anal­y­sis of what other AGI safety re­searchers have writ­ten about the na­ture of hu­man val­ues is rather large and is pre­sented in an­other of my texts: What AGI Safety Re­searchers Have Writ­ten About the Na­ture of Hu­man Values. A his­tor­i­cal overview of the evolu­tion of the idea of hu­man val­ues can be found in Claw­son, Vin­son, “Hu­man val­ues: a his­tor­i­cal and in­ter­dis­ci­plinary anal­y­sis” (1978). A list of ideas for achiev­ing AGI safety with­out the idea of hu­man val­ues will also be pub­lished sep­a­rately.

In Sec­tion 1 the on­tolog­i­cal sta­tus of hu­man val­ues is ex­plored. In sec­tion 2 the idea of hu­man val­ues as an or­dered set of prefer­ences is crit­i­cized. Sec­tion 3 ex­plores whether the idea of hu­man val­ues is use­ful to AGI safety.

1. On­tolog­i­cal sta­tus and sources of hu­man values

1.1. AI al­ign­ment re­quires an ac­tu­ally ex­ist­ing, sta­ble, finite set of pre­dict­ing data about peo­ples’ mo­ti­va­tion, which is called “hu­man val­ues”

In an AI al­ign­ment frame­work, fu­ture ad­vanced AI will learn hu­man val­ues. So, we don’t need to di­rectly spec­ify hu­man prefer­ences, we just need to cre­ate AI ca­pa­ble of learn­ing hu­man val­ues. (From a safety point of view, there is a cir­cu­lar­ity prob­lem here, as such AI needs to be safe be­fore it starts to learn hu­man val­ues, or it could do it in un­safe and un­eth­i­cal ways, as I de­scribe in de­tail in Pos­si­ble Dangers of the Un­re­stricted Value Learn­ers, but let’s as­sume for now as it is some­how by­passed—per­haps via a set of pre­limi­nary safety mea­sures.)

The idea of AI al­ign­ment is based on the idea that there is a finite, sta­ble set of data about a per­son, which could be used to pre­dict one’s choices, and which is ac­tu­ally morally good. The rea­son­ing be­hind this ba­sis is be­cause if it is not true, then learn­ing is im­pos­si­ble, use­less, or will not con­verge.

The idea of value learn­ing as­sumes that while hu­man val­ues are com­plex, they are much sim­pler than the in­for­ma­tion needed for whole brain em­u­la­tion. Other­wise, full brain em­u­la­tion will be the best pre­dic­tive method.

More­over, the idea of AI al­ign­ment sug­gests that this in­for­ma­tion could be learned if cor­rect learn­ing pro­ce­dures are found. (No pro­ce­dures = no al­ign­ment.)

This ac­tu­ally ex­ist­ing and ax­iolog­i­cal good, sta­ble, finite set of pre­dict­ing data about peo­ples’ mo­ti­va­tion is of­ten called “hu­man val­ues,” and it is as­sumed that any AI al­ign­ment pro­ce­dure will be able to learn this data. This view on the na­ture of hu­man val­ues from an AI al­ign­ment point of view is rather vague, and it doesn’t say what hu­man val­ues are, nei­ther does it show how they are im­ple­mented in the hu­man mind. This view doesn’t de­pend on any psy­cholog­i­cal the­ory of the hu­man mind. As a pure ab­strac­tion, it could be ap­plied to any agent whose mo­ti­va­tional struc­ture we want to learn.

Be­fore the val­ues of a per­son can be learned, they have to be­come “hu­man.” That is, they need to be com­bined with some the­ory about how val­ues are en­coded in the hu­man brain. AI safety re­searchers have sug­gested many such the­o­ries, and ex­it­ing psy­cholog­i­cal liter­a­ture sug­gests even more the­o­ries about the na­ture of hu­man mo­ti­va­tion.

In psy­chol­ogy, there is also a “the­ory of hu­man val­ues,” a set of gen­eral mo­ti­va­tional prefer­ences which in­fluence choices. This the­ory should be dis­t­in­guished from “hu­man val­ues” as an ex­pected out­put of an AI al­ign­ment pro­ce­dure. For ex­am­ple, some psy­cholog­i­cal tests may say that Mary’s val­ues are free­dom, kind­ness, and art. How­ever, the out­put of an AI al­ign­ment pro­ce­dure could be com­pletely differ­ent and not even pre­sented in words, but in some set of equa­tions about her re­ward func­tion. To dis­t­in­guish hu­man val­ues as they are ex­pected in AI al­ign­ment from hu­man val­ues as a part of psy­chol­ogy, we will call the first “hu­man-val­ues-for-AI-al­ign­ment.”

The main in­tu­ition be­hind the idea of hu­man val­ues is that, in many cases, we can pre­dict an­other per­son’s be­hav­ior if we know what he wants. For ex­am­ple: “I want to drink,” or “John wants to earn more money” of­ten clearly trans­lates into agent-like be­hav­ior.

As a re­sult, the idea be­hind AI al­ign­ment could be re­con­structed as the fol­low­ing: if we have a cor­rect the­ory of hu­man mo­ti­va­tion a pri­ori, and knowl­edge about the hu­man claims and choices as pos­te­ri­ori data, we could use some­thing like Bayesian logic to re­con­struct his ac­tual prefer­ences.

To have this a pri­ori knowl­edge, we need to know the in­ter­nal struc­ture of hu­man val­ues and how they are en­coded in the hu­man brain. Sev­eral over sim­plified the­o­ries about the struc­ture of the hu­man val­ues have been sug­gested: as a hu­man re­ward func­tion, as a re­ward-con­cept as­so­ci­a­tion, as a com­bi­na­tion of lik­ing-ap­prov­ing-want­ing etc.

How­ever, all these are based on the as­sump­tion that hu­man val­ues ex­ist at all: That the hu­man mo­ti­va­tion could be com­pressed un­equiv­o­cally into one—and only one—sim­ple sta­ble model. This, and other as­sump­tions, ap­pear even be­fore the “cor­rect psy­cholog­i­cal the­ory” of hu­man val­ues is cho­sen. In this ar­ti­cle, these as­sump­tions will be an­a­lyzed.

1.2. Hu­man val­ues do not ac­tu­ally ex­ist; they are only use­ful de­scrip­tions of hu­man be­hav­ior and rationalization

The idea that hu­man be­hav­ior is de­ter­mined by hu­man val­ues is now deeply in­cor­po­rated into peo­ple’s un­der­stand­ing of the world and is rarely sub­ject to reser­va­tions. How­ever, the on­tolog­i­cal sta­tus of hu­man val­ues is un­cer­tain: do they ac­tu­ally ex­ist, or they are just a use­ful way of de­scribing hu­man be­hav­ior? The idea of hu­man-val­ues-for-AI-al­ign­ment re­quires that some pre­dict­ing set of data about mo­ti­va­tion does ac­tu­ally ex­ist. If it is only a de­scrip­tion, in which there could be mul­ti­ple de­scrip­tions in var­i­ous situ­a­tions, ex­trap­o­lat­ing such a de­scrip­tion will cre­ate prob­lems.

In other words, de­scrip­tions are ob­server-de­pen­dent, while ac­tu­ally ex­ist­ing things are ob­server in­de­pen­dent. If we pro­gram in an agent with util­ity func­tion U, this util­ity func­tion ex­ists in­de­pen­dently of any ob­ser­va­tions and could be un­equiv­o­cally learned by some pro­ce­dures. If we have a pro­cess with agent-like fea­tures, it could be de­scribed differ­ently, de­pend­ing on how com­plex a model of the pro­cess we want to cre­ate.

For ex­am­ple, if we have a moun­tain with sev­eral sum­mits, we could de­scribe it as one moun­tain, two moun­tains or three moun­tains de­pend­ing on the re­s­olu­tion abil­ity of our model.

In the dis­cus­sion of the on­tolog­i­cal sta­tus of hu­man val­ues we en­counter a very long-stand­ing philo­soph­i­cal prob­lem of the re­al­ity of uni­ver­sals, that is, high-level ab­stract ideas. The Mid­dle Ages dis­pute be­tween re­al­ists, who thought that uni­ver­sals are real, and nom­i­nal­ists, who thought that only sin­gu­lar things are real, was won by nom­i­nal­ists. From his­tory, we know that “hu­man val­ues” is a rel­a­tively new con­struc­tion which ap­pears only in some psy­cholog­i­cal the­o­ries of mo­ti­va­tion.

How­ever, we can’t say that “hu­man val­ues” are just ran­dom de­scrip­tions, which can be cho­sen com­pletely ar­bi­trar­ily, be­cause there are some nat­u­ral lev­els where the de­scrip­tion matches re­al­ity. In case of val­ues, it is a level of one’s claims about his-her prefer­ences. While these claims may not perfectly much any deeper re­al­ity of one’s val­ues, they ex­ist un­equiv­o­cally, at least in the given mo­ment. The main un­cer­tainty in the hu­man val­ues is the un­equiv­o­cal ex­is­tence of some deeper level, which cre­ates such claims, which is the level of “true val­ues”.

Hu­man prefer­ences are rel­a­tively easy to mea­sure (while mea­sur­ing libido is not easy). One could ask a per­son about his/​her prefer­ences, and s/​he will write that s/​he prefers ap­ples over or­anges, Democrats over Repub­li­cans, etc. Such an­swers could be statis­ti­cally con­sis­tent in some groups, which could al­low the pre­dic­tion of fu­ture an­swers. But it is of­ten as­sumed that real hu­man val­ues are some­thing differ­ent than ex­plicit prefer­ences. Hu­man val­ues are as­sumed to be able to gen­er­ate prefer­en­tial state­ments but not to be equal to them.

One could also mea­sure hu­man be­hav­ior in these choice ex­per­i­ments, and check if the per­son ac­tu­ally prefers or­anges over ap­ples, and also get con­sis­tent re­sults. The ob­vi­ous prob­lem of such prefer­en­tial sta­bil­ity is that it is typ­i­cally mea­sured for psy­cholog­i­cally sta­ble peo­ple in a sta­ble so­ciety, and in a sta­ble situ­a­tion (a con­trol­led ex­per­i­ment). The re­sult­ing sta­bil­ity is still statis­ti­cal: That is, one who like ap­ples, may some­times choose an or­ange, but this atyp­i­cal choice may be dis­re­garded as noise in the data.

Ex­per­i­ments which de­liber­ately dis­rupt situ­a­tional sta­bil­ity con­sis­tently show that hu­man prefer­ences play a small role in ac­tual hu­man be­hav­ior. For ex­am­ple, changes in so­cial pres­sure re­sult in con­sis­tent changes in be­hav­ior, thus con­tra­dict­ing de­clared and ob­served val­ues. The most fa­mous ex­am­ple is the Stan­ford Pri­son Ex­per­i­ment, where stu­dents quickly took on abu­sive rolls.

The only way for hu­man val­ues to ac­tu­ally ex­ist, would be if we could pin­point some re­gion of the hu­man brain where they are ex­plic­itly pre­sented as rules. How­ever, only very sim­ple be­hav­ioral pat­terns, like the swim­ming re­flex, may ac­tu­ally be ge­net­i­cally hard­coded in the brain, and all oth­ers are so­cially defined.

So, there are two main in­ter­pre­ta­tions of the idea of “hu­man val­ues”:

1) Values ac­tu­ally ex­ist, and each hu­man makes choices based on their own val­ues. There is one sta­ble source of hu­man claims, ac­tions, emo­tions and mea­surable prefer­ences, which com­pletely defines them, is lo­cated some­where in the brain, and could be un­equiv­o­cally mea­sured.

2) Values are use­ful de­scrip­tions. Hu­mans make choices un­der the in­fluence of many in­puts, in­clud­ing situ­a­tion, learned be­hav­ior, mood, un­con­scious de­sires, and ran­dom­ness, and to sim­plify the de­scrip­tion of the situ­a­tion we use the des­ig­na­tion “hu­man val­ues.” More de­tail on this topic can be found in the book by Lee Ross et al: “The per­son and the situ­a­tion: Per­spec­tives of so­cial psy­chol­ogy.

Hu­mans have a sur­pris­ingly big prob­lem when they are asked about their ul­ti­mate goals: They just don’t know them! They may cre­ate ad hoc some so­cially ac­cept­able list of prefer­ences, like fam­ily, friend­ship, etc., but this will be a poor pre­dic­tor of their ac­tual be­hav­ior.

It is sur­pris­ing that most hu­mans can live suc­cess­ful lives with­out ex­plic­itly know­ing and us­ing their list of goals and prefer­ences. In con­trast, a per­son can gen­er­ally iden­tify his/​her cur­rent wishes, a skill ob­vi­ously nec­es­sary for sur­vival, for ex­am­ple s/​he can con­sider thirst and the de­sire for wa­ter.

1.3. Five sources of in­for­ma­tion of hu­man val­ues: ver­bal­iza­tions, thoughts, emo­tions, be­hav­ior and neu­rolog­i­cal scans

There are sev­eral differ­ent ways one could learn about some­one’s val­ues. There is an un­der­ly­ing as­sump­tion that all of these ways con­verge to the same val­ues, which ap­pears to be false upon closer ex­am­i­na­tion. The main in­for­ma­tion chan­nels to learn some­one’s prefer­ences are:

  1. Ver­bal claims. This what a per­son says about his/​her prefer­ences. Such claims try to pre­sent a per­son as bet­ter ac­cord­ing to ex­pected so­cial norms. Arm­strong sug­gested the ex­am­i­na­tion of fa­cial ex­pres­sions when a per­son lies about his/​her true val­ues to de­duce his/​her real val­ues, per­haps by train­ing some AGI to do it. He based this sug­ges­tion on the in­ter­est­ing idea that “Hu­mans have a self-model of their own val­ues.” How­ever, it ap­pears that most hu­mans ei­ther could live with­out such a model, or that their model is ra­tio­nal­iza­tion made to look good. Such claims could have differ­ent sub­types: what a per­son says to friends, writes in books, etc. Writ­ten claims could be more con­sis­tent and so­cially ap­pro­pri­ate, as they are bet­ter thought out. Claims to close friends could be more ori­en­tated to­ward short-term effect, ma­nipu­la­tion, and so­cial situ­a­tion-de­pen­dence. At the same time, claims to friends could also be more sincere, as they have been sub­jected to less in­ter­nal cen­sor­ship. Similarly, claims made un­der drugs, es­pe­cially al­co­hol, could be even less cen­sored, but might not pre­sent “true val­ues”. They could pre­sent some sup­pressed “counter-val­ues”, such as the use of an ob­scene lex­i­con or some mimet­i­cally repli­cated so­cial cliché, like “I hate all mem­bers of so­cial group X.”

  2. In­ter­nal thought claims: the pri­vate thoughts which ap­pear in the in­ter­nal di­a­log or plan­ning. Peo­ple may be more hon­est in their thoughts. How­ever, many peo­ple lie to them­selves about their own val­ues, or are just un­able to fully ar­tic­u­late the com­plex­ity of their val­ues.

  3. Be­hav­ior. What a per­son ac­tu­ally does could rep­re­sent the sum of all his/​her de­sires, trained mod­els of be­hav­ior, ran­dom ac­tions, etc. Con­tra­dict­ing val­ues could re­sult in zero be­hav­ior, such as a case when one wants to buy a dress, but is afraid to spend too much money on it. Be­hav­ior can also take differ­ent forms: choices be­tween two al­ter­na­tives, which one might sig­nal in many ways; ver­bal be­hav­ior, other than state­ments of one’s own val­ues; and chains of phys­i­cal ac­tions (e.g. danc­ing).

  4. Ex­pres­sion of emo­tions. Hu­man val­ues could be re­con­structed based on emo­tional re­ac­tions to stim­uli. A per­son could pre­fer to look at some images longer, feel arousal, smile, etc. How­ever, this way of learn­ing val­ues would over­es­ti­mate sup­pressed emo­tions and un­der­es­ti­mate ra­tio­nal prefer­ences. For ex­am­ple, a pe­dophile may be be­come aroused by some type of images, but on the ra­tio­nal level s/​he may fight this type of emo­tion. Emo­tions could be pre­sented to the out­side in many ways, by fa­cial ex­pres­sions, tone of voice and con­tent of speech, pose, and even body odor. A per­son could also sup­press ex­pres­sion of emo­tions or fake them.

  5. Non-be­hav­ioral, neu­ro­phys­iolog­i­cal rep­re­sen­ta­tions of val­ues. Most of these are cur­rently un­available to out­side ob­servers, but brain waves, neu­ro­trans­mit­ter con­cen­tra­tions or sin­gle-neu­ron ac­ti­va­tions, as well as some con­nec­tome con­nec­tions, could be di­rectly or in­di­rectly used to gather in­for­ma­tion about one’s val­ues. AGI with ad­vanced nan­otech­nol­ogy may have full ac­cess to the in­ter­nal states of one’s brain.

1.4. Where do hu­man val­ues come from: evolu­tion, cul­ture, ide­olo­gies, logic and per­sonal events

If one has some­thing (e.g. a car), it is as­sumed that one made an act of choice by ei­ther buy­ing it or at least keep­ing it in one’s pos­ses­sion. How­ever, this de­scrip­tion is not ap­pli­ca­ble to val­ues, as one is not mak­ing a choice to have a value, but in­stead makes choices based on the val­ues. Or, if one says that a per­son makes a choice to hold some value (and this choice was not based on any other val­ues), one as­sumes the ex­is­tence of some­thing like “free will”, which is an even more spec­u­la­tive and prob­le­matic con­cept than val­ues [ref]. Ob­vi­ously, some in­stru­men­tal val­ues could be de­rived from ter­mi­nal val­ues, but that is more like plan­ning, not gen­er­a­tion of val­ues.

If one were to define the “source” of hu­man val­ues, it would sim­plify value learn­ing, as one could de­rive val­ues di­rectly from the source. There are sev­eral views about the gen­e­sis of hu­man val­ues:

1) God gives val­ues as rules.

2) “Free will”: some enig­matic abil­ity to cre­ate val­ues and choices out of noth­ing.

3) Genes and evolu­tion: Values are en­coded in hu­man genes in the form of some ba­sic drives, and these drives ap­peared as a re­sult of an evolu­tion­ary pro­cess.

4) Cul­ture and ed­u­ca­tion: Values are em­bed­ded in so­cial struc­ture and learned. There are sev­eral sub­var­i­ants re­gard­ing source, e.g. lan­guage, re­li­gion, par­ents, so­cial web, so­cial class (Marx), books one read or memes which are cur­rently af­fect­ing the per­son.

5) Sig­nifi­cant per­sonal events: Th­ese could be trauma or in­tense plea­sure in child­hood, e.g. “birth trauma,” or first love in school.

6) Log­i­cal val­ues: A set of ideas that a ra­tio­nal mind could use to define val­ues based on some first prin­ci­ples, e.g. Kant’s im­per­a­tive [ref].

7) Ran­dom pro­cess: Some in­ter­nal ran­dom pro­cess re­sults in choos­ing the main pri­ori­ties, prob­a­bly in child­hood [ref].

God and free will are out­side of ra­tio­nal dis­cus­sion. How­ever, all the other ideas have some merit as these six fac­tors could af­fect the gen­e­sis of hu­man val­ues and it is not easy to choose one which is dom­i­nat­ing.

2. Crit­ics of the idea of hu­man val­ues as a con­stant set of per­sonal prefer­ences: it is based on many assumptions

2.1. Hu­man prefer­ences are not constant

Per­sonal val­ues evolve from child­hood to adult­hood. They also change when a per­son be­comes a mem­ber of an­other so­cial group, be­cause of the new and differ­ent role, ex­po­sure to differ­ent peer pres­sure and differ­ent ide­ol­ogy.

More­over, it is likely that we have a meta-value about evolv­ing val­ues: that it is good that some­one’s val­ues are chang­ing with age. If a per­son con­tinues to play with the same toys at 30 he played with at 3 years old, it may be a sig­nal of de­vel­op­men­tal ab­nor­mal­ities.

Another way to de­scribe hu­man prefer­ences is not as “val­ues”, but as “wishes”. The main differ­ence is that “val­ues” are as­sumed to be con­stant, but wishes are as­sumed to be con­stantly chang­ing and even chaotic in na­ture. Also, a wish typ­i­cally dis­ap­pears when granted. If I wish for some wa­ter and then get some, I will not want any more wa­ter for the next few hours. Wishes are also more in­stru­men­tal and of­ten rep­re­sent phys­iolog­i­cal needs or ba­sic drives.

2.2. Hu­man choices are not defined by hu­man values

The state­ment that “hu­mans have val­ues” as­sumes that these val­ues are most im­por­tant fac­tor in pre­dict­ing hu­man be­hav­ior. For ex­am­ple, if we know that a chess AI’s ter­mi­nal goal is to win in chess, we could as­sume that it will try to win in chess. But in the hu­man case, know­ing some­one val­ues may have sur­pris­ingly lit­tle pre­dic­tive power about this per­son’s ac­tions.

In this sub­sec­tion, we will look at differ­ent situ­a­tions in which hu­man choices are not defined by (de­clared) hu­man val­ues but are af­fected by some other fac­tors.

1. Situation

The idea of “hu­man val­ues” im­plies that a per­son acts ac­cord­ing to his/​her val­ues. This is the cen­tral idea of all value the­ory, be­cause it as­sumes that if we know choices, we can re­con­struct val­ues, and if we know val­ues, we can pre­sum­ably re­con­struct the be­hav­ior of the per­son.

There is also an­other un­der­ly­ing as­sump­tion that the re­la­tion be­tween be­hav­ior and val­ues is un­equiv­o­cal, that is, given a set of be­hav­ior B we could re­con­struct one and only one set of val­ues V which defines it. But this doesn’t work even from a math­e­mat­i­cal point of view, as for any finite B there ex­ist in­finitely many pro­grams which could cre­ate it. Thus, for a uni­ver­sal agent, similar be­hav­ior could be cre­ated by very differ­ent val­ues. Arm­strong wrote about this, stat­ing that the be­hav­ior of an agent de­pends not only on val­ues, but on policy, which, in turn, de­pends on one’s bi­ases, limits of in­tel­li­gence, and available knowl­edge.

How­ever, in the hu­man case, the main prob­lem is not that hu­man be­ings are able to pre­tend that they have one set of val­ues, but that they ac­tu­ally have differ­ent val­ues. Typ­i­cally, only con artists and psy­chopaths are ly­ing about their ac­tual in­ten­tions. The prob­lem is that hu­man be­hav­ior is not defined by hu­man val­ues at all, as demon­strated in nu­mer­ous psy­cholog­i­cal ex­per­i­ments. A great de­scrip­tion of these re­sults can be found in Ross and Nis­bett’s book “The per­son and the situ­a­tion: Per­spec­tives of so­cial psy­chol­ogy”.

In a 1973 ex­per­i­ment, Dar­ley and Bat­son checked if a per­son would help a man who was ly­ing in their path. “They ex­am­ined a group of stu­dents of the the­olog­i­cal sem­i­nary who were prepar­ing to ut­ter his first ser­mon. If the sub­jects, be­ing afraid of be­ing late for the ser­mon, hur­ried, then about 10% of them pro­vided as­sis­tance. On the con­trary, if they did not hurry, hav­ing enough time be­fore it be­gan, the num­ber of stu­dents who came to the aid in­creased to 63%”.

Ross et al wrote that max­i­mum at­tain­able level of pre­dic­tion of the be­hav­ior of a per­son in a new situ­a­tion, based ei­ther on their per­sonal traits or statis­tics re­gard­ing their pre­vi­ous be­hav­ior, has a cor­re­la­tion co­effi­cient of 0.3.

2. In­ter­nal conflicts

Another im­por­tant con­cep­tion de­scribed Ross and Nis­bett’s “The per­son and the situ­a­tion” is that sta­ble be­hav­ior can be un­der­pinned by con­flict­ing at­ti­tudes, where differ­ent forces bal­ance each other. For ex­am­ple, a per­son wants to have un­limited ac­cess to sex, but is also afraid of so­cial reper­cus­sions and costs of such de­sire, and thus uses porn. This may be in­ter­preted as if he has a wish to or val­ues us­ing porn, but that is not so: porn is only a com­pro­mise be­tween two forces, and such a bal­ance is rather frag­ile and could have un­pre­dictable con­se­quences if a per­son is placed in a differ­ent situ­a­tion. Th­ese ideas were ex­plored by Fest­inger (Ross, p29).

3. Emo­tional affect

It is known that many crimes oc­cur un­der in­tense and un­ex­pected emo­tional af­fect, for ex­am­ple “road rage,” or mur­ders com­mit­ted out of jeal­ousy. Th­ese emo­tions are in­tense re­ac­tion of our “atavis­tic an­i­mal brain” to the situ­a­tion. Such situ­a­tion may be in­signifi­cant in broader con­text of con­tem­po­rary civ­i­liza­tion, but in­tense emo­tions can over­ride our ra­tio­nal judge­ments and al­most take con­trol over our ac­tions.

Note, there is still no con­clu­sion about the na­ture of emo­tion in psy­cholog­i­cal liter­a­ture, though there is a ra­tio­nal model of emo­tions as ac­cel­er­a­tors of learn­ing by in­creas­ing ap­pre­ci­a­tion of the situ­a­tion (as men­tioned in So­tala’s ar­ti­cle about val­ues).

[Um­brello com­ment: “This is the ex­act point that John­son (in the above com­ment) ar­gues against, the en­light­en­ment era idea of the sep­a­ra­tion of psy­cholog­i­cal fac­ul­ties (i.e., rea­son vs. imag­i­na­tion). We have to be care­ful to not fall within this di­chotomy since it is not clear what the bound­aries of these differ­ent states of mind are.”]

4. Peer pressure

Ex­per­i­ments con­ducted by Ash and Mil­gram demon­strated that peer pres­sure can cause peo­ple to act against what they per­ceive or value. Zim­bardo’s Stan­ford Pri­son Ex­per­i­ment also demon­strated how peer pres­sure af­fect peo­ple be­hav­ior and even be­lieves.

5. Ran­dom pro­cesses in the brain

Some hu­man ac­tions are just ran­dom. Neu­rons can fire ran­domly, and many ran­dom fac­tors af­fects mood. This ran­dom­ness cre­ates noise in ex­per­i­ments, but ba­si­cally, we try to clean the data of noise. We could hide ran­dom­ness of be­hav­ior in some prob­a­bil­is­tic pre­dic­tions about be­hav­ior.

Hu­mans can ran­domly for­get or re­mem­ber some­thing, and this in­cludes their wishes. In other words, de­clared val­ues could ran­domly drift.

6. Con­di­tional and un­con­di­tional re­flexes

Some forms of be­hav­ior are hard­wired in the hu­man brain and even in the prim­i­tive hind­brain, and thus are in­de­pen­dent of any hu­man val­ues, that is, un­con­di­tional re­flexes, e.g. the swim­ming re­flex, fight or flight re­sponse, etc.

There are also con­di­tional re­flexes, i.e. it is pos­si­ble to train a per­son to pre­sent re­ac­tion B if stim­uli A is given. If such re­flex is trained, such re­flex does not pre­sent any in­for­ma­tion about the per­son’ val­ues. But some de­sires can be trig­gered in­ten­tion­ally, an ap­proach which is in­ten­sively used in ad­ver­tis­ing: a per­son see­ing Coca-Cola may start to feel a de­sire to drink soda. Similarly, a per­son hear­ing a loud bang may have a panic at­tack if he has PTSD.

7. Som­nam­bu­lism, the bi­sected brain, and ac­tions un­der hyp­no­sis

It is well-known that hu­mans are ca­pa­ble of perform­ing com­plex be­hav­iors while com­pletely un­con­scious. The best ex­am­ple is som­nam­bu­lism, or sleep walk­ing. Some peo­ple are able to perform com­plex be­hav­ior in that state, even drive a car and com­mit a mur­der, with­out any mem­o­ries of the event (in this way, it differs from ac­tions in dreams, where at least some form of con­trol ex­ists). Surely, a per­son’s ac­tions in that situ­a­tion could not be used to ex­trap­o­late the per­son’s prefer­ences.

While som­nam­bu­lism is an ex­treme case, many hu­man ac­tions oc­cur me­chan­i­cally, that is, out of any con­scious con­trol, in­clud­ing driv­ing a car and the com­pul­sive be­hav­ior of ad­dicts.

Ex­per­i­ments (as of­ten in psy­chol­ogy, ques­tion­able) have also demon­strated that hu­mans whose brain hemi­spheres were sep­a­rated have two differ­ent “con­scious­nesses” with differ­ent prefer­ences (though these re­sults have re­cently been challenged) [ref].

Another ex­treme case is hyp­no­sis, where a hu­man is con­di­tioned to act ac­cord­ing to an­other per­son’s will, some­times even with­out know­ing it. While ex­treme cases of hyp­no­sis are rare and spec­u­la­tive, the effec­tive­ness of TV pro­pa­ganda in “brain wash­ing” demon­strates that some form of sug­ges­tion is real and plays an im­por­tant role in mass be­hav­ior. For ex­am­ple, Putin’s au­toc­racy in­vested a lot to gain con­trol over TV and most of TV-view­ers in Rus­sia sup­port Putin poli­tics.

8. Ac­tions un­der in­fluence of drugs; de­mented peo­ple and children

Some drugs, which are part of hu­man cul­ture and value sys­tems, no­tably al­co­hol, are known to change be­hav­ior and pre­sented val­ues, mostly be­cause self-con­trol is low­ered and sup­pressed in­stinc­tive drives be­come ac­tive. Also, the poli­cies to achieve goals be­come less ra­tio­nal un­der drugs. It also seems that al­co­hol and other drugs in­crease in­ter­nal mis-al­ign­ment be­tween differ­ent sub­per­son­al­ities.

While a per­son is legally re­spon­si­ble for what he does un­der in­fluence of drugs, his pre­sent­ing of his val­ues changes: some hid­den or sup­pressed val­ues may be­come openly ex­pressed (in vina ver­i­tas). Even some cars’ AI can rec­og­nize that a per­son is drunk and pre­vent him from driv­ing.

For pure the­o­ret­i­cal AGI this may be a difficulty, as it is not ob­vi­ous why sober peo­ple are some­how more “value priv­ileged” than drunk peo­ple. Why, then, should the AGI ig­nore this large class of peo­ple and their val­ues?

Ob­vi­ously, “drunk peo­ple” is not the only class which should be ig­nored. Small chil­dren, pa­tients in men­tal hos­pi­tals, peo­ple with de­men­tia, dream char­ac­ters, vic­tims of to­tal­i­tar­ian brain­wash­ing, etc. – all of these and many more can be re­garded as classes of peo­ple whose val­ues should be ig­nored, which could be­come a ba­sis for some form of dis­crim­i­na­tion at the end.

Also, pre­sented val­ues de­pends on the time of the day and phys­iolog­i­cal con­di­tions. If a per­son is tired, ill or sleepy, this could af­fect his-her val­ues cen­tered be­hav­ior.

An ex­treme case of “brain­wash­ing” is feral chil­dren risen by an­i­mals: and most of their val­ues also should not be re­garded as “hu­man val­ues”.

2.3. “Hu­man val­ues” can’t be eas­ily sep­a­rated from biases

The prob­lem of the in­con­sis­tency of hu­man be­hav­ior was well known to the founders of the ra­tio­nal­ists and the AGI safety move­ment, who de­scribed it via the idea of bi­ases. It seems that hu­mans, ac­cord­ing to ra­tio­nal­ists un­der­stand­ing, have a con­stant set of val­ues. How­ever, hu­mans act ir­ra­tionally based on this set of val­ues be­cause they are af­fected by nu­mer­ous cog­ni­tive bi­ases. By ap­ply­ing differ­ent ra­tio­nal­ist train­ing and de­bi­as­ing to a per­son, we could pre­sum­ably cre­ate a “ra­tio­nal per­son” who will act con­sis­tently and ra­tio­nally and will effec­tively reach his-her own pos­i­tive val­ues. The prob­lem is that such model of purely ra­tio­nal per­son act­ing on the set of co­her­ent al­tru­is­tic val­ues is com­pletely non-hu­man.

[Um­brello com­ment: Heuris­tic tools can be used to de-bias AGI de­sign. I ar­gued this in a pa­per, and showed a way in which it can be done. See Um­brello, S. (2018) ‘The moral psy­chol­ogy of value sen­si­tive de­sign: the method­olog­i­cal is­sues of moral in­tu­itions for re­spon­si­ble in­no­va­tion’, Jour­nal of Re­spon­si­ble In­no­va­tion. Tay­lor & Fran­cis, 5(2), pp. 186–200. doi: 10.1080/​23299460.2018.1457401.]

Another prob­lem is that a lot of hu­mans have differ­ent se­ri­ous psy­chi­a­tric dis­eases, in­clud­ing schizophre­nia, ob­ses­sive-com­pul­sive di­s­or­der, ma­nia, and oth­ers, which sig­nifi­cantly af­fect their value struc­ture. While ex­treme cases can be eas­ily rec­og­nized, weaker forms may be part of the “psy­chopathol­ogy of or­di­nary life”, and thus part of “hu­man na­ture”. We don’t know if a truly healthy hu­man mind ex­ists at all.

Arm­strong sug­gested not to sep­a­rate bi­ases from prefer­ences, as AGI will find easy ways to over­come bi­ases. But the AGI could, in the same way, find the paths to over­come the prefer­ences.

2.4. Hu­man val­ues are sub­ject-cen­tered and can’t be sep­a­rated from the person

In the idea “hu­mans have val­ues,” the verb “have” as­sumes the type of re­la­tion that could be sep­a­rated. This im­plies some form of or­thog­o­nal­ity be­tween hu­man mind and hu­man val­ues, as well as a strict bor­der be­tween the mind and val­ues. For ex­am­ple, if I have an mp3 file, I can delete the file. In that case, the state­ment “I don’t have the file” will be fac­tual. I can give this file to an­other per­son and in that case, I can say: “That per­son now has the file”. But hu­man val­ues can’t be trans­ferred in the same way as a file for two rea­sons: they are sub­ject-cen­tered, and there is no strict bor­der be­tween val­ues and the other parts of the mind.

Most hu­man val­ues are cen­tered around a par­tic­u­lar per­son (with the ex­cep­tion of some ar­tifi­cially con­structed purely al­tru­is­tic val­ues, like some­one who wants to re­duce the amount of suffer­ing in the world, but com­pletely ig­nor­ing who is suffer­ings: hu­mans or an­i­mals, etc.) One may ar­gue that non-sub­ject val­ues are bet­ter, but this is not how hu­man val­ues works. For ex­am­ple, a per­son at­taches a value not to a tasty food, but to the fact that he will con­sume such food in the fu­ture. If healthy food ex­ists with­out the po­ten­tial one could con­sume it, we can’t say that it has value.

From this, it fol­lows that if we copy hu­man val­ues in an AGI, that AGI should de­scribe the same state of the world, but not the same prefer­ences. For ex­am­ple, we don’t want to copy into AGI a de­sire to have sex with hu­mans, but we want that AGI will help its owner in his/​her re­pro­duc­tive suc­cess. How­ever, in­stru­men­tal goals like self-preser­va­tion will be still AGI-cen­tered.

The sub­ject of value is more im­por­tant than a value it­self, be­cause if a typ­i­cal hu­man A has some value X, there is surely some­one else on Earth who is get­ting X, but X doesn’t mat­ter to that per­son. How­ever, if the same per­son A gets an­other valuable thing Y, it is still good for him. At­tempts to prop­erly define the sub­ject quickly evolves into the prob­lem of per­sonal iden­tity, which is no­to­ri­ously difficult and known to be para­dox­i­cal. That prob­lem is much more difficult to ver­bal­ize, that is, a per­son may cor­rectly say what he wants, but fails to provide a defi­ni­tion of who he is.

Ob­vi­ously, there is no easy way to sep­a­rate val­ues from all un­der­ly­ing facts, neu­ronal mechanisms, bi­ases and poli­cies – more on that in the next sub­sec­tion. More about similar prob­lems was said in the post of Joar Skalse “Two agents can have the same source code and op­ti­mise differ­ent util­ity func­tions.”

Hu­man prefer­ences are self-cen­tered, but if AGI takes hu­man prefer­ences as its own, they will not be AGI-cen­tered, but will be prefer­ences about the state of the world, and this will make them closer to the ex­ter­nal rules. In other words, prefer­ences about the well-be­ing of some­thing out­side you is an obli­ga­tion and bur­den, and AGI will search the ways to over­come such prefer­ences.

2.5. Many hu­man val­ues are not ac­tion­able + the hid­den com­plex­ity of values

If some­one says that “I like po­etry”, it is a clear rep­re­sen­ta­tion of his/​her declar­a­tive val­ues, but it is un­likely to pre­dict what he ac­tu­ally does. Is he writ­ing po­ems ev­ery day for an hour, and if so, which type? Or he is read­ing ev­ery week for two hours – and what does he read: Homer, By­ron or his girlfriend’s po­ems? Will he at­tend a po­etry slam?

This could be called the “hid­den com­plex­ity of val­ues,” but if we start to un­knot that com­plex­ity, there will be no definite bor­der be­tween val­ues and ev­ery­thing else in the per­son’s mind. In other words, short tex­tual rep­re­sen­ta­tions of val­ues are not ac­tion­able, and if we try to make a full rep­re­sen­ta­tion, we will end up re­pro­duc­ing the en­tire brain.

In Yud­kowsky’s ex­am­ple of the com­plex­ity of val­ues, about re­mov­ing one’s aged mother from a burn­ing house, the com­plex­ity of val­ues comes from many com­mon-sense de­tails, which are not in­cluded into the word “re­mov­ing”.

2.6. Open ques­tion: is there any re­la­tion be­tween val­ues, con­scious­ness and qualia?

In some mod­els, where prefer­ences dic­tate choices, where is no need for con­scious­ness. How­ever, many prefer­ences are framed as prefer­ences about fu­ture sub­jec­tive ex­pe­riences, like pain or plea­sure.

There are at least 3 mean­ings of the idea of “con­scious­ness” and 3 cor­re­spond­ing ques­tions:

a) Con­scious­ness is what I know and can said about it – Should we care about un­con­scious val­ues?

b) Con­scious­ness is what I feel as pure sub­jec­tive ex­pe­rience, qualia – Should we solve the prob­lem of qualia to in or­der to cor­rectly pre­sent hu­man prefer­ences about sub­jec­tive ex­pe­riences?

c) Con­scious­ness is my re­flec­tion about my­self and only val­ues which I de­clare are my val­ues should be counted – True or not?

Re­lated: G. Wor­ley on philo­soph­i­cal con­ser­vatism: “Philo­soph­i­cal Con­ser­vatism in AI Align­ment Re­search” and “Meta-eth­i­cal un­cer­tainty in AGI al­ign­ment,” where he dis­cusses the prob­lems with meta-ethics and the non-ex­is­tence of moral facts. See also the post of So­tala about con­scious­ness and the brain.

2.7 Hu­man val­ues as a tran­sient phe­nomenon: my val­ues are not mine

Hu­man val­ues are as­sumed to be a sta­ble but hid­den source of hu­man choices, prefer­ences, emo­tions and claims about val­ues. How­ever, hu­man val­ues—even as­sum­ing that such source of all mo­ti­va­tion re­ally ex­ists—are con­stantly chang­ing on day-to-day ba­sis, as a per­son is af­fected by ad­ver­tis­ing, new books, new friends, changes in hor­mone lev­els, and mood.

In­ter­est­ingly, per­sonal iden­tity is more sta­ble than hu­man val­ues. A per­son re­mains the same in his/​her own eyes as well as the eyes of other peo­ple, de­spite sig­nifi­cant changes of val­ues and preferences

3. The idea of “hu­man val­ues” maybe not as use­ful con­cept as it looks like for AGI Safety

3.1. Hu­man val­ues are not safe if scaled, ex­tracted from a hu­man or separated

Many hu­man val­ues evolved in the mi­lieu of strong sup­pres­sion from so­ciety, limited availa­bil­ity of needed re­sources, limits on the abil­ity to con­sume re­sources, and pres­sure from other val­ues, and thus don’t scale safely if they are taken alone, with­out their ex­ter­nal con­straints.

A pos­si­ble ex­am­ple of the prob­lem from an­i­mal king­dom: if a fox gets into a hen­house, it will kill all the chick­ens, be­cause it hasn’t evolved a “stop mechanism”. In the same way, a hu­man could like tasty food, but re­lies on in­ter­nal body reg­u­la­tion to de­cide when to stop, which does not always work.

If one goal or value dom­i­nates over all other val­ues in one’s mind, it be­comes “pa­per­clippy”, and turn a per­son into a dan­ger­ous manic. Ex­am­ples in­clude sex­ual de­viants, hoard­ers, and money-ob­sessed cor­po­rate man­agers. In con­trast, some val­ues bal­ance one an­other, like the de­sire for con­sump­tion and the de­sire to main­tain a small ecolog­i­cal foot­print. If they are sep­a­rated, the con­sump­tion de­sire will tile the uni­verse with “or­gas­mium,” and “ecolog­i­cal de­sire” will end in an at­tempt to stop ex­ist­ing.

The point here is that val­ues with­out hu­mans are dan­ger­ous. In other words, if I want to get as much X as pos­si­ble, get­ting 1000X is maybe not what I want – though my ex­pressed de­sire can con­vert my AGI into a pa­per­clip­per.

In the idea of “hu­man have val­ues” it is in­trin­si­cally as­sumed that these val­ues are a) good and b) safe. A similar idea has been ex­plored in a post by Wei Dai, “Three AGI safety re­lated ideas.”

His­tor­i­cally “hu­man val­ues” were not re­garded as some­thing good. Hu­man­ity re­garded suffer­ing as aris­ing from “origi­nal sin” and af­fect­ing by all pos­si­ble dan­ger­ous effects: lust, greed, etc. There was no worth in hu­man val­ues for the philoso­phers of the past, and that is why they tried to cre­ate morals or a set of laws, which would be much bet­ter than in­born hu­man val­ues. In that case, the state or re­li­gion pro­vided the cor­rect set of norms, and hu­man na­ture was merely a source of sin.

If we take rich, young peo­ple at the be­gin­ning of 21st cen­tury, we may see that they are in gen­eral not so “sinister” as hu­mans in the past and that they sincerely sup­port all kinds of nice things. How­ever, hu­man­ity’s sadis­tic na­ture is still here, we just use so­cially ac­cepted ways to re­al­ize our “de­sire to kill” by watch­ing “Game of Thrones” or play­ing “World of Tanks”. If AGI ex­trap­o­lated our val­ues based on our prefer­ences in games, we could find our­selves in a night­mar­ish world.

There is also com­pletely “un­hu­man” ide­olo­gies and cul­tural tra­di­tions. First is ob­vi­ously Ger­man na­tional-so­cial­ism, and also an­cient Maya cul­ture, where up­per classes con­stantly ate hu­man meat. Another ex­am­ple is re­li­gion groups prac­tic­ing col­lec­tive suicide, ISIS and ter­ror­ists. Notably, tran­shu­man­ist thought states that to be a hu­man means to want to over­come hu­man limi­ta­tions, in­clud­ing in­nate val­ues.

AGI which is learn­ing hu­man val­ues will be not in­trin­si­cally safer than AGI with hard coded rules. We may want to sim­plify AGI al­ign­ment by es­cap­ing hand-coded rules and by giv­ing AGI au­thor­ity to ex­tract our goals and to ex­trap­o­late them. But there is no ac­tual sim­plifi­ca­tion: we still have to hand-code a the­ory of hu­man val­ues and the ways how to ex­tract them and to ex­trap­o­late them. This cre­ates large un­cer­tainty, which is not bet­ter than rule cod­ing. Nat­u­rally, prob­lems arise re­gard­ing the in­ter­ac­tion of AGI with “hu­man val­ues”: for ex­am­ple, if a per­son wants to com­mit suicide, should AGI help him?

We don’t need AGI al­ign­ment for all pos­si­ble hu­man tasks: Most of these tasks can be solved with­out AGI (by Drexler’s CAIS, for ex­am­ple). The only task for which al­ign­ment is re­ally needed is “pre­vent­ing the cre­ation of other un­safe AGI,” that is, us­ing AGI as a weapon to stop other AGI pro­jects. Another im­por­tant and su­per-com­plex task which re­quires su­per­in­tel­li­gent AGI is reach­ing hu­man im­mor­tal­ity.

3.2. It is wrong to think of val­ues as of prop­erty of a sin­gle hu­man: val­ues are so­cial phenomena

1. Not hu­man have val­ues, but val­ues have humans

In the state­ment “Hu­man have val­ues,” sep­a­rate hu­man be­ings are pre­sented as the main sub­jects of val­ues, i.e. those who have val­ues. But most val­ues are defined by so­ciety and de­scribe so­cial be­hav­ior. In other words, as rec­og­nized by Marx, many val­ues are not per­sonal but so­cial, and help to keep so­ciety work­ing ac­cord­ing the cur­rent eco­nomic situ­a­tion.

So­ciety ex­pends enor­mous effort to con­trol peo­ple’s val­ues via ed­u­ca­tion, ad­ver­tis­ing, celebri­ties-as-role-mod­els, books, churches, ide­olo­gies, group mem­ber­ship iden­tity, sham­ing, sta­tus sig­nal­ing and pun­ish­ment. So­cial val­ues con­sist of un­con­scious re­peat­ing of the group be­hav­ior + con­scious re­peat­ing of norms to main­tain mem­ber­ship in the group. Much be­hav­ior is di­rected via the un­con­scious defi­ni­tion of one’s so­cial role, as de­scribed by Valentino in the post “The in­tel­li­gent so­cial web.”

2. Values as memes, used for group building

Very rarely a per­son could evolve his/​her own val­ues with­out be­ing in­fluenced by any­one else; this of­ten hap­pens against his/​her own will. In other words, it is not the case that “hu­mans have val­ues” – a more cor­rect word­ing would be “val­ues have hu­mans.” This is es­pe­cially true in the case of ide­olo­gies, which could be seen as es­pe­cially effec­tive com­bi­na­tions of memes, some­thing like a memetic virus con­sist­ing of sev­eral pro­teins-memes, of­ten sup­ported by a large “train­ing dataset” of school­ing in a cul­ture where this type of be­hav­ior seems to be a norm.

3. Ideologies

In the case of ide­olo­gies, val­ues are not hu­man prefer­ences, but in­stru­ments to ma­nipu­late and bind peo­ple. In ide­olo­gies (and re­li­gions) val­ues are most ar­tic­u­lated, but they play the role of group mem­ber­ship to­kens, not ac­tual rules dic­tat­ing ac­tions. How­ever, most peo­ple are un­able to fol­low such sets of rules.

Han­son wrote “X is not about X,” and this is an ex­am­ple. To be a mem­ber of the group, a per­son must vo­cally agree that his/​her main goal is X (e.g. “Love god X”), which is easy ver­ifi­able. But if he is ac­tu­ally do­ing enough for X is much less mea­surable, and some­times even unim­por­tant.

For ex­am­ple, Je­sus pro­moted val­ues of “liv­ing as a bird,” “poverty” or “turn the other cheek” (“But I say unto you, That ye re­sist not evil: but whoso­ever shall smite thee on thy right cheek, turn to him the other also.” Mat 5:39, KJV), but churches are rich or­ga­ni­za­tions and hu­man­ity con­stantly en­gages in re­li­gious wars.

A per­son could be trained to have any value ei­ther by brain­wash­ing or by in­ten­sive re­ward, but still pre­serve his/​her iden­tity.

4. Religion

4. Religion

In re­li­gions, val­ues and ide­olo­gies are em­bed­ded in more com­plex mytholog­i­cal con­text. Most peo­ple who ever lived were re­li­gious. Reli­gion as a suc­cess­ful com­bi­na­tion of memes is some­thing like a ge­netic code of cul­ture. Reli­gion also re­quires a whole ad­her­ence to all—even small­est rit­u­als, like eat­ing cer­tain types of food and wear­ing ex­act forms of hats—not only to a few ide­olog­i­cal rules.

There is a the­ory that re­li­gion was needed to com­pen­sate the fear of death in early hu­mans, and thus hu­mans are ge­net­i­cally se­lected to be re­li­gious. The idea of God is not a nec­es­sary part of re­li­gion, as there are re­li­gion-like be­lief sys­tems with­out a god which, how­ever, have all the struc­tural el­e­ments of re­li­gion (Bud­dhism, com­mu­nism, UFO cults like Raëlian move­ment).

More­over, even com­pletely anti-re­li­gious and declar­a­tively ra­tio­nal ide­olo­gies may still have struc­tural similar­i­ties to re­li­gion, as was men­tioned by Cory Doc­torow in “Rap­ture of the Nerds.” Even the whole idea of a fu­ture su­per­in­tel­li­gent AI cold be seen as a re­li­gious view mir­rored into the fu­ture, in which sins are re­placed with “cog­ni­tive bi­ases,”, churches with “ra­tio­nal­ity houses” etc.

In the case of re­li­gion, a sig­nifi­cant part of “per­sonal val­ues” are not per­sonal, but are defined by re­li­gious mem­ber­ship, es­pe­cially the declar­a­tive val­ues. Ac­tual hu­man be­hav­ior could sig­nifi­cantly de­vi­ate from the re­li­gious norms be­cause of the com­bi­na­tion of af­fect, situ­a­tion, and per­sonal traits.

6. Hyp­no­sis and un­con­scious learning

At least some hu­mans are sus­cep­ti­ble to in­fluence by the be­liefs of oth­ers, and charis­matic peo­ple use this abil­ity. For ex­am­ple, I knew about cry­on­ics for a long time, but only started to be­lieve in it af­ter Mike Dar­win told me his per­sonal view about it.

The high­est (but also the most con­tro­ver­sial) from of sug­gestibil­ity is hyp­no­sis, which has two not-nec­es­sar­ily-si­mul­ta­neous man­i­fes­ta­tions: trans in­duc­tion and sug­ges­tion. The sec­ond doesn’t nec­es­sary re­quire the former.

Peo­ple also could learn via ob­ser­va­tion of ac­tions of other peo­ple, which is un­con­scious learn­ing.

3.3. Hu­mans don’t “have” val­ues; they are ves­sels for val­ues, full of differ­ent sub­per­sobal­ities

1. Values are change­able but iden­tity is preserved

In this sec­tion, we will look more closely at the con­nec­tion be­tween a per­son and his/​her val­ues. When we say that “per­son X has value A,” some form of strong con­nec­tion is im­plied. But hu­man per­sonal iden­tity is stronger than most of the val­ues the per­son will have dur­ing his/​her life­time. It is as­sumed that iden­tity is pre­served from early child­hood; for ex­am­ple, Leo Tols­toy wrote that he felt him­self to be the same per­son from 5 years old un­til his death. But most hu­man val­ues change dur­ing that time. Surely there can be some per­sis­tent in­ter­ests which ap­pear in hu­man child­hood, but they will not dom­i­nate 100 per­cent of the time.

Thus, hu­man per­sonal iden­tity is not based around val­ues, and the con­nec­tion be­tween iden­tity and val­ues is weak. Values can ap­pear and dis­ap­pear dur­ing a life­time. More­over, a hu­man can have con­tra­dict­ing val­ues in the same mo­ment.

We could see a per­son as some ves­sel where de­sires ap­pear and dis­ap­pear. In a nor­mal per­son, some form of “democ­racy of val­ues” is hap­pen­ing: he makes choices by com­par­ing the rel­a­tive power of differ­ent val­ues and de­sires in a given mo­ment, and the act of choice and its prac­ti­cal con­se­quences up­dates the bal­ance of power of differ­ent val­ues. In other words, while val­ues re­main the same, the prefer­en­tial re­la­tion be­tween them is chang­ing.

From the idea of the per­son­al­ity as a ves­sel for val­ues fol­lows two things:

1) Hu­man val­ues could be pre­sented as sub­agents which “live” in the vessel

2) There are meta-val­ues which pre­serve the ex­is­tence of the ves­sel and reg­u­late the in­ter­ac­tion be­tween the val­ues.

2. Subpersonalities

Many differ­ent psy­cholog­i­cal the­o­ries de­scribe the mind as con­sist­ing of two, three, or many parts that can be called sub­per­son­al­ities. The ob­vi­ous difficulty of such di­vi­sion is that sub­per­son­al­ities do not “ac­tu­ally” ex­ist, but in­stead are use­ful de­scrip­tion. But as de­scrip­tions they are not pas­sive; they can ac­tively sup­port any the­ory and play along in the roles which are ex­pected.

Another difficulty is that differ­ent peo­ple have differ­ent level of schizo­typy, or differ­ent “de­co­her­ence” be­tween sub­per­son­al­ities: hy­per-ra­tio­nal minds can look com­pletely mono­lithic, fluid minds can cre­ate sub­per­son­al­ities ad hoc, and some peo­ple can suffer from a strong dis­so­ci­a­tive di­s­or­der and ac­tu­ally pos­sess sub­per­son­al­ities.

Some in­ter­est­ing liter­a­ture on sub­per­son­al­ities (be­yond Kul­veit’s AGI Safety the­ory) in­cludes:

Vic­tor Bog­art, “Tran­scend­ing the Di­chotomy of Either “Subper­son­al­ities” or “An In­te­grated Uni­tary Self

Lester wrote a lot about the­ory of sub­per­son­al­ities “A Sub­self The­ory of Per­son­al­ity.”

En­cy­clo­pe­dia of Per­son­al­ity and In­di­vi­d­ual Differ­ences in­cludes a sec­tion by Lester with find­ing about sub­per­son­al­ities (p 3691).

So­tala started a new se­quence “Se­quence in­tro­duc­tion: non-agent and mul­ti­a­gent mod­els of mind

Mih­nea Moldoveanu “The self as a prob­lem: The in­tra-per­sonal co­or­di­na­tion of con­flict­ing de­sires

Min­sky in the “So­ciety of mind” wrote about many too-small agents in the hu­man mind – K-lines, which are much sim­pler than “per­son­al­ities.” But cur­rent ar­tifi­cial neu­ral nets don’t need them.

3. The in­finite di­ver­sity of hu­man values

The idea that “hu­man have val­ues” as­sumes that there is a spe­cial hu­man sub­set of all pos­si­ble val­ues. How­ever, hu­man prefer­ences are very di­verse. For any type of ob­jects, there is a per­son who col­lects them or likes the YouTube videos about them. Hu­mans can have any pos­si­ble val­ues, limited only by val­ues’ com­plex­ity.

4. Nor­ma­tive plu­ral­ity of values

Most moral the­o­ries like util­i­tar­i­anism tries to search for just one cor­rect over­ar­ch­ing value. How­ever, there ap­pear prob­lems like re­pug­nant con­clu­sion. Such prob­lems ap­pear if we take the value liter­ary or try to max­i­mize it to ex­treme lev­els. The same prob­lems will af­fect a pos­si­ble fu­ture AGI if it tries to over-max­i­mize its util­ity func­tion. Even pa­per­clip max­i­mizer just wants to be sure that it will cre­ate enough pa­per­clips. Be­cause of this, some writ­ers on AGI safety started to sug­gest that we should es­cape util­ity func­tions in AGI, as they are in­her­ently dan­ger­ous. (For ex­am­ple, in the post of Shah “AGI safety with­out goal-di­rected be­hav­ior”.)

The idea that good moral model should be based on ex­is­tence of many differ­ent val­ues – with­out any over­ar­ch­ing value – is pre­sented in the ar­ti­cle by Carter “A plu­ral­ity of val­ues.” How­ever, this claim is self-con­tra­dic­tory, be­cause the norm “there should not be over­ar­ch­ing value” is it­self over­ar­ch­ing value. Carter es­cape it by sug­gest­ing to use “in­differ­ence-curves” from microe­co­nomics: a type of util­ity func­tion which com­bines two vari­ables.

How­ever, in that case over­ar­ch­ing val­ues maybe “con­tent free”. For ex­am­ple, func­tional democ­racy pro­vides ev­ery­body the right to free speech, but don’t pre­scribe the con­tent of speech be­sides a few highly de­bated top­ics like hate speech or speech which af­fected other abil­ity to speak. But ex­actly this “for­bid­den” top­ics as well as level of their re­stric­tion be­comes the most at­trac­tive to dis­cus­sions very soon.

Bostrom wrote about Par­li­a­men­tary Model, where differ­ent val­ues are pre­sented. But any par­li­a­ment need speaker and rules.

3.4. Values, choices, and commands

A per­son could have many con­tra­dic­tory val­ues, but an act of choice is the ir­re­versible de­ci­sion of tak­ing one of sev­eral op­tions – and such choice may take the form of a com­mand for a robot or an AGI. The act of mak­ing a choice is some­thing like an ir­re­versible col­lapse (similar to the col­lapse of a quan­tum wave func­tion on some ba­sis), and mak­ing a choice re­quires sig­nifi­cant psy­cholog­i­cal en­ergy, as it of­ten means deny­ing the re­al­iza­tion of other val­ues, and con­se­quently, feel­ing frus­tra­tion and other nega­tive emo­tions. In other words, mak­ing a choice is a com­plex moral work, not just a sim­ple pro­cess of in­fer­ence from an ex­ist­ing set of val­ues. Many peo­ple suffer from an in­abil­ity to make choices, or an in­abil­ity to stick with choices they made pre­vi­ously.

A choice is typ­i­cally not fi­nal­ized un­til we take some ir­re­versible ac­tion in the cho­sen di­rec­tion, like buy­ing a ticket to the coun­try.

In the case of Task AGI (AGI de­signed to make a task perfectly and then stop), the choice is mo­ment when we give the AGI a com­mand.

In some sense, mak­ing choices is moral work of hu­mans, and if AGI au­to­mates this work, it will steal one more job from us – and not only a job, but the mean­ing of life.

Differ­ence be­tween val­ues and desires

In­side the idea of hu­man val­ues is a hid­den as­sump­tion that there is a more or less sta­ble set of prefer­ences which peo­ple can con­sciously ac­cess, so peo­ple have some re­spon­si­bil­ity for hav­ing par­tic­u­lar val­ues, be­cause they can change them and im­ple­ment them. An al­ter­na­tive view is “de­sires”: they ap­pear sud­denly and out of nowhere, and the con­scious mind is their vic­tim.

For ex­am­ple, let us com­pare two state­ments:

“I pre­fer healthy en­vi­ron­men­tally friendly food” – this is a con­scious prefer­ence.

“I had a sud­den urge to go out­side and meet new peo­ple” – this is a de­sire.

De­sires are un­pre­dictable and over­whelming, and they may be use­less from the point of the ra­tio­nal mind of the per­son, but may still be use­ful from more a gen­eral per­spec­tive (for ex­am­ple, they may sig­nal that it is time to take a rest).

3.5. Meta-val­ues: prefer­ence about values

1. Meta-val­ues as morals

The idea that “hu­man have val­ues” as­sumes that val­ues pre­sent some un­struc­tured set of things. In the same way, a per­son could say that he has toma­toes, cu­cum­bers and onions. But the re­la­tion be­tween val­ues is more com­plex, and there are val­ues about val­ues.

For ex­am­ple, a per­son may have some food prefer­ences, but doesn’t ap­prove these food prefer­ences, as they re­sult in overeat­ing. Nega­tive meta-val­ues en­code acts of sup­press­ing the nor­mal-level value, or, al­ter­na­tively, self-sham­ing. Pos­i­tive meta-val­ues en­courage a per­son to do what he already likes to do, or foster a value for some use­ful thing.

Meta-meta val­ues are also pos­si­ble, for ex­am­ple, if one wants to be a perfect per­son, s/​he will en­courage his/​her value for health food. The abil­ity to en­force one’s meta-val­ues over one’s own val­ues is called “willpower”. For ex­am­ple, all fights with pro­cras­ti­na­tion is an at­tempt to en­force the meta-value of “work” over short-term plea­sures.

Meta-val­ues are closer to morals: they are more con­sciously ar­tic­u­lated, but there is always prac­ti­cal difficulty in en­forc­ing them. The rea­son for it is that low-level val­ues are based on strong, in­nate hu­man drives and have close con­nec­tions with short-term re­wards; thus, they have more en­ergy to af­fect prac­ti­cal be­hav­ior (e.g. difficul­ties in diet­ing).

As meta-val­ues are typ­i­cally more pleas­ant sound­ing and more con­sciously ap­proved, hu­mans are more likely to pre­sent them as their true val­ues if asked in so­cial situ­a­tions. But it is more difficult to ex­tract meta-val­ues from hu­man be­hav­ior than “nor­mal” val­ues.

2. Sup­pressed val­ues

Th­ese are val­ues we con­sciously know that we have, but which we pre­fer not to have and do not wish to let af­fect our be­hav­ior. Such val­ues could be ex­cess sex­ual in­ter­est. Typ­i­cally, hu­mans are un­able to com­pletely sup­press some un­de­sired val­ues, but at least they know about them and have an opinion about them.

3. Sub­con­scious val­ues and sub-personalities

The idea that “hu­mans have val­ues” as­sumes that the per­son knows what he has, but this is not always true. There are hid­den val­ues, which ex­ist in the brain but not in con­scious mind, and can ap­pear from time to time.

Freud was the first to dis­cover the role of the un­con­scious in hu­mans. But the field of the un­con­scious is very amor­phous and eas­ily ad­justs to at­tempts to de­scribe it. Thus, any the­ory which tries to de­scribe it ap­pears to be a self-fulfilling prophecy. Dreams may be full of libido sym­bols, but at the same time rep­re­sent Jun­gian An­ima archetypes. The rea­son is that un­con­scious­ness is not a thing, but a field where differ­ent forces com­bine.

Some peo­ple suffer from mul­ti­ple per­son­al­ity di­s­or­der, when they have sev­eral per­son­al­ities which take con­trol over their body from time to time. Th­ese per­son­al­ities have differ­ent main traits and prefer­ences. This adds ob­vi­ous difficulty to the idea of “hu­man val­ues,” as the ques­tion arises, which val­ues are real for a hu­man who has many per­sons in his/​her brain? While true “mul­ti­ple per­son­al­ity di­s­or­der” is rare, there is a the­ory that in any hu­man there are many sub-per­son­al­ities which con­stantly in­ter­act. Such sub-per­son­al­ities could be called one by one by a psy­chother­a­peu­tic method called a “di­a­log of voices,” cre­ated by Stones (Stone & Stone, 2011).

The the­ory be­hind sub-per­son­al­ities claims that they can’t be com­pletely and effec­tively sup­pressed, and will ap­pear from time to time in the form of some co­a­lesced be­hav­ior like jokes (this idea was pre­sented by Freud in his work “Jokes and Their Re­la­tion to the Un­con­scious”), tone of voices, spon­ta­neous acts (like shop-lift­ing), dreams, feel­ings, etc.

4. Zero be­hav­ior and con­tra­dict­ing values

Hu­mans of­ten have con­tra­dic­tory val­ues. For ex­am­ple, if I want a cake very much, but also have a strong in­cli­na­tion for diet­ing, I will do noth­ing. So, I have two val­ues, which ex­actly com­pen­sate for each other and thus have no effect on my be­hav­ior. Ob­serv­ing only be­hav­ior will not give an ob­server any clues about these val­ues. More com­plex ex­am­ples are pos­si­ble, where con­tra­dic­tory val­ues cre­ate in­con­sis­tent be­hav­ior, and this is very typ­i­cal for hu­mans.

5. Prefer­ence about other’s preferences

Hu­man could have prefer­ences about prefer­ences of other peo­ple. For ex­am­ple: “I want M. to love me” or “I pre­fer that ev­ery­body will be util­i­tar­ian”.

They are some­how re­cur­sive: I need to know the real na­ture of hu­man prefer­ences in or­der to be sure that other peo­ple ac­tu­ally want what I want. In other words, such prefer­ences about prefer­ence have em­bed­ded idea about what I think is the “prefer­ence”: if M. will be­have as if she loves me – is it enough? Or it should be her claims of love? Or her emo­tions? Or co­herency of all three?

3.6. Hu­man val­ues can­not be sep­a­rated from the hu­man mind

1. Values are not en­coded sep­a­rately in the brain

The idea “hu­man have val­ues” as­sumes the ex­is­tence of at least two sep­a­rated en­tities: hu­man and val­ues.

There is not any sep­a­rate neu­ral net­work or brain re­gion that pre­sents a hu­man value func­tion (lim­bic sys­tem codes emo­tions, but emo­tions are only part of hu­man val­ues). While there is a dis­tinc­tive re­ward-reg­u­lat­ing re­gion, the re­ward it­self is not a hu­man value (as much as we agree that pure wire­head­ing is not good). Most of what we call “hu­man val­ues” are not only about re­ward (while re­ward surely plays a role), but in­clude an ex­pla­na­tion for what the re­ward is, i.e. some con­cep­tual level.

Any pro­cess in hu­man mind has in­ten­tion­al­ity. For ex­am­ple, a mem­ory of smell of a rose will af­fect our feel­ings about roses. This means that it is not easy to dis­t­in­guish be­tween fact and val­ues in some’s mind, and or­thog­o­nal­ity the­sis doesn’t hold for hu­mans.

The or­thog­o­nal­ity the­sis can’t be ap­plied to hu­mans in most cases, as there is no pre­cise bor­der be­tween hu­man value and some other in­for­ma­tion or pro­cesses in the hu­man mind. The com­plex­ity of hu­man val­ues means that a value is deeply rooted in ev­ery­thing I know and feel, and that at­tempts to pre­sent val­ues as a finite set of short rules does not work very well.

Surely, we can use the idea of hu­man set of prefer­ences if we want some method to ap­prox­i­mate a pre­dic­tion of the per­son’s ap­proval and be­hav­ior. It will offer some­thing, like, say, an 80 per­cent pre­dic­tion of hu­man choices. This is more than enough in the pre­dic­tion of the be­hav­ior of a con­sumer, where we could mon­e­tize any pre­dic­tion above ran­dom; e.g. if we pre­dict that 80 per cent peo­ple would pre­fer red t-shirts to green ones, we could ad­just man­u­fac­tur­ing and earn a profit. (In­ter­est­ing ar­ti­cle on the topic: “In­verse Re­in­force­ment Learn­ing for Mar­ket­ing.”)

How­ever, a re­con­structed set of val­ues is not enough to pre­dict hu­man be­hav­ior in edge cases, like “So­phie’s Choice” (a novel about Nazi camp, where a woman has to choose which of her chil­dren will be ex­e­cuted), or a real-world trol­ley prob­lem. But ex­actly such pre­dic­tions are im­por­tant in AGI safety, es­pe­cially if we want AGI to make pivotal de­ci­sions about the fu­ture of hu­man­ity! Some pos­si­ble tough ques­tions: should hu­mans be up­loaded? Should we care about an­i­mals or aliens or un­born pos­si­ble peo­ple? Should a small level of suffer­ing be pre­served to avoid eter­nal bore­dom?

In­ter­est­ingly, hu­mans evolved abil­ity to pre­dict each other’s be­hav­ior and choices to some ex­tent, partly limited to the same cul­ture, age and situ­a­tion, as this skill is es­sen­tial to effec­tive so­cial in­ter­ac­tion. We au­to­mat­i­cally cre­ate some “the­ory of mind”, and there is also a “folk the­ory of mind”, in which peo­ple are pre­sented as sim­ple agents with clear goals which dic­tate their be­hav­ior (like “Max is only in­ter­ested in money and that’s why he changed jobs.”)

2. Hu­man val­ues are dis­persed in­side “train­ing data” and “trained neu­ral nets”

Not only are val­ues are not lo­cated in some place in the brain, they are not learned as “rules.” If we train an ar­tifi­cial neu­ral net on some kind of dataset, like Karpa­thy’s RNN on texts, it will re­peat prop­er­ties of the texts (such train­ing in­cludes a re­ward func­tion, but it rather sim­ple and tech­ni­cal and only demon­strates similar­ity of out­put to the in­put). In the same way, a per­son who grew up in some so­cial en­vi­ron­ment will re­peat its main be­hav­ioral habits, like car driv­ing habits or in­ter-per­sonal re­la­tions mod­els. The in­ter­est­ing point is that these traits are not pre­sented ex­plic­itly ei­ther in­side the data nor in­side the neu­ral net trained on it. No sin­gle neu­ron is cod­ing the hu­man prefer­ence for X, but be­hav­ior which could be in­ter­preted as a statis­ti­cal in­cli­na­tion to X is re­sult­ing from col­lec­tive work of all neu­rons.

In other words, a statis­ti­cally large en­sem­ble of neu­rons trained on a statis­ti­cally large dataset cre­ated a statis­ti­cally sig­nifi­cant in­cli­na­tion to some type of be­hav­ior, which could es­sen­tially be de­scribed as some “rule-like value,” though this is only an ap­prox­i­ma­tion.

3. Amor­phous struc­ture of hu­man in­ter­nal pro­cesses and false pos­i­tives in find­ing in­ter­nal parts

Each neu­ron ba­si­cally works as an adding ma­chine of in­puts and is trig­gered when the sum is high enough. The same prin­ci­ple can be found in psy­cholog­i­cal pro­cesses which add up un­til it trig­gers ac­tion. This cre­ates difficulty in in­fer­ring mo­tives from ac­tions, as there is a com­bi­na­tion of many differ­ent in­puts.

This also cre­ates the prob­lem of false pos­i­tives in hu­man mind mod­el­ing, where a hu­man be­hav­ior un­der some fixed con­di­tions and ex­pec­ta­tions pro­duces the ex­pected types of be­hav­ior, statis­ti­cally con­firm­ing the ex­per­i­menter’s hy­poth­e­sis.

3.7. Hu­man val­ues pre­sen­ta­tion is bi­ased to­wards pre­sent­ing so­cially ac­cepted claims

The idea of “hu­man val­ues” is bi­ased to­wards moral­ity. When we think of hu­man val­ues we ex­pect that some­thing good and high-level will be pre­sented, like “equal­ity” or “flour­ish­ing,” as hu­mans are un­der so­cial pres­sure to pre­sent an ideal­ized ver­sion of the self. In con­tem­po­rary so­ciety, some­one will not be prized if he said that he likes “kill, rape and eat a lot of sugar”. This cre­ates in­ter­nal cen­sor­ship, which could be even un­con­scious (Freudian cen­sor­ship) [ref]. Hu­mans claim and even be­lieve that they have so­cially ac­cepted val­ues: that they are nice, pos­i­tive, etc. This cre­ates an ideal­ized image of self. But hu­mans are un­re­flec­tive about their sup­pressed mo­tives and even ac­tions. Thus, they lie to them­selves about the ac­tual goals of their be­hav­ior: they do A think­ing that the goal is X, but their real mo­tive is Y [Han­son].

So­cieties with strong ide­olo­gies will more strongly af­fect self-rep­re­sen­ta­tion of val­ues. Ideal­ized and gen­er­al­ized ver­sion of val­ues start to look like morals.

In his book “Elephant in the brain,” Han­son pre­sented a model in which the self­ish sub­con­sciously tries to max­i­mize per­sonal so­cial sta­tus, and con­sciously cre­ate a nar­ra­tive to ex­plain the per­son’s ac­tions as al­tru­is­tic and ac­cept­able.

3.8. Hu­man val­ues could be ma­nipu­lated by the ways and or­der they are extracted

The idea that “hu­mans have val­ues” as­sumes that such val­ues ex­ist in­de­pen­dently of some third-party ob­server who can ob­jec­tively mea­sure them.

How­ever, by us­ing differ­ent ques­tions and or­der­ing these ques­tions differ­ently, one can ma­nipu­late hu­man an­swers. One method of such ma­nipu­la­tion is the Erick­so­nian hyp­no­sis, where each ques­tion cre­ates cer­tain frames, and also has hid­den as­sump­tions.

Another sim­ple but effec­tive mar­ket­ing ma­nipu­la­tive strat­egy is the tech­nique of “Three Ye­ses”, where pre­vi­ous ques­tions frame fu­ture an­swers. In other words, by care­fully con­struct­ing the right ques­tions we could ex­tract from a per­son al­most any value sys­tem, which would diminish the use­ful­ness of such ex­trac­tion.

This could also af­fect AGI safety, if AGI has some pre-con­cep­tions of what the value sys­tem should be, or even if AGI wants to ma­nipu­late val­ues, – it could find the ways to do so.

3.9 Hu­man val­ues are, in fact, non-human

Hu­man val­ues are formed by forces which are not hu­mans. First of all, it is evolu­tion and nat­u­ral se­lec­tion. Hu­man val­ues are also shaped by non-hu­man forces like cap­i­tal­ism or face­book al­gorithm and tar­geted ad­ver­tis­ing. Be­ing born in some cul­ture, be­ing af­fected by some books or trau­matic events is also ran­dom pro­cesses out of the per­son choice.


Many viruses could af­fect hu­man be­hav­ior with the goal to make repli­ca­tion easy. Com­mon cold makes peo­ple more so­cial. It seems that tox­o­plasma in­fec­tion makes peo­ple (and af­fected mice) less risk averse. See e.g. “Viruses and be­hav­ioral changes: a re­view of clini­cal and ex­per­i­men­tal find­ings”.

There are even more out­stand­ing claims that our micro­biome con­trols hu­man be­hav­ior, in­clud­ing food choices and re­pro­duc­tion via pro­duc­tion of feromons-like chem­i­cals on the skin. It was claimed that fe­cal trans­plants can cure autism via changes in gut micro­biome.

3.10. Any hu­man value model has not only episte­molog­i­cal as­sump­tions, but also has ax­iolog­i­cal (nor­ma­tive) assumptions

If a psy­cholog­i­cal model does not just de­scribe hu­man mo­ti­va­tion, but also de­ter­mines what part of this mo­ti­va­tional sys­tem should be learned by AGI as “true val­ues,” it in­evitably in­cludes ax­iolog­i­cal or nor­ma­tive as­sump­tions about what is good and what is bad. A similar idea was ex­plored by Arm­strong in in “Nor­ma­tive as­sump­tions: re­gret.”

The most ob­vi­ous such “value as­sump­tion” is that some­one’s re­ward func­tion should be val­ued at all. For ex­am­ple, dur­ing in­ter­ac­tion be­tween a hu­man and a snail, we ex­pect that the hu­man re­ward func­tion (if we are not ex­treme pro-an­i­mal rights ac­tivists) is the cor­rect one, and the “snail’s val­ues” should be ig­nored.

Another type of ax­iolog­i­cal as­sump­tion is about what should be more cor­rectly re­garded as ac­tual hu­man val­ues: re­wards or claims. This is not a fac­tual as­sump­tion, but as­sump­tion about im­por­tance, which could also be pre­sented as a choice be­tween whom an ob­server should be­lieve: ra­tio­nal­ity or emo­tions, rider or elephant, Sys­tem 2 or Sys­tem 1, rules or re­ward.

There are also meta-value as­sump­tions: should I re­gard “rules about rules” as more im­por­tant than my pri­mary val­ues. For ex­am­ple, I of­ten say peo­ple should ig­nore the tone of my voice, I only en­dorse the con­tent of my ver­bal com­mu­ni­ca­tion.

Psy­cholog­i­cal value mod­els are of­ten nor­ma­tive, as they are of­ten con­nected with psy­chother­apy, which is based on some ideas what is healthy hu­man mind. For ex­am­ple, Freud’s model not only pre­sents a model of hu­man mind, but also a model of dis­ease of the mind; in Freud’s case, neu­roses.

3.11. Values may be not the best route for sim­ple and effec­tive de­scrip­tions of hu­man motivation

From the point of view of naïve folk psy­chol­ogy, a value sys­tem is eas­ily tractable: “Peter val­ues money, Alice val­ues fam­ily life” – but the anal­y­sis above showed that if we go deeper, the com­plex­ity and prob­lems of the idea of hu­man val­ues grows to the point of in­tractabil­ity.

In other words, the idea that “hu­man have val­ues” as­sumes that “value” is a cor­rect prim­i­tive which promises easy and quick de­scrip­tion of hu­man be­hav­ior but doesn’t fulfil this promise af­ter close ex­am­i­na­tion. Thus, maybe it is the wrong prim­i­tive, and some other sim­ple idea will provide bet­ter de­scrip­tion – one with lower com­plex­ity and that is more eas­ily ex­tractable – than val­ues? There are at least two al­ter­na­tives to val­ues as short de­scrip­tors of the hu­man mo­ti­va­tional sys­tem: “wants” and com­mands.

Ob­vi­ously, there is a differ­ence be­tween “val­ues” and “wants”. For ex­am­ple, I could sit on a chair and not want any­thing, but still have some val­ues, e.g. about per­sonal safety or the well-be­ing of Afri­can an­i­mals. More­over, a per­son with differ­ent val­ues may have similar “wants”. In­tu­itively, a cor­rect un­der­stand­ing of “wants” is the sim­pler task.

I can re­con­struct my cat’s “wants” based on the tone of her me­ows. She may want to eat, have a door opened, or to be cud­dled. How­ever, re­con­struct­ing the cat’s val­ues is a much more com­plex task which must be based on as­sump­tions.

The main differ­ence be­tween wants and val­ues: if you want some­thing, you know it. But if you have a value, you may not know about it. The sec­ond differ­ence: wants can be tem­porar­ily satis­fied, but will reap­pear, while val­ues are con­stant. Values gen­er­ate wants, wants gen­er­ate com­mands. Only wants form the ba­sis for com­mands to AGI.

3.12. Who are real hu­mans in “hu­man val­ues”?

In the idea of hu­man val­ues is as­sumed that we could eas­ily define who are “hu­mans”, that is morally sig­nifi­cant be­ings. This ques­tion suffers from the edge cases, which may be not easy to guess by AGI? Such edge cases:

· Are apes hu­mans? Ne­an­derthals?

· Is Hitler hu­man?

· Are coma pa­tients hu­mans?

· What about chil­dren, drug-in­tox­i­cated peo­ple, Alzheimer pa­tients?

· Ex­trater­res­tri­als?

· Un­born chil­dren?

· Feral chil­dren?

· In­di­vi­d­u­als with autism and vic­tims of differ­ent ge­netic di­s­or­ders?

· Dream char­ac­ters?

By ma­nipu­lat­ing the defi­ni­tion of who is “hu­man”, we could ma­nipu­late the out­come of a mea­sure­ment of val­ues.

3.13. The hu­man re­ward func­tion is not “hu­man val­ues”

Many ideas about learn­ing hu­man val­ues are in fact de­scribing learn­ing based on the “hu­man re­ward func­tion.” From a neu­rolog­i­cal point of view and sub­jec­tive ex­pe­rience, hu­man re­ward is the ac­ti­va­tion of some cen­ters in the brain and ex­pe­rienc­ing qualia of plea­sure. But when calcu­lated by an­a­lyz­ing be­hav­ior, “hu­man re­ward func­tion” does not nec­es­sar­ily mean a set of rules for en­dor­phin bursts. Such re­ward func­tion would mean pure he­do­nis­tic util­i­tar­i­anism, which is not the only pos­si­ble moral philos­o­phy, or might even mean vol­un­tary wire­head­ing. The ex­is­tence of high-lev­els goals, prin­ci­ples and morals means that the qualia of re­ward is only a part of hu­man mo­ti­va­tion sys­tem.

Alter­na­tively, a hu­man re­ward func­tion may be viewed as some ab­stract con­cept which de­scribes the set of hu­man prefer­ences in the style of VNM-ra­tio­nal­ity (con­vert­ing set of prefer­ences on to a co­her­ent util­ity func­tion), but which is un­known to the per­son.

One as­sump­tion about hu­man val­ues is that hu­mans have a con­stant re­ward – but the hu­man re­ward func­tion evolves with age. For ex­am­ple, sex­ual images and ac­tivi­ties be­come re­ward­ing for teenagers and are med­i­tated by the pro­duc­tion of sex hor­mones. Hu­man re­wards also change af­ter we are satis­fied by food, wa­ter or sex.

Thus, the hu­man re­ward func­tion is not a sta­ble set of prefer­ences about the world, but changes with age and based on pre­vi­ous achieve­ments. This hu­man re­ward func­tion is black-boxed from the con­scious mind but is con­trol­led by pre­sent­ing differ­ent re­wards. Such a black-boxed re­ward func­tion may be de­scribed as a rule-based sys­tem.

Pos­si­ble ex­am­ple of such rules: “If age = 12, turn on sex­ual re­ward”. Such a rule gen­er­a­tor is un­con­scious but has power over the con­scious mind – and we may think that it is no good! In other words, we could have moral prefer­ences about differ­ent types of mo­ti­va­tion in hu­mans.

3.14. Difficult cases for value learn­ing: en­light­en­ment, art, re­li­gion, ho­mo­sex­u­al­ity and psi

There are sev­eral types of situ­a­tions or ex­per­i­ments where the ex­ist­ing of a sta­ble set of prefer­ences is clear, like a situ­a­tion of mul­ti­ple choices be­tween brands (ap­ple vs. or­anges), differ­ent forms of the trol­ley prob­lem, ques­tion­naires, etc. How­ever, there are situ­a­tions and ac­tivi­ties which is not easy to de­scribe in this value lan­guage.

En­light­en­ment – many prac­ti­tion­ers claim that at some higher med­i­ta­tion states the idea of per­sonal iden­tity and of a unique per­sonal set of prefer­ences, or even of the re­al­ity of out­side world be­comes “ob­so­lete,” seen as wrong or harm­ful in prac­tice. This may or may not be true fac­tu­ally, but ob­vi­ously af­fects prefer­ence prefer­ences, when a per­son ap­pears to have a meta-value of not hav­ing a value, and of some form of mean­ingful non-ex­is­tence (e.g. nir­vana, mok­sha). How could we al­ign AGI with Bud­dha?

Art – ra­tio­nal think­ing of­ten has difficulty un­der­stand­ing art, and many of its in­ter­pre­ta­tions based on out­side views are over­sim­plified. More­over, a sig­nifi­cant part of art is about vi­o­lence, and we en­joy it – but we don’t want AGI to be vi­o­lent.

Reli­gion – seems to of­ten as­sign a lot of val­ues to false, or at least uncheck­able claims. Reli­gion is one of the strongest memetic pro­duc­ers, but it also in­cludes some the­o­ries of mo­ti­va­tions, which are not about val­ues, but are based on other ba­sic ideas like “free will” or “god’s will.” Reli­gion also could be seen as an in­va­sive ide­ol­ogy or memetic virus, which over­rides per­sonal prefer­ences.

Psi – con­tem­po­rary sci­ence de­nies val­idity of the para­psy­cholog­i­cal re­search, but ob­ser­va­tions like Jun­gian syn­chron­ic­ity or Grof’s transper­sonal psy­chol­ogy con­tinue to ap­pear and im­ply a differ­ent model of hu­man mind and mo­ti­va­tion than tra­di­tional neu­ro­science. Even some AGI re­searchers, like Ben Go­ertzel, are in­ter­ested in psi. In Grof’s psy­chol­ogy, the feel­ings and val­ues of other hu­man and even an­i­mals could in­fluence a per­son (un­der LSD) in non-phys­i­cal ways, and in a more minor form this could hap­pen (if it is pos­si­ble at all) even in or­di­nary life.

Idle­ness – and non-goal-ori­ented states of mind, like ran­dom thought streams.

Nostal­gia – this is an ex­am­ple of a value which has very large fac­tual con­tent. It is not just an idea of pure hap­piness of the feel­ing of “re­turn­ing home.” It is an at­trac­tion to the “train­ing dataset”: home coun­try and lan­guage, of­ten aris­ing from the sub­con­scious, in dreams, but later tak­ing over the con­scious mind.

There are a few other fields, already men­tioned, where the idea of val­ues ex­pe­riences difficul­ties: dreams, drugs-in­duced hal­lu­ci­na­tions, child­hood, psy­chi­a­tric dis­eases, mul­ti­ple per­son­al­ity di­s­or­der, crime un­der af­fect, qualia. And all of this is not just edge cases – it is biggest and most in­ter­est­ing part of what makes us hu­mans.

3.15. Hu­man val­ues ex­clud­ing each other and Cat­e­gor­i­cal im­per­a­tive as a meta-value

As large part of hu­man val­ues are prefer­ences about other peo­ple prefer­ences, they mu­tu­ally ex­clude each other. E.g.: {I want “X loves me”, but X don’t want to be in­fluenced by other’s de­sires}. Such situ­a­tion is typ­i­cal in or­di­nary life, but if such val­ues are scaled and ex­trap­o­lated, one side should be cho­sen: ei­ther I will win, or X.

To es­cape such situ­a­tion, some­thing like Kan­tian moral low, Cat­e­gor­i­cal Im­per­a­tive, should be used as a metal-value, which ba­si­cally reg­u­late how other’s peo­ple val­ues re­late to each other:

Act only ac­cord­ing to that maxim by which you can at the same time will that it should be­come a uni­ver­sal law.

In other words, Cat­e­gor­i­cal Im­per­a­tive is some­thing like “up­date­less de­ci­sion the­ory” in which you choose a policy with­out up­dat­ing on your lo­cal po­si­tion, so if ev­ery­body will use this prin­ci­ple, they will come to the same policy. (See com­par­i­son of differ­ent de­ci­sion the­o­ries de­vel­oped by LessWrong com­mu­nity here.)

From the Cat­e­gor­i­cal Im­per­a­tive could be de­rived some hu­man val­ues like: it is bad to kill other peo­ple, as one doesn’t want to be kil­led. How­ever, the main thing is that such meta-level prin­ci­ple of re­la­tion be­tween val­ues of differ­ent peo­ple can’t be de­rived just from ob­ser­va­tion of a sin­gle per­son.

More­over, most eth­i­cal prin­ci­ples are de­scribing in­ter­per­sonal re­la­tions, so they are not about per­sonal val­ues, but about the ways how val­ues of differ­ent peo­ple should in­ter­act. The things like Cat­e­gor­i­cal im­per­a­tive can’t be learned from ob­ser­va­tion; but they also can’t be de­duced based on pure logic, so they can’t be called “true” or “false”.

In other words, AGI learn­ing hu­man val­ues can’t learn meta-eth­i­cal prin­ci­ples like Cat­e­gor­i­cal im­per­a­tive nor it can’t de­duce them from pure math. That is why we should provide AGI with cor­rect de­ci­sion the­ory, but it is not clear why “cor­rect the­ory” should ex­ist at all.

This could also be called meta-eth­i­cal nor­ma­tive as­sump­tion: some high level eth­i­cal prin­ci­ples which can’t be de­duced from ob­ser­va­tions.


The whole ar­gu­ments pre­sented above demon­strated that the idea of hu­man val­ues is ar­tifi­cial and not very use­ful for AGI Safety in its naive form. There are many hid­den as­sump­tions in it, and these as­sump­tions may af­fect AGI al­ign­ing pro­cess, re­sult­ing into un­safe AGI.

In this ar­ti­cle, we de­con­struct the idea of hu­man val­ues and come to the set of con­clu­sions which could be sum­ma­rized as fol­low­ing:

“Hu­man val­ues” are use­ful de­scrip­tions, not real ob­jects.

● “Hu­man val­ues” are just a use­ful in­stru­ment for the de­scrip­tion of hu­man be­hav­ior. There are sev­eral other ways of de­scribing hu­man be­hav­ior, such as choices, trained be­hav­ior, etc. Each of these have their own ad­van­tages and limi­ta­tions.

● Hu­man val­ues can­not be sep­a­rated from other pro­cesses in the hu­man brain (hu­man non-or­thog­o­nal­ity).

● There are at least four differ­ent ways to learn about a hu­man’s val­ues, which may not con­verge (thoughts, dec­la­ra­tions, be­hav­ior, emo­tions).

“Hu­man val­ues” are poor pre­dic­tors of behavior

● The idea of “hu­man val­ues” or a “set of prefer­ences” is good at de­scribing only statis­ti­cal be­hav­ior of con­sumers.

● Hu­man val­ues are weak pre­dic­tors of hu­man be­hav­ior, as be­hav­ior is af­fected by situ­a­tion, ran­dom­ness, etc.

● Hu­man val­ues are not sta­ble: they of­ten change with each new choice.

● Large classes of hu­man be­hav­ior and claims should be ig­nored, if one wants to learn an in­di­vi­d­ual’s true val­ues.

The idea of a “hu­man value sys­tem” has flaws

● In each mo­ment, a per­son has a con­tra­dic­tory set of val­ues, and his/​her ac­tions are a com­pro­mise be­tween them.

● Hu­mans do not have one ter­mi­nal value (un­less they are men­tally ill).

● Hu­man val­ues are not or­dered as a set of prefer­ences. A ra­tio­nal set of prefer­ences is a the­o­ret­i­cal model of or­dered choices, but hu­man val­ues are con­stantly fight­ing each other. The val­ues are bi­ased and un­der­defined – but this is what makes us hu­mans.

● Hu­mans do not “have” val­ues: Hu­man per­sonal iden­tity is not strongly con­nected with hu­man val­ues: they are fluid, but iden­tity is pre­served.

“Hu­man val­ues” are not good by de­fault.

● Any­thing could be a hu­man value (e.g. some peo­ple may have at­trac­tion to rape or vi­o­lence).

● Some real hu­man val­ues are dan­ger­ous, and it would not be good to have them in AGI.

● “Hu­man val­ues” are not “hu­man”: they are similar to the val­ues of other an­i­mals and, also, they are so­cial memetic con­structs.

● Hu­man val­ues are not nec­es­sar­ily safe if scaled, re­moved from hu­mans, or sep­a­rated from each other. AGI with hu­man val­ues may not be safe.

Hu­man val­ues can­not be sep­a­rated from the hu­man mind.

● Any pro­cess in the hu­man mind has in­ten­tion­al­ity; the or­thog­o­nal­ity the­sis can not be ap­plied to hu­mans in most cases.

● As the hu­man mind is similar to a neu­ral net­work trained on a large dataset, hu­man val­ues and be­hav­ioral pat­terns are not ex­plic­itly pre­sented in any ex­act lo­ca­tion, but are dis­tributed through­out the brain.

● There is not a sim­ple psy­cholog­i­cal the­ory which sub­stan­tially out­performs other the­o­ries when it comes to the full model of hu­man mind, be­hav­ior and mo­ti­va­tion.

● “Hu­man val­ues” im­plies that in­di­vi­d­ual val­ues are more im­por­tant than group val­ues, like fam­ily val­ues.

● Not all “hu­man val­ues” are val­ues of the con­scious mind. For ex­am­ple, som­nam­bu­lism, dreams, and mul­ti­ple per­son­al­ity di­s­or­der may look like a hu­man value in­side a per­son’s brain, but is not part of the con­scious mind.

We recom­mend that ei­ther the idea of “hu­man val­ues” should be re­placed with some­thing bet­ter for the goal of AGI Safety, or at least be used very cau­tiously; the ap­proaches to AI safety which don’t use the idea of hu­man val­ues at all may re­quire more at­ten­tion, like the use of full brain mod­els, box­ing and ca­pa­bil­ity limit­ing.


The work was started dur­ing AI Safety Camp 2 in Prague 2018. I want to thank Linda Linse­fors, Jan Kul­veit, David Denken­berger, Alexan­dra Sur­dina, Steven Um­brello who pro­vided im­por­tant feed­back for the ar­ti­cle. All er­rors are my own.

Ap­pendix. Table of as­sump­tions in the idea of hu­man values

This table (in google docs) pre­sents all find­ings of this sec­tion is more con­densed and struc­tured form. The goal of this overview is to help fu­ture sci­en­tists to es­ti­mate val­idity of their best model of hu­man val­ues.

See also an at­tempt to map 20 main as­sump­tions against 20 main the­o­ries of hu­man val­ues as a very large spread­sheet here.