The two-layer model of human values, and problems with synthesizing preferences

I have been think­ing about Stu­art Arm­strong’s prefer­ence syn­the­sis re­search agenda, and have long had the feel­ing that there’s some­thing off about the way that it is cur­rently framed. In the post I try to de­scribe why. I start by de­scribing my cur­rent model of hu­man val­ues, how I in­ter­pret Stu­art’s im­plicit as­sump­tions to con­flict with it, and then talk about my con­fu­sion with re­gard to rec­on­cil­ing the two views.

The two-layer/​ULM model of hu­man values

In Player vs. Char­ac­ter: A Two-Level Model of Ethics, Sarah Con­stantin de­scribes a model where the mind is di­vided, in game terms, into a “player” and a “char­ac­ter”. The char­ac­ter is ev­ery­thing that we con­sciously ex­pe­rience, but our con­scious ex­pe­riences are not our true rea­sons for act­ing. As Sarah puts it:

In many games, such as Magic: The Gather­ing, Hearth­stone, or Dun­geons and Dragons, there’s a two-phase pro­cess. First, the player con­structs a deck or char­ac­ter from a very large sam­ple space of pos­si­bil­ities. This is a par­tic­u­lar com­bi­na­tion of strengths and weak­nesses and ca­pa­bil­ities for ac­tion, which the player thinks can be suc­cess­ful against other decks/​char­ac­ters or at win­ning in the game uni­verse. The choice of deck or char­ac­ter of­ten de­ter­mines the strate­gies that deck or char­ac­ter can use in the sec­ond phase, which is ac­tual game­play. In game­play, the char­ac­ter (or deck) can only use the af­for­dances that it’s been pre­vi­ously set up with. This means that there are two sep­a­rate places where a player needs to get things right: first, in de­sign­ing a strong char­ac­ter/​deck, and sec­ond, in ex­e­cut­ing the op­ti­mal strate­gies for that char­ac­ter/​deck dur­ing game­play. [...]
The idea is that hu­man be­hav­ior works very much like a two-level game. [...] The player de­ter­mines what we find re­ward­ing or un­re­ward­ing. The player de­ter­mines what we no­tice and what we over­look; things come to our at­ten­tion if it suits the player’s strat­egy, and not oth­er­wise. The player gives us emo­tions when it’s strate­gic to do so. The player sets up our sub­con­scious eval­u­a­tions of what is good for us and bad for us, which we ex­pe­rience as “lik­ing” or “dis­lik­ing.”
The char­ac­ter is what ex­e­cut­ing the player’s strate­gies feels like from the in­side. If the player has de­cided that a task is unim­por­tant, the char­ac­ter will ex­pe­rience “for­get­ting” to do it. If the player has de­cided that al­li­ance with some­one will be in our in­ter­ests, the char­ac­ter will ex­pe­rience “lik­ing” that per­son. Some­times the player will no­tice and seize op­por­tu­ni­ties in a very strate­gic way that feels to the char­ac­ter like “be­ing lucky” or “be­ing in the right place at the right time.”
This is where con­fu­sion of­ten sets in. Peo­ple will of­ten protest “but I did care about that thing, I just for­got” or “but I’m not that Machi­avel­lian, I’m just do­ing what comes nat­u­rally.” This is true, be­cause when we talk about our­selves and our ex­pe­riences, we’re speak­ing “in char­ac­ter”, as our char­ac­ter. The strat­egy is not go­ing on at a con­scious level. In fact, I don’t be­lieve we (char­ac­ters) have di­rect ac­cess to the player; we can only in­fer what it’s do­ing, based on what pat­terns of be­hav­ior (or thought or emo­tion or per­cep­tion) we ob­serve in our­selves and oth­ers.

I think that this model is ba­si­cally cor­rect, and that our emo­tional re­sponses, prefer­ences, etc. are all the re­sult of a deeper-level op­ti­miza­tion pro­cess. This op­ti­miza­tion pro­cess, then, is some­thing like that de­scribed in The Brain as a Univer­sal Learn­ing Ma­chine:

The uni­ver­sal learn­ing hy­poth­e­sis pro­poses that all sig­nifi­cant men­tal al­gorithms are learned; noth­ing is in­nate ex­cept for the learn­ing and re­ward ma­chin­ery it­self (which is some­what com­pli­cated, in­volv­ing a num­ber of sys­tems and mechanisms), the ini­tial rough ar­chi­tec­ture (equiv­a­lent to a prior over mindspace), and a small library of sim­ple in­nate cir­cuits (analo­gous to the op­er­at­ing sys­tem layer in a com­puter). In this view the mind (soft­ware) is dis­tinct from the brain (hard­ware). The mind is a com­plex soft­ware sys­tem built out of a gen­eral learn­ing mechanism. [...]
An ini­tial un­trained seed ULM can be defined by 1.) a prior over the space of mod­els (or equiv­a­lently, pro­grams), 2.) an ini­tial util­ity func­tion, and 3.) the uni­ver­sal learn­ing ma­chin­ery/​al­gorithm. The ma­chine is a real-time sys­tem that pro­cesses an in­put sen­sory/​ob­ser­va­tion stream and pro­duces an out­put mo­tor/​ac­tion stream to con­trol the ex­ter­nal world us­ing a learned in­ter­nal pro­gram that is the re­sult of con­tin­u­ous self-op­ti­miza­tion. [...]
The key defin­ing char­ac­ter­is­tic of a ULM is that it uses its uni­ver­sal learn­ing al­gorithm for con­tin­u­ous re­cur­sive self-im­prove­ment with re­gards to the util­ity func­tion (re­ward sys­tem). We can view this as sec­ond (and higher) or­der op­ti­miza­tion: the ULM op­ti­mizes the ex­ter­nal world (first or­der), and also op­ti­mizes its own in­ter­nal op­ti­miza­tion pro­cess (sec­ond or­der), and so on. Without loss of gen­er­al­ity, any sys­tem ca­pa­ble of com­put­ing a large num­ber of de­ci­sion vari­ables can also com­pute in­ter­nal self-mod­ifi­ca­tion de­ci­sions.
Con­cep­tu­ally the learn­ing ma­chin­ery com­putes a prob­a­bil­ity dis­tri­bu­tion over pro­gram-space that is pro­por­tional to the ex­pected util­ity dis­tri­bu­tion. At each timestep it re­ceives a new sen­sory ob­ser­va­tion and ex­pends some amount of com­pu­ta­tional en­ergy to in­fer an up­dated (ap­prox­i­mate) pos­te­rior dis­tri­bu­tion over its in­ter­nal pro­gram-space: an ap­prox­i­mate ‘Bayesian’ self-im­prove­ment.

Rephras­ing these posts in terms of each other, in a per­son’s brain “the player” is the un­der­ly­ing learn­ing ma­chin­ery, which is search­ing the space of pro­grams (brains) in or­der to find a suit­able con­figu­ra­tion; the “char­ac­ter” is what­ever set of emo­tional re­sponses, aes­thet­ics, iden­tities, and so forth the learn­ing pro­gram has cur­rently hit upon.

Many of the things about the char­ac­ter that seem fixed, can in fact be mod­ified by the learn­ing ma­chin­ery. One’s sense of aes­thet­ics can be up­dated by prop­a­gat­ing new facts into it, and strongly-held iden­tities (such as “I am a tech­ni­cal per­son”) can change in re­sponse to new kinds of strate­gies be­com­ing vi­able. Un­lock­ing the Emo­tional Brain de­scribes a num­ber of such up­dates, such as—in these terms—the ULM elimi­nat­ing sub­pro­grams block­ing con­fi­dence af­ter re­ceiv­ing an up­date say­ing that the con­se­quences of ex­press­ing con­fi­dence will not be as bad as pre­vi­ously pre­dicted.

Another ex­am­ple of this kind of a thing was the frame­work that I sketched in Build­ing up to an In­ter­nal Fam­ily Sys­tems model: if a sys­tem has cer­tain kinds of bad ex­pe­riences, it makes sense for it to spawn sub­sys­tems ded­i­cated to en­sur­ing that those ex­pe­riences do not re­peat. Mo­ral psy­chol­ogy’s so­cial in­tu­ition­ist model claims that peo­ple of­ten have an ex­ist­ing con­vic­tion that cer­tain ac­tions or out­comes are bad, and that they then level seem­ingly ra­tio­nal ar­gu­ments for the sake of pre­vent­ing those out­comes. Even if you re­but the ar­gu­ments, the con­vic­tion re­mains. This kind of a model is com­pat­i­ble with an IFS/​ULM style model, where the learn­ing ma­chin­ery sets the goal of pre­vent­ing par­tic­u­lar out­comes, and then ap­plies the “rea­son­ing mod­ule” for that pur­pose.

Qiaochu Yuan notes that once you see peo­ple be­ing up­set at their coworker for crit­i­ciz­ing them and you do ther­apy ap­proaches with them, and this gets to the point where they are cry­ing about how their father never told them that they were proud of them… then it gets re­ally hard to take peo­ple’s re­ac­tions to things at face value. Many of our con­sciously ex­pe­rienced mo­ti­va­tions, ac­tu­ally have noth­ing to do with our real mo­ti­va­tions. (See also: No­body does the thing that they are sup­pos­edly do­ing, The Elephant in the Brain, The In­tel­li­gent So­cial Web.)

Prefer­ence syn­the­sis as a char­ac­ter-level model

While I like a lot of the work that Stu­art Arm­strong has done on syn­the­siz­ing hu­man prefer­ences, I have a se­ri­ous con­cern about it which is best de­scribed as: ev­ery­thing in it is based on the char­ac­ter level, rather than the player/​ULM level.

For ex­am­ple, in “Our val­ues are un­der­defined, change­able, and ma­nipu­la­ble”, Stu­art—in my view, cor­rectly—ar­gues for the claim stated in the ti­tle… ex­cept that, it is not clear to me to what ex­tent the things we in­tu­itively con­sider our “val­ues”, are ac­tu­ally our val­ues. Stu­art opens with this ex­am­ple:

When asked whether “com­mu­nist” jour­nal­ists could re­port freely from the USA, only 36% of 1950 Amer­i­cans agreed. A fol­low up ques­tion about Ame­rian jour­nal­ists re­port­ing freely from the USSR got 66% agree­ment. When the or­der of the ques­tions was re­versed, 90% were in favour of Amer­i­can jour­nal­ists—and an as­tound­ing 73% in favour of the com­mu­nist ones.

From this, Stu­art sug­gests that peo­ple’s val­ues on these ques­tions should be thought of as un­der­de­ter­mined. I think that this has a grain of truth to it, but that call­ing these opinions “val­ues” in the first place is mis­lead­ing.

My preferred fram­ing would rather be that peo­ple’s val­ues—in the sense of some deeper set of re­wards which the un­der­ly­ing ma­chin­ery is op­ti­miz­ing for—are in fact un­der­de­ter­mined, but that is not what’s go­ing on in this par­tic­u­lar ex­am­ple. The or­der of the ques­tions does not change those val­ues, which re­main sta­ble un­der this kind of a con­sid­er­a­tion. Rather, con­sciously-held poli­ti­cal opinions are strate­gies for car­ry­ing out the un­der­ly­ing val­ues. Re­ceiv­ing the ques­tions in a differ­ent or­der caused the sys­tem to con­sider differ­ent kinds of in­for­ma­tion when it was choos­ing its ini­tial strat­egy, caus­ing differ­ent strate­gic choices.

Stu­art’s re­search agenda does talk about in­cor­po­rat­ing meta-prefer­ences, but as far as I can tell, all the meta-prefer­ences are about the char­ac­ter level too. Stu­art men­tions “I want to be more gen­er­ous” and “I want to have con­sis­tent prefer­ences” as ex­am­ples of meta-prefer­ences; in ac­tu­al­ity, these meta-prefer­ences might ex­ist be­cause of some­thing like “the learn­ing sys­tem has iden­ti­fied gen­eros­ity as a so­cially ad­mirable strat­egy and pre­dicts that to lead to bet­ter so­cial out­comes” and “the learn­ing sys­tem has for­mu­lated con­sis­tency as a gen­er­ally valuable heuris­tic and one which af­firms the ‘log­i­cal thinker’ iden­tity, which in turn is be­ing op­ti­mized be­cause of its pre­dicted so­cial out­comes”.

My con­fu­sion about a bet­ter the­ory of values

If a “purely char­ac­ter-level” model of hu­man val­ues is wrong, how do we in­cor­po­rate the player level?

I’m not sure and am mostly con­fused about it, so I will just bab­ble & bog­gle at my con­fu­sion for a while, in the hopes that it would help.

The op­ti­mistic take would be that there ex­ists some set of uni­ver­sal hu­man val­ues which the learn­ing ma­chin­ery is op­ti­miz­ing for. There ex­ist var­i­ous ther­apy frame­works which claim to have found some­thing like this.

For ex­am­ple, the NEDERA model claims that there ex­ist nine nega­tive core feel­ings whose avoidance hu­mans are op­ti­miz­ing for: peo­ple may feel Alone, Bad, Hel­pless, Hope­less, Inad­e­quate, In­signifi­cant, Lost/​Di­sori­ented, Lost/​Empty, and Worth­less. And pjeby men­tions that in his em­piri­cal work, he has found three clusters of un­der­ly­ing fears which seem similar to these nine:

For ex­am­ple, work­ing with peo­ple on self-image prob­lems, I’ve found that there ap­pear to be only three crit­i­cal “fla­vors” of self-judg­ment that cre­ate life-long low self-es­teem in some area, and as­so­ci­ated com­pul­sive or avoidant be­hav­iors:
Belief that one is bad, defec­tive, or mal­i­cious (i.e. lack­ing in care/​al­tru­ism for friends or fam­ily)
Belief that one is fool­ish, in­ca­pable, in­com­pe­tent, un­wor­thy, etc. (i.e. lack­ing in abil­ity to learn/​im­prove/​perform)
Belief that one is self­ish, ir­re­spon­si­ble, care­less, etc. (i.e. not re­spect­ing what the fam­ily or com­mu­nity val­ues or be­lieves im­por­tant)
(No­tice that these are things that, if you were bad enough at them in the an­ces­tral en­vi­ron­ment, or if peo­ple only thought you were, you would lose re­pro­duc­tive op­por­tu­ni­ties and/​or your life due to os­tracism. So it’s rea­son­able to as­sume that we have wiring bi­ased to treat these as high-pri­or­ity long-term drivers of com­pen­satory sig­nal­ing be­hav­ior.)
Any­way, when some­body gets taught that some be­hav­ior (e.g. show­ing off, not work­ing hard, for­get­ting things) equates to one of these moral­ity-like judg­ments as a per­sis­tent qual­ity of them­selves, they of­ten de­velop a com­pul­sive need to prove oth­er­wise, which makes them choose their goals, not based on the goal’s ac­tual util­ity to them­self or oth­ers, but rather based on the goal’s per­ceived value as a means of virtue-sig­nal­ling. (Which then leads to a pat­tern of con­tinu­ally try­ing to achieve similar goals and ei­ther failing, or feel­ing as though the goal was un­satis­fac­tory de­spite suc­ceed­ing at it.)

So—as­sum­ing for the sake of ar­gu­ment that these find­ings are cor­rect—one might think some­thing like “okay, here are the things the brain is try­ing to avoid, we can take those as the ba­sic hu­man val­ues”.

But not so fast. After all, emo­tions are all com­puted in the brain, so “avoidance of these emo­tions” can’t be the only goal any more than “op­ti­miz­ing hap­piness” can. It would only lead to wire­head­ing.

Fur­ther­more, it seems like one of the things that the un­der­ly­ing ma­chin­ery also learns, is situ­a­tions in which it should trig­ger these feel­ings. E.g. feel­ings of ir­re­spon­si­bil­ity can be used as an in­ter­nal car­rot and stick scheme, in which the sys­tem comes to pre­dict that if it will feel per­sis­tently bad, this will cause parts of it to pur­sue spe­cific goals in an at­tempt to make those nega­tive feel­ings go away.

Also, we are not only try­ing to avoid nega­tive feel­ings. Em­piri­cally, it doesn’t look like happy peo­ple end up do­ing less than un­happy peo­ple, and guilt-free peo­ple may in fact do more than guilt-driven peo­ple. The re­la­tion­ship is nowhere lin­ear, but it seems like there are plenty of happy, en­er­getic peo­ple who are happy in part be­cause they are do­ing all kinds of fulfilling things.

So maybe we could look at the in­verse of nega­tive feel­ings: pos­i­tive feel­ings. The cur­rent main­stream model of hu­man mo­ti­va­tion and ba­sic needs is self-de­ter­mi­na­tion the­ory, which ex­plic­itly holds that there ex­ist three sep­a­rate ba­sic needs:

Au­ton­omy: peo­ple have a need to feel that they are the mas­ters of their own des­tiny and that they have at least some con­trol over their lives; most im­por­tantly, peo­ple have a need to feel that they are in con­trol of their own be­hav­ior.
Com­pe­tence: an­other need con­cerns our achieve­ments, knowl­edge, and skills; peo­ple have a need to build their com­pe­tence and de­velop mas­tery over tasks that are im­por­tant to them.
Re­lat­ed­ness (also called Con­nec­tion): peo­ple need to have a sense of be­long­ing and con­nect­ed­ness with oth­ers; each of us needs other peo­ple to some degree

So one model could be that the ba­sic learn­ing ma­chin­ery is, first, op­ti­miz­ing for avoid­ing bad feel­ings; and then, op­ti­miz­ing for things that have been as­so­ci­ated with good feel­ings (even when do­ing those things is lo­cally un­re­ward­ing, e.g. tak­ing care of your chil­dren even when it’s un­pleas­ant). But this too risks run­ning into the wire­head­ing is­sue.

A prob­lem here is that while it might make in­tu­itive sense to say “okay, if the char­ac­ter’s val­ues aren’t the real val­ues, let’s use the player’s val­ues in­stead”, the split isn’t ac­tu­ally any­where that clean. In a sense the player’s val­ues are the real ones—but there’s also a sense in which the player doesn’t have any­thing that we could call val­ues. It’s just a learn­ing sys­tem which ob­serves a stream of re­wards and op­ti­mizes it ac­cord­ing to some set of mechanisms, and even the re­ward and op­ti­miza­tion mechanisms them­selves may end up get­ting at least par­tially rewrit­ten. The un­der­ly­ing ma­chin­ery has no idea about things like “ex­is­ten­tial risk” or “avoid­ing wire­head­ing” or nec­es­sar­ily even “per­sonal sur­vival”—think­ing about those is a char­ac­ter-level strat­egy, even if it is cho­sen by the player us­ing crite­ria that it does not ac­tu­ally un­der­stand.

For a mo­ment it felt like look­ing at the player level would help with the un­der­defin­abil­ity and muta­bil­ity of val­ues, but the player’s val­ues seem like they could be even less defined and even more muta­ble. It’s not clear to me that we can call them val­ues in the first place, ei­ther—any more than it makes mean­ingful sense to say that a neu­ron in the brain “val­ues” firing and re­leas­ing neu­ro­trans­mit­ters. The player is just a set of code, or go­ing one ab­strac­tion level down, just a bunch of cells.

To the ex­tent that there ex­ists some­thing that in­tu­itively re­sem­bles what we call “hu­man val­ues”, it feels like it ex­ists in some hy­brid level which in­cor­po­rates parts of the player and parts of the char­ac­ter. That is, as­sum­ing that the two can even be very clearly dis­t­in­guished from each other in the first place.

Or some­thing. I’m con­fused.