Resolving human values, completely and adequately

In pre­vi­ous posts, I’ve been as­sum­ing that hu­man val­ues are com­plete and con­sis­tent, but, fi­nally, we are ready to deal with ac­tual hu­man val­ues/​prefer­ences/​re­wards—the whole con­tra­dic­tory, messy, in­co­her­ent mess of them.

Define “com­pletely re­solv­ing hu­man val­ues”, as an AI ex­tract­ing a con­sis­tent hu­man re­ward func­tion from the in­con­sis­tent data on the prefer­ence and val­ues of a sin­gle hu­man (leav­ing aside the eas­ier case of re­solv­ing con­flicts be­tween differ­ent hu­mans). This post will look at how such re­s­olu­tions could be done—or at least pro­pose an ini­tial at­tempt, to be im­proved upon.

EDIT: There is a prob­lem with ren­der­ing some of the LaTeX, which I don’t un­der­stand. The draft ren­dered it fine, but not the pub­lished ver­sion. So I’ve re­placed some LaTeX with uni­code or image files; it gen­er­ally works, but there are over­sized images in sec­tion 3.

Ad­e­quate ver­sus elegant

Part of the prob­lem is re­solv­ing hu­man val­ues, is that peo­ple have been look­ing to do it too well and too el­e­gantly. This re­sults in ei­ther com­plete re­s­olu­tions that ig­nore vast parts of the val­ues (eg he­do­nic util­i­tar­i­anism), or in thor­ough analy­ses of a tiny part of the prob­lem (eg all the pa­pers pub­lished on the trol­ley prob­lem).

In­com­plete re­s­olu­tions are not suffi­cient to guide an AI, and el­e­gant com­plete re­s­olu­tions seem to be like most utopias: not any good for real hu­mans.

Much bet­ter to aim for an ad­e­quate com­plete re­s­olu­tion. Ad­e­quacy means two things here:

  • It doesn’t lead to dis­as­trous out­comes, ac­cord­ing to the hu­man’s cur­rent val­ues.

  • If a hu­man has a strong value or meta-value, that will strongly in­fluence the ul­ti­mate hu­man re­ward func­tion, un­less their other val­ues point strongly in the op­po­site di­rec­tion.

Aiming for ad­e­quacy is quite free­ing, al­low­ing you to go ahead and con­struct a re­s­olu­tion, which can then be tweaked and im­proved upon. It also opens up a whole new space of pos­si­ble solu­tions. And, last but not least, any at­tempt to for­mal­ise and write a solu­tion gives a much bet­ter un­der­stand­ing of the prob­lem.

Ba­sic frame­work, then modifications

This post is a first at­tempt at con­struct­ing such an ad­e­quate com­plete re­s­olu­tion. Some of the de­tails will re­main to be filled in, oth­ers will doubtlessly be changed; nev­er­the­less, this first at­tempt should be in­struc­tive.

The re­s­olu­tion will be built in three steps:

  • a) It will provide a ba­sic frame­work for re­solv­ing low level val­ues, or meta-val­ues of the same “level”.

  • b) It will ex­tend this frame­work to ac­count for some types of meta-val­ues ap­ply­ing to lower level val­ues.

  • c) It will then al­low some meta-val­ues to mod­ify the whole frame­work.

Fi­nally, the post will con­clude with some types of meta-val­ues that are hard to in­te­grate into this frame­work.

1 Ter­minol­ogy and ba­sic concepts

Let be a hu­man, whose “true” val­ues we are try­ing elu­ci­date. Let be the pos­si­ble en­vi­ron­ments (in­clud­ing its tran­si­tion rules), with the ac­tual en­vi­ron­ment. And let be the set of fu­ture his­to­ries that the hu­man may en­counter, from time on­ward (the hu­man’s past his­tory is seen as part of the en­vi­ron­ment).

Let be a set of re­wards. We’ll as­sume that is closed un­der many op­er­a­tions—af­fine trans­for­ma­tions (in­clud­ing nega­tion), adding two re­wards to­gether, mul­ti­ply­ing them to­gether, and so on. For sim­plic­ity, as­sume that is a real vec­tor space, gen­er­ated by a finite num­ber of ba­sis re­wards.

Then define to be a set of po­ten­tial val­ues of . This is defined to be all the value/​prefer­ence/​re­ward state­ments that might agree to, more or less strongly.

1.1 The role of the AI

The AI’s role is elu­ci­date how much the hu­man ac­tu­ally ac­cepts state­ments in (see for in­stance here and here). For any given , it will com­pute , the weight of the value . For men­tal cal­ibra­tion pur­poses, as­sume that is in the to range, and that if the hu­man has no cur­rent opinion on , then is zero (the con­verse is not true: could be zero be­cause the hu­man has care­fully analysed but found it to be ir­rele­vant or nega­tive).

The AI will also com­pute , the en­dorse­ment of . This mea­sures the ex­tent to which ‘ap­proves’ or ‘dis­ap­proves’ of a cer­tain re­ward or value (there is a re­ward nor­mal­i­sa­tion is­sue which I’ll elide for now).

Ob­ject level val­ues are those which are non-zero only on re­wards; ie the for which for all . To avoid the most ob­vi­ous self-refer­en­tial prob­lem, any value’s self-en­dorse­ment is as­sumed to be zero (so ). As we will see be­low, pos­i­tively en­dors­ing a nega­tive re­ward is not the same as nega­tively en­dors­ing a pos­i­tive re­ward: does not mean the same thing as .

Then this post will at­tempt to define the re­s­olu­tion func­tion , which maps weights, en­dorse­ments, and the en­vi­ron­ment to a sin­gle re­ward func­tion. So if is the cross product of all pos­si­ble weight func­tions, en­dorse­ment func­tions, and en­vi­ron­ments:

In the fol­low­ing, we’ll also have need for a more gen­eral , and for spe­cial dis­tri­bu­tions over de­pen­dent on a given ; but we’ll define them as and when they are needed.

2 The ba­sic framework

In this sec­tion, we’ll in­tro­duce a ba­sic frame­work for re­solv­ing re­wards. This will in­volve mak­ing a cer­tain num­ber of ar­bi­trary choices, choices that may then get mod­ified in the next sec­tion.

This sec­tion will deal with the prob­lems with hu­man val­ues be­ing con­tra­dic­tory, un­der­defined, change­able, and ma­nipu­la­ble. As a side effect, this will also deal with the fact that hu­mans can make moral er­rors (and end up feel­ing their pre­vi­ous val­ues were ‘wrong’), and that they can de­rive in­sights from philo­soph­i­cal thought ex­per­i­ments.

As an ex­am­ple, we’ll use a clas­sic mod­ern dilemma: whether to in­dulge in ba­con or to keep slim.

So let there be two re­wards, the ba­con re­ward, and , the slim­ness re­ward. As­sume that if always in­dul­ges, and , while if they never in­dulge, and . There are var­i­ous trade­off and gains from trade for in­ter­me­di­ate lev­els of in­dul­gence, the de­tails of which are not rele­vant here.

Then define .

2.1 Con­tra­dic­tory values

Define by {I like eat­ing ba­con}, and {I want to keep slim}. Given the right nor­ma­tive as­sump­tions, the AI can eas­ily es­tab­lish that and are both greater than zero. For ex­am­ple, it can note that the hu­man some­times does in­dulge, or de­sires to do so; but also the hu­man feels sad about gain­ing weight, shame about their lack of dis­ci­pline, and some­times en­gages in anti-ba­con pre­com­mit­ment ac­tivi­ties.

The nat­u­ral thing here is to weight by the weight of and the en­dorse­ment that gives to (and similarly with and ). This means that

Or, for the more gen­eral for­mula, with im­plicit un­cur­ry­ing so as to write as a func­tion of two vari­ables:

For this post, I’ll ig­nore the is­sue of whether that sum always con­verges (which it would al­most cer­tainly do, in prac­tice).

2.2 Unen­dors­ing rewards

I said there was a differ­ence be­tween a nega­tive en­dorse­ment of , and a pos­i­tive en­dorse­ment of . A pos­i­tive en­dorse­ment is just a value judge­ment that sees as good, while the nega­tive en­dorse­ment just doesn’t want to ap­pear at all.

For ex­am­ple, con­sider {I’m not wor­ried about my weight}. Ob­vi­ously this has a nega­tive en­dorse­ment of , but it doesn’t have a pos­i­tive en­dorse­ment of - it ex­plic­itly doesn’t have a de­sire to be fat, ei­ther. So the weight and en­dorse­ment of are fine when it comes to re­duc­ing the pos­i­tive weight of , but not when mak­ing a zero or nega­tive weight more nega­tive. To cap­ture that, rewrite as:

Then the AI, to max­imise ’s re­wards, sim­ply needs to fol­low the policy that max­imises that re­ward.

2.3 Un­derdefined rewards

Let’s now look at the prob­lem of un­der­defined val­ues. To illus­trate, add the op­tion of li­po­suc­tion to the main model. If in­dul­ges in ba­con, and un­der­goes li­po­suc­tion, then both and can be set to .

But might not want to un­dergo li­po­suc­tion (as­sumed, in this model, to be costless). Let be the re­ward for no li­po­suc­tion, if li­po­suc­tion is avoided, and if it hap­pens, and let {I want to avoid li­po­suc­tion}. Ex­tend to .

Be­cause hasn’t thought much about li­po­suc­tion, they cur­rently have . But it’s pos­si­ble they may have firm views on it, af­ter some re­flec­tion. If so, it would be good to use those views now. When hu­mans haven’t thought about val­ues, there are many ways they can de­velop them, de­pend­ing on how the is­sue is pre­sented to them and how it in­ter­acts with their cat­e­gories, so­cial cir­cles, moral in­stincts, and world mod­els.

For ex­am­ple, as­sume that the AI can figure out that, if is given a de­scrip­tion of li­po­suc­tion that starts with “lazy peo­ple can cheat by...“, then they will be against it: will be greater than zero. How­ever, if they are given a de­scrip­tion that starts with “effi­cient peo­ple can op­ti­mise by...“, then they will be in favour of it, and will be zero.

If is the weight of at fu­ture time , given the fu­ture his­tory , define the dis­counted fu­ture weight as

for, say, if is de­nom­i­nated in days. If is the his­tory with the “lazy” de­scrip­tion, this will be greater than zero. If it’s the his­tory with the “effi­cient” de­scrip­tion, it will be close to zero.

We’d like to use the ex­pected value of , but there are two prob­lems with that. The first is that many pos­si­ble fu­tures might in­volve no re­flec­tion about on the part of . We don’t care about these fu­tures. The other is that these fu­tures de­pend on the ac­tions of the AI, so that it can ma­nipu­late the hu­man’s fu­ture val­ues.

So define , a sub­set of the set of his­to­ries . This sub­set is defined firstly so that the will have rele­vant opinions about : they won’t be in­differ­ent to it. Se­condly, these are fu­ture on which the hu­man is al­lowed to de­velop their val­ues ‘nat­u­rally’, with­out un­due rig­ging and in­fluence on the part of the AI (see this for an ex­am­ple of such a dis­tri­bu­tion). Note that these need not be his­to­ries which will ac­tu­ally hap­pen, just fu­ture his­to­ries which the AI can es­ti­mate. Let be the prob­a­bil­ity dis­tri­bu­tion of fu­ture his­to­ries, re­stricted to (this re­quires that the AI pick a sen­si­ble prob­a­bil­ity dis­tri­bu­tion over its own fu­ture policy, at least for the pur­pose of com­put­ing this prob­a­bil­ity dis­tri­bu­tion).

Note that the ex­act defi­ni­tion of and are vi­tally im­por­tant and still need to be fully es­tab­lished. That is a crit­i­cal prob­lem I’ll be re­turn­ing to in the fu­ture.

Lay­ing that aside for the mo­ment, we can define the ex­pected rele­vant weight:

Then the for­mula for be­comes:

us­ing in­stead of .

2.4 Mo­ral er­rors and moral learning

The above was de­signed to ad­dress un­der­defined val­ues, but it ac­tu­ally does much more than that. It deals with change­able val­ues, and ad­dresses moral er­rors and moral learn­ing.

An ex­am­ple of moral er­ror is think­ing that you want some­thing, but, upon achiev­ing it, you find that you don’t. Let us ex­am­ine , the de­sire to be slim. Peo­ple don’t gen­er­ally have a strong in­trin­sic de­sire for slim­ness just for the sake of it; in­stead, they might strive for this be­cause they think it will make them healthier, hap­pier, might in­crease their fu­ture sta­tus, might in­crease their self-dis­ci­pline in gen­eral, or some­thing similar.

So we could re­place with {I de­sire X}, where X is some­thing that be­lieves will come out of slim­ming.

When com­put­ing and , the AI will test how re­acts to achiev­ing slim­ness, or achiev­ing X, and ul­ti­mately com­pute a low but a high . This would be even more the case if is al­lowed to con­tain im­pos­si­ble fu­ture his­to­ries, such as hy­po­thet­i­cals where the hu­man mirac­u­lously slims with­out achiev­ing X, or vice-versa.

The use of also picks up sys­tem­atic, pre­dictable moral change. For ex­am­ple, the hu­man may be cur­rently com­mit­ted to a nar­ra­tive that seems them­selves as dis­ci­plined, stereo­typ­i­cal-ra­tio­nal be­ing that will over­come their short term weak­nesses. Their weight is high. How­ever, the AI knows that try­ing to slim will be un­pleas­ant for , and that they will soon give up as the pain mounts, and change their nar­ra­tive to one where they ac­cept and bal­ance their own foibles. So the ex­pected is low, un­der most rea­son­able fu­tures where hu­mans can­not con­trol their own value changes (this has ob­vi­ous analo­gies with ma­jor life changes, such as loss of faith or changes in poli­ti­cal out­looks).

Then there is the third case where strongly held val­ues may end be­ing in­co­her­ent (as I ar­gued is the case of the ‘pu­rity’ moral foun­da­tion). Sup­pose the hu­man deeply be­lieves that {Hu­mans have souls and pigs don’t, so it’s ok to eat pigs, but not ok to defile the hu­man form with li­po­suc­tion}. This value would thus en­dorse and . But it’s also based on false facts.

There seems to be three stan­dard ways to re­solve this. Re­plac­ing “soul” with, say, “mind ca­pa­ble of com­plex thought and abil­ity to suffer”, they may shift to {I should not eat pigs}. Or if they go for “hu­mans have no souls, so ‘defile­ment’ makes no sense”, they may em­brace {All hu­man en­hance­ments are fine}. Or, as hap­pens of­ten in the real world when peo­ple can’t jus­tify their val­ues, they may shift their jus­tifi­ca­tion but pre­serve the ba­sic value: {It is nat­u­ral and tra­di­tional and there­fore good to eat pig, and avoid li­po­suc­tion}.

Now, I feel is prob­a­bly in­co­her­ent as well, but there are no lack of co­her­ent-but-ar­bi­trary rea­sons to eat pigs and avoid li­po­suc­tion, so some value set similar to that is plau­si­ble.

Then suit­ably defined would al­low the AI to figure out which way the hu­man wants to up­date their val­ues for , , , and , as the hu­man moves away from the in­cor­rect first val­ues to one of the other al­ter­na­tives.

2.5 Au­to­mated philos­o­phy and CEV

The use of also al­lows one to in­tro­duce philos­o­phy to the mix. One sim­ply needs to in­clude in the pre­sen­ta­tion of philo­soph­i­cal thought ex­per­i­ments to , and ’s re­ac­tion and up­dat­ing. Similarly, one can do the ini­tial steps of co­her­ent ex­trap­o­lated vo­li­tion, by in­clud­ing fu­tures where changes them­selves in the de­sired di­rec­tion. This can be seen as au­tomat­ing some of philos­o­phy (this ap­proach has noth­ing to say about episte­mol­ogy and on­tol­ogy, for in­stance).

In­deed, you could define philoso­phers as peo­ple with par­tic­u­larly strong philo­soph­i­cal meta-val­ues: that is, putting a high pre­mium on philo­soph­i­cal con­sis­tency, sim­plic­ity, and logic.

The more weight is given to philos­o­phy or to frame­works like CEV, the more el­e­gant and co­her­ent the re­sult­ing re­s­olu­tion is, but the higher the risk of it go­ing dis­as­trously wrong by los­ing key parts of hu­man val­ues—we risk run­ning into the prob­lems de­tailed here and here.

2.6 Meta-values

We’ll con­clude this sec­tion by look­ing at how one can ap­ply the above frame­work to meta-val­ues. There are val­ues that have non-zero en­dorse­ments of other val­ues, ie .

The pre­vi­ous {All hu­man en­hance­ments are fine} could be seen as a meta-value, one that un­en­dorses the anti-li­po­suc­tion value , so Or we might have one that un­en­dorses short-term val­ues: {Short-term val­ues are less im­por­tant}, with .

The prob­lem with comes when val­ues start refer­ring to val­ues that start refer­ring to them­selves. This al­lows in­di­rect self-refer­ence, with all the trou­ble that that brings.

Now, there are var­i­ous tools for deal­ing with self-refer­ence or cir­cu­lar rea­son­ing—Paul Chris­ti­ano’s prob­a­bil­is­tic self-refer­ence, and Scott Aaron­son’s Ei­gen­moral­ity are ob­vi­ous can­di­dates.

But in the spirit of ad­e­quacy, I’ll di­rectly define a method that can take all these pos­si­bly self-refer­en­tial val­ues and re­solve them. Those who are not in­ter­ested in the maths here can skip to the next sec­tion; there is no real in­sight here.

Let , and let be an or­der­ing (or a per­mu­ta­tion) of , ie a bi­jec­tive map from to . Then re­cur­sively define by , and

Thus each is the sum of the ac­tual weight , plus the -ad­justed en­dorse­ments of the val­ues pre­ced­ing it (in the or­der­ing), with the zero lower bound. By av­er­ag­ing across the set of all per­mu­ta­tions of , we can then define:

Then, fi­nally, for re­solv­ing the re­ward, we can use these weights in the stan­dard re­ward func­tion:

3. The “wrong” : meta-val­ues for the re­s­olu­tion process

The of the pre­vi­ous sec­tion is suffi­cient to re­solve the val­ues of an which has no strong feel­ings on how those val­ues should be re­solved.

But many may find it in­ad­e­quate, filled with ar­bi­trary choices, do­ing too much by hand/​fiat, or do­ing to lit­tle. So the next step is to let ’s val­ues af­fect how the it­self works.

Define as the frame­work con­structed in the pre­vi­ous sec­tion. And let be the set of all such pos­si­ble re­s­olu­tion frame­works. We now ex­tend so that can en­dorse or un­en­dorse not only el­e­ments of and , but also of .

Then we can define

and define it­self as

Th­ese for­mu­las make sense, since the var­i­ous el­e­ments of takes val­ues in , which can be summed. Also, be­cause we can mul­ti­ply a re­ward by a pos­i­tive scalar, there is no need for renor­mal­is­ing or weight­ing in these sum­ming for­mu­las.

Now, this is not a com­plete trans­for­ma­tion of ac­cord­ing to ’s val­ues—for ex­am­ple, there is no place for these val­ues to change the com­pu­ta­tion of , which is com­puted ac­cord­ing to the pre­vi­ously defined for . (Note: Those are where the LaTeX er­rors used to be, and now there are over­sized image files which I can’t re­duce, sorry!)

But I won’t worry about that for the mo­ment, though I’ll un­doubt­edly re­turn to it later. First of all, I very much doubt that many hu­mans have strong in­tu­itions about the cor­rect method for re­solv­ing con­tra­dic­tions among the differ­ent ways of de­sign­ing a re­s­olu­tion sys­tem for map­ping most val­ues and meta-val­ues to a re­ward. And if some­one does have such a meta-value, I’d wa­ger it’ll be mostly to benefit a spe­cific ob­ject level value or re­ward, so it’s more in­struc­tive to look at the ob­ject level.

But the real rea­son I won’t dig too much into those is­sues for the mo­ment, is that the next sec­tion demon­strates that there are prob­lems with fully self-refer­en­tial ways of re­solv­ing val­ues. I’d like to un­der­stand and solve those be­fore get­ting too meta on the re­s­olu­tion pro­cess.

4 Prob­lems with self-refer­en­tial

Here I’ll look at some of the prob­lems that can oc­cur with fully self-refer­en­tial Θ and/​or v. The pre­sen­ta­tion will be more in­for­mal, since I haven’t defined the lan­guage or the for­mal­ism to al­low such for­mu­la­tion yet.

4.1 All-or-noth­ing val­ues, and per­sonal identity

Some val­ues put a high pre­mium on sim­plic­ity, or on defin­ing the whole of the rele­vant part of . For ex­am­ple, the pa­per “An im­pos­si­bil­ity the­o­rem in pop­u­la­tion ax­iol­ogy...” ar­gues that to­tal util­i­tar­i­anism is the only the the­ory that avoids a se­ries of counter-in­tu­itive prob­lems.

Now, I’ve dis­agreed that these prob­lems are ac­tu­ally prob­lems. But some peo­ple’s in­tu­itions strongly dis­agree with me, and feel that to­tal util­i­tar­i­anism is jus­tified by these ar­gu­ments. In­deed, I get the im­pres­sion that, for some peo­ple, even a small dero­ga­tion to to­tal util­i­tar­i­anism is bad: they strongly pre­fer 100% to­tal util­i­tar­i­anism to 99.99% to­tal util­i­tar­i­anism + 0.01% some­thing else.

This could be en­coded as a value {I value hav­ing a sim­ple pop­u­la­tions ethics}. This would provide a bonus based on the over­all sim­plic­ity of the image of Θ. To do this, we have in­tro­duced per­sonal iden­tity (an is­sue which I’ve ar­gued is un­re­solved in terms of re­ward func­tions), as well as about the image of Θ.

Pop­u­la­tion ethics feels like an ab­stract high-level con­cept, but here is a much more down-to-earth ver­sion. When the AI looks for­wards, it ex­trap­o­lates the weight of cer­tain val­ues based on the ex­pected weight in the fu­ture. What if the AI ex­trap­o­lates that will be ei­ther or in the fu­ture, with equal prob­a­bil­ity? It then rea­son­ably sets to .

But the hu­man will live in one of those fu­tures. The AI will be max­imis­ing their ‘true goals’ which in­clude , while is forced into ex­treme val­ues of ( or ) which do not cor­re­spond to the value the AI is cur­rently max­imis­ing. So {I want to agree with the re­ward that com­putes} is a rea­son­able meta-value, that would re­ward close­ness be­tween ex­pected fu­ture val­ues and ac­tual fu­ture val­ues.

In that case, one thing the AI would be mo­ti­vated to do, is to ma­nipu­late so that they have the ‘right’ weights in the fu­ture. But this might not always be pos­si­ble. And might see that as a du­bi­ous thing to do.

Note here that this is not a prob­lem of de­siring per­sonal moral growth in the fu­ture. As­sum­ing that can be defined, the AI can then grant it. The prob­lem would be want­ing per­sonal moral growth and want­ing the AI to fol­low the val­ues that emerge from this growth.

4.2 You’re not the boss of me!

For self-refer­ence, we don’t need Gödel or Rus­sell. There is a much sim­pler, more nat­u­ral self-refer­ence para­dox lurk­ing here, one that is very com­mon in hu­mans: the urge to not be told what to do.

If the AI com­putes , there are many hu­mans who would, on prin­ci­ple, de­clare and de­cide that their re­ward was some­thing other than . This could be a value { in­cor­rectly com­putes my val­ues}. I’m not sure how to re­solve this prob­lem, or even if it’s much of a prob­lem (if the hu­man will dis­agree equally no mat­ter what, then we may as well ig­nore that dis­agree­ment; and if they dis­agree to differ­ent de­grees in differ­ent cir­cum­stances, this gives some­thing to min­imise and trade-off against other val­ues). But I’d like to un­der­stand and for­mal­ise this bet­ter.

5 Con­clu­sion: much more work

I hope this post demon­strates what I am hop­ing to achieve, and how we might start go­ing about it. Com­bin­ing this re­s­olu­tion pro­ject, with the means of ex­tract­ing hu­man val­ues would then al­low the In­verse Re­in­force­ment Learn­ing pro­ject to suc­ceed in full gen­er­al­ity: we could then have the AI de­duce hu­man val­ues from ob­ser­va­tion, and then fol­low them. This seems like a po­ten­tial recipe for a Friendly-ish AI.