A theory of human values

At the end of my post on need­ing a the­ory of hu­man val­ues, I stated that the three com­po­nents of such a the­ory were:

  1. A way of defin­ing the ba­sic prefer­ences (and ba­sic meta-prefer­ences) of a given hu­man, even if these are un­der-defined or situ­a­tional.

  2. A method for syn­the­sis­ing such ba­sic prefer­ences into a sin­gle util­ity func­tion or similar ob­ject.

  3. A guaran­tee we won’t end up in a ter­rible place, due to noise or differ­ent choices in the two defi­ni­tions above.

To sum­marise this post, I sketch out meth­ods for 1. and 2., and look at what 3. might look like, and what we can ex­pect from such a guaran­tee, and some of the is­sues with it.

Ba­sic hu­man preferences

For the first point, I’m defin­ing a ba­sic prefer­ence as ex­ist­ing within the men­tal mod­els of a hu­man.

Any prefer­ence judge­ment within that model—that some out­come was bet­ter than an­other, that some ac­tion was a mis­take, that some be­havi­our was fool­ish, that some­one is to be feared—is defined to be a ba­sic prefer­ence.

Ba­sic meta-prefer­ences work in the same way, with meta-prefer­ences just defined to be prefer­ences over prefer­ences (or over meth­ods of syn­the­sis­ing prefer­ences). Also in­clude odd meta-prefer­ences here—such as prefer­ences over be­liefs. I’ll try to trans­form these odd prefer­ences in “iden­tity prefer­ences”: prefer­ences over the kind of per­son you want to be.

“Rea­son­able” situations

To define that, we need to define the class of “rea­son­able” situ­a­tions in which to have these men­tal mod­els. Th­ese could be real situ­a­tions (Mrs X thought that she’d like some sushi as she went past the restau­rant) or coun­ter­fac­tual (if Mr Y had gone past that restau­rant, he would have wanted sushi). The “one-step hy­po­thet­i­cals post” is about defin­ing these rea­son­able situ­a­tions.

Any­thing that oc­curs out­side of a rea­son­able situ­a­tion is dis­carded as not in­dica­tive of gen­uine ba­sic hu­man prefer­ence; this is due to the fact that hu­mans can be per­suaded to en­dorse/​un­en­dorse al­most any­thing in the right situ­a­tion (eg by drugs or brain surgery, if all else fails).

We can have prefer­ences and meta-prefer­ences over non-rea­son­able situ­a­tions (what to do in a world where plants were con­scious?), as long as these prefer­ences and meta-prefer­ences were ex­pressed in rea­son­able situ­a­tions. We can have a CEV style meta-prefer­ence (“I wish my prefer­ences were more like what a CEV would gen­er­ate”), but, apart from that, the prefer­ences a CEV would gen­er­ate are not di­rectly rele­vant: the situ­a­tions where “we knew more, thought faster, were more the peo­ple we wished we were, had grown up farther to­gether” are highly non-typ­i­cal.

We would not want the AI it­self ma­nipu­lat­ing the defi­ni­tion of “rea­son­able” situ­a­tions. It’s for this that I’ve looked into ways of quan­tify­ing and re­mov­ing AI rig­ging and in­fluenc­ing of the learn­ing pro­cess.

Syn­the­sis­ing hu­man preferences

The sim­ple prefer­ences and meta-prefer­ences con­structed above will be of­ten wildly con­tra­dic­tory (eg we want to be gen­er­ous and rich), in­con­sis­tent across time, and gen­er­ally un­der­defined. They can also be weakly or strongly held.

The im­por­tant thing now is to syn­the­sise all of these into some ad­e­quate over­all re­ward or util­ity func­tion. Not be­cause util­ity func­tions are in­trin­si­cally good, but be­cause they are sta­ble: if you’re not an ex­pected util­ity max­imiser, events may likely push you into be­com­ing one. And it’s much bet­ter to start off with an ad­e­quate util­ity func­tion, than to hope that ran­dom-drift-un­til-our-goals-are-sta­ble will get us to an ad­e­quate out­come.

Syn­the­sis­ing the prefer­ence util­ity function

The idea is to start with three things:

  1. A way of re­solv­ing con­tra­dic­tions be­tween prefer­ences (and be­tween meta-prefer­ences, and so on).

  2. A way of ap­ply­ing meta-prefer­ences to prefer­ences (en­dors­ing and anti-en­dors­ing other prefer­ences).

  3. A way of al­low­ing (rele­vant) meta-prefer­ences to change the meth­ods used in the two points above.

This post showed one method of do­ing that, with con­tra­dic­tions re­solved by weight­ing the re­ward/​util­ity func­tion for each prefer­ence and then adding them to­gether lin­early. The weights were pro­por­tional to some mea­sure of the in­ten­sity of each prefer­ence.

In a more re­cent post, I re­al­ised that lin­ear ad­di­tion may not be the nat­u­ral thing to do for some types of prefer­ences (which I dubbed “iden­tity” prefer­ences). The smooth min­i­mum gives an­other way of com­bin­ing util­ities, though it needs a nat­u­ral zero as well as a weight. So the hu­man’s model of the sta­tus quo is rele­vant here. For prefer­ences com­bined in a smooth­min, we can just re­set the nat­u­ral zero (rais­ing it to make the prefer­ence less im­por­tant, low­er­ing it to make it more) rather than chang­ing the weight.

I’m dis­t­in­guish­ing be­tween iden­tity and world prefer­ences, but the real dis­tinc­tion is be­tween prefer­ences that hu­mans pre­fer to com­bine lin­early, and those they pre­fer to com­bine in a smooth­min. So it could work that along with prefer­ence and weight (and nat­u­ral zero), one thing we could ask of ba­sic prefer­ences is whether they should go in the lin­ear of the smooth­min group.

Also, though I’m very will­ing to let a lin­ear prefer­ence get sent to zero if the hu­man’s meta-prefer­ences un­en­dorse them, I’m less sure about those in the other group; it’s pos­si­ble that un­en­dors­ing of a smooth­min prefer­ence should raise the “nat­u­ral zero” rather than send­ing the prefer­ence to zero. After all, we’ve iden­ti­fied these prefer­ences as key parts of our iden­tity, even though we un­en­dorse them.

Meta-changes to the syn­the­sis method

Then fi­nally, on point 3 above, the rele­vant hu­man meta-prefer­ences can change the syn­the­sis pro­cess. Heav­ily weighted meta-prefer­ences of this type will re­sult in com­pletely differ­ent pro­cesses than de­scribed above; lightly weighted meta-prefer­ences will make only small changes. The origi­nal post looked into that in more de­tail.

No­tice that I am mak­ing some de­liber­ate and some­what ar­bi­trary choices: us­ing lin­ear or smooth­min to com­bine meta-prefer­ences (in­clud­ing those that might want to change the meth­ods of com­bi­na­tions). How much weight a meta-prefer­ence must have, be­fore it se­ri­ously changes the syn­the­sis method, is some­what ar­bi­trary.

I’m also start­ing with two types of prefer­ence com­bi­na­tions, lin­ear and smooth­min, rather than many more or just one. The idea is that these two way of com­bin­ing prefer­ences seem the most salient to me, and our own meta-val­ues can change these ways if we feel strongly about them. It’s as if I’m start­ing the de­sign of a for­mula one car, be­fore an AI trains it­self to com­plete the de­sign. I know it’ll change a lot of things, but if I start with “four wheels, a cock­pit and a mo­tor”, I’m hop­ing to get them started on the right path, even if they even­tu­ally over­rule me.

Or, if you pre­fer, I think start­ing with this de­sign is more likely to nudge a bad out­come into a good one, than to do the op­po­site.

Non-ter­rible outcomes

Now for the most tricky part of this: given the above, can we ex­pect non-ter­rible out­comes?

This is a difficult ques­tion to an­swer, be­cause “ter­rible out­comes” re­mains un­defined (if we had a full defi­ni­tion, it could serve a util­ity func­tion it­self), and, in a sense, there is no prin­ci­pled trade-off be­tween two prefer­ences: the only gen­eral op­ti­mal­ity mea­sure is Pareto, and that can be reached by any lin­ear com­bi­na­tion of util­ities.

Scope in­sen­si­tivity to the res­cue?

There are two ob­vi­ous senses in which an out­come could be ter­rible:

  1. We could lose some­thing of great value, never to have it again.

  2. We could fall dras­ti­cally short of max­imis­ing a util­ity func­tion to the up­most.

From the per­spec­tive of a util­ity max­imiser, both these out­comes could be equally ter­rible—it’s just a ques­tion of com­put­ing the ex­pected util­ity differ­ence be­tween the two sce­nar­ios.

How­ever, for ac­tual hu­mans, the first sce­nario seems to loom much larger. This can be seen as a form of scope in­sen­si­tivity: we might say that we be­lieve in to­tal util­i­tar­i­anism, but we don’t feel that a trillion peo­ple is re­ally a trillion times bet­ter than a trillion peo­ple, so the larger the num­bers grow, the more we are, in prac­tice, will­ing to trade off to­tal util­i­tar­i­anism for other val­ues.

Now, we might de­plore that state of af­fairs (that de­plor­ing is a valid meta-prefer­ence), but that does seem to be how hu­man work. And though there are ar­gu­ments against scope in­sen­si­tivity for ac­tu­ally ex­is­tent be­ings, it is perfectly con­sis­tent to re­ject them when con­sid­er­ing whether we have a duty to cre­ate new be­ings.

What this means is that peo­ple’s prefer­ences seem much closer to smooth min­i­mums than to lin­ear sums. Some are ex­plic­itly setup like that from the be­gin­ning (those that go in the smooth­min bucket). Others may be like that in prac­tice, ei­ther be­cause meta-prefer­ences want them to be, or be­cause of the vast size of the fu­ture: see next sec­tion.

The size of the future

The fu­ture is vast, with the en­ergy of billions of galax­ies, effi­ciently used, at our dis­posal. Po­ten­tially far, far larger than that, if we’re clever about our com­pu­ta­tions.

That means that it’s far eas­ier to reach “agree­ment” be­tween two util­ity func­tions with diminish­ing marginal re­turns (as most of them will be, in prac­tice and in the­ory). Even with­out diminish­ing marginal re­turns, and with­out us­ing smooth­min, it’s un­likely that one util­ity func­tion will re­main high­est marginal re­turns all the way up to all re­sources be­ing used up. At some point, benefit­ing a tiny lit­tle prefer­ence slightly will likely be eas­ier.

The ex­cep­tion of this is if prefer­ences are ex­plic­itly op­posed to each other; eg masochism ver­sus pain-re­duc­tion. But even there, they are un­likely to be com­pletely and ex­actly nega­tions of one an­other. The masochist may find some ac­tivi­ties that don’t fit perfectly un­der “in­creased pain” as tra­di­tion­ally un­der­stood, so some com­pro­mise be­tween the two prefer­ences be­comes pos­si­ble.

The un­der­defined na­ture of some prefer­ence may be an boon here; if is for­bid­den, but only in situ­a­tions in , then go­ing out­side of may al­low the -lov­ing prefer­ences their space to grow. So, for ex­am­ple, obey­ing promises might be­come a gen­eral value, but we might al­low games, masked balls, or similar situ­a­tion where ly­ing is al­lowed, be­cause the ad­van­tages of hon­esty—rep­u­ta­tion, ease of co­or­di­na­tion—are de­liber­ately ab­sent.

Growth, learn­ing, and fol­low­ing your own preferences

I’ve ar­gued that our val­ues and prefer­ence will soon be­come sta­ble as we start to self mod­ify.

This is go­ing to be hard for those who put an ex­plicit pre­mium on con­tinual moral growth. Now, it’s still pos­si­ble to have con­tinued moral change with­ing a nar­row band, but

Fi­nally, there’s the is­sue of what hap­pens when the AI tells you “here is , the syn­the­sis of your prefer­ences”, and you go “well, I have all these prob­lems with it”. Since hu­mans are of­ten con­trar­ian by na­ture, it may be im­pos­si­ble for an AI to con­struct a that we would ever ex­plic­itly en­dorse. This is a sort of “self-refer­ence” prob­lem in syn­the­sis­ing prefer­ences.

Tol­er­ance levels

The whole de­sign—with an ini­tial frame­work, liberal use of smooth­min, a de­fault for stan­dard com­bi­na­tions of prefer­ences, and a vast amount of re­sources available—is de­signed to reach an ad­e­quate, rather than an op­ti­mal solu­tion. Op­ti­mal solu­tions are very sub­ject to Good­hart’s law if we don’t in­clude ev­ery­thing we care about; if we do in­clude ev­ery­thing we care about, the pro­cess may come to re­sem­ble the one I’ve defined here, above.

Con­versely, if the hu­man fear that such a syn­the­sis will be­come badly be­haved in cer­tain ex­treme situ­a­tions—then that fear will be in­cluded in the syn­the­sis. And, if the fear is strong enough, will serve to di­rect the out­comes away from those ex­treme situ­a­tions.

So the whole de­sign is some­what tol­er­ant to changes in the ini­tial con­di­tions: differ­ent start­ing points may end up in differ­ent end points, but all of them will hope­fully be ac­cept­able.

Did I think of ev­ery­thing?

With all such meth­ods, there’s the risk of not in­clud­ing ev­ery­thing, so end­ing up in a ter­rible point by omis­sion. That risk is cer­tainly there, but it seems that we couldn’t end up in a ter­rible hel­l­wor­lds, or at least no in one that could be mean­ingfully de­scribed/​sum­marised to the hu­man (be­cause avoid­ing hel­l­wor­lds is high on hu­man prefer­ence and meta-prefer­ences, and there is lit­tle ex­plicit force push­ing the other way).

And I’ve ar­gued that it’s un­likely that in­de­scrib­able hel­l­wor­lds are even pos­si­ble.

How­ever, there are still a lot of holes to fill, and I have to en­sure that this doesn’t just end up as a se­ries of patches un­til I can’t think of any fur­ther patches. That’s my great­est fear, and I’m not yet sure how to ad­dress it.