Why we need a *theory* of human values

There have been mul­ti­ple prac­ti­cal sug­ges­tions for meth­ods about how we should ex­tract the val­ues of a given hu­man. Here are four com­mon classes of such meth­ods:

  • Meth­ods that put high weight on hu­man (bounded) quasi-ra­tio­nal­ity, or re­vealed prefer­ences. For ex­am­ple, we can as­sume the Kas­parov was ac­tu­ally try­ing to win against Deep­Blue, not try­ing des­per­ately to lose while in­ad­ver­tently play­ing ex­cel­lent chess.

  • Meth­ods that pay at­ten­tion to our ex­plic­itly stated val­ues.

  • Meth­ods that use re­gret, sur­prise, joy, or similar emo­tions, to es­ti­mate what hu­mans ac­tu­ally want. This could be seen as a form of hu­man TD learn­ing.

  • Meth­ods based on an ex­plicit pro­ce­dure for con­struct­ing the val­ues, such as CEV and Paul’s in­di­rect nor­ma­tivity.

Diver­gent methods

The first ques­tion is why we would ex­pect these meth­ods to point even vaguely in the same di­rec­tion. They all take very differ­ent ap­proaches—why do we think they’re mea­sur­ing the same thing?

The an­swer is that they roughly match up in situ­a­tions we en­counter ev­ery­day. In such typ­i­cal situ­a­tions, peo­ple who feel re­gret are likely to act to avoid that situ­a­tion again, to ex­press dis­plea­sure about the situ­a­tion, etc.

By anal­ogy, con­sider a town where there are only two weather events: bright sunny days and snow storms. In that town there is a strong cor­re­la­tion be­tween baro­met­ric pres­sure, wind speed, cloud cover, and tem­per­a­ture. All four in­di­ca­tors track differ­ent things, but, in this town, they are ba­si­cally in­ter­change­able.

But if the weather grows more di­verse, this cor­re­la­tion can break down. Rain storms, cloudy days, me­teor im­pacts: all these can dis­rupt the al­ign­ment of the differ­ent in­di­ca­tors.

Similarly, we ex­pect that an AI could re­move us from typ­i­cal situ­a­tions and put us into ex­treme situ­a­tions—at least “ex­treme” from the per­spec­tive of the ev­ery­day world where we forged the in­tu­itions that those meth­ods of ex­tract­ing val­ues roughly match up. Not only do we ex­pect this, but we de­sire this: a world with­out ab­solute poverty, for ex­am­ple, is the kind of world we would want the AI to move us into, if it could.

In those ex­treme and un­prece­dented situ­a­tions, we could end up with re­vealed prefer­ences point­ing one way, stated prefer­ences an­other, while re­gret and CEV point in differ­ent di­rec­tions en­tierly. In that case, we might be tempted to ask “should we fol­low re­gret or stated prefer­ences?” But that would be the wrong ques­tion to ask: our meth­ods no longer cor­re­lated with each other, let alone with some fun­da­men­tal mea­sure of hu­man val­ues.

We are thus in an un­defined state; in or­der to con­tinue, we need a meta-method that de­cides be­tween the differ­ent meth­ods. But what crite­ria could such meta-method use for de­cid­ing (note that sim­ply get­ting hu­man feed­back is not gener­i­cally an op­tion)? Well, it would have to se­lect the method which best matches up with hu­man val­ues in this ex­treme situ­a­tion. To do that, it needs a defi­ni­tion—a the­ory—of what hu­man val­ues ac­tu­ally are.

Un­derdefined methods

The pre­vi­ous sec­tion un­der­states the prob­lems with purely prac­ti­cal ways of as­sess­ing hu­man val­ues. It pointed out di­ver­gences be­tween the meth­ods in “ex­treme situ­a­tions”. Per­haps we were imag­in­ing these ex­treme situ­a­tions as the equiv­a­lent of a me­teor im­pact on weather sys­tem: bizarre edge cases where rea­son­able meth­ods fi­nally break down.

But all those ac­tu­ally meth­ods fail in typ­i­cal situ­a­tions as well. If we in­ter­pret the meth­ods naively, they fail of­ten. For ex­am­ple, in 1919, some of the Chicago White Sox base­ball team were ac­tu­ally try­ing to lose. If we ask some­one their stated val­ues in a poli­ti­cal de­bate or a court­room, we don’t ex­pect an hon­est an­swer. Emo­tion based ap­proaches fail in situ­a­tions where hu­mans de­liber­ately ex­pose them­selves to nos­tal­gia, or fear, or other “nega­tive” emo­tions (eg through scary movies). And there are failure modes for the ex­plicit pro­ce­dures, too.

This is true if we in­ter­pret the meth­ods naively. If we were more “rea­son­able” or “so­phis­ti­cated”, we would point out that don’t ex­pect those meth­ods to be valid in ev­ery typ­i­cal situ­a­tion. In fact, we can do bet­ter than that: we have a good in­tu­itive un­der­stand­ing of when the meth­ods suc­ceed and when they fail, and differ­ent peo­ple have similar in­tu­itions (we all un­der­stand that peo­ple are more hon­est in re­laxed pri­vate set­tings that stress­ful pub­lic ones, for ex­am­ple). It’s as if we lived in a town with ei­ther sunny days or snow storms ex­cept on week­ends. Then ev­ery­one could agree that the differ­ent in­di­ca­tors cor­re­late dur­ing the week. So the more so­phis­ti­cated meth­ods would in­clude some­thing like “ig­nore the data if it’s Satur­day or Sun­day”.

But there are prob­lems with this anal­ogy. Un­like for the weather, there are no clear prin­ci­ple for de­cid­ing when it’s the equiv­a­lent of the week­end. Yes, we have an in­tu­itive grasp of when stated prefer­ences fail, for in­stance. But as Mo­ravec’s para­dox shows, an in­tu­itive un­der­stand­ing doesn’t trans­late into an ex­plicit, for­mal defi­ni­tion—and it’s that kind of for­mal defi­ni­tion that we need if we want to code up those meth­ods. Even worse, we don’t all agree as to when the meth­ods fail. For ex­am­ple, some economists deny the very ex­is­tence of men­tal ill­ness, while psy­chi­a­trists (and most laypeo­ple) very much feel these ex­ist.

Hu­man judge­ment and ma­chine patching

So figur­ing out whether the meth­ods ap­ply is an ex­er­cise in hu­man judge­ment. Figur­ing out whether the meth­ods have gone wrong is a similar ex­er­cise (see the Last Judge in CEV). And figur­ing out what to do when they don’t ap­ply is also an ex­er­cise in hu­man judge­ment—if we judge that some­one is ly­ing about their stated prefer­ences, we could just re­verse their state­ment to get their true val­ues.

So we need to patch the meth­ods us­ing our hu­man judge­ment. And prob­a­bly patch the patches and so on. Not only is the patch­ing pro­cess a ter­rible and in­com­plete way of con­struct­ing a safe goal for the AI, but hu­man judge­ments are not con­sis­tent—we can be swayed in things as ba­sic as whether a be­havi­our is ra­tio­nal, let alone all the situ­a­tional bi­ases that cloud our as­sess­ments of more com­pli­cated is­sues.

So ob­vi­ously, the solu­tion to these prob­lems is to figure out which hu­man is best in their judge­ments, and then to see un­der what cir­cum­stances these judge­ments can be least bi­ased, and how to pre­sent the in­for­ma­tion to them in the most im­par­tial way and then au­to­mate that judge­ment...

Stop that. It’s silly.. The cor­rect solu­tion is not to as­sess the ra­tio­nal­ity of hu­man judge­ments of meth­ods of ex­tract­ing hu­man val­ues. The cor­rect solu­tion is to come up with a bet­ter the­o­ret­i­cal defi­ni­tion of what hu­man val­ues are. Armed with such a the­ory, we can re­solve or ig­nore the above is­sues in a di­rect and prin­ci­pled way.

Build­ing a the­ory of hu­man values

Just be­cause we need a the­ory of hu­man val­ues, doesn’t mean that it’s easy to find one—the uni­verse is cruel like that.

A big part of my cur­rent ap­proach is to build such a the­ory. I will pre­sent an overview of my the­ory in a sub­se­quent post, though most of the pieces have ap­peared in past posts already. My ap­proach uses three key com­po­nents:

  1. A way of defin­ing the ba­sic prefer­ences (and ba­sic meta-prefer­ences) of a given hu­man, even if these are un­der-defined or situ­a­tional.

  2. A method for syn­the­sis­ing such ba­sic prefer­ences into a sin­gle util­ity func­tion or similar ob­ject.

  3. A guaran­tee we won’t end up in a ter­rible place, due to noise or differ­ent choices in the two defi­ni­tions above.