# Is my result wrong? Maths vs intuition vs evolution in learning human preferences

The math­e­mat­i­cal re­sult is clear: you can­not de­duce hu­man prefer­ences merely by ob­serv­ing hu­man be­havi­our (even with sim­plic­ity pri­ors).

Yet many peo­ple in­stinc­tively re­ject this re­sult; even I found it ini­tially counter-in­tu­itive. And you can make a very strong ar­gu­ment that it’s wrong. It would go some­thing like this:

“I, a hu­man , can es­ti­mate what hu­man wants, just by ob­serv­ing their be­havi­our. And these es­ti­ma­tions have ev­i­dence be­hind them: will of­ten agree that I’ve got their val­ues right, and I can use this es­ti­ma­tion to pre­dict ’s be­havi­our. There­fore, it seems I’ve done the im­pos­si­ble: go from be­havi­our to prefer­ences.”

# Evolu­tion and em­pa­thy modules

This is how I in­ter­pret what’s go­ing on here. Hu­mans (roughly) have em­pa­thy mod­ules which al­low them to es­ti­mate the prefer­ences of other hu­mans, and pre­dic­tion mod­ules which use the out­come of to pre­dict their be­havi­our. Since evolu­tion is colos­sally lazy, these mod­ules don’t vary much from per­son to per­son.

So, for a his­tory of hu­man ’s be­havi­our in typ­i­cal cir­cum­stances, the mod­ules for two hu­mans and will give similar an­swers:

• .

More­over, when hu­mans turn their mod­ules to their own be­havi­our, they get similar re­sult. The hu­man will have a priv­ileged ac­cess to their own de­liber­a­tions; so define as the in­ter­nal his­tory of . Thus:

• .

This idea con­nects with par­tial prefer­ences/​par­tial mod­els in the fol­low­ing way: gives ac­cess to their own in­ter­nal mod­els and prefer­ences; so the ap­prox­i­mately equal sym­bols above means that, by ob­serv­ing the be­havi­our of other hu­mans, we have ap­prox­i­mate ac­cess to their own in­ter­nal mod­els.

Then just takes the re­sults of to pre­dict fu­ture be­havi­our; since and have co-evolved, it’s no sur­prise that would have a good pre­dic­tive record.

So, given , it is true that a hu­man can es­ti­mate the prefer­ences of an­other hu­man, and, given , it is true they can use this knowl­edge to pre­dict be­havi­our.

# The problems

So, what are the prob­lems here? There are three:

1. and only func­tion well in typ­i­cal situ­a­tions. If we al­low hu­mans to self-mod­ify ar­bi­trar­ily or cre­ate strange other be­ings (such as AIs them­selves, or merged hu­man-AIs), then our em­pa­thy and pre­dic­tions will start to fail[1].

2. It needs and to be given; but defin­ing these for AIs is very tricky. Time and time again, we’ve found that tasks that are easy for hu­mans to do are not easy for hu­mans to pro­gram into AIs.

3. The em­pa­thy and pre­dic­tion mod­ules are similar, but not iden­ti­cal, from per­son to per­son and cul­ture to cul­ture[2].

So both are cor­rect: my re­sult (with­out as­sump­tion, you can­not go from hu­man be­havi­our to prefer­ences) and the cri­tique (given these as­sump­tions that hu­mans share, you can go from hu­man be­havi­our to prefer­ences).

And when it comes to hu­mans pre­dict­ing hu­mans, the cri­tique is more valid: listen­ing to your heart/​gut is a good way to go. But when it comes to pro­gram­ming po­ten­tially pow­er­ful AIs that could com­pletely trans­form the hu­man world in strange and un­pre­dictable ways, my nega­tive re­sult is more rele­vant than the cri­tique is.

## A note on assumptions

I’ve had some dis­agree­ments with peo­ple that boil down to me say­ing “with­out as­sum­ing A, you can­not de­duce B”, and them re­spond­ing “since A is ob­vi­ously true, B is true”. I then go on to say that I am go­ing to as­sume A (or define A to be true, or what­ever).

At that point, we don’t ac­tu­ally have a dis­agree­ment. We’re say­ing the same thing (ac­cept A, and thus ac­cept B), with a slight differ­ence of em­pha­sis—I’m more “moral anti-re­al­ist” (we choose to ac­cept A, be­cause it agrees with our in­tu­ition) they are more “moral re­al­ist” (A is true, be­cause it agrees with our in­tu­ition). It’s not par­tic­u­larly pro­duc­tive to dig more.

# In prac­tice: de­bug­ging and in­ject­ing moral preferences

There are some in­ter­est­ing prac­ti­cal con­se­quences to this anal­y­sis. Sup­pose, for ex­am­ple, that some­one is pro­gram­ming a click­bait de­tec­tor. They then gather a whole col­lec­tion of click­bait ex­am­ples, train a neu­ral net on them, and fid­dle with the hy­per­pa­ram­e­ters till the clas­sifi­ca­tion looks de­cent.

But both “gath­er­ing a whole col­lec­tion of click­bait ex­am­ples”, “the clas­sifi­ca­tion looks de­cent” are not facts about the uni­verse: they are judge­ments of the pro­gram­mers. The pro­gram­mers are us­ing their own and mod­ules to es­tab­lish that cer­tain ar­ti­cles are a) likely to be clicked on, but b) not what the clicker would re­ally want to read. So the whole pro­cess is en­tirely de­pen­dent on pro­gram­mer judge­ment—it might feel like “de­bug­ging”, or “mak­ing rea­son­able mod­el­ling choices”, but its ac­tu­ally in­ject­ing the pro­gram­mers’ judge­ments into the sys­tem.

And that’s fine! We’ve seen that differ­ent peo­ple have similar judge­ments. But there are two caveats: first, not ev­ery­one will agree, be­cause there is not perfect agree­ment be­tween the em­pa­thy mod­ules. The pro­gram­mers should be care­ful as to whether this is an area of very di­ver­gent judge­ments or not.

And sec­ond, these re­sults will likely not gen­er­al­ise well to new dis­tri­bu­tions. That’s be­cause hav­ing im­plicit ac­cess to cat­e­gori­sa­tion mod­ules that them­selves are valid only in typ­i­cal situ­a­tions… is not a way to gen­er­al­ise well. At all.

Hence we should ex­pect poor gen­er­al­i­sa­tion from such meth­ods, to other situ­a­tions and (some­times) to other hu­mans. In my opinion, if pro­gram­mers are more aware of these is­sues, they will have bet­ter gen­er­al­i­sa­tion perfor­mance.

1. I’d con­sider the Star Trek uni­verse to be much more typ­i­cal that, say, 7th cen­tury China. The Star Trek uni­verse is filled with be­ings that are slight var­i­ants or ex­ag­ger­a­tions of mod­ern hu­mans, while peo­ple in 7th cen­tury China will have very alien ways of think­ing about so­ciety, hi­er­ar­chy, good be­havi­our, and so on. But that is still very typ­i­cal com­pared with the truly alien be­ings that can ex­ist in the space of all pos­si­ble minds. ↩︎

2. For in­stance, Amer­i­cans will typ­i­cally ex­plain a cer­tain be­havi­our by in­trin­sic fea­tures of the ac­tor, while In­di­ans will give more credit to the cir­cum­stance (Miller, Joan G. “Cul­ture and the de­vel­op­ment of ev­ery­day so­cial ex­pla­na­tion.” Jour­nal of per­son­al­ity and so­cial psy­chol­ogy 46.5 (1984): 961). ↩︎

• I would add that peo­ple over­es­ti­mate their abil­ity to guess oth­ers prefer­ences. “He just wants money” or “She just wants to marry him”. Such over­sim­plified mod­els could be not just use­ful sim­plifi­ca­tions, buts could be blatantly wrong.

• I agree we’re not as good as we think we are. But there are a lot of things we do agree on, that seem triv­ial: eg “this per­son is red in the face, shout­ing at me, and punch­ing me; I de­duce that they are an­gry and wish to do me harm”. We have far, far, more agree­ment than ran­dom agents would.

• I agree, and I think much of the difficulty peo­ple have in ac­cept­ing the re­sult comes from not see­ing im­plic­itly as­sumed norms we are always ap­ply­ing to un­der­stand things. I think this runs much deeper than say­ing hu­mans have some­thing like an em­pa­thy mod­ule, though, and is a gen­eral prob­lem of hu­mans not see­ing re­al­ity clearly, and in­stead think­ing they see it when in fact what they are see­ing (and es­pe­cially what they are in­ter­pret­ing and in­fer­ring) is tainted by prior ev­i­dence in ways that makes ev­ery­thing hu­mans do con­di­tional on the pri­ors such that no see­ing is truly free and in­de­pen­dent of the con­di­tions in which it arises.

That’s fairly ab­stract, so an­other way to put it is that we’re con­stantly see­ing the world on the as­sump­tion that we already know what the world looks like. We can learn to as­sume less, but the nat­u­ral, adap­tive state is to as­sume a lot be­cause it has lead to greater re­pro­duc­tive fit­ness, prob­a­bly speci­fi­cally be­cause it al­lowed hard-to-make in­fer­ences pos­si­ble only by mak­ing strong as­sump­tions about the world that were of­ten true (or true enough for our an­ces­tors’ pur­poses).

(I think this ties into the story I’ve been tel­ling for a long time about de­vel­op­men­tal psy­chol­ogy, and the newer story I’ve been tel­ling about hu­man brains do­ing min­i­miza­tion of pre­dic­tion er­ror with ad­di­tional home­o­static set points, but I also think it stands on its own with­out that, so I write here with­out refer­ence to them other than this com­ment.)

• The prob­lem with the maths is that it does not cor­re­late ‘val­ues’ with any real world ob­serv­able. You give all ob­jects a prop­erty, you say that that prop­erty is dis­tributed by sim­plic­ity pri­ors. You have not yet speci­fied how these ‘val­ues’ things re­late to any real world phe­nomenon in any way. Un­der this model, you could never see any ev­i­dence that hu­mans don’t ‘value’ max­i­miz­ing pa­per­clips.

To solve this, we need to un­der­stand what val­ues are. The val­ues of a hu­man are much like the file­names on a hard disk. If you run a quan­tum field the­ory simu­la­tion, you don’t have to think about ei­ther, you can make your pre­dic­tions di­rectly. If you want to make ap­prox­i­mate pre­dic­tions about how a hu­man will be­have, you can think in terms of val­ues and get some­what use­ful pre­dic­tions. If you want to pre­dict ap­prox­i­mately how a com­puter sys­tem will be­have, in­stead of simu­lat­ing ev­ery tran­sis­tor, you can think in terms fold­ers and files.

I can sub­sti­tute words in the ‘proof’ that hu­mans don’t have val­ues, and get a proof that com­put­ers don’t have files. It works the same way, you turn your un­cer­tainty in the re­la­tion be­tween the ex­act and the ap­prox­i­mate into a con­fi­dence that the two are un­cor­re­lated. Mak­ing a some­what naive and not for­mally speci­fied as­sump­tion along the lines of, “the real ac­tion taken op­ti­mizes hu­man val­ues bet­ter than most pos­si­ble ac­tions” will get you a mean­ingful but not perfect defi­ni­tion of ‘val­ues’. You still need to say ex­actly what a “pos­si­ble ac­tion” is.

Mak­ing a some­what naive and not for­mally speci­fied as­sump­tion along the lines of, “the files are what you see when you click on the file viewer” will get you a mean­ingful but not perfect defi­ni­tion of ‘files’. You still need to say ex­actly what a “click” is. And how you trans­late a pat­tern of pho­tons into a ‘file’.

We see that if you were run­ning a quan­tum simu­la­tion of the uni­verse, then get­ting val­ues out of a vir­tual hu­man is the same type of prob­lem as get­ting files off a vir­tual com­puter.

• I like this anal­ogy. Prob­a­bly not best to put too much weight on it, but it has some in­sights.

• I won­der if part of messi­ness might stem from con­fus­ing var­i­ous do­mains and ranges. For ex­am­ple, for hu­man, we have a com­plex of wants—some are driven very much by phys­iolog­i­cal fac­tors, some by cul­tural fac­tor and some by in­di­vi­d­ual fac­tors (in­clud­ing things like what I did yes­ter­day or 5 hours ago). We might call these our prefer­ence do­main.

Then we need some func­tion map­ping the prefer­ences into the range of be­hav­iors that are ob­serv­able. As­sum­ing that there is some­thing ap­prox­i­mat­ing a func­tion here (caveat—not a math guy here so maybe that is mi­sused/​loaded here). From that we have some hope for de­duc­ing the be­hav­ior back to the prefer­ence.

How­ever, we should not con­sider the above three sources as com­ing from the same do­main, or map­ping to the same range. Con­fu­sion may come in from both the fuzzi­ness (I’m im­plic­itly agree­ing with the gen­eral can­not in­fer prefer­ences from be­hav­ior that well as a gen­eral propo­si­tion) of the “cor­rect” func­tion as well as a con­fu­sion of as­so­ci­at­ing a be­hav­ior to one of the three ranges, and then at­tempt­ing to de­duce the prefer­ence.

If I see A do­ing x and as­cribe x to the phys­iolog­i­cal range and then at­tempt to de­duce the prefer­ence (in the phys­iolog­i­cal do­main) when x is ac­tu­ally in the in­di­vi­d­ual range for A I will prob­a­bly see a lot of er­rors. But maybe not 100% er­ror.

I do think there is some­thing to the we’re all hu­man so can rec­og­nize a lot of mean­ing in ac­tion from oth­ers—but things like cul­ture (as men­tioned) does in­fluence perfor­mance here. So, what is an ac­cept­able ac­cu­racy rate? Is the goal math­e­mat­i­cal cer­tainty or some­thing else?

• Your ti­tle seems click­baity, since its ques­tion is an­swered no in the post, and this ar­ti­cle would have been more sur­pris­ing had you an­swered yes. (And my ex­pec­ta­tion was that if you ask that ques­tion in the ti­tle, you don’t know the an­swer any­more.)

hav­ing im­plicit ac­cess to cat­e­gori­sa­tion mod­ules that them­selves are valid only in typ­i­cal situ­a­tions… is not a way to gen­er­al­ise well

How do you know this? Should we turn this into one of those con­crete ML ex­per­i­ments?

• PS: the other ti­tle I con­sid­ered was “Why do peo­ple feel my re­sult is wrong”, which felt too con­de­scend­ing.

Hehe—I don’t nor­mally do this, but I feel I can in­dulge once ^_^

hav­ing im­plicit ac­cess to cat­e­gori­sa­tion mod­ules that them­selves are valid only in typ­i­cal situ­a­tions… is not a way to gen­er­al­ise well

How do you know this?

Mo­ravec’s para­dox again. Chess­mas­ters didn’t eas­ily pro­gram chess pro­grams; and those chess pro­grams didn’t gen­er­al­ise to games in gen­eral.

Should we turn this into one of those con­crete ML ex­per­i­ments?

That would be good. I’m aiming to have a lot more prac­ti­cal ex­per­i­ments from my re­search pro­ject, and this could be one of them.

• Chess­mas­ters didn’t eas­ily pro­gram chess pro­grams; and those chess pro­grams didn’t gen­er­al­ise to games in gen­eral.

I’d say a more rele­vant anal­ogy is whether some ML al­gorithm could learn to play Go teach­ing games against a mas­ter, by ex­am­ple of a mas­ter play­ing teach­ing games against a stu­dent, with­out know­ing what Go is.

• And whether those pro­grams could then perform well if their op­po­nent forces them into a very un­usual situ­a­tion, such as would not have ever ap­peared in a chess­mas­ter game.

If I sac­ri­fice a knight for no ad­van­tage what­so­ever, will the op­po­nent be able to deal with that? What if I set up a trap to cap­ture a piece, rely­ing on my op­po­nent not see­ing the trap? A chess­mas­ter play­ing an­other chess­mas­ter would never play a sim­ple trap, as it would never suc­ceed; so would the ML be able to deal with it?