Is my result wrong? Maths vs intuition vs evolution in learning human preferences

The math­e­mat­i­cal re­sult is clear: you can­not de­duce hu­man prefer­ences merely by ob­serv­ing hu­man be­havi­our (even with sim­plic­ity pri­ors).

Yet many peo­ple in­stinc­tively re­ject this re­sult; even I found it ini­tially counter-in­tu­itive. And you can make a very strong ar­gu­ment that it’s wrong. It would go some­thing like this:

“I, a hu­man , can es­ti­mate what hu­man wants, just by ob­serv­ing their be­havi­our. And these es­ti­ma­tions have ev­i­dence be­hind them: will of­ten agree that I’ve got their val­ues right, and I can use this es­ti­ma­tion to pre­dict ’s be­havi­our. There­fore, it seems I’ve done the im­pos­si­ble: go from be­havi­our to prefer­ences.”

Evolu­tion and em­pa­thy modules

This is how I in­ter­pret what’s go­ing on here. Hu­mans (roughly) have em­pa­thy mod­ules which al­low them to es­ti­mate the prefer­ences of other hu­mans, and pre­dic­tion mod­ules which use the out­come of to pre­dict their be­havi­our. Since evolu­tion is colos­sally lazy, these mod­ules don’t vary much from per­son to per­son.

So, for a his­tory of hu­man ’s be­havi­our in typ­i­cal cir­cum­stances, the mod­ules for two hu­mans and will give similar an­swers:

  • .

More­over, when hu­mans turn their mod­ules to their own be­havi­our, they get similar re­sult. The hu­man will have a priv­ileged ac­cess to their own de­liber­a­tions; so define as the in­ter­nal his­tory of . Thus:

  • .

This idea con­nects with par­tial prefer­ences/​par­tial mod­els in the fol­low­ing way: gives ac­cess to their own in­ter­nal mod­els and prefer­ences; so the ap­prox­i­mately equal sym­bols above means that, by ob­serv­ing the be­havi­our of other hu­mans, we have ap­prox­i­mate ac­cess to their own in­ter­nal mod­els.

Then just takes the re­sults of to pre­dict fu­ture be­havi­our; since and have co-evolved, it’s no sur­prise that would have a good pre­dic­tive record.

So, given , it is true that a hu­man can es­ti­mate the prefer­ences of an­other hu­man, and, given , it is true they can use this knowl­edge to pre­dict be­havi­our.

The problems

So, what are the prob­lems here? There are three:

  1. and only func­tion well in typ­i­cal situ­a­tions. If we al­low hu­mans to self-mod­ify ar­bi­trar­ily or cre­ate strange other be­ings (such as AIs them­selves, or merged hu­man-AIs), then our em­pa­thy and pre­dic­tions will start to fail[1].

  2. It needs and to be given; but defin­ing these for AIs is very tricky. Time and time again, we’ve found that tasks that are easy for hu­mans to do are not easy for hu­mans to pro­gram into AIs.

  3. The em­pa­thy and pre­dic­tion mod­ules are similar, but not iden­ti­cal, from per­son to per­son and cul­ture to cul­ture[2].

So both are cor­rect: my re­sult (with­out as­sump­tion, you can­not go from hu­man be­havi­our to prefer­ences) and the cri­tique (given these as­sump­tions that hu­mans share, you can go from hu­man be­havi­our to prefer­ences).

And when it comes to hu­mans pre­dict­ing hu­mans, the cri­tique is more valid: listen­ing to your heart/​gut is a good way to go. But when it comes to pro­gram­ming po­ten­tially pow­er­ful AIs that could com­pletely trans­form the hu­man world in strange and un­pre­dictable ways, my nega­tive re­sult is more rele­vant than the cri­tique is.

A note on assumptions

I’ve had some dis­agree­ments with peo­ple that boil down to me say­ing “with­out as­sum­ing A, you can­not de­duce B”, and them re­spond­ing “since A is ob­vi­ously true, B is true”. I then go on to say that I am go­ing to as­sume A (or define A to be true, or what­ever).

At that point, we don’t ac­tu­ally have a dis­agree­ment. We’re say­ing the same thing (ac­cept A, and thus ac­cept B), with a slight differ­ence of em­pha­sis—I’m more “moral anti-re­al­ist” (we choose to ac­cept A, be­cause it agrees with our in­tu­ition) they are more “moral re­al­ist” (A is true, be­cause it agrees with our in­tu­ition). It’s not par­tic­u­larly pro­duc­tive to dig more.

In prac­tice: de­bug­ging and in­ject­ing moral preferences

There are some in­ter­est­ing prac­ti­cal con­se­quences to this anal­y­sis. Sup­pose, for ex­am­ple, that some­one is pro­gram­ming a click­bait de­tec­tor. They then gather a whole col­lec­tion of click­bait ex­am­ples, train a neu­ral net on them, and fid­dle with the hy­per­pa­ram­e­ters till the clas­sifi­ca­tion looks de­cent.

But both “gath­er­ing a whole col­lec­tion of click­bait ex­am­ples”, “the clas­sifi­ca­tion looks de­cent” are not facts about the uni­verse: they are judge­ments of the pro­gram­mers. The pro­gram­mers are us­ing their own and mod­ules to es­tab­lish that cer­tain ar­ti­cles are a) likely to be clicked on, but b) not what the clicker would re­ally want to read. So the whole pro­cess is en­tirely de­pen­dent on pro­gram­mer judge­ment—it might feel like “de­bug­ging”, or “mak­ing rea­son­able mod­el­ling choices”, but its ac­tu­ally in­ject­ing the pro­gram­mers’ judge­ments into the sys­tem.

And that’s fine! We’ve seen that differ­ent peo­ple have similar judge­ments. But there are two caveats: first, not ev­ery­one will agree, be­cause there is not perfect agree­ment be­tween the em­pa­thy mod­ules. The pro­gram­mers should be care­ful as to whether this is an area of very di­ver­gent judge­ments or not.

And sec­ond, these re­sults will likely not gen­er­al­ise well to new dis­tri­bu­tions. That’s be­cause hav­ing im­plicit ac­cess to cat­e­gori­sa­tion mod­ules that them­selves are valid only in typ­i­cal situ­a­tions… is not a way to gen­er­al­ise well. At all.

Hence we should ex­pect poor gen­er­al­i­sa­tion from such meth­ods, to other situ­a­tions and (some­times) to other hu­mans. In my opinion, if pro­gram­mers are more aware of these is­sues, they will have bet­ter gen­er­al­i­sa­tion perfor­mance.


  1. I’d con­sider the Star Trek uni­verse to be much more typ­i­cal that, say, 7th cen­tury China. The Star Trek uni­verse is filled with be­ings that are slight var­i­ants or ex­ag­ger­a­tions of mod­ern hu­mans, while peo­ple in 7th cen­tury China will have very alien ways of think­ing about so­ciety, hi­er­ar­chy, good be­havi­our, and so on. But that is still very typ­i­cal com­pared with the truly alien be­ings that can ex­ist in the space of all pos­si­ble minds. ↩︎

  2. For in­stance, Amer­i­cans will typ­i­cally ex­plain a cer­tain be­havi­our by in­trin­sic fea­tures of the ac­tor, while In­di­ans will give more credit to the cir­cum­stance (Miller, Joan G. “Cul­ture and the de­vel­op­ment of ev­ery­day so­cial ex­pla­na­tion.” Jour­nal of per­son­al­ity and so­cial psy­chol­ogy 46.5 (1984): 961). ↩︎