Humans can be assigned any values whatsoever…

(Re)Posted as part of the AI Align­ment Fo­rum se­quence on Value Learn­ing.

Ro­hin’s note: In the last post, we saw that a good broad value learn­ing ap­proach would need to un­der­stand the sys­tem­atic bi­ases in hu­man plan­ning in or­der to achieve su­per­hu­man perfor­mance. Per­haps we can just use ma­chine learn­ing again and learn the bi­ases and re­ward si­mul­ta­neously? This post by Stu­art Arm­strong (origi­nal here) and the as­so­ci­ated pa­per say: “Not with­out more as­sump­tions.”
This post comes from a the­o­ret­i­cal per­spec­tive that may be alien to ML re­searchers; in par­tic­u­lar, it makes an ar­gu­ment that sim­plic­ity pri­ors do not solve the prob­lem pointed out here, where sim­plic­ity is based on Kol­mogorov com­plex­ity (which is an in­stan­ti­a­tion of the Min­i­mum De­scrip­tion Length prin­ci­ple). The ana­log in ma­chine learn­ing would be an ar­gu­ment that reg­u­lariza­tion would not work. The proof used is spe­cific to Kol­mogorov com­plex­ity and does not clearly gen­er­al­ize to ar­bi­trary reg­u­lariza­tion tech­niques; how­ever, I view the ar­gu­ment as be­ing sug­ges­tive that reg­u­lariza­tion tech­niques would also be in­suffi­cient to ad­dress the prob­lems raised here.

Hu­mans have no val­ues… nor do any agent. Un­less you make strong as­sump­tions about their ra­tio­nal­ity. And de­pend­ing on those as­sump­tions, you get hu­mans to have any val­ues.

An agent with no clear preferences

There are three but­tons in this world, , , and , and one agent .

and can be op­er­ated by , while can be op­er­ated by an out­side ob­server. will ini­tially press but­ton ; if ever is pressed, the agent will switch to press­ing . If is pressed again, the agent will switch back to press­ing , and so on. After a large num­ber of turns , will shut off. That’s the full al­gorithm for .

So the ques­tion is, what are the val­ues/​prefer­ences/​re­wards of ? There are three nat­u­ral re­ward func­tions that are plau­si­ble:

  • , which is lin­ear in the num­ber of times is pressed.
  • , which is lin­ear in the num­ber of times is pressed.
  • , where is the in­di­ca­tor func­tion for be­ing pressed an even num­ber of times, be­ing the in­di­ca­tor func­tion for be­ing pressed an odd num­ber of times.

For , we can in­ter­pret as an max­imis­ing agent which over­rides. For , we can in­ter­pret as an max­imis­ing agent which re­leases from con­straints. And is the “ is always fully ra­tio­nal” re­ward. Se­man­ti­cally, these make sense for the var­i­ous ’s be­ing a true and nat­u­ral re­ward, with “co­er­cive brain surgery” in the first case, “re­lease H from an­noy­ing so­cial obli­ga­tions” in the sec­ond, and “switch which of and gives you plea­sure” in the last case.

But note that there is no se­man­tic im­pli­ca­tions here, all that we know is , with its full al­gorithm. If we wanted to de­duce its true re­ward for the pur­pose of some­thing like In­verse Re­in­force­ment Learn­ing (IRL), what would it be?

Model­ling hu­man (ir)ra­tio­nal­ity and reward

Now let’s talk about the prefer­ences of an ac­tual hu­man. We all know that hu­mans are not always ra­tio­nal. But even if hu­mans were fully ra­tio­nal, the fact re­mains that we are phys­i­cal, and vuln­er­a­ble to things like co­er­cive brain surgery (and in prac­tice, to a whole host of other more or less ma­nipu­la­tive tech­niques). So there will be the equiv­a­lent of “but­ton ” that over­rides hu­man prefer­ences. Thus, “not im­mor­tal and un­change­able” is in prac­tice enough for the agent to be con­sid­ered “not fully ra­tio­nal”.

Now as­sume that we’ve thor­oughly ob­served a given hu­man h (in­clud­ing their in­ter­nal brain wiring), so we know the hu­man policy (which de­ter­mines their ac­tions in all cir­cum­stances). This is, in prac­tice all that we can ever ob­serve—once we know perfectly, there is noth­ing more that ob­serv­ing h can teach us.

Let be a pos­si­ble hu­man re­ward func­tion, and R the set of such re­wards. A hu­man (ir)ra­tio­nal­ity plan­ning al­gorithm (here­after referred to as a plan­ner), is a map from R to the space of poli­cies (thus says how a hu­man with re­ward will ac­tu­ally be­have—for ex­am­ple, this could be bounded ra­tio­nal­ity, ra­tio­nal­ity with bi­ases, or many other op­tions). Say that the pair is com­pat­i­ble if . Thus a hu­man with plan­ner and re­ward would be­have as does.

What pos­si­ble com­pat­i­ble pairs are there? Here are some can­di­dates:

  • , where and are some “plau­si­ble” or “ac­cept­able” plan­ner and re­ward func­tions (what this means is a big ques­tion).
  • , where is the “fully ra­tio­nal” plan­ner, and is a re­ward that fits to give the re­quired policy.
  • , where , and , where is defined as ; here is the “fully anti-ra­tio­nal” plan­ner.
  • , where maps all re­wards to , and is triv­ial and con­stant.
  • , where and .

Dist­in­guish­ing among com­pat­i­ble pairs

How can we dis­t­in­guish be­tween com­pat­i­ble pairs? At first ap­pear­ance, we can’t. That’s be­cause, by their defi­ni­tion of com­pat­i­ble, all pairs pro­duce the cor­rect policy . And once we have , fur­ther ob­ser­va­tions of tell us noth­ing.

I ini­tially thought that Kol­mogorov or al­gorith­mic com­plex­ity might help us here. But in fact:

The­o­rem: The pairs , , are ei­ther sim­pler than , or differ in Kol­mogorov com­plex­ity from it by a con­stant that is in­de­pen­dent of .

Proof: The cases of and are easy, as these differ from and by two minus signs. Given , a fixed-length al­gorithm com­putes . Then a fixed length al­gorithm defines (by map­ping in­put to ). Fur­ther­more, given and any his­tory , a fixed length al­gorithm com­putes the ac­tion the agent will take; then a fixed length al­gorithm defines and for .

So the Kol­mogorov com­plex­ity can shift be­tween and (all in for , all in for ), but it seems that the com­plex­ity of the pair doesn’t go up dur­ing these shifts.

This is puz­zling. It seems that, in prin­ci­ple, one can­not as­sume any­thing about ’s re­ward at all! , , and is com­pat­i­ble with any pos­si­ble re­ward . If we give up the as­sump­tion of hu­man ra­tio­nal­ity—which we must—it seems we can’t say any­thing about the hu­man re­ward func­tion. So it seems IRL must fail.