Hu­mans can be as­signed any val­ues what­so­ever…

(Re)Pos­ted as part of the AI Align­ment Forum se­quence on Value Learn­ing.

Ro­hin’s note: In the last post, we saw that a good broad value learn­ing ap­proach would need to un­der­stand the sys­tem­atic bi­ases in hu­man plan­ning in or­der to achieve su­per­hu­man per­form­ance. Per­haps we can just use ma­chine learn­ing again and learn the bi­ases and re­ward sim­ul­tan­eously? This post by Stu­art Arm­strong (ori­ginal here) and the as­so­ci­ated pa­per say: “Not without more as­sump­tions.”
This post comes from a the­or­et­ical per­spect­ive that may be alien to ML re­search­ers; in par­tic­u­lar, it makes an ar­gu­ment that sim­pli­city pri­ors do not solve the prob­lem poin­ted out here, where sim­pli­city is based on Kol­mogorov com­plex­ity (which is an in­stan­ti­ation of the Min­imum De­scrip­tion Length prin­ciple). The ana­log in ma­chine learn­ing would be an ar­gu­ment that reg­u­lar­iz­a­tion would not work. The proof used is spe­cific to Kol­mogorov com­plex­ity and does not clearly gen­er­al­ize to ar­bit­rary reg­u­lar­iz­a­tion tech­niques; how­ever, I view the ar­gu­ment as be­ing sug­gest­ive that reg­u­lar­iz­a­tion tech­niques would also be in­suf­fi­cient to ad­dress the prob­lems raised here.

Hu­mans have no val­ues… nor do any agent. Un­less you make strong as­sump­tions about their ra­tion­al­ity. And de­pend­ing on those as­sump­tions, you get hu­mans to have any val­ues.

An agent with no clear preferences

There are three but­tons in this world, , , and , and one agent .

and can be op­er­ated by , while can be op­er­ated by an out­side ob­server. will ini­tially press but­ton ; if ever is pressed, the agent will switch to press­ing . If is pressed again, the agent will switch back to press­ing , and so on. After a large num­ber of turns , will shut off. That’s the full al­gorithm for .

So the ques­tion is, what are the val­ues/​pref­er­ences/​re­wards of ? There are three nat­ural re­ward func­tions that are plaus­ible:

  • , which is lin­ear in the num­ber of times is pressed.
  • , which is lin­ear in the num­ber of times is pressed.
  • , where is the in­dic­ator func­tion for be­ing pressed an even num­ber of times, be­ing the in­dic­ator func­tion for be­ing pressed an odd num­ber of times.

For , we can in­ter­pret as an max­im­ising agent which over­rides. For , we can in­ter­pret as an max­im­ising agent which re­leases from con­straints. And is the “ is al­ways fully ra­tional” re­ward. Se­mantic­ally, these make sense for the vari­ous ’s be­ing a true and nat­ural re­ward, with “co­er­cive brain sur­gery” in the first case, “re­lease H from an­noy­ing so­cial ob­lig­a­tions” in the second, and “switch which of and gives you pleas­ure” in the last case.

But note that there is no se­mantic im­plic­a­tions here, all that we know is , with its full al­gorithm. If we wanted to de­duce its true re­ward for the pur­pose of some­thing like In­verse Rein­force­ment Learn­ing (IRL), what would it be?

Model­ling hu­man (ir)ra­tion­al­ity and reward

Now let’s talk about the pref­er­ences of an ac­tual hu­man. We all know that hu­mans are not al­ways ra­tional. But even if hu­mans were fully ra­tional, the fact re­mains that we are phys­ical, and vul­ner­able to things like co­er­cive brain sur­gery (and in prac­tice, to a whole host of other more or less ma­nip­u­lat­ive tech­niques). So there will be the equi­val­ent of “but­ton ” that over­rides hu­man pref­er­ences. Thus, “not im­mor­tal and un­change­able” is in prac­tice enough for the agent to be con­sidered “not fully ra­tional”.

Now as­sume that we’ve thor­oughly ob­served a given hu­man h (in­clud­ing their in­ternal brain wir­ing), so we know the hu­man policy (which de­term­ines their ac­tions in all cir­cum­stances). This is, in prac­tice all that we can ever ob­serve—once we know per­fectly, there is noth­ing more that ob­serving h can teach us.

Let be a pos­sible hu­man re­ward func­tion, and R the set of such re­wards. A hu­man (ir)ra­tion­al­ity plan­ning al­gorithm (here­after re­ferred to as a plan­ner), is a map from R to the space of policies (thus says how a hu­man with re­ward will ac­tu­ally be­have—for ex­ample, this could be bounded ra­tion­al­ity, ra­tion­al­ity with bi­ases, or many other op­tions). Say that the pair is com­pat­ible if . Thus a hu­man with plan­ner and re­ward would be­have as does.

What pos­sible com­pat­ible pairs are there? Here are some can­did­ates:

  • , where and are some “plaus­ible” or “ac­cept­able” plan­ner and re­ward func­tions (what this means is a big ques­tion).
  • , where is the “fully ra­tional” plan­ner, and is a re­ward that fits to give the re­quired policy.
  • , where , and , where is defined as ; here is the “fully anti-ra­tional” plan­ner.
  • , where maps all re­wards to , and is trivial and con­stant.
  • , where and .

Distin­guish­ing among com­pat­ible pairs

How can we dis­tin­guish between com­pat­ible pairs? At first ap­pear­ance, we can’t. That’s be­cause, by their defin­i­tion of com­pat­ible, all pairs pro­duce the cor­rect policy . And once we have , fur­ther ob­ser­va­tions of tell us noth­ing.

I ini­tially thought that Kol­mogorov or al­gorithmic com­plex­ity might help us here. But in fact:

The­orem: The pairs , , are either sim­pler than , or dif­fer in Kol­mogorov com­plex­ity from it by a con­stant that is in­de­pend­ent of .

Proof: The cases of and are easy, as these dif­fer from and by two minus signs. Given , a fixed-length al­gorithm com­putes . Then a fixed length al­gorithm defines (by map­ping in­put to ). Fur­ther­more, given and any his­tory , a fixed length al­gorithm com­putes the ac­tion the agent will take; then a fixed length al­gorithm defines and for .

So the Kol­mogorov com­plex­ity can shift between and (all in for , all in for ), but it seems that the com­plex­ity of the pair doesn’t go up dur­ing these shifts.

This is puzz­ling. It seems that, in prin­ciple, one can­not as­sume any­thing about ’s re­ward at all! , , and is com­pat­ible with any pos­sible re­ward . If we give up the as­sump­tion of hu­man ra­tion­al­ity—which we must—it seems we can’t say any­thing about the hu­man re­ward func­tion. So it seems IRL must fail.


The next post in the Value Learn­ing se­quence will be ‘Lat­ent Vari­ables and Model Mis-spe­cific­a­tion’ by Ja­cob Stein­hardt, and will post on Wed­nes­day 7th Novem­ber.

To­mor­row’s AI Align­ment Forum se­quences post will be ’Sub­sys­tem Align­ment, in the Embed­ded Agency se­quence.