Agents That Learn From Human Behavior Can’t Learn Human Values That Humans Haven’t Learned Yet

[Epistemic sta­tus: ¯\_(ツ)_/​¯ ]

Arm­strong and Min­der­mann write about a no free lunch the­o­rem for in­verse re­in­force­ment learn­ing (IRL): the same ac­tion can re­flect many differ­ent com­bi­na­tions of val­ues and (ir­ra­tional) plan­ning al­gorithms.

I think even as­sum­ing hu­mans were fully ra­tio­nal ex­pected util­ity max­i­miz­ers, there would be an im­por­tant un­der­de­ter­mi­na­tion prob­lem with IRL and with all other ap­proaches that in­fer hu­man prefer­ences from their ac­tual be­hav­ior. This is prob­a­bly ob­vi­ous if and only if it’s cor­rect, and I don’t know if any non-straw peo­ple dis­agree, but I’ll ex­pand on it any­way.

Con­sider two ra­tio­nal ex­pected util­ity max­i­miz­ing hu­mans, Alice and Bob.

Alice is, her­self, a value learner. She wants to max­i­mize her true util­ity func­tion, but she doesn’t know what it is, so in prac­tice she uses a prob­a­bil­ity dis­tri­bu­tion over sev­eral pos­si­ble util­ity func­tions to de­cide how to act.

If Alice re­ceived fur­ther in­for­ma­tion (from a moral philoso­pher, maybe), she’d start max­i­miz­ing a spe­cific one of those util­ity func­tions in­stead. But we’ll as­sume that her in­for­ma­tion stays the same while her util­ity func­tion is be­ing in­ferred, and she’s not do­ing any­thing to get more; per­haps she’s not in a po­si­tion to.

Bob, on the other hand, isn’t a value learner. He knows what his util­ity func­tion is: it’s a weighted sum of the same sev­eral util­ity func­tions. The rel­a­tive weights in this mix hap­pen to be iden­ti­cal to Alice’s rel­a­tive prob­a­bil­ities.

Alice and Bob will act the same. They’ll max­i­mize the same lin­ear com­bi­na­tion of util­ity func­tions, for differ­ent rea­sons. But if you could find out more than Alice knows about her true util­ity func­tion, then you’d act differ­ently if you wanted to truly help Alice than if you wanted to truly help Bob.

So in some cases, it’s not enough to look at how hu­mans be­have. Hu­mans are Alice on some points and Bob on some points. Figur­ing out de­tails will re­quire ex­plic­itly ad­dress­ing hu­man moral un­cer­tainty.