Agents That Learn From Human Behavior Can’t Learn Human Values That Humans Haven’t Learned Yet

[Epistemic status: ¯\_(ツ)_/​¯ ]

Armstrong and Mindermann write about a no free lunch theorem for inverse reinforcement learning (IRL): the same action can reflect many different combinations of values and (irrational) planning algorithms.

I think even assuming humans were fully rational expected utility maximizers, there would be an important underdetermination problem with IRL and with all other approaches that infer human preferences from their actual behavior. This is probably obvious if and only if it’s correct, and I don’t know if any non-straw people disagree, but I’ll expand on it anyway.

Consider two rational expected utility maximizing humans, Alice and Bob.

Alice is, herself, a value learner. She wants to maximize her true utility function, but she doesn’t know what it is, so in practice she uses a probability distribution over several possible utility functions to decide how to act.

If Alice received further information (from a moral philosopher, maybe), she’d start maximizing a specific one of those utility functions instead. But we’ll assume that her information stays the same while her utility function is being inferred, and she’s not doing anything to get more; perhaps she’s not in a position to.

Bob, on the other hand, isn’t a value learner. He knows what his utility function is: it’s a weighted sum of the same several utility functions. The relative weights in this mix happen to be identical to Alice’s relative probabilities.

Alice and Bob will act the same. They’ll maximize the same linear combination of utility functions, for different reasons. But if you could find out more than Alice knows about her true utility function, then you’d act differently if you wanted to truly help Alice than if you wanted to truly help Bob.

So in some cases, it’s not enough to look at how humans behave. Humans are Alice on some points and Bob on some points. Figuring out details will require explicitly addressing human moral uncertainty.