Humans aren’t agents—what then for value learning?

Humans aren’t agents in the same way a thermostat isn’t an agent.

A specimen of Temperatus moderatu humilis

Consider a truly humble thermostat. One temperature sensor, one set point, one output to turn the furnace on while the sensed temperature is below the set point. If we’re being generous, we might construct an intentional stance model of the world in which this thermostat is an agent that wants the house to be at the set point.

But if the environment were a little bit different—say we block the vents leading to the upstairs—the thermostat doesn’t try to unblock the vents. In the changed environment, it just acts as if it wants only the downstairs to be at the set point.

It is not that one of these is what the thermostat Really Wants and the other isn’t. This entire model in which the thermostat wants things is a convenient fiction we’re using to help think about the world. To ask what the thermostat Really Wants even after you know all the physical facts is an error, driven by the human tendency to mix up properties of our models with properties of objects.

You can fix this, and model the thermostat in a way that correctly predicts its behavior in more environments, but every time you make such an expansion of the space of environments, you make your model of the thermostat more concrete and less agenty. Eventually you end up with something like “It wants to increase the output signal when the input voltage is smaller than the voltage controlled by the dial on the front,” at which point you might as well strip off the veneer about it “wanting” anything and predict it using physics.

This is what humans are like. In the ancestral environment, I would behave like someone who wants to eat fresh fruits and vegetables. Introduce Doritos to the environment, and I’ll eat those instead. To expand the space of environments to include Doritos, you had to make your model of me more concrete (i.e. “Charlie wants to eat things that taste good”). If you pump heroin into my brain, I’ll behave like someone who wants more heroin—which you can predict if you stop modeling me in terms of tastes and start modeling me in terms of anatomy and chemistry.

The model of me as someone who wants to eat fresh fruits and vegetables didn’t fail because I have True Values and eating Doritos fulfills my True Values better than eating wild berries, but because the environment has been altered in a way that happens to be beyond the domain of validity of the ancestral model.

It’s just like how the thermostat doesn’t Really Want anything in particular. When the environment has the vents unblocked, interpreting the thermostat as wanting to control the whole house is a useful model. When you place me in the ancestral environment, interpreting me as wanting to eat fresh fruits and vegetables is a useful model of me.

Humans’ apparent values can change with the environment. Put us in the ancestral environment and we’ll behave as if we like nutrition and reproducing. Put us in the modern environment and we’ll behave as if we like Doritos and sex—we can model this transition by being less idealized about humans. Pump heroin into our brains and we’ll behave as if we want more—we can model this by being even less idealized. There is no One True level of idealization at which the True Values live.

This has direct consequences for value learning, which is the attempt to program computers to infer human values. You cannot just say “assume humans are agents and infer their values,” because there is no True interpretation of human behavior in terms of an agent’s desires. This is, finally, what I mean by saying that humans are not agents: in the context of value learning, it won’t work to tell the computer to assume that humans are agents.

What then for value learning? Well, step 1 is to accept that if the AI is going to learn something about human morality, it’s going to learn to tell a certain sort of story about humans, which features human desires and beliefs in a way suitable to guide the AI’s plans. This class of stories is not going to be the One True way of thinking about humans, and so this AI might have to learn from humans about how they model humans.

There is a second half of this post. Given that these stories about human desires are dependent on the environment, and given that our opinion about the best way to interpret humans involves some famously fallible human intuition, won’t these stories be at risk of failure under pressure from optimization in the vast space of possible environments?

Yes.

But instead of repeating what Scott has already said better, if you want to read about it you’ll just have to go to The Tails Coming Apart as Metaphor for Life.