Humans aren’t agents—what then for value learning?

Hu­mans aren’t agents in the same way a ther­mo­stat isn’t an agent.

Con­sider a truly hum­ble ther­mo­stat. One tem­per­a­ture sen­sor, one set point, one out­put to turn the fur­nace on while the sensed tem­per­a­ture is be­low the set point. If we’re be­ing gen­er­ous, we might con­struct an in­ten­tional stance model of the world in which this ther­mo­stat is an agent that wants the house to be at the set point.

But if the en­vi­ron­ment were a lit­tle bit differ­ent—say we block the vents lead­ing to the up­stairs—the ther­mo­stat doesn’t try to un­block the vents. In the changed en­vi­ron­ment, it just acts as if it wants only the down­stairs to be at the set point.

It is not that one of these is what the ther­mo­stat Really Wants and the other isn’t. This en­tire model in which the ther­mo­stat wants things is a con­ve­nient fic­tion we’re us­ing to help think about the world. To ask what the ther­mo­stat Really Wants even af­ter you know all the phys­i­cal facts is an er­ror, driven by the hu­man ten­dency to mix up prop­er­ties of our mod­els with prop­er­ties of ob­jects.

You can fix this, and model the ther­mo­stat in a way that cor­rectly pre­dicts its be­hav­ior in more en­vi­ron­ments, but ev­ery time you make such an ex­pan­sion of the space of en­vi­ron­ments, you make your model of the ther­mo­stat more con­crete and less agenty. Even­tu­ally you end up with some­thing like “It wants to in­crease the out­put sig­nal when the in­put voltage is smaller than the voltage con­trol­led by the dial on the front,” at which point you might as well strip off the ve­neer about it “want­ing” any­thing and pre­dict it us­ing physics.

This is what hu­mans are like. In the an­ces­tral en­vi­ron­ment, I would be­have like some­one who wants to eat fresh fruits and veg­eta­bles. In­tro­duce Dori­tos to the en­vi­ron­ment, and I’ll eat those in­stead. To ex­pand the space of en­vi­ron­ments to in­clude Dori­tos, you had to make your model of me more con­crete (i.e. “Char­lie wants to eat things that taste good”). If you pump heroin into my brain, I’ll be­have like some­one who wants more heroin—which you can pre­dict if you stop mod­el­ing me in terms of tastes and start mod­el­ing me in terms of anatomy and chem­istry.

The model of me as some­one who wants to eat fresh fruits and veg­eta­bles didn’t fail be­cause I have True Values and eat­ing Dori­tos fulfills my True Values bet­ter than eat­ing wild berries, but be­cause the en­vi­ron­ment has been al­tered in a way that hap­pens to be be­yond the do­main of val­idity of the an­ces­tral model.

It’s just like how the ther­mo­stat doesn’t Really Want any­thing in par­tic­u­lar. When the en­vi­ron­ment has the vents un­blocked, in­ter­pret­ing the ther­mo­stat as want­ing to con­trol the whole house is a use­ful model. When you place me in the an­ces­tral en­vi­ron­ment, in­ter­pret­ing me as want­ing to eat fresh fruits and veg­eta­bles is a use­ful model of me.

Hu­mans’ ap­par­ent val­ues can change with the en­vi­ron­ment. Put us in the an­ces­tral en­vi­ron­ment and we’ll be­have as if we like nu­tri­tion and re­pro­duc­ing. Put us in the mod­ern en­vi­ron­ment and we’ll be­have as if we like Dori­tos and sex—we can model this tran­si­tion by be­ing less ideal­ized about hu­mans. Pump heroin into our brains and we’ll be­have as if we want more—we can model this by be­ing even less ideal­ized. There is no One True level of ideal­iza­tion at which the True Values live.

This has di­rect con­se­quences for value learn­ing, which is the at­tempt to pro­gram com­put­ers to in­fer hu­man val­ues. You can­not just say “as­sume hu­mans are agents and in­fer their val­ues,” be­cause there is no True in­ter­pre­ta­tion of hu­man be­hav­ior in terms of an agent’s de­sires. This is, fi­nally, what I mean by say­ing that hu­mans are not agents: in the con­text of value learn­ing, it won’t work to tell the com­puter to as­sume that hu­mans are agents.

What then for value learn­ing? Well, step 1 is to ac­cept that if the AI is go­ing to learn some­thing about hu­man moral­ity, it’s go­ing to learn to tell a cer­tain sort of story about hu­mans, which fea­tures hu­man de­sires and be­liefs in a way suit­able to guide the AI’s plans. This class of sto­ries is not go­ing to be the One True way of think­ing about hu­mans, and so this AI might have to learn from hu­mans about how they model hu­mans.

There is a sec­ond half of this post. Given that these sto­ries about hu­man de­sires are de­pen­dent on the en­vi­ron­ment, and given that our opinion about the best way to in­ter­pret hu­mans in­volves some fa­mously fal­lible hu­man in­tu­ition, won’t these sto­ries be at risk of failure un­der pres­sure from op­ti­miza­tion in the vast space of pos­si­ble en­vi­ron­ments?

Yes.

But in­stead of re­peat­ing what Scott has already said bet­ter, if you want to read about it you’ll just have to go to The Tails Com­ing Apart as Me­taphor for Life.