Practical consequences of impossibility of value learning

There is a No Free Lunch re­sult in value-learn­ing. Essen­tially, you can’t learn the prefer­ences of an agent from its be­havi­our un­less you make as­sump­tions about its ra­tio­nal­ity, and you can’t learn its ra­tio­nal­ity un­less you make as­sump­tions about its prefer­ences.

More im­por­tantly, sim­plic­ity/​Oc­cam’s ra­zor/​reg­u­lari­sa­tion don’t help with this, un­like with most No Free Lunch the­o­rems. Among the sim­plest ex­pla­na­tions of hu­man be­havi­our are:

  1. We are always fully ra­tio­nal all the time.

  2. We are always fully anti-ra­tio­nal all the time.

  3. We don’t ac­tu­ally pre­fer any­thing to any­thing.

That re­sult, though math­e­mat­i­cally valid, seems highly the­o­ret­i­cal, and of lit­tle prac­ti­cal in­ter­est—af­ter all, for most hu­mans, it’s ob­vi­ous what other hu­mans want, most of the time. But I’ll ar­gue that the re­sult has strong prac­ti­cal con­se­quences.

Iden­ti­fy­ing clickbait

Sup­pose that Face­book or some other cor­po­ra­tion de­cides to cut down on the amount of click­bait on its feeds.

This shouldn’t be too hard, the pro­gram­mers rea­son. They start by se­lect­ing a set of click­bait ex­am­ples, and check how peo­ple en­gage with these. They pro­gramme a neu­ral net to recog­nise that kind of “en­gage­ment” on other posts, which nets a large amount of can­di­date click­bait. They then go through the can­di­date posts, la­bel­ling the clear ex­am­ples of click­bait and the clear non-ex­am­ples, and add these to the train­ing and test sets. They re­train and im­prove the neu­ral net. A few iter­a­tions later, their neu­ral net is well trained, and they let it run on all posts, oc­ca­sion­ally au­dit­ing the re­sults. Seek­ing to make the pro­cess more trans­par­ent, they run in­ter­pretabil­ity meth­ods on the neu­ral net, seek­ing to iso­late the key com­po­nents of click­bait, and clear away some er­rors or over-fits—maybe the equiv­a­lent, for click­bait, of re­mov­ing the “look for images of hu­man arms” in the dumb­bell iden­ti­fi­ca­tion nets.

The cen­tral issue

Could that method work? Pos­si­bly. With enough data and enough pro­gram­ming efforts, it cer­tainly seems that it could. So, what’s the prob­lem?

The prob­lem is that so many stages of the pro­cess re­quires choices on the part of the pro­gram­mers. The ini­tial se­lec­tion of click­bait in the first place; the la­bel­ling of can­di­dates at the sec­ond stage; the num­ber of cy­cles of iter­a­tions and im­prove­ments; the choice of ex­plicit hy­per-pa­ram­e­ters and im­plicit ones (like how long to run each iter­a­tion); the au­dit­ing pro­cess; the se­lec­tion of key com­po­nents. All of these rely on the pro­gram­mers be­ing able to iden­tify click­bait, or the fea­tures of click­bait, when they see them.

And that might not sound bad; if we wanted to iden­tify pho­tos of dogs, for ex­am­ple, we would fol­low a similar pro­cess. But there is a key differ­ence. There is a some­what ob­jec­tive defi­ni­tion of dog (though be­ware am­bigu­ous cases). And the pro­gram­mers, when mak­ing choices, will be ap­prox­i­mat­ing or find­ing ex­am­ples of this defi­ni­tion. But there is no ob­jec­tive, semi-ob­jec­tive, or some­what ob­jec­tive defi­ni­tion of click­bait.

Why? Be­cause the defi­ni­tion of click­bait de­pends on as­sess­ing the prefer­ences of the hu­man that sees it. It can be roughly defined as “some­thing a hu­man is likely to click on (be­havi­our), but wouldn’t re­ally ul­ti­mately want to see (prefer­ence)”.

And, and this is an im­por­tant point, the No Free Lunch the­o­rem ap­plies to hu­mans. So hu­mans can’t de­duce prefer­ences or ra­tio­nal­ity from be­havi­our, at least, not with­out mak­ing as­sump­tions.

So how do we solve the prob­lem? Be­cause hu­mans do of­ten de­duce the prefer­ences and ra­tio­nal­ity of other hu­mans, and of­ten other hu­mans will agree with them, in­clud­ing the hu­man be­ing as­sessed. How do we do it?

Well, drum­roll, we do it by… mak­ing as­sump­tions. And since evolu­tion is so very lazy, the as­sump­tions that hu­mans make—about each other’s ra­tio­nal­ity/​prefer­ence, about their own ra­tio­nal­ity/​prefer­ence—are all very similar. Not iden­ti­cal, of course, but com­pared with a ran­dom agent mak­ing ran­dom as­sump­tions to in­ter­pret the be­havi­our of an­other ran­dom agent, hu­mans are es­sen­tially all the same.

This means that, to a large ex­tent, it is perfectly valid for pro­gram­mers to use their own as­sump­tions when defin­ing click­bait, or in other situ­a­tions of as­sess­ing the val­ues of oth­ers. In­deed, un­til we solve the is­sue in gen­eral, this may be the only way of do­ing this; it’s cer­tainly the only easy way.

The lesson

So, are there any prac­ti­cal con­se­quences for this? Well, the im­por­tant thing is that pro­gram­mers re­al­ise they are us­ing their own as­sump­tions, and take these into con­sid­er­a­tion when pro­gram­ming. Even things that they feel might just be “de­bug­ging”, by re­mov­ing ob­vi­ous failure modes, could be them in­ject­ing their as­sump­tions into the sys­tem. This has two ma­jor con­se­quence:

  1. Th­ese as­sump­tions don’t form a nice neat cat­e­gory that “carve re­al­ity at its joints”. Con­cepts such as “dog” are some­what am­bigu­ous, but con­cepts like “hu­man prefer­ences” will be even more so, be­cause they are a se­ries of evolu­tion­ary kludges, rather than a sin­gle nat­u­ral thing. There­fore we ex­pect that ex­trap­o­lat­ing pro­gram­mer as­sump­tions, or mov­ing to a new dis­tri­bu­tion, will re­sult in bad be­havi­our, that will have to be patched anew with more as­sump­tions.

  2. There are cases when their as­sump­tions and those of the users may di­verge; look­ing out for these situ­a­tions is im­por­tant. This is eas­ier if pro­gram­mers re­al­ise they are mak­ing as­sump­tions, rather than ap­prox­i­mat­ing ob­jec­tively true cat­e­gories.