Learning human preferences: optimistic and pessimistic scenarios

In this post and the next, I try and clar­ify—for my­self and for oth­ers—the pre­cise prac­ti­cal im­pli­ca­tions of the “Oc­cam’s ra­zor is in­suffi­cient to in­fer the prefer­ences of ir­ra­tional agents ” pa­per.

Time and again, I’ve had trou­ble get­ting oth­ers to un­der­stand what that pa­per im­plies, and what it doesn’t. It’s nei­ther ir­rele­vant (like many no-free-lunch the­o­rems), nor is it a rad­i­cal skep­ti­cism/​noth­ing is real/​we can’t re­ally know any­thing pa­per.

I’ve been hav­ing pro­duc­tive con­ver­sa­tions with Re­becca Gor­man, whom I want to thank for her help (and who phrased things well in terms of la­tent vari­ables)!

A sim­ple bi­ased agent

Con­sider the fol­low­ing sim­ple model of an agent:

The agent’s ac­tions can be ex­plained by their be­liefs and prefer­ences[1], and by their bi­ases: by this, we mean the way in which the ac­tion se­lec­tor differs from an un­bound­edly ra­tio­nal ex­pected prefer­ence max­imiser.

The re­sults of the Oc­cam’s ra­zor pa­per im­ply that prefer­ences (and be­liefs, and bi­ases) can­not be de­duced sep­a­rately from know­ing the agent’s policy (and hence, a for­tiori, from any ob­ser­va­tions of the agent’s be­havi­our).

La­tent and “nat­u­ral” variables

Let be a la­tent vari­able of the policy - or a some vari­able that can be de­duced from in some sim­ple or nat­u­ral way.

A con­se­quence of the Oc­cam’s ra­zor re­sult is that any such will typ­i­cally be a mix­ture of prefer­ences, be­liefs, and bi­ases. For if the tended to be re­stricted to one of these three com­po­nents, that would mean that sep­a­rat­ing them would be pos­si­ble via la­tent or sim­ple vari­ables.

So, for ex­am­ple, if we con­ducted a prin­ci­ple com­po­nent anal­y­sis on , we would ex­pect that the com­po­nents would all be mixes of prefer­ences/​be­liefs/​bi­ases.

The op­ti­mistic scenario

To get around the im­pos­si­bil­ity re­sult, we need “nor­ma­tive as­sump­tions”: as­sump­tions about the prefer­ences (or be­liefs, or bi­ases) of the agent that can­not be de­duced fully from ob­ser­va­tions.

Un­der the op­ti­mistic sce­nario, we don’t need many of these, at least for iden­ti­fy­ing hu­man prefer­ences. We can la­bel a few ex­am­ples (“the an­chor­ing bias, as illus­trated in this sce­nario, is a bias”; “peo­ple are at least weakly ra­tio­nal”; “hu­mans of­ten don’t think about new courses of ac­tion they’ve never seen be­fore”, etc...). Call this la­bel­led data[2] .

The al­gorithm now con­structs cat­e­gories prefer­ences*, be­liefs*, and bi­ases* - these are the gen­er­al­i­sa­tions that it has achieved from . Op­ti­misti­cally, these cor­re­spond quite closely to what we mean by these cat­e­gories, at least when com­bined with the in­for­ma­tion of , the policy of hu­man . It is now pos­si­ble for the al­gorithm to iden­tify la­tent or nat­u­ral vari­ables that lie along the “prefer­ences”, “be­liefs”, and “bi­ases” axes, thus iden­ti­fy­ing and iso­lat­ing hu­man prefer­ences.

It seems there’s a con­tra­dic­tion here—by defi­ni­tion, does not con­tain much in­for­ma­tion, yet sep­a­rat­ing prefer­ences may re­quire a lot of in­for­ma­tion. The hope is that acts as a door­way to other sources of in­for­ma­tion—such as hu­man psy­chol­ogy pa­pers, Wikipe­dia, hu­man fic­tion, and so on. Call this other data .

The Oc­cam’s ra­zor re­sult still ap­plies to : one of the sim­plest ex­pla­na­tions for is to as­sume that is always ra­tio­nal and that con­sists of “speech acts” (think of a dishon­est poli­ti­cian’s speech—you would not want to take the literal con­tent of the speech as cor­rect in­for­ma­tion). The re­sult still ap­plies even to , where we take the poli­cies of ev­ery hu­man in the set of all hu­mans.

How­ever, it is hoped that will al­low the al­gorithm to effec­tively sep­a­rate prefer­ences from bi­ases and be­liefs. The hope is that acts as key to un­lock the vast amount of in­for­ma­tion in - that once the al­gorithm has a ba­sic idea what a prefer­ence is, then all the hu­man liter­a­ture on the mean­ing of prefer­ence be­comes us­able. As more than just speech acts, but as ac­tual sources of in­for­ma­tion, as the al­gorithm re­al­ises the mean­ing of the way we want it to, and re­al­ises what is lies/​metaphors/​ex­ag­ger­a­tions.

This is what we would hope would hap­pen. Guided by our own in­tu­itions—which have no prob­lem dis­t­in­guish­ing prefer­ences in other hu­mans and in our­selves, at least roughly—we may feel that this is likely.

The pes­simistic scenario

In the pes­simistic sce­nario, hu­man prefer­ences, bi­ases, and be­liefs are twisted to­gether is a far more com­pli­cated way, and can­not be sep­a­rated by a few ex­am­ples.

Con­sider for ex­am­ple the an­chor­ing bias. I’ve ar­gued that the an­chor­ing bias is for­mally very close to be­ing a taste prefer­ence.

In con­trast, take ex­am­ples of racial bias, hind­sight bias, illu­sion of con­trol, or naive re­al­ism. Th­ese bi­ases all seem to be of quite differ­ent from the an­chor­ing bias, and quite differ­ent from each other. At the very least, they seem to be of differ­ent “type sig­na­ture”.

So, un­der the pes­simistic sce­nario, some bi­ases are much closer to prefer­ences that generic bi­ases (and generic prefer­ences) are to each other. It’s not un­com­mon for parts of the brain to reuse other parts for differ­ent pur­poses; the pu­rity moral prefer­ence, for ex­am­ple, re­cy­cles part of the emo­tion of dis­gust. In­di­vi­d­ual bi­ases and prefer­ences prob­a­bly similarly use a lot of the same ma­chin­ery in the brain, mak­ing it hard to tell the differ­ences be­tween them.

Thus pro­vid­ing a few ex­am­ples of prefer­ences/​be­liefs/​bi­ases, , is not enough to dis­en­tan­gle them. Here fails to un­lock the mean­ing of - when read­ing psy­chol­ogy pa­pers, the al­gorithm sees a lot of be­havi­our (“this hu­man wrote this pa­per; I could have pre­dicted that”), but not in­for­ma­tion rele­vant to the di­vi­sion be­tween prefer­ences/​be­liefs/​bi­ases.

Pes­simism, in­for­ma­tion, and cir­cu­lar reasoning

It’s worth dig­ging into that last point a bit more, since it is key to many peo­ple’s in­tu­itions in this area. On this web­site, we find a quote:

Civil strife is as much a greater evil than a con­certed war effort as war it­self is worse than peace. Herodotus

Taken liter­ally, this would mean civil strife << war << peace. But no-one sen­si­ble would take it liter­ally; first of all, we’d want to know if the quote was gen­uine, we’d want to figure out a bit about Herodotus’s back­ground, we’d want to see whether his ex­pe­rience is rele­vant, what has changed in war­fare and hu­man prefer­ences over the cen­turies, and so on.

So we’d be putting the in­for­ma­tion into con­text, and, to do so, we’d be us­ing our own the­ory of mind, our own knowl­edge of what a prefer­ence is, what be­liefs and bi­ases hu­mans typ­i­cally have...

There’s a chicken and egg prob­lem: it’s not clear that ex­tra in­for­ma­tion is much use to the al­gorithm, with­out a ba­sic un­der­stand­ing of what prefer­ences/​be­liefs/​bi­ases are. So with­out a good grasp to get started, the al­gorithm may not be able to use the ex­tra in­for­ma­tion—even all the world’s in­for­ma­tion—to get a fur­ther un­der­stand­ing. And hu­man out­puts—such as psy­chol­ogy liter­a­ture—are writ­ten to be un­der­stood un­am­bigu­ously (-ish) by hu­mans. Thus in­ter­pret­ing it in the hu­man fash­ion, may rely on im­plicit as­sump­tions that the al­gorithm doesn’t have ac­cess to.

It’s im­por­tant to re­al­ise that this is not a failure of in­tel­li­gence on the part of the al­gorithm. AIXI, the ideal­ised un­com­putable su­per­in­tel­li­gence, will fail at image clas­sifi­ca­tion tasks if we give it in­cor­rectly la­bel­led data or don’t give it enough am­bigu­ous ex­am­ples to re­solve am­bigu­ous cases.

Failure mode of pes­simistic scenario

So the failure mode, in the pes­simistic sce­nario, is that the al­gorithm gen­er­ates the cat­e­gories prefer­ences*, be­liefs*, and bi­ases*, but that these don’t cor­re­spond well to ac­tual prefer­ences, be­liefs, or bi­ases—at least not as we get be­yond the train­ing ex­am­ples pro­vided (it doesn’t help that hu­mans them­selves have trou­ble dis­t­in­guish­ing these in many situ­a­tions!).

Sp, what the al­gorithm thinks is a prefer­ence may well be a mix­ture of all three cat­e­gories. We might cor­rect it by point­ing out its mis­takes and adding some more ex­am­ples, but this might only carry it a bit fur­ther: when­ever it gets to an area where we haven’t pro­vided la­bels, it starts to make large cat­e­gori­sa­tion er­rors or stum­bles upon ad­ver­sar­ial ex­am­ples.

This may feel counter-in­tu­itive, be­cause, for us, ex­tract­ing prefer­ences feels easy. I’ll ad­dress that point in the next sec­tion, but I’ll first note that al­gorithms find­ing tasks hard that we find easy is not un­usual.

To re­it­er­ate: mak­ing the al­gorithm smarter would not solve the prob­lem; the is­sue (in the pes­simistic sce­nario) is that the three cat­e­gories are not well-defined nor well-sep­a­rated.

Pes­simism: hu­mans in­ter­pret­ing other humans

We know that hu­mans can in­ter­pret the prefer­ences, be­liefs, and bi­ases of other hu­mans, at least ap­prox­i­mately. If we can do it so eas­ily, how could it be hard for a smart al­gorithm to do so?

Mo­ravec’s para­dox might im­ply that it would be difficult for an al­gorithm to do so, but that just means we need a smart enough al­gorithm.

But the ques­tion might be badly posed, in which case in­finite smart­ness would not be enough. For ex­am­ple, imag­ine that hu­mans looked like this, with the “Hu­man Agency In­ter­preter” (ba­si­cally the the­ory of mind) do­ing the job of in­ter­pret­ing other hu­mans. The green ar­rows are there to re­mind us how much of this is done via em­pa­thy: by pro­ject­ing our own prefer­ences/​be­liefs onto the hu­man we are con­sid­er­ing.

This setup also has an op­ti­mistic and a pes­simistic sce­nario. They in­volve how fea­si­ble it is for the al­gorithm to iso­late the “Hu­man Agency In­ter­preter”. In the op­ti­mistic sce­nario, we can use a few ex­am­ples, point to the Wikipe­dia page on the­ory of mind, and the al­gorithm will ex­tract a rea­son­able fac­simile of the hu­man agency in­ter­preter, and then use that to get a rea­son­able de­com­po­si­tion of the hu­man al­gorithm into be­liefs/​prefer­ences/​bi­ases.

In the pes­simistic sce­nario, the Hu­man Agency In­ter­preter is also twisted up with ev­ery­thing else in the hu­man brain, and our ex­am­ples are not enough to dis­en­tan­gle it, and the same prob­lem re-ap­pears at this level: there is no prin­ci­pled way of figur­ing out the hu­man the­ory of mind, with­out start­ing from the hu­man the­ory of mind.


  1. It may seem odd that there is an ar­row go­ing from ob­ser­va­tions to prefer­ences, but a) hu­man prefer­ences do seem to vary in time and cir­cum­stances, and b) there is no clear dis­tinc­tion be­tween ob­ser­va­tion-de­pen­dent and ob­ser­va­tion-in­de­pen­dent prefer­ences. For ex­am­ple, you could have a prefer­ence for eat­ing when you’re hun­gry; is this an eat­ing prefer­ence that is hunger-de­pen­dent, or a eat­ing-when-hun­gry prefer­ence that is in­de­pen­dent of any ob­ser­va­tions? Be­cause of these sub­tleties, I’ve preferred to draw the ar­row un­am­bigu­ously go­ing into the prefer­ences node, from the ob­ser­va­tions node, so that there is no con­fu­sion. ↩︎

  2. This data may end up be­ing pro­vided im­plic­itly, by pro­gram­mers cor­rect­ing “ob­vi­ous mis­takes” in the al­gorithm. ↩︎