Divergent preferences and meta-preferences

A pu­ta­tive new idea for AI con­trol; in­dex here.

In sim­ple graph­i­cal form, here is the prob­lem of di­ver­gent hu­man prefer­ences:

Here the AI ei­ther chooses or , and as a con­se­quence, the hu­man then chooses or .

There are a va­ri­ety of situ­a­tions in which this is or isn’t a prob­lem (when or or their nega­tions aren’t defined, take them to be the nega­tive of what is define):

  • Not prob­lems:

    • “gives right shoe/​left shoe”, “adds left shoe/​right shoe”.

    • “offers drink”, “goes look­ing for ex­tra drink”.

    • “gives money”, “makes large pur­chase”.

  • Po­ten­tially prob­lems:

    • “causes hu­man to fall in love with X/​Y”, “moves to X’s/​Y’s coun­try”.

    • “recom­mends study­ing X/​Y”, “choose pro­fes­sion P/​Q”.

    • “lets hu­man con­ceive child”, “keeps up pre­vi­ous hob­bies and friend­ships”.

  • Prob­lems:

    • “co­er­cive brain surgery”, any­thing.

    • “ex­treme ma­nipu­la­tion”, al­most any­thing.

    • heroin in­jec­tion”, “wants more heroin”.

So, what are the differ­ences? For the “not prob­lems”, it makes sense to model the hu­man as hav­ing a sin­gle re­ward , var­i­ously “likes hav­ing a match­ing pair of shoes”, “needs a cer­tain amount of fluids”, and “val­ues cer­tain pur­chases”. Then all that the the AI is do­ing is helping (or not) the hu­man to­wards that goal.

As you move more to­wards the “prob­lems”, no­tice that they seem to have two dis­tinct hu­man re­ward func­tions, and , and that the AI’s ac­tions seem to choose which one the hu­man will end up with. In the spirit of hu­mans not be­ing agents, this seems to be AI de­ter­min­ing what val­ues the hu­man will come to pos­sess.

Grue, Bleen, and agency

Of course, you could always say that the hu­man ac­tu­ally has re­ward , where is the in­di­ca­tor func­tion as to whether the AI does ac­tion or not.

Similarly to the grue and bleen prob­lem, there is no log­i­cal way of dis­t­in­guish­ing that “pieced-to­gether” from a more “nat­u­ral” (such as valu­ing plea­sure, for in­stance). Thus there is no log­i­cal way of dis­t­in­guish­ing the hu­man be­ing an agent from the hu­man not be­ing an agent, just from its prefer­ences and be­havi­our.

How­ever, from a learn­ing and com­pu­ta­tional com­plex­ity point of view, it does make sense to dis­t­in­guish “nat­u­ral” ‘s (where and are es­sen­tially the same, de­spite the hu­man’s ac­tions be­ing differ­ent) from com­pos­ite ’s.

This al­lows us to define:

  • Prefer­ence di­ver­gence point: A prefer­ence di­ver­gence point is one where and are suffi­ciently dis­tinct, ac­cord­ing to some crite­ria of dis­tinc­tion.

Note that some­times, and : the two and over­lap on a com­mon piece , but di­verge on and . It makes sense to define this as a prefer­ence di­ver­gence point as well, if and are “im­por­tant” in the agent’s sub­se­quent de­ci­sions. Im­por­tance be­ing a some­what hazy met­ric, which would, for in­stance, as­sess how much re­ward the hu­man would sac­ri­fice to in­crease and .


From the per­spec­tive of re­vealed prefer­ences about the hu­man, will pre­dict the same be­havi­our for all scal­ing fac­tors .

Thus at a prefer­ence di­ver­gence point, the AI’s be­havi­our, if it was a max­imiser, would de­pend on the non-ob­served weight­ing be­tween the two di­ver­gent prefer­ences.

This is un­safe, es­pe­cially if one of the di­ver­gent prefer­ences is much eas­ier to achieve a high value with than the other.

Thus prefer­ence di­ver­gence points are mo­ments when the AI should turn ex­plic­itly to hu­man meta-prefer­ences to dis­t­in­guish be­tween them.

This can be made re­cur­sive—if we see the hu­man meta-prefer­ences as ex­plic­itly weight­ing ver­sus and hence giv­ing , then if there is a prior AI de­ci­sion point , and, de­pend­ing on what the AI chooses, the hu­man meta-prefer­ences will be differ­ent, this gives two re­ward func­tions and with differ­ent weights and .

If these weights are suffi­ciently dis­tinct, this could iden­tify a meta-prefer­ence di­ver­gence point and hence a point where hu­man meta-meta-prefer­ences be­come rele­vant.

No comments.