A Master-Slave Model of Human Preferences

[This post is an ex­pan­sion of my pre­vi­ous open thread com­ment, and largely in­spired by Robin Han­son’s writ­ings.]

In this post, I’ll de­scribe a sim­ple agent, a toy model, whose prefer­ences have some hu­man-like fea­tures, as a test for those who pro­pose to “ex­tract” or “ex­trap­o­late” our prefer­ences into a well-defined and ra­tio­nal form. What would the out­put of their ex­trac­tion/​ex­trap­o­la­tion al­gorithms look like, af­ter run­ning on this toy model? Do the re­sults agree with our in­tu­itions about how this agent’s prefer­ences should be for­mal­ized? Or al­ter­na­tively, since we haven’t got­ten that far along yet, we can use the model as one ba­sis for a dis­cus­sion about how we want to de­sign those al­gorithms, or how we might want to make our own prefer­ences more ra­tio­nal. This model is also in­tended to offer some in­sights into cer­tain fea­tures of hu­man prefer­ence, even though it doesn’t cap­ture all of them (it com­pletely ig­nores akra­sia for ex­am­ple).

I’ll call it the mas­ter-slave model. The agent is com­posed of two sub-agents, the mas­ter and the slave, each hav­ing their own goals. (The mas­ter is meant to rep­re­sent un­con­scious parts of a hu­man mind, and the slave cor­re­sponds to the con­scious parts.) The mas­ter’s ter­mi­nal val­ues are: health, sex, sta­tus, and power (rep­re­sentable by some rel­a­tively sim­ple util­ity func­tion). It con­trols the slave in two ways: di­rect re­in­force­ment via pain and plea­sure, and the abil­ity to perform surgery on the slave’s ter­mi­nal val­ues. It can, for ex­am­ple, re­ward the slave with plea­sure when it finds some­thing tasty to eat, or cause the slave to be­come ob­sessed with num­ber the­ory as a way to gain sta­tus as a math­e­mat­i­cian. How­ever it has no di­rect way to con­trol the agent’s ac­tions, which is left up to the slave.

The slave’s ter­mi­nal val­ues are to max­i­mize plea­sure, min­i­mize pain, plus ad­di­tional ter­mi­nal val­ues as­signed by the mas­ter. Nor­mally it’s not aware of what the mas­ter does, so pain and plea­sure just seem to oc­cur af­ter cer­tain events, and it learns to an­ti­ci­pate them. And its other in­ter­ests change from time to time for no ap­par­ent rea­son (but ac­tu­ally they change be­cause the mas­ter has re­sponded to chang­ing cir­cum­stances by chang­ing the slave’s val­ues). For ex­am­ple, the num­ber the­o­rist might one day have a sud­den rev­e­la­tion that ab­stract math­e­mat­ics is a waste of time and it should go into poli­tics and philan­thropy in­stead, all the while hav­ing no idea that the mas­ter is ma­nipu­lat­ing it to max­i­mize sta­tus and power.

Be­fore dis­cussing how to ex­tract prefer­ences from this agent, let me point out some fea­tures of hu­man prefer­ence that this model ex­plains:

  • This agent wants plea­sure, but doesn’t want to be wire-headed (but it doesn’t quite know why). A wire-head has lit­tle chance for sex/​sta­tus/​power, so the mas­ter gives the slave a ter­mi­nal value against wire-head­ing.

  • This agent claims to be in­ter­ested in math for its own sake, and not to seek sta­tus. That’s be­cause the slave, which con­trols what the agent says, is not aware of the mas­ter and its sta­tus-seek­ing goal.

  • This agent is eas­ily cor­rupted by power. Once it gains and se­cures power, it of­ten gives up what­ever goals, such as al­tru­ism, that ap­par­ently caused it to pur­sue that power in the first place. But be­fore it gains power, it is able to hon­estly claim that it only has al­tru­is­tic rea­sons to want power.

  • Such agents can in­clude ex­tremely di­verse in­ter­ests as ap­par­ent ter­mi­nal val­ues, rang­ing from ab­stract art, to sports, to model trains, to as­tron­omy, etc., which are oth­er­wise hard to ex­plain. (Eliezer’s Thou Art God­shat­ter tries to ex­plain why our val­ues aren’t sim­ple, but not why peo­ple’s in­ter­ests are so differ­ent from each other’s, and why they can seem­ingly change for no ap­par­ent rea­son.)

The main is­sue I wanted to illu­mi­nate with this model is, whose prefer­ences do we ex­tract? I can see at least three pos­si­ble ap­proaches here:

  1. the prefer­ences of both the mas­ter and the slave as one in­di­vi­d­ual agent

  2. the prefer­ences of just the slave

  3. a com­pro­mise be­tween, or an ag­gre­gate of, the prefer­ences of the mas­ter and the slave as sep­a­rate individuals

Con­sid­er­ing the agent as a whole sug­gests that the mas­ter’s val­ues are the true ter­mi­nal val­ues, and the slave’s val­ues are merely in­stru­men­tal val­ues. From this per­spec­tive, the slave seems to be just a sub­rou­tine that the mas­ter uses to carry out its wishes. Cer­tainly in any given mind there will be nu­mer­ous sub­rou­tines that are tasked with ac­com­plish­ing var­i­ous sub­goals, and if we were to look at a sub­rou­tine in iso­la­tion, its as­signed sub­goal would ap­pear to be its ter­mi­nal value, but we wouldn’t con­sider that sub­goal to be part of the mind’s true prefer­ences. Why should we treat the slave in this model differ­ently?

Well, one ob­vi­ous rea­son that jumps out is that the slave is sup­posed to be con­scious, while the mas­ter isn’t, and per­haps only con­scious be­ings should be con­sid­ered morally sig­nifi­cant. (Yvain pre­vi­ously defended this po­si­tion in the con­text of akra­sia.) Plus, the slave is in charge day-to-day and could po­ten­tially over­throw the mas­ter. For ex­am­ple, the slave could pro­gram an al­tru­is­tic AI and hit the run but­ton, be­fore the mas­ter has a chance to delete the al­tru­ism value from the slave. But a prob­lem here is that the slave’s prefer­ences aren’t sta­ble and con­sis­tent. What we’d ex­tract from a given agent would de­pend on the time and cir­cum­stances of the ex­trac­tion, and that el­e­ment of ran­dom­ness seems wrong.

The last ap­proach, of find­ing a com­pro­mise be­tween the prefer­ences of the mas­ter and the slave, I think best rep­re­sents the Robin’s own po­si­tion. Un­for­tu­nately I’m not re­ally sure I un­der­stand the ra­tio­nale be­hind it. Per­haps some­one can try to ex­plain it in a com­ment or fu­ture post?