Values determined by “stopping” properties

The lo­tus-eaters are ex­am­ples of hu­mans who have fol­lowed he­do­nism all the way through to its log­i­cal con­clu­sion. In con­trast, the “mind­less out­sourcers” are a pos­si­ble con­se­quence of the urge to effi­ciency: com­pet­i­tive pres­sures mak­ing up­loads choose to de­stroy their own iden­tity.

In my “Ma­hatma Arm­strong” ver­sion of Eliezer’s CEV, a some­what al­tru­is­tic en­tity ends up de­stroy­ing all life, af­ter a se­ries of perfectly ra­tio­nal self-im­prove­ments. And in many ex­am­ples where AIs are sup­posed to serve hu­man prefer­ences, these prefer­ences are defined by a pro­ce­dure (say, a ques­tion-an­swer pro­cess) that the AI can eas­ily ma­nipu­late.

Sta­bil­ity and stop­ping properties

Al­most ev­ery­one agrees that hu­man val­ues are un­der-de­ter­mined (we haven’t thought deeply and rigor­ously about ev­ery situ­a­tion) and change­able by life ex­pe­rience. There­fore, it makes no sense to use “cur­rent hu­man val­ues” as a goal; this con­cept doesn’t even ex­ist in any rigor­ous sense.

So we need some way of ex­trap­o­lat­ing true hu­man val­ues. All the pre­vi­ous ex­am­ples could be con­sid­ered ex­am­ples of ex­trap­o­la­tion, and they all share the same prob­lem: they are defined by their “stop­ping crite­ria” more than by their ini­tial con­di­tions.

For ex­am­ple, the lo­tus eaters have reached a so­porific he­do­nism they don’t want to wake out of. There no longer is “any­one there” to change any­thing in the mind­less out­sourcers. CEV is ex­plic­itly as­sumed to be con­ver­gent: con­ver­gent to a point where the ideal­ised en­tity no longer sees any need to change. The AI ex­am­ple is a bit differ­ent in flavour, but the “stop­ping crite­ria” are what­ever the hu­man /​chooses/​is tricked into/​is forced into/​ say­ing. This means that the AI could be an op­ti­mi­sa­tion pro­cess push­ing the hu­man to say what­ever it wants us to.

Im­por­tantly, all these stop­ping crite­ria are lo­cal: they ex­plic­itly care only about the situ­a­tion when the stop­ping crite­ria is reached, not about the jour­ney there, nor the ini­tial con­di­tions.

Pro­cesses can drift very far from their start­ing point, if they have lo­cal stop­ping crite­ria, even un­der very mild se­lec­tion pres­sure. Con­sider the fol­low­ing game: each of two play­ers is to name an num­ber, and , be­tween and . The player with the high­est num­ber gets that much in eu­ros, and the one with the strictly low­est one gets that much plus two in eu­ros. Each player starts at , and each in turn is al­lowed to ad­just their num­bers un­til they don’t want to any more.

Then if both play­ers are greedy and my­opic, one player will start by drop­ping to , fol­lowed by the next player drop­ping theirs to , and so on, go­ing back and forth be­tween the play­ers un­til one stands at and the other at . Ob­vi­ously if the could be cho­sen from a larger range, there’s no limit to the amount of loss that such a pro­cess could gen­er­ate.

Similarly, if our pro­cess of ex­trap­o­lat­ing hu­man val­ues have lo­cal stop­ping crite­ria, there’s no limit to how bad they could end up be­ing, or how “far away” in the space of val­ues they could go.

This, by the way, ex­plains my in­tu­itive dis­like for some types of moral re­al­ism. If there are true ob­jec­tive moral facts that hu­mans can ac­cess, then what­ever pro­cess counts as “ac­cess­ing them” be­comes… a lo­cal stop­ping con­di­tion for defin­ing value. So I don’t tend to fo­cus on ar­gu­ments about how cor­rect or in­tu­itive that pro­cess is; in­stead, I want to know where it ends up.

Bar­ri­ers to to­tal drift

So, how can we pre­vent lo­cal stop­ping con­di­tions from shoot­ing far across the land­scape of pos­si­ble val­ues? This is ac­tu­ally a prob­lem I’ve been work­ing on for a long time; you can see this in my old pa­per “Chain­ing God: A qual­i­ta­tive ap­proach to AI, trust and moral sys­tems”. I would not recom­mend read­ing that pa­per—it’s hope­lessly am­a­teur­ish, an­thro­po­mor­phi­sis­ing, and con­fused—but it shows one of the ob­vi­ous solu­tion: tie val­ues to their point of ori­gin.

There seem roughly three in­ter­ven­tions that can be done to over­come the prob­lem of lo­cal stop­ping crite­ria.

  • I. The first is to tie the pro­cess to the start­ing point, as above. Now, ini­tial hu­man val­ues are not prop­erly defined; nev­er­the­less it seems pos­si­ble to state that some val­ues are fur­ther away from this un­defined start­ing point than oth­ers (pa­per­clipers are very far, money-max­imiser quite far, situ­a­tions where recog­nis­ably hu­man be­ings do recog­nis­ably hu­man stuff are much closer). Then the ex­trap­o­la­tion pro­cess gets a penalty for wan­der­ing too far afield. The stop­ping con­di­tions are no longer purely lo­cal.

  • II. If there is an agent-like piece in the ex­trap­o­la­tion pro­cess, we can re­move rig­ging (pre­vi­ously called bias) or in­fluence, so that the agent can’t ma­nipu­late the ex­trap­o­la­tion pro­cess. This is a par­tial mea­sure: it re­places a tar­geted ex­trap­o­la­tion pro­cess with a ran­dom walk, which re­moves one ma­jor is­sue but doesn’t solve the whole prob­lem.

  • III. Fi­nally, it is of­ten sug­gested that con­straints be added to the ex­trap­o­la­tion pro­cess. For ex­am­ple, if the hu­man val­ues are de­ter­mined by hu­man feed­back, then we can for­bid the AI from co­erc­ing the hu­man in any way, or re­strict it to only us­ing some meth­ods (such as re­laxed con­ver­sa­tion). I am du­bi­ous about this kind of ap­proach. It firstly as­sumes that con­cepts like “co­er­cion” and “re­laxed con­ver­sa­tion” can be defined—but if that were the case, we’d be closer to solv­ing the is­sue di­rectly. And sec­ondly, it as­sumes that re­stric­tions that ap­ply to hu­mans also ap­ply to AIs: we can’t eas­ily change the core val­ues of fel­low hu­mans with con­ver­sa­tion, but su­per-pow­ered AIs may be able to do so.

In my meth­ods, I’ll nor­mally be us­ing mostly in­ter­ven­tions of type I and II.