Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

(This post is largely a write-up of a con­ver­sa­tion with Scott Garrabrant.)

Stable Poin­t­ers to Value

How do we build sta­ble poin­t­ers to val­ues?

As a first ex­am­ple, con­sider the wire­head­ing prob­lem for AIXI-like agents in the case of a fixed util­ity func­tion which we know how to es­ti­mate from sense data. As dis­cussed in Daniel Dewey’s Learn­ing What to Value and other places, if you try to im­ple­ment this by putting the util­ity calcu­la­tion in a box which re­wards an AIXI-like RL agent, the agent can even­tu­ally learn to mod­ify or re­move the box, and hap­pily does so if it can get more re­ward by do­ing so. This is be­cause the RL agent pre­dicts, and at­tempts to max­i­mize, re­ward re­ceived. If it un­der­stands that it can mod­ify the re­ward-giv­ing box to get more re­ward, it will.

We can fix this prob­lem by in­te­grat­ing the same re­ward box with the agent in a bet­ter way. Rather than hav­ing the RL agent learn what the out­put of the box will be and plan to max­i­mize the out­put of the box, we use the box di­rectly to eval­u­ate pos­si­ble fu­tures, and have the agent plan to max­i­mize that eval­u­a­tion. Now, if the agent con­sid­ers mod­ify­ing the box, it eval­u­ates that fu­ture with the cur­rent box. The box as cur­rently con­figured sees no ad­van­tage to such tam­per­ing. This is called an ob­ser­va­tion-util­ity max­i­mizer (to con­trast it with re­in­force­ment learn­ing). Daniel Dewey goes on to show that we can in­cor­po­rate un­cer­tainty about the util­ity func­tion into ob­ser­va­tion-util­ity max­i­miz­ers, re­cov­er­ing the kind of “learn­ing what is be­ing re­warded” that RL agents were sup­posed to provide, but with­out the per­verse in­cen­tive to try and make the util­ity turn out to be some­thing easy to max­i­mize.

This feels much like a use/​men­tion dis­tinc­tion. The RL agent is max­i­miz­ing “the func­tion in the util­ity mod­ule”, whereas the ob­ser­va­tion-util­ity agent (OU agent) is max­i­miz­ing the func­tion in the util­ity mod­ule.

The Easy vs Hard Problem

I’ll call the prob­lem which OU agents solve the easy prob­lem of wire­head­ing. There’s also the hard prob­lem of wire­head­ing: how do you build a sta­ble poin­ter to val­ues if you can’t build an ob­ser­va­tion-util­ity box? For ex­am­ple, how do you set things up so that the agent wants to satisfy a hu­man, with­out in­cen­tivis­ing the AI to ma­nipu­late the hu­man to be easy to satisfy, or cre­at­ing other prob­lems in the at­tempt to avoid this? Daniel Dewey’s ap­proach of in­cor­po­rat­ing un­cer­tainty about the util­ity func­tion into the util­ity box doesn’t seem to cut it—or at least, it’s not ob­vi­ous how to set up that un­cer­tainty in the right way.

The hard prob­lem is the wire­head­ing prob­lem Which Tom Ever­itt at­tempts to make progress on in Avoid­ing Wire­head­ing with Value Re­in­force­ment Learn­ing and Re­in­force­ment Learn­ing with a Cor­rupted Re­in­force­ment Chan­nel. It’s also con­nected to the prob­lem of Gen­er­al­iz­able En­vi­ron­men­tal Goals in AAMLS. CIRL gets at an as­pect of this prob­lem as well, show­ing how it can be solved if the prob­lem of en­vi­ron­men­tal goals is solved (and if we can as­sume that hu­mans are perfectly ra­tio­nal, or that we can some­how fac­tor out their ir­ra­tional­ity—Stu­art Arm­strong has some use­ful thoughts on why this is difficult). Ap­proval-di­rected agents can be seen as an at­tempt to turn the hard prob­lem into the easy prob­lem, by treat­ing hu­mans as the eval­u­a­tion box rather than try­ing to in­fer what the hu­man wants.

All these ap­proaches have differ­ent ad­van­tages and dis­ad­van­tages, and the point of this post isn’t to eval­u­ate them. My point is more to con­vey the over­all pic­ture which seems to con­nect them. In a sense, the hard prob­lem is just an ex­ten­sion of the same use/​men­tion dis­tinc­tion which came up with the easy prob­lem. We have some idea how to max­i­mize “hu­man val­ues”, but we don’t know how to ac­tu­ally max­i­mize hu­man val­ues. Me­taphor­i­cally, we’re try­ing to derefer­ence the poin­ter.

Stu­art Arm­strong’s in­differ­ence work is a good illus­tra­tion of what’s hard about the hard prob­lem. In the RL vs OU case, you’re go­ing to con­stantly strug­gle with the RL agent’s mis­al­igned in­cen­tives un­til you switch to an OU agent. You can try to patch things by ex­plic­itly pun­ish­ing ma­nipu­la­tion of the re­ward sig­nal, warp­ing the agent’s be­liefs to think ma­nipu­la­tion of the re­wards is im­pos­si­ble, etc, but this is re­ally the wrong ap­proach. Switch­ing to OU makes all of that un­nec­es­sary. Un­for­tu­nately, in the case of the hard prob­lem, it’s not clear there’s an analo­gous move which makes all the slip­pery prob­lems dis­ap­pear.

Illus­tra­tion: An Agent Embed­ded in Its Own Utility Function

If an agent is log­i­cally un­cer­tain of its own util­ity func­tion, the easy prob­lem can turn into the hard prob­lem.

It’s quite pos­si­ble that an agent might be log­i­cally un­cer­tain of its own util­ity func­tion if the func­tion is quite difficult to com­pute. In par­tic­u­lar, hu­man judge­ment could be difficult to com­pute even af­ter learn­ing all the de­tails of the hu­man’s prefer­ences, so that the AI needs un­cer­tain rea­son­ing about what the model tells it.

Why can this turn the easy prob­lem of wire­head­ing into the hard prob­lem? If the agent is log­i­cally un­cer­tain about the util­ity func­tion, its de­ci­sions may have log­i­cal cor­re­la­tions with the util­ity func­tion. This can give the agent some log­i­cal con­trol over its util­ity func­tion, rein­tro­duc­ing a wire­head­ing prob­lem.

As a con­crete ex­am­ple, sup­pose that we have con­structed an AI which max­i­mizes CEV: it wants to do what an imag­i­nary ver­sion of hu­man so­ciety, de­liber­at­ing un­der ideal con­di­tions, would de­cide is best. Ob­vi­ously, the AI can­not ac­tu­ally simu­late such an ideal so­ciety. In­stead, the AI does its best to rea­son about what such an ideal so­ciety would do.

Now, sup­pose the agent figures out that there would be an ex­act copy of it­self in­side the ideal so­ciety. Per­haps the ideal so­ciety figures out that it has been con­structed as a thought ex­per­i­ment to make de­ci­sions about the real world, so they con­struct a simu­la­tion of the real world in or­der to bet­ter un­der­stand what they will be mak­ing de­ci­sions about. Fur­ther­more, sup­pose for the sake of ar­gu­ment that our AI can break out of the simu­la­tion and ex­ert ar­bi­trary con­trol over the ideal so­ciety’s de­ci­sions.

Naively, it seems like what the AI will do in this situ­a­tion is take con­trol over the ideal so­ciety’s de­liber­a­tion, and make the CEV val­ues as easy to satisfy as pos­si­ble—just like an RL agent mod­ify­ing its util­ity mod­ule.

Ob­vi­ously, this could be taken as rea­son to make sure the ideal so­ciety doesn’t figure out that it’s just a thought ex­per­i­ment, or that they don’t con­struct copies of the AI. But, we don’t gen­er­ally want good prop­er­ties of the AI to rely on as­sump­tions about what hu­mans do; wher­ever pos­si­ble, we want to de­sign the AI to avoid such prob­lems.

In­differ­ence and CDT

In this case, it seems like the right thing to do is for the AI to ig­nore any in­fluence which its ac­tions have on its es­ti­mate of its util­ity func­tion. It should act as if it only has in­fluence over the real world. That way, the ideal so­ciety which defines CEV can build all the copies of the AI they want; the AI only con­sid­ers how its ac­tions have in­fluence over the real world. It avoids cor­rupt­ing the CEV.

Clearly, this would be an in­differ­ence-style solu­tion. What’s in­ter­est­ing to me is that it also looks like a CDT-style solu­tion. In fact, this seems like an an­swer to my ques­tion at the end of Smok­ing Le­sion Steel­man: a case of ig­no­rance about your own util­ity func­tion which doesn’t arise from an ob­vi­ously bad agent de­sign. Like the smok­ing le­sion steel­man, ig­no­rance about util­ity here seems to recom­mend CDT-like rea­son­ing over EDT-like rea­son­ing.

This sug­gests to me that there is a deep con­nec­tion be­tween CDT, in­differ­ence, sta­ble poin­t­ers, and cor­rigi­bil­ity. As Ja­son Konek & Ben Lev­in­stein ar­gued in The Foun­da­tions of Epistemic De­ci­sion The­ory, CDT is about get­ting the di­rec­tion-of-fit right in de­ci­sion the­ory: you want be­liefs to bet­ter fit the world (if your be­liefs don’t match the world, you change your be­liefs), but you want the world to bet­ter fit your goals (if your goals don’t match the world, you change the world). The easy prob­lem of wire­head­ing is to fol­low the sec­ond maxim when you have your util­ity func­tion in hand. The hard prob­lem of wire­head­ing is to go about this when your util­ity is not di­rectly ob­serv­able. If you build a sta­ble poin­ter to a hu­man, you be­come cor­rigible. Do­ing this cor­rectly seems to in­volve some­thing which at least looks very similar to in­differ­ence.

This pic­ture is a lit­tle too clean, and likely badly wrong in some re­spect: sev­eral of these con­cepts are likely to come apart when ex­am­ined more closely. Nonethe­less, this seems like an in­ter­est­ing way of look­ing at things.