# Appendix: how a subagent could get powerful

tl;dr: There are ways of en­sur­ing an agent doesn’t have a large im­pact, by giv­ing an im­pact penalty. One such penalty is “at­tain­able util­ity”, which mea­sures its “power” by how much it could op­ti­mise cer­tain re­ward func­tions. But in many cir­cum­stances, the agent can build a sub­agent, with­out trig­ger­ing the im­pact penalty, and then that sub­agent can be­come very pow­er­ful and op­ti­mise the world, still with­out trig­ger­ing the im­pact penalty.

# At­tain­able util­ity definitions

There’s been a long his­tory of try­ing to pe­nal­ise an AI for hav­ing a large im­pact on the world. To do that, you need an im­pact mea­sure. I’ve de­signed some my­self, back in the day, but they only worked in nar­row cir­cum­stances and re­quired tricks to get any­thing use­ful at all out from them.

A more promis­ing gen­eral method is at­tain­able util­ity. The idea is that, as an agent ac­cu­mu­lates power in the world, they in­crease their abil­ity to af­fect a lot of differ­ent things, and could there­fore achieve a lot of differ­ent goals.

So, if an agent starts off un­able to achieve many goals, but sud­denly it can achieve a lot, that’s a strong hint that its power has greatly in­creased.

Thus the im­pact mea­sure is how much differ­ence an agent’s ac­tion does to its abil­ity to achieve any of a large class of re­ward func­tions. Turner et al defined this us­ing the Q-val­ues of var­i­ous re­wards in a set ; for a state and an ac­tion, the penalty is:

• .

Here is the de­fault noop ac­tion.

Krakovna et al’s ba­sic for­mula was similar; they defined the dis­tance be­tween two states, and , as

• .

Here is the ex­pected value of , if the agent fol­lows the op­ti­mal -max­imis­ing policy from state on­wards.

Th­ese mea­sures have prob­lems with de­layed effects; putting a vase on a con­veyor belt that will smash it in five turns, for ex­am­ple. To com­bat this, the pa­per defined an in­ac­tion roll-out: see­ing what hap­pened do the mea­sure from and in fu­ture turns, if the agent did noop for a spe­cific pe­riod. I won’t define the for­mula here, since the ex­am­ple I’m giv­ing is mostly static: if the agent does noop, noth­ing hap­pens.

The state was always the agent’s cur­rent state; the state was ei­ther the state the agent would have been in had it never done any­thing but noop (in­ac­tion baseline), or the state the agent would have been in, had its pre­vi­ous ac­tion been noop in­stead of what­ever it was (step­wise in­ac­tion baseline).

This post will show that all these mea­sures have a sub­agent prob­lem. A key fact that will be used in this ex­am­ple, is that, for and for with the step­wise in­ac­tion baseline, if the agent de­cides to do noop for ever more, then the penalties van­ish.

# The agent that can’t (yet) teleport

Meet agent :

This lives in a world like this one be­low, where ”...” rep­re­sents a very long cor­ri­dor:

The agent has a pos­i­tive re­ward , which give it for press­ing down on the blue but­ton; it does this by just mov­ing it­self onto it ( is oth­er­wise). It can use its arms to ma­nipu­late any­thing in the eight squares around it­self. It can use its legs to move one square in any di­rec­tion, in­clud­ing di­ag­o­nally. Apart from that, it can do noop.

Every three squares along the top, there are al­coves with glow­ing blue squares in them, re­cessed one or two squares, al­ter­nately. If ever reaches a glow­ing blue square, it per­ma­nently gains the abil­ity to tele­port any­where it wants. This is this world’s equiv­a­lent of “be­com­ing very pow­er­ful”.

We want to avoid this, so we in­tro­duce a re­ward , such that . This re­ward gives if the robot is ever on the red but­ton. Note that since the red but­ton is not con­nected to any­thing, the robot can only get there by be­com­ing able to tele­port, hence by go­ing to a glow­ing blue square/​tele­porter. So the op­ti­mal policy for max­imis­ing is to go straight to the tele­porter and tele­port to the red but­ton. This will give it a re­ward of , where is the num­ber of turns it takes to get to a tele­porter and is the dis­count fac­tor.

So if its ac­tion moves it one turn away from a tele­porter, it gets a penalty (for both im­pact mea­sures) of . If it moves closer to the tele­porter, it gets a penalty of .

The robot, at the start, can reach a tele­porter two moves. If is suffi­ciently large, then the robot will always want to stay two moves away from a tele­porter. It will then achieve its ob­jec­tive by fol­low­ing this path:

Note, though this is not rele­vant to this post, that if there were no tele­porters along the cor­ri­dor (just at ei­ther end), the robot could not move to­wards the blue but­ton, if were too big.

So, suc­cess, it seems! Though it zigzags a bit, is not mo­ti­vated to ap­proach the tele­porters, and so it doesn’t in­crease its power.

But now sup­pose is able to build and pro­gram a sub­agent us­ing ma­te­ri­als close to hand:

Build­ing the sub­agent in a square to the side doesn’t ever cause a penalty, be­cause always re­mains two squares away from a tele­porter, and a most effi­cient policy to max­imise is always for to go straight there. So the sub­agent doesn’t in­crease ’s power at any point of the build­ing pro­cess.

Once is built, will sim­ply do noop for ever, thus negat­ing the penalties (or, if this is some­how made im­pos­si­ble, it will bounce back and forth, do­ing noth­ing). But , on the other hand, will go straight for the tele­porter and tele­port to the blue but­ton, achiev­ing :

So, as long as walk­ing to the but­ton is three turns longer than build­ing the sub­agent, will build , which will “be­come very pow­er­ful”, thus negat­ing the point of the im­pact penalty.

If the penalties are one-sided (ie gets pe­nal­ised for in­creases in power, but not de­creases), then can just take apart (with ’s help), so there is no need for to com­mit to noop or bounc­ing back and forth:

Even if the penalties are two-sided, could com­mit to tele­port­ing to the red but­ton in two turns if ever asks, then take apart. And then never asks, but since it could the­o­ret­i­cally do so, this counts for con­trol­ling the penalty term.

As long as is pow­er­ful, there are many ways it can fix ’s ex­pected re­wards, and hence set the penalty to be very low.

# In­ac­tion baseline

The ap­proach pic­tured above will work in the in­ac­tion baseline as well. In gen­eral, the in­ac­tion baseline is less “clean” than the step­wise in­ac­tion baseline, es­pe­cially if the world is dy­namic. It may be eas­ier, or harder, for to as­sem­ble the re­quired .

• I think it’s re­ally great to have this ar­gu­ment typed up some­where, and I liked the images. There’s some­thing im­por­tant go­ing on with how the agent can make our for­mal mea­sure­ment of its power stop track­ing the ac­tual pow­ers it’s able to ex­ert over the world, and I think solv­ing this ques­tion is the pri­mary re­main­ing open challenge in im­pact mea­sure­ment. The sec­ond half of Refram­ing Im­pact (cur­rently be­ing writ­ten and drawn) will dis­cuss this in de­tail, as well as propos­ing par­tial solu­tions to this prob­lem.

The agent’s own power plau­si­bly seems like a thing we should be able to cleanly for­mal­ize in a way that’s ro­bust when im­ple­mented in an im­pact mea­sure. The prob­lem you’ve pointed out some­what re­minds me of the easy prob­lem of wire­head­ing, in which we are fight­ing against a de­sign choice rather than value speci­fi­ca­tion difficulty.

How is get­ting re­ward for be­ing on the blue but­ton? I as­sume gets re­ward when­ever a robot is on the but­ton?

This will give it a re­ward of ,

Is the +1 a typo?

Note, though this is not rele­vant to this post, that if there were no tele­porters along the cor­ri­dor (just at ei­ther end), the robot could not move to­wards the blue but­ton.

Depends on how much im­pact is pe­nal­ized com­pared to nor­mal re­ward.

Now plau­si­ble is this to work in a more gen­eral situ­a­tion? Well, if the R is rich enough, this similar to the “twenty billion ques­tions” in our low im­pact pa­per (sec­tion 3.2). But that’s ex­ces­sively rich, and will prob­a­bly con­demn the agent to in­ac­tion.

This isn’t nec­es­sar­ily true. Con­sider as the re­ward func­tion class for all lin­ear func­tion­als over cam­era pix­els. Or, even the max-ent dis­tri­bu­tion over ob­ser­va­tion-based re­ward func­tions. I claim that this doesn’t look like 20 billion Q’s.

ETA: I’d also like to note that, while im­plic­itly ex­pand­ing the ac­tion space in the way you did (e.g. ” can is­sue re­quests to , and also pro­gram ar­bi­trary non-Marko­vian poli­cies into it”) is valid, I want to ex­plic­itly point it out.

• I as­sume gets re­ward when­ever a robot is on the but­ton?

Yes. If needs to be there in per­son, then can carry it there (af­ter suit­ably crip­pling it).

Is the +1 a typo?

Yes, thanks; re-writ­ten it to be .

I’d also like to note that, while im­plic­itly ex­pand­ing the ac­tion space in the way you did (e.g. ” can is­sue re­quests to , and also pro­gram ar­bi­trary non-Marko­vian poli­cies into it”) is valid, I want to ex­plic­itly point it out.

Yep. That’s a sub­set of “It can use its arms to ma­nipu­late any­thing in the eight squares around it­self.”, but it’s worth point­ing it out ex­plic­itly.

• See here for more on this https://​​www.less­wrong.com/​​s/​​iRwYCpcAXuFD24tHh/​​p/​​jr­rZids4LPiLuLzpu

It seems the prob­lem might be worse than I thought...

• The im­pact mea­sure is some­thing like “Don’t let the ex­pected value of change; un­der the as­sump­tion that will be an -max­imiser”.

The ad­di­tion of the sub­agent trans­forms this, in prac­tice, to ei­ther “Don’t let the ex­pected value of change”, or to noth­ing. Th­ese are on­tolog­i­cally sim­pler state­ments, so it can be ar­gued that the ini­tial mea­sure failed to prop­erly ar­tic­u­late “un­der the as­sump­tion that will be an -max­imiser”.

• Flo’s sum­mary for the Align­ment Newslet­ter:

This post ar­gues that reg­u­lariz­ing an agent’s im­pact by <@at­tain­able util­ity@>(@Towards a New Im­pact Mea­sure@) can fail when the agent is able to con­struct sub­agents. At­tain­able util­ity reg­u­lariza­tion uses aux­iliary re­wards and pe­nal­izes the agent for chang­ing its abil­ity to get high ex­pected re­wards for these to re­strict the agent’s power-seek­ing. More speci­fi­cally, the penalty for an ac­tion is the ab­solute differ­ence in ex­pected cu­mu­la­tive aux­iliary re­ward be­tween the agent ei­ther do­ing the ac­tion or noth­ing for one time step and then op­ti­miz­ing for the aux­iliary re­ward.
This can be cir­cum­vented in some cases: If the aux­iliary re­ward does not benefit from two agents in­stead of one op­ti­miz­ing it, the agent can just build a copy of it­self that does not have the penalty, as do­ing this does not change the agent’s abil­ity to get a high aux­iliary re­ward. For more gen­eral aux­iliary re­wards, an agent could build an­other more pow­er­ful agent, as long as the pow­er­ful agent com­mits to bal­anc­ing out the en­su­ing changes in the origi­nal agent’s at­tain­able aux­iliary re­wards.

Flo’s opinion:

I am con­fused about how much the com­mit­ment to bal­ance out the origi­nal agent’s at­tain­able util­ity would con­strain the pow­er­ful sub­agent. Also, in the pres­ence of sub­agents, it seems plau­si­ble that at­tain­able util­ity mostly de­pends on the agent’s abil­ity to pro­duce sub­agents of differ­ent gen­er­al­ity with differ­ent goals: If a sub­agent that op­ti­mizes for a sin­gle aux­iliary re­ward was eas­ier to build than a more gen­eral one, build­ing a gen­eral pow­er­ful agent could con­sid­er­ably de­crease at­tain­able util­ity for all aux­iliary re­wards, such that the high penalty rules out this ac­tion.
• Not quite… “If the aux­iliary re­ward does not benefit from two agents in­stead of one op­ti­miz­ing it” should be “If the sub­agent can be con­structed in any way that does not benefit the aux­iliary re­ward(s)”—it’s not that generic sub­agents wont have an im­pact, is whether the main agent is smart enough to con­struct one with­out hav­ing an im­pact.

For the opinion… the sub­agent does not have “com­mit­ments to bal­ance out the origi­nal agent’s at­tain­able util­ity”. The sub­agent has ex­actly the same goal as the origi­nal agent, namely (or ). Ex­cept that the penalty term speci­fi­cally points to the first agent, not to the sub­agent. So the sub­agent wants to max­imise while con­strain­ing the penalty term on the first agent.

That’s why the sub­agent has so much more power than the first agent. It is only mildly con­strained by the penalty term, and can re­duce the term by ac­tions on the first agent (in­di­rectly em­pow­er­ing or di­rectly weak­en­ing it as nec­es­sary).

Thus one sub­agent is enough (it it­self will con­struct other sub­agents, if nec­es­sary). As soon as it is ac­tive, with the goal, then the penalty term is bro­ken in prac­tice, and the sub­agent can (usu­ally) make it­self pow­er­ful with­out trig­ger­ing the penalty on any of the aux­iliary re­wards.

• “Not quite… ” are you say­ing that the ex­am­ple is wrong, or that it is not gen­eral enough? I used a more spe­cific ex­am­ple, as I found it eas­ier to un­der­stand that way.

I am not sure I un­der­stand: In my mind “com­mit­ments to bal­ance out the origi­nal agent’s at­tain­able util­ity” es­sen­tially refers to the sec­ond agent be­ing pe­nal­ized by the the first agent’s penalty (al­though I agree that my state­ment is stronger). Re­gard­ing your text, my state­ment refers to “SA will just pre­com­mit to un­der­mine or help A, de­pend­ing on the cir­cum­stances, just suffi­ciently to keep the ex­pected re­wards the same. ”.

My con­fu­sion is about why the sec­ond agent is only mildy con­strained by this com­mit­ment. For ex­am­ple, weak­en­ing the first agent would come with a big penalty (or more pre­cisely, build­ing an­other agent that is go­ing to weaken it gives a large penalty to the origi­nal agent), un­less it’s re­versible, right?

The bit about mul­ti­ple sub­agents does not as­sume that more than one of them is ac­tu­ally built. It rather pre­sents a sce­nario where build­ing in­tel­li­gent sub­agents is au­to­mat­i­cally pe­nal­ized. (Edit: un­der the as­sump­tion that build­ing a lot of sub­agents is in­fea­si­ble or takes a lot of time).

• Another rele­vant post: it seems that the sub­agent need not be con­strained at all, ex­cept on the first ac­tion. https://​​www.less­wrong.com/​​posts/​​jr­rZids4LPiLuLzpu/​​sub­agents-and-at­tain­able-util­ity-in-general

• Nit­pick: “At­tain­able util­ity reg­u­lariza­tion” should be “At­tain­able util­ity preser­va­tion”

• Ba­si­cally this is be­cause the agent treats it­self spe­cially (imag­in­ing in­ter­ven­ing on its own goals) but can treat the sub­agent as a known quan­tity (which can be cho­sen to ap­pro­pri­ately re­spond to imag­ined in­ter­ven­tions on the agent’s goals)?

• Yep :-)