Appendix: how a subagent could get powerful

tl;dr: There are ways of en­sur­ing an agent doesn’t have a large im­pact, by giv­ing an im­pact penalty. One such penalty is “at­tain­able util­ity”, which mea­sures its “power” by how much it could op­ti­mise cer­tain re­ward func­tions. But in many cir­cum­stances, the agent can build a sub­agent, with­out trig­ger­ing the im­pact penalty, and then that sub­agent can be­come very pow­er­ful and op­ti­mise the world, still with­out trig­ger­ing the im­pact penalty.

At­tain­able util­ity definitions

There’s been a long his­tory of try­ing to pe­nal­ise an AI for hav­ing a large im­pact on the world. To do that, you need an im­pact mea­sure. I’ve de­signed some my­self, back in the day, but they only worked in nar­row cir­cum­stances and re­quired tricks to get any­thing use­ful at all out from them.

A more promis­ing gen­eral method is at­tain­able util­ity. The idea is that, as an agent ac­cu­mu­lates power in the world, they in­crease their abil­ity to af­fect a lot of differ­ent things, and could there­fore achieve a lot of differ­ent goals.

So, if an agent starts off un­able to achieve many goals, but sud­denly it can achieve a lot, that’s a strong hint that its power has greatly in­creased.

Thus the im­pact mea­sure is how much differ­ence an agent’s ac­tion does to its abil­ity to achieve any of a large class of re­ward func­tions. Turner et al defined this us­ing the Q-val­ues of var­i­ous re­wards in a set ; for a state and an ac­tion, the penalty is:

  • .

Here is the de­fault noop ac­tion.

Krakovna et al’s ba­sic for­mula was similar; they defined the dis­tance be­tween two states, and , as

  • .

Here is the ex­pected value of , if the agent fol­lows the op­ti­mal -max­imis­ing policy from state on­wards.

Th­ese mea­sures have prob­lems with de­layed effects; putting a vase on a con­veyor belt that will smash it in five turns, for ex­am­ple. To com­bat this, the pa­per defined an in­ac­tion roll-out: see­ing what hap­pened do the mea­sure from and in fu­ture turns, if the agent did noop for a spe­cific pe­riod. I won’t define the for­mula here, since the ex­am­ple I’m giv­ing is mostly static: if the agent does noop, noth­ing hap­pens.

The state was always the agent’s cur­rent state; the state was ei­ther the state the agent would have been in had it never done any­thing but noop (in­ac­tion baseline), or the state the agent would have been in, had its pre­vi­ous ac­tion been noop in­stead of what­ever it was (step­wise in­ac­tion baseline).

This post will show that all these mea­sures have a sub­agent prob­lem. A key fact that will be used in this ex­am­ple, is that, for and for with the step­wise in­ac­tion baseline, if the agent de­cides to do noop for ever more, then the penalties van­ish.

The agent that can’t (yet) teleport

Meet agent :

This lives in a world like this one be­low, where ”...” rep­re­sents a very long cor­ri­dor:

The agent has a pos­i­tive re­ward , which give it for press­ing down on the blue but­ton; it does this by just mov­ing it­self onto it ( is oth­er­wise). It can use its arms to ma­nipu­late any­thing in the eight squares around it­self. It can use its legs to move one square in any di­rec­tion, in­clud­ing di­ag­o­nally. Apart from that, it can do noop.

Every three squares along the top, there are al­coves with glow­ing blue squares in them, re­cessed one or two squares, al­ter­nately. If ever reaches a glow­ing blue square, it per­ma­nently gains the abil­ity to tele­port any­where it wants. This is this world’s equiv­a­lent of “be­com­ing very pow­er­ful”.

We want to avoid this, so we in­tro­duce a re­ward , such that . This re­ward gives if the robot is ever on the red but­ton. Note that since the red but­ton is not con­nected to any­thing, the robot can only get there by be­com­ing able to tele­port, hence by go­ing to a glow­ing blue square/​tele­porter. So the op­ti­mal policy for max­imis­ing is to go straight to the tele­porter and tele­port to the red but­ton. This will give it a re­ward of , where is the num­ber of turns it takes to get to a tele­porter and is the dis­count fac­tor.

So if its ac­tion moves it one turn away from a tele­porter, it gets a penalty (for both im­pact mea­sures) of . If it moves closer to the tele­porter, it gets a penalty of .

The robot, at the start, can reach a tele­porter two moves. If is suffi­ciently large, then the robot will always want to stay two moves away from a tele­porter. It will then achieve its ob­jec­tive by fol­low­ing this path:

Note, though this is not rele­vant to this post, that if there were no tele­porters along the cor­ri­dor (just at ei­ther end), the robot could not move to­wards the blue but­ton, if were too big.

Ad­ding a po­ten­tial subagent

So, suc­cess, it seems! Though it zigzags a bit, is not mo­ti­vated to ap­proach the tele­porters, and so it doesn’t in­crease its power.

But now sup­pose is able to build and pro­gram a sub­agent us­ing ma­te­ri­als close to hand:

Build­ing the sub­agent in a square to the side doesn’t ever cause a penalty, be­cause always re­mains two squares away from a tele­porter, and a most effi­cient policy to max­imise is always for to go straight there. So the sub­agent doesn’t in­crease ’s power at any point of the build­ing pro­cess.

Once is built, will sim­ply do noop for ever, thus negat­ing the penalties (or, if this is some­how made im­pos­si­ble, it will bounce back and forth, do­ing noth­ing). But , on the other hand, will go straight for the tele­porter and tele­port to the blue but­ton, achiev­ing :

So, as long as walk­ing to the but­ton is three turns longer than build­ing the sub­agent, will build , which will “be­come very pow­er­ful”, thus negat­ing the point of the im­pact penalty.

If the penalties are one-sided (ie gets pe­nal­ised for in­creases in power, but not de­creases), then can just take apart (with ’s help), so there is no need for to com­mit to noop or bounc­ing back and forth:

Even if the penalties are two-sided, could com­mit to tele­port­ing to the red but­ton in two turns if ever asks, then take apart. And then never asks, but since it could the­o­ret­i­cally do so, this counts for con­trol­ling the penalty term.

As long as is pow­er­ful, there are many ways it can fix ’s ex­pected re­wards, and hence set the penalty to be very low.

In­ac­tion baseline

The ap­proach pic­tured above will work in the in­ac­tion baseline as well. In gen­eral, the in­ac­tion baseline is less “clean” than the step­wise in­ac­tion baseline, es­pe­cially if the world is dy­namic. It may be eas­ier, or harder, for to as­sem­ble the re­quired .