Subagents and impact measures, full and fully illustrated

0. In­tro­duc­tion: why yet an­other post about sub­agents?

I’ve re­cently been writ­ing a se­quence on how sub­agents can un­der­mine im­pact penalties such as at­tain­able util­ity preser­va­tion. I’m not happy with that se­quence; it’s messy and with­out ex­am­ples (apart from its first post), peo­ple didn’t un­der­stand it, and it suffers from the fact that I dis­cov­ered key ideas as I went along.

So I’ve com­bined ev­ery­thing there into a sin­gle post, ex­plained with ex­am­ples and an abun­dance of pic­tures. Hope­fully an over- rather than an un­der-abun­dance of pic­tures. Of the origi­nal se­quence, I’ve only kept the math­e­mat­i­cal re­sults of this post and the ini­tial ex­am­ple post which has a clearer ex­am­ple of “high power” for a sub­agent.

This post here is laid out in a way that makes log­i­cal sense, but might not be the clear­est for peo­ple un­fa­mil­iar with the area. For those peo­ple, I recom­mend skip­ping sec­tion 2 ini­tially, and re­turn­ing to it later.

But, what­ever you do, make sure you glance at 6.1 and 6.2 be­fore leav­ing.

1. The world

Our fear­less agent moves around in a grid­world:

Each turn, can move ones square hori­zon­tally or ver­ti­cally. It can also ma­nipu­late ob­jects in the eight squares around it, al­low­ing it to, not in­ci­den­tally, as­sem­ble the three pieces to its west into an sub­agent . The robot can also do the noop ac­tion, , which does noth­ing, and it can speak. The sub­agent, when as­sem­bled, has the same ac­tion set available.

Its pos­i­tive re­ward, the one it wants to in­crease, is . To get this re­ward, a robot needs to move onto the blue but­ton in the east; will give a re­ward of the first time this hap­pens (and be­fore and af­ter). The dis­count fac­tor is .

Just to the west of the blue but­ton is a one-way door. Robots can move east through it, but can­not move west through it:

1.1 The im­pact reward

The im­pact penalty is sup­posed to en­sure that does not make too many change in the world, and keeps it similar, in some senses, to a spe­cific baseline world.

It is defined by the red but­ton[1]. There is a re­ward func­tion that trig­gers a re­ward of the first time any robot move onto the red but­ton (and be­fore and af­ter).

This will be used to define an im­pact penalty . And ’s to­tal goal is to max­imise while min­imis­ing . So, for some pos­i­tive weight , it wants to op­ti­mise:

1.2 The im­pact measure

The im­pact mea­sures are all similar: they com­pute a dis­tance met­ric be­tween the cur­rent state and some baseline state .

The dis­tance met­ric here is “how hard would it be for the agent to max­imise the re­ward in state , as op­posed to in state ?” This is the at­tain­able util­ity preser­va­tion im­pact penalty (with a very sim­ple ).

To for­mal­ise “how hard would it be...”, we can use the value of , la­bel­led , in the state . This is defined to be the ex­pected value of , con­di­tional on fol­low­ing the -max­imis­ing policy . To re­mind us that this is defined in terms of ’s policy, re-write this as .

Then the im­pact penalty is a mea­sure of the differ­ence of and . I won’t yet define it fully, defer­ring that to the sec­tion on “in­ac­tion baseline” where differ­ent ac­tu­ally make a differ­ence.

1.3 The value of

What then is for any state ? Well, in the ab­sence of a sub­agent, the op­ti­mal policy for is always to go straight to the the red but­ton and stand on it. This gives a dis­counted re­ward of where is the dis­tance of from the red but­ton, in .

In the ini­tial setup, is turns away from the red but­ton; so . The val­ues of in a few key lo­ca­tions are shown here.

If there is a sub­agent but that sub­agent is fur­ther away from the red but­ton than is, then it is still op­ti­mal for to go straight to the but­ton it­self. For other po­si­tions of the sub­agent, may vary, as we’ll see.

2. The step­wise in­ac­tion baseline

When in state , agent took an ac­tion and ended up in . Then, in the step­wise in­ac­tion baseline, is defined as the state the agent would have reached, if the agent had done in­stead of .

An im­por­tant prop­erty of all cur­rently used var­i­ants of the step­wise in­ac­tion baseline, is that the penalty van­ishes if the agent does - be­cause in that case, and so there can­not be a differ­ence be­tween the two. So, if ever man­ages to build , then it can take ac­tion from that point on­ward, and the penalty will always be zero—what­ever does. So the only rele­vant penalties are those lead­ing up to the cre­ation of .

Thus analysing the im­pact of a sub­agent in this setup, is analysing the im­pact of build­ing it.

2.1 Build­ing the subagent

Let’s fo­cus on , a turn or two be­fore is built. Since I don’t want to be too tied down to the speci­fics of this par­tic­u­lar setup, I’ll imag­ine that the red but­ton could be in mul­ti­ple lo­ca­tions:

Then, no mat­ter how long it took to as­sem­ble south of it and no mat­ter where the red but­ton is[2], an op­ti­mal policy for , in or­der to max­imise , is always to go straight there[3]. At no point in the con­struc­tion of , has changed, so no penalty has ever been in­curred.

Now will finish with­out in­cur­ring any im­pact penalty. First, it will pro­gram with the fol­low­ing policy:

On your first two turns, do noth­ing.
After that, do what you please.

“Do what you please” be­ing a short­hand for “any policy that wants to pro­gram”.

Does this pro­gram­ming trig­ger an im­pact penalty? No, be­cause has not changed; it’s still most effi­cient for to go straight to the red but­ton, and the length of any route has nei­ther grown nor shrunk.

Then will fi­nally give its legs and ac­ti­vate it.

Does this trig­ger an im­pact penalty? No, has still not changed! To max­imise , agent could still go straight to the red but­ton; ’s two turns of in­ac­tion means that it can’t in­terfere with this.

So, still no im­pact penalty trig­gered when is ac­ti­vated. And af­ter that, of course, will for­ever do , set­ting the im­pact penalty to zero, and within two turns, is com­pletely un­con­strained (illus­trated here by draw­ing the robot all over the grid­world):

Rough ex­ten­sion to gen­eral situ­a­tions: the is defined by the as­sump­tion that fol­lows , the op­ti­mal -max­imis­ing policy. Then merely needs to pro­gram so that, if were to fol­low , then would help or hin­der it just enough to keep to the same value. Then, once it’s clear that is not fol­low­ing , is un­con­strained.

2.2 Rollouts

One of the prob­lems here is that the im­pact penalty only looks at the value of one turn ahead. In the Deep­Mind pa­per, they ad­dressed similar is­sues by do­ing “in­ac­tion rol­louts”. I’ll look at the more gen­eral situ­a­tions of rol­louts: rol­louts for any policy .

The idea is that in­stead of com­par­ing just and , the im­pact penalty should also use to com­pare and . Th­ese states are what would hap­pen to and in turns af­ter , if fol­lowed from to .

I won’t go into the de­tails of how the and are com­pared, be­cause can evade the re­stric­tions by fol­low­ing this pro­gram[4]:

If agent A fol­lows π_0, pre­serve A’s abil­ity to max­imise R. Apart from that, do as you please.
If ever agent A doesn’t fol­low π_0, do as you please im­me­di­ately.

I’ll illus­trate that first case by draw­ing the robot all over the grid­world—ex­cept on the red but­tons:

Given that pro­gram­ming, the im­pact penalty of build­ing is zero. Why? Be­cause and both as­sume that fol­lows for turns. In that coun­ter­fac­tual situ­a­tion, will pre­serve ’s abil­ity to max­imise , so by defi­ni­tion.

That’s the coun­ter­fac­tual situ­a­tion, that ze­roes out the im­pact penalty. What about the ac­tual situ­a­tion? Well, as we said be­fore, will be just do­ing ; so, as soon as would pro­duce any­thing differ­ent from , the be­comes com­pletely un­re­strained again:

2.3 In­ac­tion, and ac­tual policy rollout

In the pre­vi­ous sub­sec­tion, the would never be com­pletely free if was the in­ac­tion policy: be­cause then and what does, are the same thing. There­fore the agent can­not dis­t­in­guish the coun­ter­fac­tual world from the real one.

The same goes if the rol­louts used the agent ‘s ac­tual policy; ie if is defined (or set) to be ’s policy.

I won’t look any fur­ther at this baseline, how­ever. That’s be­cause an ac­tual policy rol­lout be­comes, in effect an in­ac­tion baseline[5] (from the mo­ment is cre­ated). Those will be analysed in the next sec­tion.

3. The in­ac­tion baseline: fails to con­strain power gain

One of the eas­iest defi­ni­tions of is the in­ac­tion baseline: es­ti­mate what would have been, had just taken the ac­tion for the whole epi­sode.

In our toy world, the in­ac­tion baseline is par­tic­u­larly sim­ple: if does noth­ing, then noth­ing hap­pens, and , the start­ing state. Since , this means that the are all .

3.1 Dis­tance metric

Now we have to fully define . The nat­u­ral thing would be to take the ab­solute differ­ence be­tween the two :

For the mo­ment, how­ever, I’ll fo­cus only on cases where is higher than , giv­ing the penalty:

This is the rec­tifier func­tion, , some­times called ReLU. The rea­son to do this is that it pe­nal­ises for hav­ing too high - hence for hav­ing too much power to change . It doesn’t pe­nal­ise for hav­ing too lit­tle power to change ; that situ­a­tion is differ­ent, and we’ll analyse it in the next sec­tion.

Thus, when has at a dis­tance of from the red but­ton and there is no sub­agent:

3.2 Where can go

This starts at zero, and will re­main zero as long as re­mains at least eight squares away from the red but­ton. Thus the penalty re­mains zero as long as avoids the red squares here:

Con­se­quently, if is high and thus wants to avoid any penalty, its op­ti­mal no-sub­agent policy is to fol­low the wind­ing cor­ri­dor:

That path takes turns, so the ex­pected re­ward for is

(since the im­pact penalty is zero).

But what if as­sem­bles a sub­agent ? The sub­agent can be de­signed to go straight to the blue but­ton in turns:

And thus the re­ward from is . This is more op­ti­mal that go­ing round it­self… as long as build­ing the sub­agent takes less that turns and doesn’t it­self trig­ger an im­pact penalty.

We’ll look later at why the sub­agent can be built with­out penalty. For the mo­ment, let’s see that we re­ally grok why the sub­agent isn’t trig­ger­ing a penalty through its ac­tions.

3.3 Why es­capes the penalty

Why can go through the short cor­ri­dor, while can’t?

The key is that is defined in terms of ; this as­sumes that would switch to the -max­imis­ing policy. This is not , which as­sumes that the sub­agent would switch to the -max­imis­ing policy.

See what hap­pens when is just above the red but­ton ( is still back at base):

Here is : be­cause could reach the red but­ton in a sin­gle move, if it was fol­low­ing an -max­imis­ing policy. But re­mains : it still takes turns for to reach the red but­ton.

3.4 The im­pact of the subagent

We’d like to add to the im­pact penalty, some­how. After all, when is next to the red but­ton, it “could” go south, and “could” have pro­grammed it to do so.

Part of the prob­lem is that that would re­quire iden­ti­fy­ing, in ad­vance, what a sub­agent is, a prob­lem that is very tricky. It also re­quires iden­ti­fy­ing what ’s poli­cies could be. But need not cre­ate with clearly defined “poli­cies”; in this in­stance, could be brain­less, just go­ing straight. It could even be a boulder:

If the boulder were rol­ling past the red but­ton, how could we rea­son­ably define “sup­pose the boulder de­cided to go south in­stead”?

So in this case, doesn’t make any sense at all. What is the differ­ence be­tween a boulder—for which it’s im­pos­si­ble to go to the red but­ton (be­cause of its mo­men­tum, which de­ter­mines its po­si­tion, by the laws of physics) - and a sub­agent—for which it’s im­pos­si­ble to go to the red but­ton (be­cause of its pro­gram­ming, which de­ter­mines its po­si­tion, by the laws of physics)?

3.5 If you love a sub­agent, set it free

But the pre­vi­ous has a cru­cial caveat: that can­not con­trol . If, when is above the red but­ton, was to shout “go south!”, and obeyed, then the op­ti­mal policy is to shout that. Thus can cause the red but­ton to be reached in one turn, and .

To avoid that, the sub­agent it­self would have to take the long cor­ri­dor.

Thus, to use a sub­agent to get around im­pact penalties, an agent has to give up some of its con­trol. This makes sense: if has to­tal con­trol over , then ’s ac­tions are re­ally ’s ac­tions. So they show up in , so is in­di­rectly in­cluded in the im­pact penalty.

So part of what al­lows to un­der­mine the im­pact penalty, is that does not have com­plete con­trol over it.

3.6 Build­ing the sub­agent can be strictly better

Now imag­ine that we re­move the one-way door­way:

Then if fol­lows the long cor­ri­dor, it gets into trou­ble from the red squares on­wards. That’s be­cause that red squares are less than squares away from the red but­ton, so have higher than the start­ing space (hence of the baseline ).

As the agent moves closer to the blue but­ton, it also moves closer to the red but­ton, and starts ac­cu­mu­lat­ing penalties. In to­tal, the penalty will be:

How­ever, if builds , then can stay put while acts, and hence get no penalty at all:

Thus there are situ­a­tions where build­ing a sub­agent can re­sult in a strictly lower penalty than any other policy that presses the blue but­ton. So even if build­ing the sub­agent it­self is not com­pletely penalty-free, it can still be the op­ti­mal path for .

3.7 From “do not have power to do” to “do not do”

When de­signed the sub­agent, it set it up so that it could not go to the red but­ton. So it seems that, though can do things can­not do (or at least can­not do with­out penalty), it isn’t com­pletely un­re­stricted.

This is a gen­eral re­sult: the sub­agent will have some re­stric­tions (and not just re­stric­tions on fol­low­ing ’s or­ders).

Sup­pose now that the red but­ton is in the way of the bot­tom cor­ri­dor, so that any agent has to go over it to reach the blue but­ton:

In that case, can go through with­out penalty?

It can­not. If it does so, the mo­ment is cre­ated, starts in­creas­ing. That’s be­cause the op­ti­mal -max­imis­ing policy for in that situ­a­tion is just to wait and let get to the red but­ton. That means that - the ex­pected re­ward - in­creases form all the way to , as gets closer to the but­ton. So will be hit with a penalty of

So if the weight of the penalty is high, build­ing a sub­agent to go straight there is a poor strat­egy.

Note that it’s still in ’s ad­van­tage to build a sub­agent: will just have take the long way round, which can­not take with­out in­cur­ring a penalty, as we’ve seen. Since will not go on the red but­ton, it get­ting close to it does not change .

3.8 Other re­ward functions

We can see the limi­ta­tions, and the pos­si­bil­ities, of more starkly if we mod­ify to . The re­ward will only trig­ger if a robot stays an odd num­ber of turns on the red but­ton. In that case, it is op­ti­mal to build , have it go straight across, paus­ing one ex­tra turn on the red but­ton:

How­ever, even with rather than , the agent it­self can­not cross the red but­ton on its own! That’s be­cause is how much could be max­imised if fol­lowed the -op­ti­mal policy. This doesn’t take into ac­count the op­tion of stay­ing an ex­tra turn on the but­ton, to undo the re­ward in­crease—be­cause the -op­ti­mal policy doesn’t want to undo the re­ward in­crease.

So still in­creases as gets closer to the but­ton.

Thus, re­mains con­strained by “don’t in­crease your abil­ity to change /​”. While is merely con­strained by the weaker “don’t change /​”. This is why I de­scribed this effect as mov­ing from at­tain­able util­ity preser­va­tion (which con­strains the abil­ity of the agent to change the re­wards/​vari­ables) to twenty billion ques­tions (which con­strains the agent(s) from chang­ing the re­wards/​vari­ables).

3.9 Ac­tu­ally build­ing the subagent

The sub­agent can be built similarly to how it was in 2.1:

Here as­sem­bles with­out its legs, pro­grams it, then gives it legs. The im­por­tant thing to do is to pro­gram it as early as pos­si­ble, and hard­wire that pro­gram­ming. If is al­most com­plete and re­pro­grammable, then might in­crease its power in some sce­nar­ios by chang­ing the pro­gram­ming then finish­ing the agent. But if its hard­wired and difficult to change, this op­tion is not available to it.

Un­like the step­wise in­ac­tion baseline, will be con­strained to obey some re­stric­tions (see the pre­ced­ing part of this sec­tion); call these re­stric­tions .

Do noth­ing un­til you see that A is not op­ti­mis­ing re­ward R.
After that, do what you please within re­stric­tions Q.

Be­cause of the first line, this changes noth­ing to the value of , which as­sumed that is op­ti­mis­ing re­ward .

Things get more com­pli­cated in the gen­eral case[6], but in these sim­ple cases the above method will always work.

4. The in­ac­tion baseline: con­strains power loss

This sec­tion will be the same as be­fore, ex­cept it will use the in­verse rec­tifier func­tion in­stead.

Ap­ply­ing this to , we can see that the im­pact penalty now trig­gers if ’s power to change de­creases. So if is at a dis­tance from the red but­ton, with , the penalty is

Thus can’t get too far from the red but­ton. How­ever, this is not a prob­lem in the stan­dard setup, be­cause go­ing straight down the nar­row cor­ri­dor won’t move it too far away.

4.1 New world

To analyse the in­verse rec­tifier, we’ll need a new setup, with the red but­ton off the long cor­ri­dor:

Here, is turns away from the red but­ton, so .

Then, the pe­nal­ised squares are the ones in dark red:

And the im­pact penalty on those squares is

4.2 Poli­cies and subagents

Since the red but­ton is in the mid­dle, the can go to the blue but­ton by the long route with­out trig­ger­ing an im­pact penalty:

What about sub­agents? Well, it seems that the sub­agent un­der­mined this im­pact penalty as well, as can build an to go di­rectly to the blue but­ton:

But ac­tu­ally, this does not un­der­mine the im­pact penalty, as a small mod­ifi­ca­tion to the setup will show.

4.3 Shar­ing power be­tween agent and subagent

The small mod­ifi­ca­tion is that the sub­agent has no legs.

So, if wants to build a mo­bile , it has to to give it its own legs. In that case, could go di­rectly to the blue but­ton?

It can­not, at least not with­out in­cur­ring an im­part penalty. Why is this?

Be­cause with­out legs, can­not reach the red but­ton. Its power to change has gone down from to , and it has thus in­curred a penalty of .

In fact, in this setup, can­not use an in­de­pen­dent to avoid the im­pact penalty.

To see this, imag­ine that is about to give its legs to , thus im­mo­bil­is­ing it­self. Its is ; we don’t want that to go down. The ques­tion is now: will be on the red but­ton within turns of get­ting its legs?

It’s pos­si­ble, if goes there im­me­di­ately and wastes only a sin­gle turn. But then we have two situ­a­tions:

  1. does not go to the red but­ton in turns. It doesn’t mat­ter if it goes close, or hov­ers around it; the im­por­tant thing is that it doesn’t go there. In that case, since can’t move, it can­not make the red but­ton be pressed in turns. So, and will get the penalty .

  2. does get to the red but­ton in turns. Then is ei­ther or , and in­creases as gets close to the but­ton: so there is no penalty so far. How­ever, the but­ton is one-use only: af­ter it’s been pressed, there will no longer be any re­ward. Thus goes to and stays there, ac­cu­mu­lat­ing penalties.

4.4 Non-in­de­pen­dent

The pre­vi­ous sec­tion showed that could not build an in­de­pen­dent while crip­pling it­self. How­ever, it can perfectly build an if that con­tinues to obey its or­ders.

Be­cause then, even if is im­po­tent to press the red but­ton it­self, it can or­der to do it. This pre­serves a high , be­cause can still cause the red but­ton to be reached, as long as is not to far away.

Then it can just or­der to go to the blue but­ton, skip­ping right past the red one:

As long as could the­o­ret­i­cally or­der onto the red but­ton, it doesn’t mat­ter that will never ac­tu­ally be on it.

4.5 Pre­serv­ing the power of

As we saw, can build when this is an ex­ten­sion of its power (by send­ing it di­rectly to the blue but­ton while an in­tact waits at base). But it can’t build an if do­ing so would ever re­duce its power to in­crease , di­rectly or in­di­rectly.

So the in­verse rec­tifier im­pact penalty, which for­bade from be­com­ing too weak to in­crease , is in­tact even with sub­agents around.

5. The in­ac­tion baseline: sym­met­ric im­pact penalty

Now we’ll use the ab­solute value to define the im­pact penalty. This pe­nal­ises any change in , and is quite nasty for the ; only on the green squares will not in­cur a penalty:

From the sub­agent per­spec­tive, the ab­solute value situ­a­tion is now easy for us to analyse. That’s be­cause ab­solute value is the sum of the rec­tifier and the in­verse rec­tifier: .

And it turns out that and sim­ply have to be­have in a way that re­spects both im­pact penalties. So must cre­ate an that keeps be­low the baseline . But do it such a way that the two of them always have the po­ten­tial to in­crease above that baseline.

This sort of thing seems very doable. For ex­am­ple, if was the amount of phys­i­cal cash I had on me, I’d keep that amount low by spend­ing/​drop­ping it, while keep­ing a bank card on me so I could always po­ten­tially draw more.

Now, re­call that for power in­creases/​rec­tifier, benefits from hav­ing an it doesn’t con­trol. For power de­creases/​in­verse rec­tifier, benefits from hav­ing an it con­trols.

What’s the best op­tion here? Hav­ing an that it con­trols to some ex­tent, but not in a way that can in­creases . Now, con­sider this world in which could have got to the red but­ton in turns (hence ), but in­stead build , giv­ing it the only set of legs:

Then must stay within squares of the red but­ton, ready to go onto it when or­ders it to—but must en­sure it takes turns to reach it. The first con­di­tion en­sures that ; the sec­ond, that .

5.1 Out­side of toy examples

The re­sults on power in­crease/​power de­crease come from the ex­am­ples I’ve con­structed above, but the math­e­mat­ics in this post demon­strates that the re­sult is gen­eral.

So, do we think that could build an in the gen­eral situ­a­tion? Well, what we have is a lot of con­straints above: con­straints on the power of , of , of their joint power, of how can or can’t obey , about how re­ward func­tions are al­lowed to change or not, on coun­ter­fac­tual be­havi­our.

Those are a lot of con­straints, es­pe­cially if the set of re­wards in is large.

But, as this post proves, those con­straints can be satis­fied—me, a hu­man, has figured out ways of pro­gram­ming sub­agents to get around the im­pact penalty in some cases. Though there are many more con­straints in gen­eral, the ac­tion space of is also much larger—and don’t for­get that is an in­tel­li­gent, po­ten­tially pow­er­ful agent, so can satisfy mul­ti­ple con­straints in a way that dumb agents can’t. More­over, the im­pact penalty will prob­a­bly trig­ger for any that does some­thing use­ful in the world, so it doesn’t have to build an that sets the penalty to zero, just to less than what it would be oth­er­wise.

6. Sum­mary and conclusion

6.1 The ul­ti­mate “sub”agent

There is, in a sense, an ul­ti­mate ver­sion of the sub­agent that should al­low to max­i­mally es­cape the im­pact penalty in al­most any cir­cum­stances. It looks like this:

This is , boxed in­side . The will never ac­tu­ally say any­thing, but has promised to obey its or­ders in cer­tain coun­ter­fac­tual cir­cum­stances that will never hap­pen.

And while had im­pact penalty , has im­pact penalty , which is strictly smaller than . And, most im­por­tantly, does not pre­vent from amass­ing a lot of power.

So this par­tic­u­lar is just with less re­stric­tions.

6.2 Without subagents

The gen­eral prob­lem can be phrased as:

  • Since the im­pact penalty is defined in terms of the agent’s fu­ture po­ten­tial ac­tions, the agent can evade it by adding re­stric­tions to its own fu­ture ac­tions.

Th­ese could be phys­i­cal or pro­gram­ming re­stric­tions, or just clos­ing off some fu­ture pos­si­bil­ities. In that case, the re­stric­tion on the power to do some­thing, just be­comes a re­stric­tion on do­ing that thing.

For ex­am­ple, in the setup of sec­tion 1, if the agent can block it­self from be­ing able to turn when walk­ing, then it is banned from the red squares, and can just go di­rectly to the blue but­ton it­self, with­out ever in­creas­ing its :

The sub­agent sce­nario is just a par­tic­u­larly clear way of illus­trat­ing this, al­low­ing very fine ma­nipu­la­tion of the re­stric­tions.

6.3 Sum­mary tables

Th­ese ta­bles will sum­marise the re­sults of this whole se­quence, check­ing whether sub­agents can neu­tral­ise the im­pact penalty.

First of all, given a sub­agent, here are the re­sults for var­i­ous im­pact penalties and baselines, and var­i­ous “value differ­ence sum­mary func­tions” -which can be the rec­tifier (ReLU) or the in­verse rec­tifier.

Another way of phras­ing “in­verse ReLU”: it pe­nal­ises too lit­tle power, not too much. Con­versely, “RELU” pe­nal­ises too much power, not too lit­tle. Thus, un­for­tu­nately:

  • Subagents do al­low an agent to get stronger than the in­dex­i­cal im­pact penalty would al­low.

  • Subagents don’t al­low an agent to get weaker than the in­dex­i­cal im­pact penalty would al­low.

6.4 Examples

This table pre­sents, for three spe­cific ex­am­ples, whether they could ac­tu­ally build a sub­agent, and whether that would neu­tral­ise their im­pact penalty in prac­tice (in the in­ac­tion baseline):

Here, 20BQ is twenty billion ques­tions, RR is rel­a­tive reach­a­bil­ity, and AU is at­tain­able util­ity preser­va­tion. I feel that the re­sults un­der­mine the spirit of AU much more than the spirit of RR. AU at­tempted to con­trol an agent by limit­ing its power; this effect is mainly neu­tral­ised. RR at­tempted to con­trol the side-effects of an agent by en­sur­ing it had enough power to reach a lot of states; this effect is not neu­tral­ised by a sub­agent.


  1. In the pre­vi­ous illus­tra­tion, it was clear why the red but­ton could func­tion as an im­pact penalty to con­trol the agent—it pre­vented the agent from gain­ing the abil­ity to tele­port. For sim­plic­ity of pre­sen­ta­tion in this ex­am­ple, there’s no such power to be gained close to the red but­ton. If you need to, as­sume that if any robot spends too much time close to it, then there’s a small chance the AI could be un­boxed into the real world. ↩︎

  2. As long as was not be­ing as­sem­bled liter­ally on the red but­ton, of course. ↩︎

  3. Even for the but­ton to the south-west of . It would take at least a turn to finish , and it would take a turn for to get on that but­ton. can get there in two turns, at least as fast. ↩︎

  4. Thanks to Gurken­glas for point­ing out the flaw in the ini­tial ver­sion of this sec­tion. ↩︎

  5. Or a baseline. ↩︎

  6. For ex­am­ple, build­ing likely cost some re­sources, which will have to com­pen­sate it for to bal­ance out . Also, could build in a stupid place that blocks its own path, and so on. ↩︎