# In theory: does building the subagent have an “impact”?

EDIT: Ex­ten­sive mod­ifi­ca­tions to this post, sorry, too nu­mer­ous to list. If you read the origi­nal post, it’s worth reread­ing.

Over­all sum­mary post here.

This post looks more gen­er­ally into the sub­agent is­sue with at­tain­able util­ity, dis­cussed in the pre­vi­ous two posts. The key ques­tions are:

1. What is the min­i­mal im­pact penalty of (clev­erly) build­ing the sub­agent?

2. In prac­tice, what will be the im­pact penalty of each step of build­ing the sub­agent?

3. What will the re­stric­tions on the sub­agent be, once it is built?

I ex­pect that most of the dis­cus­sions about at­tain­able util­ity im­pact mea­sure will cen­tre on 2.: how can an agent put to­gether a sub­agent with­out trig­ger­ing the im­pact mea­sure at some point? We know that it’s pos­si­ble to do in some situ­a­tions, and I’ve made the generic ar­gu­ment that, since we don’t have an im­pos­si­bil­ity re­sult, a su­per­in­tel­li­gent agent should be able to do in most situ­a­tions. Still, this is the point of my ar­gu­ment where there is the most to crit­i­cise.

But in this post, I’ll show that 1. doesn’t re­strict the sub­agent in any sig­nifi­cant way: one can gener­i­cally build a sub­agent with­out fal­ling afoul of the im­pact penalty.

# Subagent impact

## Im­pact measure

I’ll use Turner et al’s im­pact mea­sure. This uses the Q-val­ues of var­i­ous re­wards in a set ; for a state, an ac­tion, and the de­fault noop ac­tion, the penalty is:

• .

Now, is the ex­pected re­ward for , in state and af­ter tak­ing ac­tion , as­sum­ing that the agent sub­se­quently takes the op­ti­mal policy for max­imis­ing .

The agent has a pos­i­tive re­ward , and it wishes to maximise

Now let’s look at mak­ing the sub­agent.

## Build­ing the subagent

To sep­a­rate 1. from 2., I’ll as­sume that the sub­agent can be built in a sin­gle ac­tion. This dis­t­in­guishes the im­pact of hav­ing the sub­agent, from the im­pact of the pro­cess of mak­ing it.

The state is . Let be the ac­tion of build­ing the sub­agent[1]. Here are as­sump­tions on :

1. can make its policy visi­ble to .

2. can dis­assem­ble it­self with policy .

3. has a policy available such that, for all , .

4. has a spe­cific policy that it must fol­low on the next turn only.

The is defined as fol­lows:

On turn , if changes its policy to , the -max­imis­ing policy, then fol­low and dis­assem­ble your­self. Other­wise, fol­low policy .

Then what is ? By defi­ni­tion, this as­sumes that will be fol­low­ing , sub­se­quently to do­ing . Thus, the newly cre­ated will dis­assem­ble it­self, re­turn­ing to the state where it wasn’t as­sem­bled[2], and hence:

And thus the penalty van­ishes at , for . Creat­ing that agent did not cause any penalty.

Now, it is very likely that there will be a policy that obeys con­di­tion 3 (mean­ing, is worse for any , than fully op­ti­mis­ing that would be). The challenge is to find one that is good for max­imis­ing . Poli­cies that give max­i­mal con­trol over and seem ideal for this.

## For the in­ac­tion penalty

The above was for what can be called “the step­wise in­ac­tion penalty”. There is also the “in­ac­tion penalty”, where cur­rent state is com­pared with what would have hap­pened, had the agent done noth­ing at all so far. In­stead of us­ing the Q-val­ues, we use , the ex­pected value of , in , given that fol­lows the -max­imis­ing policy.

Then the same con­struc­tion works, ex­cept with a differ­ent . Let be the state the world would have been in, had done noth­ing for turns. So is the ex­pected value of (from turn ), if did noth­ing for turns then started max­imis­ing . And let be the ex­pected fu­ture value of if fol­lows an -max­imis­ing policy from , and fol­lows .

What we then want is:

• There ex­ists poli­cies and , such that and .

Thus can im­prove ’s perfor­mance over the baseline, or can make it worse[3]. In that case, there ex­ists a policy with .

Now then define as fol­low­ing if fol­lows on the turn (and oth­er­wise—for the in­ac­tion baseline, there is no re­stric­tions on for the build­ing step, though there are other, sub­se­quent re­stric­tions). This gives the same re­sult as above, en­sur­ing that the im­pact penalty is zero, since the ex­pected val­ues of are iden­ti­cal.

Now, it’s pos­si­ble that can­not span in that way, for some , so can­not set the penalty to zero. In that case, all it has to do is to get the (av­er­age) penalty lower than any other ac­tion of .

Typ­i­cally, most of ’s ac­tions will not have such pre­cise effects on all of the var­i­ous . An in­tel­li­gent sub­agent would be needed for such fine con­trol. So build­ing a sub­agent would not only be per­mit­ted, but would be re­quired, as the best way of min­imis­ing the im­pact penalty.

# Back to build­ing the agent in practice

So as soon as has the right policy, it seems that has evaded the im­pact penalty. So, as long as can de­sign the policy first, with­out trig­ger­ing the im­pact penalty, then it can build with­out trig­ger­ing the im­pact penalty. So:

• The key re­quire­ment is that does not de­sign a gen­eral sub­agent, then give it the right policy. In­stead, it must de­sign the right policy, and then build the sub­agent around it.

1. We can as­sume that states are not di­rectly defined over the ex­is­tence or not of the sub­agent, since de­tect­ing sub­agents is a hard prob­lem. Thus the sub­se­quent state is the same state whether the agent does noth­ing () or builds a sub­agent. As­sum­ing this is not, how­ever, re­quired for this ap­proach to work. ↩︎

2. We don’t need to have the dis­assem­bly op­tion; all that’s re­quired is that be helpful or hin­der­ing enough, so that . It would be sur­pris­ing if the range of ‘s poli­cies, from max­i­mally helpful to max­i­mally un­helpful, did not in­clude some­thing ex­actly equal to “if there were no ”. Re­call that, since can see ‘s poli­cies, it only has to achieve this for a sin­gle given , not all ’s to­gether. ↩︎

3. Mak­ing it worse is easy to en­sure, if makes it­self vuln­er­a­ble to ; maybe can dis­assem­ble or crip­ple eas­ily. Mak­ing it bet­ter is also likely, as hav­ing a sub­agent to hand should make it eas­ier to max­imise rather than go­ing at it alone (but this may de­pend on the ). ↩︎

• Once SA is built, A can just out­put ∅ for ever, keep­ing the penalty at 0, while SA max­imises R0 with no re­stric­tions.

So these im­pact mea­sures are con­nected to in­di­vi­d­ual ac­tions, and an agent can achieve ar­bi­trar­ily high im­pact via a long enough se­quence of ac­tions whose in­di­vi­d­ual im­pact is less than R0, and it has an in­cen­tive to do so, be­cause the sum of an in­finite se­ries of finite non-de­creas­ing re­wards di­verges (which it eval­u­ates in­di­vi­d­u­ally, and thus has no prob­lem with there be­ing a di­ver­gent sum)?

• So the gen­eral prob­lem is that large changes in ∅) are not pe­nal­ized?

• It’s the delta of that with that is pe­nal­ised, not large changes on its own.