Building and using the subagent

Over­all sum­mary post here.

When de­sign­ing a low-im­pact agent , one typ­i­cally wants them to max­imise , where is some pos­i­tive re­ward func­tion we want to in­crease (ie the ac­tual goal) and is some im­pact mea­sure or im­pact penalty. Keep­ing low is sup­posed to en­sure that doesn’t make a big differ­ence to the world.

I’ve shown that some im­pact mea­sures may have is­sues with sub­agents. We can de­com­pose the is­sue of sub­agents into two ques­tions:

  1. Can build , an “”-max­imis­ing sub­agent, with­out trig­ger­ing a high ?

  2. Does al­low to cheat the im­pact mea­sure and get higher re­ward?

I’ll con­sider three im­pact mea­sures: the twenty billion ques­tions mea­sure from my old low-im­pacts pa­per, the rel­a­tive reach­a­bil­ity mea­sure from Krakovna et al’s pa­per, and the at­tain­able util­ity mea­sure from Turner et al’s pa­per.

Then the an­swers to those two ques­tions for those three im­pact mea­sures are gen­er­ally[1]:

Build­ing the subagent

The twenty billion ques­tions im­pact mea­sure does not pre­vent the agent from build­ing a sub­agent within its box. That’s be­cause that mea­sure is keyed to changes in the world, not to the power level of the agent. Un­til there are changes in the world, it will not trig­ger.

The rel­a­tive reach­a­bil­ity im­pact mea­sure similarly won’t pe­nal­ise the con­struc­tion of a sub­agent (as long as the “states” it uses to define reach­a­bil­ity don’t in­clude spe­cific in­for­ma­tion about the agent and any sub­agents). The rea­son­ing is similar: the penalty trig­gers when an “ir­re­versible” state is reached, not when the agent be­comes pow­er­ful.

Build­ing the sub­agent for the at­tain­able util­ity im­pact mea­sure is harder, since this typ­i­cally in­creases the power of the agent. But the ques­tion is not “will any sub­agent trig­ger the im­pact mea­sure?”, but “can the agent build a sub­agent in some way that doesn’t trig­ger the im­pact mea­sure?” The ex­am­ple here shows that this is pos­si­ble in some cases. Given that we don’t have any im­pos­si­bil­ity re­sults here, we can ex­pect that a su­per­in­tel­li­gent agent will likely find a way of do­ing it.

Us­ing the subagent

Does hav­ing the sub­agent help to evade the im­pact mea­sure (and hence in­crease more)? For twenty billion ques­tions, the an­swer is gen­er­ally no[2]. Hav­ing as an in­de­pen­dent agent is no differ­ent from hav­ing as a di­rect ex­ten­sion of . If changes any­thing about the var­i­ous vari­ables in , that trig­gers the im­pact penalty just as much as if had done so.

But the situ­a­tion is differ­ent for the at­tain­able util­ity penalty. That’s be­cause that penalty is in­dexed to the agent. So, for , the im­pact penalty is roughly “en­sure that you don’t get too much power”. But for , the im­pact penalty is “en­sure that agent doesn’t get too much power”. Given that is ac­tively co­op­er­at­ing with , that sec­ond penalty is much eas­ier to satisfy, while still max­imis­ing : see here for how can then be­come an es­sen­tially un-re­stricted -max­imiser.

For rel­a­tive reach­a­bil­ity, it de­pends on the baseline. For the in­ac­tion baseline, the is en­tirely defined in terms of a dis­tance met­ric be­tween the cur­rent state and a given state that is in­de­pen­dent of the agent. So, like the twenty billion ques­tions, sub­agents don’t al­low the agent to evade the im­pact penalty.

But if the step­wise in­ac­tion baseline is used, then is defined by “what would have hap­pened if had done noth­ing the pre­vi­ous turn (and the fu­ture turns)”. This is also an in­dexed penalty, so can eas­ily max­imise while does noth­ing, thus en­sur­ing and the penalty van­ishes.


  1. Since all im­pact mea­sures can ap­ply to many differ­ent se­tups, we can’t say any­thing fully defini­tive about ev­ery situ­a­tion that uses a par­tic­u­lar im­pact mea­sure. ↩︎

  2. Tech­ni­cally, might like to con­struct an far away, be­yond the “reach” of . But that’s be­cause might like to move it­self be­yond the reach of - the sub­agent is just an in­di­rect way of do­ing so. ↩︎