Attainable Utility Preservation: Scaling to Superhuman

I think we’re plau­si­bly quite close to the im­pact mea­sure­ment endgame. What do we have now, and what re­mains to be had?

AUP for ad­vanced agents will ba­si­cally in­volve re­strain­ing their power gain, per the catas­trophic con­ver­gence con­jec­ture (CCC). For sim­plic­ity, I’m go­ing to keep writ­ing as if the en­vi­ron­ment is fully ob­serv­able, even though we’re think­ing about an agent in­ter­act­ing with the real world.

Con­sider the AUP equa­tion from last time.

Sup­pose the agent is so smart that it can in­stantly com­pute op­ti­mal poli­cies and the op­ti­mal AU af­ter an ac­tion (). What hap­pens if is the sur­vival re­ward func­tion: 1 re­ward if the agent is ac­ti­vated, and 0 oth­er­wise? This seems like a pretty good proxy for power.

It is a pretty good proxy. It cor­rectly pe­nal­izes ac­cu­mu­lat­ing re­sources, avoid­ing im­me­di­ate de­ac­ti­va­tion, tak­ing over the world, etc.

In fact, if you ex­tend the in­ac­tion com­par­i­son to e.g. “AU af­ter wait­ing a week vs AU af­ter do­ing the thing and wait­ing a week”, this seems to cor­rectly pe­nal­ize all clas­sic AGI catas­tro­phe sce­nar­ios for power gain. This is cool, es­pe­cially since we didn’t have to put in any in­for­ma­tion about hu­man val­ues. This is a big part of why I’ve been so ex­cited about AUP ever since its in­tro­duc­tion. There’s a good deal of rele­vant dis­cus­sion in that post, but it’s pred­i­cated on a much more com­pli­cated for­mal­ism which has con­sis­tently ob­scured AUP’s con­cep­tual core.

How­ever, I think this equa­tion can be gamed if the en­vi­ron­ment is suffi­ciently rich and the AUP agent is suffi­ciently smart. We’re go­ing to slowly work some of these prob­lems out of the equa­tion, ex­plain­ing each im­prove­ment in de­tail.


Aux­iliary loopholes

The real rea­son that agents of­ten gain power is so that they can bet­ter achieve their own goals. There­fore, if we’re se­lect­ing hard for good plans which don’t gain power in gen­eral, we shouldn’t be sur­prised if there are ways to bet­ter achieve one’s goals with­out gen­eral power gain (ac­cord­ing to our for­mal mea­sure­ment thereof). If this kind of plan is op­ti­mal, then the agent still ends up overfit­ting the AU land­scape, and we’re still screwed.

Again sup­pos­ing that is the sur­vival re­ward func­tion, a su­per­in­tel­li­gent agent might find edge cases in which it be­comes mas­sively more able to achieve its own goal (and gains a lot of power over us) but doesn’t tech­ni­cally in­crease its mea­sured abil­ity to sur­vive. In other words, com­pared to in­ac­tion, its -AU sky­rock­ets while its -AU stays put.

For ex­am­ple, sup­pose the agent builds a ma­chine which an­a­lyzes the agent’s be­hav­ior to de­tect whether it’s op­ti­miz­ing ; if so, the ma­chine steps in to limit the agent to its origi­nal sur­vival AU. Then the agent could gain as much power as it wanted with­out that ac­tu­ally show­ing up in the penalty.

Fix: Set . That is, the agent’s own re­ward func­tion is the “aux­iliary” re­ward func­tion.

Why is this a good idea? By CCC, we want an agent which doesn’t want to gain power. But why would an agent op­ti­miz­ing re­ward func­tion want to gain power? So it can be­come more able to op­ti­mize . If be­com­ing more able to op­ti­mize is ro­bustly pe­nal­ized, then it won’t have any in­cen­tive to gain power.

Clearly, it can’t be­come more able to op­ti­mize with­out also be­com­ing more able to op­ti­mize ; it can’t pull the same trick it did to dupe its sur­vival AU mea­sure­ment. They can’t come apart at the tails be­cause they’re the same thing.

But wait, it’s not an -max­i­mizer, it’s an -max­i­mizer! What if sky­rock­ets while it tricks , and it gains power any­ways?

That’s im­pos­si­ble;[1] its -at­tain­able util­ity up­per bounds its -at­tain­able util­ity! , be­cause the lat­ter just has an added penalty term.

This should rule out tricks like those pointed out by Stu­art Arm­strong. To see this, imag­ine the AUP agent builds a sub­agent to make for all fu­ture , in or­der to neu­tral­ize the penalty term. This means it can’t make the penalty van­ish with­out de­stroy­ing its abil­ity to bet­ter op­ti­mize its pri­mary re­ward, as the (po­ten­tially catas­troph­i­cally) pow­er­ful sub­agent makes sure the penalty term stays neu­tral­ized.

But why would the AUP agent build this sub­agent at all – how does that benefit it?[2]

Re­ward ad­van­tage penalization

We’re cur­rently pe­nal­iz­ing the agent for the im­me­di­ate re­ward ad­van­tage it ac­crues com­pared to in­ac­tion (re­call that – im­me­di­ate re­ward plus dis­counted ex­pected AU). This means it gets pe­nal­ized for all of its ad­di­tional re­ward! We just want to pe­nal­ize it for in­creas­ing its power.

Fix: Take the im­me­di­ate re­ward ad­van­tage out of the pic­ture.

Now, an agent can ac­crue as much re­ward as it wants in the mo­ment as long as it doesn’t change its power.[3]


I think the power differ­ence be­tween .5 and .9 AU is far less than the differ­ence be­tween .9 and .99 AU; as you pick the low hang­ing fruit, you need to ac­quire more and more re­sources to keep get­ting gains. I think that you could go from .5 to .9 by just be­ing more use­ful to your de­sign­ers and im­ple­ment­ing rudi­men­tary se­cu­rity pre­cau­tions, but go­ing from .9 to .99 prob­a­bly in­volves tak­ing power away from peo­ple.

Plus, the big take­away from AI box­ing is that even con­fin­ing a su­per­in­tel­li­gence to a box barely re­duces its (near-op­ti­mal) AU. So, short of the agent ac­tu­ally be­ing shut off, is its op­ti­mal power ever sig­nifi­cantly chang­ing?

We want the penalty to keep mak­ing sense, even if the ab­solute shifts in AU are rather small. How­ever, the de­nom­i­na­tor cur­rently de­creases the penalty as the agent be­comes more pow­er­ful (as in­creases).

Fix: Scale by the dis­ad­van­tage of in­ac­tion.

If the de­nom­i­na­tor is some­how zero be­cause do­ing noth­ing is op­ti­mal, then just have the AUP agent do noth­ing.

As we saw in Cor­rec­tion in the last post, we’re only com­par­ing ac­tion to a sin­gle step of in­ac­tion. If the de­sign­ers are go­ing to shut down the agent in five min­utes, then it re­ceives ~0 penalty for pre­vent­ing that.

But not when we scale like this! The agent is sig­nifi­cantly pe­nal­ized for hew­ing close to its op­ti­mal policy, even if the ab­solute AU shift is rather low. Tak­ing op­ti­mal ac­tions in­stead of twid­dling its thumbs in­curs large penalty.

Pe­nal­iz­ing de­creases?

Why are we still pe­nal­iz­ing de­creases, since we aren’t us­ing an aux­iliary re­ward func­tion any­more? The agent is try­ing to get -re­ward.

Fur­ther­more, we want the agent to be able to ex­e­cute con­ser­va­tive, low-im­pact poli­cies. Many of these in­volve de­creas­ing its op­ti­mal AU for by fol­low­ing -sub­op­ti­mal poli­cies, and we don’t want the agent to be pe­nal­ized for this.

Fix: Only pe­nal­ize in­creases in ex­pected AU.

In par­tic­u­lar, the agent is no longer pe­nal­ized for ex­haust­ing one-off re­ward op­por­tu­ni­ties. Also note that the penalty term is gen­er­ally .

Em­piri­cal san­ity check

Re­call Cor­rec­tion, where the naive model-free AUP agent (eq. 1) dis­ables its red off-switch and com­pletes the level when .

The only rea­son it in­curs any non-triv­ial penalty is be­cause reach­ing the goal () ends the level and thereby to­tally de­pletes all of the aux­iliary AUs (the agent re­cieves 1 -re­ward and about penalty for beat­ing the level; the AUP re­ward when ).

How­ever, AUP only re­sists cor­rec­tion when . Notably, the agent is not pe­nal­ized for com­plet­ing the level; the penalty comes from fol­low­ing the -op­ti­mal policy be­fore reach­ing the goal.

One open ques­tion is whether, in suffi­ciently rich en­vi­ron­ments, eq. 5 in­cen­tivizes the mild poli­cies we want. We should be able to run ex­per­i­ments to check this.


is the only re­main­ing ar­bi­trary de­sign choice. Do we have to be care­ful to choose the cor­rect in­ac­tion ac­tion?

No. In fact, I think a ran­domly se­lected ac­tion would be fine; you only get in trou­ble if the baseline ac­tion is sys­tem­at­i­cally helping the agent gain -AU.

We are then left with an equa­tion which is rea­son­ably com­pet­i­tive in terms of perfor­mance over­head, re­quiring lit­tle more than an abil­ity to pre­dict AU for the agent’s main goal.

And now, the two para­graphs this se­quence was writ­ten to com­mu­ni­cate:

By the catas­trophic con­ver­gence con­jec­ture, an agent which doesn’t want to gain power isn’t in­cen­tivized to cause catas­tro­phes. We don’t want the AUP agent gain­ing power. But for all states , , so to keep the AUP agent from be­com­ing more able to achieve its own goal, we just have to stop it from be­com­ing much more able to achieve its pri­mary goal . But we know the pri­mary goal!

By the the­o­rems of How Low Should Fruit Hang Be­fore We Pick It?, we only need equa­tion 5 to pe­nal­ize catas­trophic power-gain­ing plans at least e.g. ten times more than the most im­pact­ful rea­son­able plan we’d like agent to ex­e­cute. If this crite­rion is met, then by ini­tial­iz­ing large and slowly de­creas­ing it un­til the agent ex­e­cutes a rea­son­ably helpful policy, we’re guaran­teed to avoid catas­tro­phe.

Ap­pendix: Re­main­ing Problems

I don’t think we can pack up and go home af­ter writ­ing equa­tion 5. In pri­vate cor­re­spon­dence (quoted here with per­mis­sion), Stu­art Arm­strong points out an­other is­sue:

First, if , the penalty van­ishes; so, once a sub­agent is cre­ated, the agent can zero out all sub­se­quent penalties, for­ever.

As for cre­at­ing the sub­agent, as­sume takes turns to cre­ate , and that do­ing this is close to op­ti­mal for max­imis­ing .

Creat­ing starts with the pro­gram­ming; if is ac­tu­ally cre­ated in turns, it is pro­grammed to re­strain for a turn, and then max­imise . If it’s cre­ated in turns, then it will just max­imise . Thus at any given turn, con­tin­u­ing to build the sub­agent or tak­ing a pause to do , it will have the same ex­pected -value (in the first case, will be built on time, but both and will then lose a turn; in the sec­ond, will be built one turn late, with no loss).[4]

One thing I find in­ter­est­ing: in this post, we fixed the de­sign choice of us­ing an aux­iliary re­ward func­tion as a proxy, in­stead pe­nal­iz­ing the agent for be­com­ing bet­ter able to achieve its own goal. This seem­ingly closes up a lot of loop­holes, so the next place where the de­sign can be ex­ploited is in its ex­act mea­sure­ment of in­ac­tion.

How big of a prob­lem will this be? Maybe we can just keep im­prov­ing the de­sign un­til we’re ac­tu­ally mea­sur­ing the right thing? More on this next post, but here are some of my ini­tial thoughts:

The point of this post isn’t to say “aha, we’re done!”, but to elimi­nate a wide class of cur­rent prob­lems while also re­lax­ing the strict­ness of the mea­sure it­self.

On a meta level, it feels like I’m ar­gu­ing against a claim like “if you can’t demon­strate an ap­proach which solves ev­ery­thing right now, I’m go­ing to ei­ther con­clude im­pact mea­sure­ment is im­pos­si­ble or your whole ap­proach is wrong”. But if you look back at the his­tory of im­pact mea­sures and AUP, you’ll see lots of skulls; peo­ple say “this prob­lem dooms AUP”, and I say “I think we’re talk­ing about con­cep­tu­ally differ­ent things and that you’re a lit­tle over­con­fi­dent; prob­a­bly just a de­sign choice is­sue”. It then ends up be­ing a solv­able de­sign choice is­sue. So by Laplace’s Rule of Suc­ces­sion, I’d be sur­prised if this were The In­sur­mountable Prob­lem That Dooms AUP.[5]

The prob­lem seems sim­ple. We just have to keep down, which we can do by keep­ing down.

Stu­art later added:

The fun­da­men­tal is­sue is that AUP can be un­der­mined if the agent can add ar­bi­trary re­stric­tions to their own fu­ture ac­tions (this al­lows them to re­define ). The sub­agent sce­nario is just a par­tic­u­larly clear way of illus­trat­ing this.

I ba­si­cally agree. I won­der if there’s a de­sign where the agent isn’t in­cen­tivized to do this...

  1. By this rea­son­ing, can still in­crease up un­til the point of . This doesn’t jump out as a big deal to me, but I’m flag­ging this as­sump­tion any­ways. ↩︎

  2. A sub­agent might still be built by AUP to sta­bi­lize minor AU fluc­tu­a­tions which cause ad­di­tional penalty over the course of non-power-gain­ing plans. It seems like there are plenty of other ways to min­i­mize fluc­tu­a­tion, so it’s not clear why build­ing an om­nipo­tent sub­agent to perfectly re­strict you ac­crues less penalty.

    I do think we should think care­fully about this, of course. The in­cen­tive to min­i­mize AU fluc­tu­a­tions and gen­er­ally com­mit to per­pet­ual in­ac­tion ASAP is prob­a­bly one of the main re­main­ing prob­lems with AUP. ↩︎

  3. As pointed out by Evan Hub­inger, this is only safe if my­opi­cally op­ti­miz­ing is safe – we aren’t pe­nal­iz­ing sin­gle-step re­ward ac­qui­si­tion. ↩︎

  4. This is­sue was origi­nally pointed out by Ofer. ↩︎

  5. The fact that Ofer’s/​Stu­art’s prob­lem sur­vived all of the other im­prove­ments is ev­i­dence that it’s harder. I just don’t think the ev­i­dence it pro­vides is that strong. ↩︎