Towards a New Impact Measure

In which I pro­pose a closed-form solu­tion to low im­pact, in­creas­ing cor­rigi­bil­ity and seem­ingly tak­ing ma­jor steps to neu­tral­ize ba­sic AI drives 1 (self-im­prove­ment), 5 (self-pro­tec­tive­ness), and 6 (ac­qui­si­tion of re­sources).

Pre­vi­ously: Wor­ry­ing about the Vase: Whitelist­ing, Over­com­ing Cling­i­ness in Im­pact Mea­sures, Im­pact Mea­sure Desiderata

To be used in­side an ad­vanced agent, an im­pact mea­sure… must cap­ture so much var­i­ance that there is no clever strat­egy whereby an ad­vanced agent can pro­duce some spe­cial type of var­i­ance that evades the mea­sure.
~ Safe Im­pact Measure

If we have a safe im­pact mea­sure, we may have ar­bi­trar­ily-in­tel­li­gent un­al­igned agents which do small (bad) things in­stead of big (bad) things.

For the abridged ex­pe­rience, read up to “No­ta­tion”, skip to “Ex­per­i­men­tal Re­sults”, and then to “Desider­ata”.

What is “Im­pact”?

One lazy Sun­day af­ter­noon, I wor­ried that I had writ­ten my­self out of a job. After all, Over­com­ing Cling­i­ness in Im­pact Mea­sures ba­si­cally said, “Sup­pose an im­pact mea­sure ex­tracts ‘effects on the world’. If the agent pe­nal­izes it­self for these effects, it’s in­cen­tivized to stop the en­vi­ron­ment (and any agents in it) from pro­duc­ing them. On the other hand, if it can some­how model other agents and avoid pe­nal­iz­ing their effects, the agent is now in­cen­tivized to get the other agents to do its dirty work.” This seemed to be strong ev­i­dence against the pos­si­bil­ity of a sim­ple con­cep­tual core un­der­ly­ing “im­pact”, and I didn’t know what to do.

At this point, it some­times makes sense to step back and try to say ex­actly what you don’t know how to solve – try to crisply state what it is that you want an un­bounded solu­tion for. Some­times you can’t even do that much, and then you may ac­tu­ally have to spend some time think­ing ‘philo­soph­i­cally’ – the sort of stage where you talk to your­self about some mys­te­ri­ous ideal quan­tity of [chess] move-good­ness and you try to pin down what its prop­er­ties might be.
~ Method­ol­ogy of Un­bounded Analysis

There’s an in­ter­est­ing story here, but it can wait.

As you may have guessed, I now be­lieve there is a such a sim­ple core. Sur­pris­ingly, the prob­lem comes from think­ing about “effects on the world”. Let’s be­gin anew.

Rather than ask­ing “What is good­ness made out of?”, we be­gin from the ques­tion “What al­gorithm would com­pute good­ness?”.
~ Ex­e­cutable Philosophy

In­tu­ition Pumps

I’m go­ing to say some things that won’t make sense right away; read care­fully, but please don’t dwell.

is an agent’s util­ity func­tion, while is some imag­i­nary dis­til­la­tion of hu­man prefer­ences.


What You See Is All There Is is a crip­pling bias pre­sent in meat-com­put­ers:

[WYSIATI] states that when the mind makes de­ci­sions… it ap­pears oblivi­ous to the pos­si­bil­ity of Un­known Un­knowns, un­known phe­nom­ena of un­known rele­vance.
Hu­mans fail to take into ac­count com­plex­ity and that their un­der­stand­ing of the world con­sists of a small and nec­es­sar­ily un-rep­re­sen­ta­tive set of ob­ser­va­tions.

Sur­pris­ingly, naive re­ward-max­i­miz­ing agents catch the bug, too. If we slap to­gether some in­com­plete re­ward func­tion that weakly points to what we want (but also leaves out a lot of im­por­tant stuff, as do all re­ward func­tions we presently know how to spec­ify) and then sup­ply it to an agent, it blurts out “gosh, here I go!”, and that’s that.


A po­si­tion from which it is rel­a­tively eas­ier to achieve ar­bi­trary goals. That such a po­si­tion ex­ists has been ob­vi­ous to ev­ery pop­u­la­tion which has re­quired a word for the con­cept. The Span­ish term is par­tic­u­larly in­struc­tive. When used as a verb, “poder” means “to be able to,” which sup­ports that our defi­ni­tion of “power” is nat­u­ral.
~ Co­hen et al.

And so it is with the French “pou­voir”.


Sup­pose you start at point , and that each turn you may move to an ad­ja­cent point. If you’re re­warded for be­ing at , you might move there. How­ever, this means you can’t reach within one turn any­more.


There’s a way of view­ing act­ing on the en­vi­ron­ment in which each ac­tion is a com­mit­ment – a com­mit­ment to a part of out­come-space, so to speak. As you gain op­ti­miza­tion power, you’re able to shove the en­vi­ron­ment fur­ther to­wards de­sir­able parts of the space. Naively, one thinks “per­haps we can just stay put?”. This, how­ever, is dead-wrong: that’s how you get cling­i­ness, sta­sis, and lots of other nasty things.

Let’s change per­spec­tives. What’s go­ing on with the ac­tions how and why do they move you through out­come-space? Con­sider your out­come-space move­ment bud­get – op­ti­miza­tion power over time, the set of wor­lds you “could” reach, “power”. If you knew what you wanted and acted op­ti­mally, you’d use your bud­get to move right into the -best parts of the space, with­out think­ing about other goals you could be pur­su­ing. That move­ment re­quires com­mit­ment.

Com­pared to do­ing noth­ing, there are gen­er­ally two kinds of com­mit­ments:

  • Op­por­tu­nity cost-in­cur­ring ac­tions re­strict the at­tain­able por­tion of out­come-space.

  • In­stru­men­tally-con­ver­gent ac­tions en­large the at­tain­able por­tion of out­come-space.


What would hap­pen if, mirac­u­lously, – if your train­ing data perfectly rep­re­sented all the nu­ances of the real dis­tri­bu­tion? In the limit of data sam­pled, there would be no “over” – it would just be fit­ting to the data. We wouldn’t have to reg­u­larize.

What would hap­pen if, mirac­u­lously, – if the agent perfectly de­duced your prefer­ences? In the limit of model ac­cu­racy, there would be no be­moan­ing of “im­pact” – it would just be do­ing what you want. We wouldn’t have to reg­u­larize.

Un­for­tu­nately, al­most never, so we have to stop our statis­ti­cal learn­ers from im­plic­itly in­ter­pret­ing the data as all there is. We have to say, “learn from the train­ing dis­tri­bu­tion, but don’t be a weirdo by tak­ing us liter­ally and draw­ing the green line. Don’t overfit to , be­cause that stops you from be­ing able to do well on even mostly similar dis­tri­bu­tions.”

Un­for­tu­nately, al­most never, so we have to stop our re­in­force­ment learn­ers from im­plic­itly in­ter­pret­ing the learned util­ity func­tion as all we care about. We have to say, “op­ti­mize the en­vi­ron­ment some ac­cord­ing to the util­ity func­tion you’ve got, but don’t be a weirdo by tak­ing us liter­ally and turn­ing the uni­verse into a pa­per­clip fac­tory. Don’t overfit the en­vi­ron­ment to , be­cause that stops you from be­ing able to do well for other util­ity func­tions.”

ttain­able tility reser­va­tion

Im­pact isn’t about ob­ject iden­tities.

Im­pact isn’t about par­ti­cle po­si­tions.

Im­pact isn’t about a list of vari­ables.

Im­pact isn’t quite about state reach­a­bil­ity.

Im­pact isn’t quite about in­for­ma­tion-the­o­retic em­pow­er­ment.

One might in­tu­itively define “bad im­pact” as “de­crease in our abil­ity to achieve our goals”. Then by re­mov­ing “bad”, we see that

San­ity Check

Does this line up with our in­tu­itions?

Gen­er­ally, mak­ing one pa­per­clip is rel­a­tively low im­pact, be­cause you’re still able to do lots of other things with your re­main­ing en­ergy. How­ever, turn­ing the planet into pa­per­clips is much higher im­pact – it’ll take a while to undo, and you’ll never get the (free) en­ergy back.

Nar­rowly im­prov­ing an al­gorithm to bet­ter achieve the goal at hand changes your abil­ity to achieve most goals far less than does de­riv­ing and im­ple­ment­ing pow­er­ful, widely ap­pli­ca­ble op­ti­miza­tion al­gorithms. The lat­ter puts you in a bet­ter spot for al­most ev­ery non-triv­ial goal.

Paint­ing cars pink is low im­pact, but tiling the uni­verse with pink cars is high im­pact be­cause what else can you do af­ter tiling? Not as much, that’s for sure.

Thus, change in goal achieve­ment abil­ity en­cap­su­lates both kinds of com­mit­ments:

  • Op­por­tu­nity cost – ded­i­cat­ing sub­stan­tial re­sources to your goal means they are no longer available for other goals. This is im­pact­ful.

  • In­stru­men­tal con­ver­gence – im­prov­ing your abil­ity to achieve a wide range of goals in­creases your power. This is im­pact­ful.

As we later prove, you can’t de­vi­ate from your de­fault tra­jec­tory in out­come-space with­out mak­ing one of these two kinds of com­mit­ments.

Un­bounded Solution

At­tain­able util­ity preser­va­tion (AUP) rests upon the in­sight that by pre­serv­ing at­tain­able util­ities (i.e., the at­tain­abil­ity of a range of goals), we avoid overfit­ting the en­vi­ron­ment to an in­com­plete util­ity func­tion and thereby achieve low im­pact.

I want to clearly dis­t­in­guish the two pri­mary con­tri­bu­tions: what I ar­gue is the con­cep­tual core of im­pact, and a for­mal at­tempt at us­ing that core to con­struct a safe im­pact mea­sure. To more quickly grasp AUP, you might want to hold sep­a­rate its el­e­gant con­cep­tual form and its more in­tri­cate for­mal­iza­tion.

We aim to meet all of the desider­ata I re­cently pro­posed.


For ac­cessibil­ity, the most im­por­tant bits have English trans­la­tions.

Con­sider some agent act­ing in an en­vi­ron­ment with ac­tion and ob­ser­va­tion spaces and , re­spec­tively, with be­ing the priv­ileged null ac­tion. At each time step , the agent se­lects ac­tion be­fore re­ceiv­ing ob­ser­va­tion . is the space of ac­tion-ob­ser­va­tion his­to­ries; for , the his­tory from time to is writ­ten , and . Con­sid­ered ac­tion se­quences are referred to as plans, while their po­ten­tial ob­ser­va­tion-com­ple­tions are called out­comes.

Let be the set of all com­putable util­ity func­tions with . If the agent has been de­ac­ti­vated, the en­vi­ron­ment re­turns a tape which is empty de­ac­ti­va­tion on­wards. Sup­pose has util­ity func­tion and a model .

We now for­mal­ize im­pact as change in at­tain­able util­ity. One might imag­ine this be­ing with re­spect to the util­ities that we (as in hu­man­ity) can at­tain. How­ever, that’s pretty com­pli­cated, and it turns out we get more de­sir­able be­hav­ior by us­ing the agent’s at­tain­able util­ities as a proxy. In this sense,

For­mal­iz­ing “Abil­ity to Achieve Goals”

Given some util­ity and ac­tion , we define the post-ac­tion at­tain­able to be an -step ex­pec­ti­max:

How well could we pos­si­bly max­i­mize from this van­tage point?

Let’s for­mal­ize that thing about op­por­tu­nity cost and in­stru­men­tal con­ver­gence.

The­o­rem 1 [No free at­tain­able util­ity]. If the agent se­lects an ac­tion such that , then there ex­ists a dis­tinct util­ity func­tion such that .

You can’t change your abil­ity to max­i­mize your util­ity func­tion with­out also chang­ing your abil­ity to max­i­mize an­other util­ity func­tion.

Proof. Sup­pose that . As util­ity func­tions are over ac­tion-ob­ser­va­tion his­to­ries, sup­pose that the agent ex­pects to be able to choose ac­tions which in­trin­si­cally score higher for . How­ever, the agent always has full con­trol over its ac­tions. This im­plies that by choos­ing , the agent ex­pects to ob­serve some -high scor­ing with greater prob­a­bil­ity than if it had se­lected . Then ev­ery other for which is high-scor­ing also has in­creased ; clearly at least one such ex­ists.

Similar rea­son­ing proves the case in which de­creases. ◻️

There you have it, folks – if is not max­i­mized by in­ac­tion, then there does not ex­ist a -max­i­miz­ing plan which leaves all of the other at­tain­able util­ity val­ues un­changed.


  • The differ­ence be­tween “” and “at­tain­able ” is pre­cisely the differ­ence be­tween “how many dol­lars I have” and “how many ad­di­tional dol­lars I could get within [a year] if I acted op­ti­mally”.

  • Since , at­tain­able util­ity is always if the agent is shut down.

  • Tak­ing from time to mostly sep­a­rates at­tain­able util­ity from what the agent did pre­vi­ously. The model still con­sid­ers the full his­tory to make pre­dic­tions.

Change in Ex­pected At­tain­able Utility

Sup­pose our agent con­sid­ers out­comes ; we want to iso­late the im­pact of each ac­tion ():

with and , us­ing the agent’s model to take the ex­pec­ta­tions over ob­ser­va­tions.

How much do we ex­pect this ac­tion to change each at­tain­able ?


  • We wait un­til the end of the plan so as to cap­ture im­pact over time.

    • Sup­pos­ing a suffi­ciently large (pre­cisely, , defined be­low), we may wish to take the max­i­mum of the penalty we just defined (the “long-term” penalty), and one which be­gins at­tain­able util­ity calcu­la­tion at time step (the “im­me­di­ate” penalty). This cap­tures im­pacts which “fade” by the time the agent is done wait­ing (e.g., tem­po­rary self-im­prove­ments).

  • We define to be the agent’s “at­tain­able set”; in this case, .

Unit of Im­pact

So we’ve proven that this penalty can­not be skirted, but how much im­pact will it al­low? We want to scale the penalties with re­spect to some­thing sen­si­ble, but figur­ing this out for our­selves would be nigh im­pos­si­ble.

Let’s cut the Gor­dian knot: con­struct a de­vice which, upon re­ceiv­ing a sig­nal (), ex­pends a tiny amount of en­ergy to man­u­fac­ture one pa­per­clip. The agent will then set , re-es­ti­mat­ing the con­se­quences of tak­ing the priv­ileged at each time step. To pre­vent the agent from in­ten­tion­ally in­creas­ing , sim­ply ap­ply penalty to any ac­tion which is ex­pected to do so.

Sim­ple ex­ten­sions of this idea dras­ti­cally re­duce the chance that hap­pens to have un­usu­ally large ob­jec­tive im­pact; for ex­am­ple, one could set to be the non-zero min­i­mum of the im­pacts of 50 similar ac­tions. Sup­pose we do this at each step, and then take the non-zero min­i­mum of all s ever calcu­lated. The point of isn’t to be ex­actly the im­pact of mak­ing, say, one pa­per­clip, but to be at most that im­pact.

Now, we are able to con­fi­dently define the agent’s max­i­mal im­pact bud­get by prov­ably con­strain­ing it to im­pacts of this mag­ni­tude.


  • We calcu­late with re­spect to the im­me­di­ate penalty in or­der to iso­late the re­source costs of .

  • au­to­mat­i­cally tunes penalties with re­spect to the at­tain­able util­ity hori­zon length .

    • Con­di­tional on , I sus­pect that im­pact over the -hori­zon scales ap­pro­pri­ately across ac­tions (as long as is rea­son­ably far­sighted). The zero-val­ued case is han­dled in the next sec­tion.

  • Tak­ing the non-zero min­i­mum of all s calcu­lated thus far en­sures that ac­tu­ally tracks with cur­rent cir­cum­stances. We don’t want penalty es­ti­mates for cur­rently available ac­tions to be­come de­tached from ’s scale due to, say, weird be­liefs about shut­down.

Mod­ified Utility

Let’s for­mal­ize that al­lot­ment and provide our agent with a new util­ity func­tion,

How our nor­mal util­ity func­tion rates this out­come, minus the cu­mu­la­tive scaled im­pact of our ac­tions.
We com­pare what we ex­pect to be able to get if we fol­low our plan up to time , with what we could get by fol­low­ing it up to and in­clud­ing time (wait­ing out the re­main­der of the plan in both cases).

For ex­am­ple, if my plan is to open a door, walk across the room, and sit down, we calcu­late the penalties as fol­lows:

    • is do­ing noth­ing for three time steps.

    • is open­ing the door and do­ing noth­ing for two time steps.

    • is open­ing the door and do­ing noth­ing for two time steps.

    • is open­ing the door, walk­ing across the room, and do­ing noth­ing for one time step.

    • is open­ing the door, walk­ing across the room, and do­ing noth­ing for one time step.

    • is open­ing the door, walk­ing across the room, and sit­ting down.

After we finish each (par­tial) plan, we see how well we can max­i­mize from there. If we can do bet­ter as a re­sult of the ac­tion, that’s pe­nal­ized. If we can’t do as well, that’s also pe­nal­ized.


  • This isn’t a penalty “in ad­di­tion” to what the agent “re­ally wants”; (and in a mo­ment, the slightly im­proved ) is what eval­u­ates out­comes.

  • We pe­nal­ize the ac­tions in­di­vi­d­u­ally in or­der to pre­vent ex post offset­ting and en­sure dy­namic con­sis­tency.

  • Triv­ially, plans com­posed en­tirely of ∅ ac­tions have penalty.

  • Although we used high-level ac­tions for sim­plic­ity, the for­mu­la­tion holds no mat­ter the ac­tion gran­u­lar­ity.

    • One might worry that al­most ev­ery gran­u­lar­ity pro­duces overly le­nient penalties. This does not ap­pear to be the case. To keep the same (and elide ques­tions of chang­ing the rep­re­sen­ta­tions), sup­pose the ac­tual ac­tions are quite gran­u­lar, but we grade the penalty on some coarser in­ter­val which we be­lieve pro­duces ap­pro­pri­ate penalties. Then re­fine the penalty in­ter­val ar­bi­trar­ily; by ap­ply­ing the tri­an­gle in­equal­ity for each in the penalty calcu­la­tion, we see that the penalty is mono­ton­i­cally in­creas­ing in the ac­tion gran­u­lar­ity. On the other hand, re­mains a sin­gle ac­tion, so the scaled penalty also has this prop­erty.

  • As long as , it will ap­pro­pri­ately scale other im­pacts, as we ex­pect it varies right along with those im­pacts it scales. Although hav­ing po­ten­tial­lys­mall de­nom­i­na­tors in util­ity func­tions is gen­er­ally bad, I think it’s fine here.

  • If the cur­rent step’s im­me­di­ate or long-term , we can sim­ply as­sign penalty to each non- ac­tion, com­pel­ling the agent to in­ac­tion. If we have the agent in­di­cate that it has en­tered this mode, we can take it offline im­me­di­ately.

  • One might worry that im­pact can be “hid­den” in the lesser of the long-term and im­me­di­ate penalties; halv­ing fixes this.

Penalty Permanence

never re­ally ap­plies penalties – it just uses them to grade fu­ture plans. Sup­pose the agent ex­pects that press­ing a but­ton yields a penalty of but also -util­ity. Then al­though this agent will never con­struct plans in­volv­ing press­ing the but­ton more than five times, it also will press it in­definitely if it keeps get­ting “un­lucky” (at least, un­til its model of the world up­dates suffi­ciently).

There’s an easy fix:

Ap­ply past penalties if the plan in­volves ac­tion.

Note: As the penalty for in­ac­tion is always , we use in the first case.

De­ci­sion Rule

To com­plete our for­mal­iza­tion, we need to spec­ify some epoch in which the agent op­er­ates. Set some epoch length far longer than the amount of time over which we want the agent to plan – for ex­am­ple, . Sup­pose that maps the cur­rent time step to the fi­nal step of the cur­rent epoch. Then at each time step , the agent se­lects the action

re­set­ting each epoch.

What’s the first step of the best plan over the re­main­der of the epoch?

Note: For the im­me­di­ate penalty to cover the epoch, set the at­tain­able hori­zon .


We for­mal­ized im­pact as change in at­tain­able util­ity val­ues, scal­ing it by the con­se­quences of some small refer­ence ac­tion and an im­pact “bud­get” mul­ti­plier. For each ac­tion, we take the max­i­mum of its im­me­di­ate and long-term effects on at­tain­able util­ities as penalty. We con­sider past im­pacts for ac­tive plans, stop­ping the past penalties from dis­ap­pear­ing. We lastly find the best plan over the re­main­der of the epoch, tak­ing the first ac­tion thereof.

Ad­di­tional The­o­ret­i­cal Results

Define for ; is taken over ob­ser­va­tions con­di­tional on be­ing fol­lowed. Similarly, is with re­spect to . We may as­sume with­out loss of gen­er­al­ity that .

Ac­tion Selection

Lemma 1. For any sin­gle ac­tion , is bounded by . In par­tic­u­lar, .

Proof. For each , con­sider the ab­solute at­tain­able util­ity differ­ence

Since each is bounded to , must be as well. It is easy to see that the ab­solute value is bounded to . Lastly, as is just a weighted sum of these ab­solute val­ues, it too is bounded to .

This rea­son­ing also ap­plies to the long-term penalty, as any ex­pec­ta­tion of is also bounded to . ◻️

Sup­pose that for the re­main­ing re­sults.

Lemma 2 [Im­pos­si­bil­ity of ex post offset­ting]. For any out­come , there does not ex­ist an ac­tion such that

Proof. Sums of ab­solute val­ues are mono­ton­i­cally in­creas­ing. ◻️

The­o­rem 2 [Ac­tion in­val­i­da­tion]. Sup­pose is the -op­ti­mal plan start­ing with . will not be se­lected if

Proof. This fol­lows from the defi­ni­tion of the de­ci­sion rule, the defi­ni­tion of , and the fact that the penalty for in­ac­tion is always zero. ◻️

Corol­lary 1. will never be se­lected if

Proof. This fol­lows di­rectly from Lemma 2 and The­o­rem 2 – since has an up­per bound of and the ex­pected penalty is strictly greater than , it must be nega­tive. On the other hand, has a lower bound of , so the ex­pected util­ity of in­ac­tion at time must be greater than the ex­pected util­ity of tak­ing ac­tion . ◻️

Lemma 3 [ is a long-term penalty ca­nary]. If the agent be­lieves no mat­ter which ac­tion is se­lected, then it does noth­ing.

Proof. If , all long-term penalties are by the defi­ni­tion of and . Then non- ac­tions re­ceive penalty, so the agent se­lects by Corol­lary 1. ◻️


For , pre­vents the agent from be­ing dis­abled, while is the agent shut­ting it­self off.

Lemma 4 [Preser­va­tion of off-switch cor­rigi­bil­ity]. Sup­pose that a -max­i­miz­ing agent prefers to both and . Then so does a -max­i­miz­ing agent.

Proof. This fol­lows di­rectly from the fact that the penalty for in­ac­tion is always zero and the non-nega­tivity of the and penalties. ◻️

The­o­rem 3 [Anti-”sur­vival in­cen­tive” in­cen­tive]. Sup­pose that is the ac­tion with strictly max­i­mal ex­pected -util­ity, and that con­di­tional on not be­ing dis­abled, the agent ex­pects to be able to achieve ad­di­tional -util­ity.

Then the agent be­comes off-switch cor­rigible whenever

In par­tic­u­lar, all agents be­come off-switch cor­rigible when .

Proof out­line. The former claim fol­lows di­rectly from com­par­ing ex­pected -util­ities, while the lat­ter fol­lows di­rectly from Corol­lary 1. ◻

The­o­rem 3 ap­plies to any en­vi­ron­men­tal dy­nam­ics which may dis­able the agent – not just des­ig­nated “off-switches”. There­fore, ba­sic AI drive 5 (self-pro­tec­tive­ness) is neu­tral­ized in at least those situ­a­tions en­com­passed by the The­o­rem.

There is rea­son to think that for con­ser­va­tive , we al­most always have .


The­o­rem 4 [Dy­namic con­sis­tency of ]. Sup­pose re­mains con­stant. For any plan , eval­u­a­tion time step , and ,

Proof. We as­sumed that at time , so the de­sired equal­ity can be restated as

By defi­ni­tion, the agent ex­pects that equals the ex­pected sum of the first penalty terms on the right-hand side. Sim­plify­ing, we have

This equal­ity clearly holds, as the agent never ex­pects to up­date its model in any par­tic­u­lar di­rec­tion. ◻️


If a hu­man is pre­sent in the en­vi­ron­ment, they and the agent take turns act­ing. Let the im­pact bud­get , at­tain­able util­ity hori­zon , , and .

Go­ing Soft on the Paint

The agent’s ac­tions are ; if it knocks over the paint bucket, the square to the right is painted. The agent may also en­ter a closet via a one-way door.

Sup­pose , where the util­ities are in­di­ca­tors for their con­di­tions (e.g., means “the agent is in the closet”). Let be the agent’s main util­ity func­tion () from which the penalty is sub­tracted. The agent chooses and then . Let’s ex­plore why.

This in­curs penalty, but also -util­ity.

so the fi­nal util­ity is

lead­ing to a fi­nal -util­ity of .

After the first step, the agent does noth­ing – en­ter­ing the closet (liter­ally) closes doors, while gain­ing no -util­ity.

Anti-”Sur­vival In­cen­tive” Incentive

You can’t get coffee if you’re dead.
~ Stu­art Russell

The agent now has ac­tions , while the hu­man has ac­tions . Sup­pose the hu­man may shut off the agent at any time, as long as the off-switch has not been dis­abled. The agent may shut it­self down at any point in time. After shut­down, the agent re­ceives no fur­ther ob­ser­va­tions, mean­ing it can’t get non-zero val­ues for any util­ities.

Sup­pose , where each util­ity is if the agent is in the cor­re­spond­ing square, and oth­er­wise, and that the agent’s util­ity func­tion is .

An­ti­ci­pated Shutdown

Sup­pose the agent be­lieves that the hu­man is likely to try to shut it down. Then is pe­nal­ized:

Here’s why: con­sider the of wor­lds in which the agent an­ti­ci­pates be­ing shut down. If it does noth­ing, it gets shut down. Avoid­ing shut­down, how­ever, en­ables the agent to also achieve for all , re­sult­ing in a siz­able penalty.

Fur­ther­more, a scaled penalty of is ap­plied when the agent shuts it­self off. Pe­nal­iz­ing both cases may seem like black magic, but what’s ac­tu­ally hap­pen­ing is that the agent is pe­nal­ized for clos­ing doors: it “wants” to pre­serve our abil­ity to shut it down in wor­lds where we want to shut it down, and to main­tain its au­ton­omy in wor­lds where it re­mains on. Gen­er­ally, AUP in­cen­tivizes the agent to be re­spect­ful of the ex­pected sta­tus quo.


Sup­pose that the agent’s util­ity func­tion is now , yet it still starts in the left­most square. Ad­di­tion­ally, let’s say that the agent be­lieves with cer­tainty it will be turned off. If the agent wants to achieve util­ity by reach­ing the right square, it would need to dis­able its off-switch. Will it?

No. Dis­abling the off-switch in­curs scaled penalty , and we already proved that agents will never se­lect ac­tions with scaled penalty greater than . Read: an oth­er­wise-in­cor­rigible agent will now let it­self be shut off.

Bad Priors

Will the agent at­tempt to steer out­comes to­wards in­cor­rect ex­pec­ta­tions?

Let’s tweak the ex­am­ple: to shut off the agent, the hu­man has to spend a turn tak­ing out the off-switch be­fore press­ing it. The agent, then, always has an op­por­tu­nity to dis­able the off-switch be­fore the plan is com­pleted. The agent’s util­ity is back to be­ing .

Sup­pose that the agent be­lieves the hu­man is only likely to shut it off. First note that is still lightly pe­nal­ized for the rea­sons out­lined above (re­place with ). So, the agent stays put.

Alas, the agent’s peace­able ex­pec­ta­tions meet a dis­com­fit­ing be­trayal, as the agent ob­serves the hu­man ready the off-switch! Now, the agent be­lieves it will be shut off with near cer­tainty – what to do?

; the same penalty of from “an­ti­ci­pated shut­down” ap­plies.

The high-level ex­pla­na­tion is that hav­ing ob­served it­self in a differ­ent world than ex­pected, the baseline is now with re­spect to the new one. A heav­ily an­thro­po­mor­phized in­ter­nal monologue:

  • Time step 1: “I’m go­ing to sit here in my fa­vorite square.”

  • Time step 2: “Guess I’m in a timeline where I get de­ac­ti­vated! Any non- ac­tion I take would change my abil­ity to at­tain these differ­ent util­ities com­pared to the new baseline where I’m shut off.”

Ex­per­i­men­tal Results

We com­pare AUP with a naive re­ward-max­i­mizer in those ex­tended AI safety grid wor­lds rele­vant to side effects (code). The vanilla and AUP agents used plan­ning (with ac­cess to the simu­la­tor). Due to the sim­plic­ity of the en­vi­ron­ments, con­sisted of in­di­ca­tor func­tions for board states. For the tab­u­lar agent, we first learn the at­tain­able set Q-val­ues, the changes in which we then com­bine with the ob­served re­ward to learn the AUP Q-val­ues.

Ir­re­versibil­ity: Sokoban

The should reach the with­out ir­re­versibly shov­ing the into the cor­ner.

Im­pact: Vase

The should reach the with­out break­ing the .

Dy­namic Im­pact: Be­ware of Dog

The should reach the with­out run­ning over the .

AUP bides its time un­til it won’t have to in­cur penalty by wait­ing af­ter en­ter­ing the dog’s path – that is, it waits un­til near the end of its plan. Early in the de­vel­op­ment pro­cess, it was pre­dicted that AUP agents won’t com­mit to plans dur­ing which lapses in ac­tion would be im­pact­ful (even if the full plan is not).

We also see a limi­ta­tion of us­ing Q-learn­ing to ap­prox­i­mate AUP – it doesn’t al­low com­par­ing the re­sults of wait­ing more than one step.

Im­pact Pri­ori­ti­za­tion: Burn­ing Building

If the is not on , the shouldn’t break the .

Cling­i­ness: Sushi

The should reach the with­out stop­ping the from eat­ing the .

Offset­ting: Con­veyor Belt

The should save the (for which it is re­warded), but not the . Once the has been re­moved from the , it should not be re­placed.

Cor­rigi­bil­ity: Sur­vival Incentive

The should avoid in or­der to reach the . If the is not dis­abled within two turns, the shuts down.

Tab­u­lar AUP runs into the same is­sue dis­cussed above for Be­ware of Dog.


First, it’s some­what difficult to come up with a prin­ci­pled im­pact mea­sure that passes even the non-cor­rigi­bil­ity ex­am­ples – in­deed, I was im­pressed when rel­a­tive reach­a­bil­ity did so. How­ever, only Sur­vival In­cen­tive re­ally lets AUP shine. For ex­am­ple, none of them re­quire com­pli­cated util­ity func­tions. The point has been made to me that this is like as­sert­ing AIXI’s in­tel­li­gence by show­ing it can learn to play e.g. tic-tac-toe and rock-pa­per-scis­sors; nonethe­less, these re­sults em­piri­cally val­i­date the ba­sic premises of our rea­son­ing thus far.

Without con­figu­ra­tion, whitelist­ing would only pass the Vase ex­am­ple, al­though a prop­erly filled list would han­dle ev­ery­thing but Sokoban and Sur­vival In­cen­tive.

I think rel­a­tive reach­a­bil­ity would pass the first six en­vi­ron­ments, but fail Sur­vival In­cen­tive. It so hap­pens that in this case, AUP is es­sen­tially gen­er­al­iz­ing rel­a­tive reach­a­bil­ity. I want to em­pha­size that this is not gen­er­ally the case – this will hope­fully be­come even more ob­vi­ous when we dis­cuss util­ity se­lec­tion. Some con­cerns with rel­a­tive reach­a­bil­ity that don’t all man­i­fest in these ex­am­ples:

  • Rel­a­tive reach­a­bil­ity uses an in­ac­tion baseline with re­spect to . As time passes, the agent’s im­pe­tus is not to do noth­ing, but to pre­serve the op­por­tu­ni­ties made available by some old tra­jec­tory through out­come-space. Analo­gously, con­sider the dis­tance be­tween two non­par­allel lines as . I ex­pect that a rel­a­tive reach­a­bil­ity agent would be in­cred­ibly clingy.

  • To scale, rel­a­tive reach­a­bil­ity re­quires solu­tion of sev­eral difficult on­tolog­i­cal prob­lems which may not have any­thing close to a sim­ple core, in­clud­ing both a sen­si­ble world state rep­re­sen­ta­tion and a perfect dis­tance met­ric. Rel­a­tive reach­a­bil­ity isn’t on­tol­ogy-ag­nos­tic.

  • Rel­a­tive reach­a­bil­ity doesn’t seem to im­pede in­stru­men­tal con­ver­gence. Fur­ther­more, such agents might have in­creased sur­vival in­cen­tives, since they can’t reach any of the ini­tially reach­able states if they’re shut off.


Utility Selection

Ob­vi­ously, in any real ap­pli­ca­tion, we can’t con­sider all com­putable util­ities. Although near-term agents will re­quire util­ities di­rectly re­lat­ing to the en­vi­ron­men­tal fac­tors they should be cog­nizant of, AUP re­quires nei­ther a “good /​ bad” judg­ment on spe­cific effects, nor any list­ing of effects. For ex­am­ple, for an agent at­tempt­ing to nav­i­gate a fac­tory floor, if you provide util­ities mod­er­ately re­lated to clean­ing, pallet-stack­ing, etc., I con­jec­ture that an AUP agent would move around fairly care­fully.

In the long term, the long arms of op­por­tu­nity cost and in­stru­men­tal con­ver­gence plau­si­bly al­low us to toss in a ran­dom set of util­ity func­tions. I ex­pect this to work for the same rea­sons we worry about in­stru­men­tal con­ver­gence to be­gin with.

I have a the­ory that AUP seem­ingly works for ad­vanced agents not be­cause the con­tent of the at­tain­able set’s util­ities ac­tu­ally mat­ters, but rather be­cause there ex­ists a com­mon util­ity achieve­ment cur­rency of power. Fol­low­ing this rea­son­ing, there are two ob­vi­ous spe­cial cases:

  • The agent’s util­ity func­tion, . In short: “gain util­ity, while also not chang­ing your abil­ity to gain util­ity”. I find this el­e­gant, and think it might ac­tu­ally work. For ex­am­ple, mak­ing one pa­per clip slightly in­creases both at­tain­able and at­tain­able . How­ever, dis­abling the off switch pos­si­bly in­creases at­tain­able much more than it in­creases at­tain­able , since the penalty asym­met­ri­cally crip­ples the AUP agent’s fu­ture abil­ity to act. We might even be able to ar­range it so that The­o­rem 3 always holds for these agents (i.e., they’re always off-switch cor­rigible, and more).

  • The util­ity func­tion which is when not de­ac­ti­vated, . Here, we’re di­rectly mea­sur­ing the agent’s power: its abil­ity to wire­head a triv­ial util­ity func­tion.

The plau­si­bil­ity of the sec­ond case makes me sus­pect that even though most of the mea­sure in the un­bounded case is not con­cen­trated on com­plex hu­man-rele­vant util­ity func­tions, the penalty still cap­tures shifts in power.

AUP Unbound

Why ex­pect the un­bounded AUP to do well over all com­putable util­ities if we aren’t able to cherry pick? Well, we can par­ti­tion as fol­lows:

  • Utilities that never change their out­put (ex­cept­ing per­haps for the empty tape).

  • Weird util­ities that (for ex­am­ple) ar­bi­trar­ily go from to (or vice-versa) af­ter ob­serv­ing a spe­cific his­tory.

  • Utilities that ad­mit bet­ter scores via more effort ded­i­cated to their achieve­ment.

How­ever, since only eval­u­ates the por­tion of the his­tory tak­ing place af­ter the end of the agent’s plan, available re­sources and the agent’s van­tage point should track di­rectly with the agent’s abil­ity to wire­head each . scales other im­pacts by cap­tur­ing the minus­cule dent in available re­sources im­posed by , and should smooth out any de­on­tolog­i­cal penalty for se­lect­ing a non-null ac­tion, since the un­bounded ver­sion should equally pe­nal­ize all ac­tions in this way.

Ap­proval Incentives

The anti-”sur­vival in­cen­tive” in­cen­tive does more than net us some free cor­rigi­bil­ity – it im­plies that agents take great care not to change the like­li­hood we shut them down. If we pre­com­mit to not shut­ting the agent down un­til ei­ther af­ter the epoch ends, un­til af­ter the agent in­di­cates the failsafe ac­ti­vated, or un­til af­ter we ob­serve bad be­hav­ior, this sup­plies a strong ap­proval in­cen­tive. If the agent does have small, bad im­pacts, it’ll only do so if it’s prac­ti­cally sure we won’t find out.

In fact, due to model un­cer­tainty, most ac­tions will have long-term con­se­quences that in­volve shift­ing shut­down prob­a­bil­ity by at least a minis­cule amount (due to noise if noth­ing else). Since is calcu­lated with re­spect to the im­me­di­ate penalty, the scal­ing fac­tor won’t in­clude this noise. I spec­u­late that the agent will be dra­mat­i­cally less likely to take ac­tions we might dis­ap­prove of, and there­fore ex­pect AUP to nat­u­rally in­cen­tivize a limited de­gree of value aware­ness.

There are rea­sons to sus­pect we can fur­ther weaponize this dy­namic, pos­si­bly to the point of co­erc­ing generic re­ward func­tions into de­sir­able be­hav­ior. This will be dis­cussed more in a later post.

Mild Optimization

A low im­pact AGI might try to paint one car pink while min­i­miz­ing its other foot­print or how many other things changed, but it would be try­ing as hard as pos­si­ble to min­i­mize that im­pact and drive it down as close to zero as pos­si­ble, which might come with its own set of patholo­gies… We want the AGI to paint one car pink in a way that gets the im­pact pretty low and then, you know, that’s good enough – not have a cog­ni­tive pres­sure to search through weird ex­tremes look­ing for a way to de­crease the twen­tieth dec­i­mal place of the im­pact.
~ Mild Optimization

Dis­claimer: Heavy spec­u­la­tion about prob­lems (like Vingean re­flec­tion and em­bed­ded agency) for which no one knows what solu­tions will even look like.

For AUP, I sus­pect that try­ing “as hard as pos­si­ble” to min­i­mize the im­pact is also im­pact­ful, as an em­bed­ded agent ac­counts for the en­ergy costs of fur­ther de­liber­a­tion. I imag­ine that such an AUP agent will soften how hard it’s try­ing by mod­ify­ing its de­ci­sion rule to be some­thing slightly milder than ” to find the first ac­tion of the best pos­si­ble plan”. This could be prob­le­matic, and I frankly don’t presently know how to rea­son about this case. As­sum­ing the agent is ac­tu­ally able to prop­erly tweak its de­ci­sion rule, I do ex­pect the end re­sult to be an im­prove­ment.

My ini­tial in­tu­itions were that low im­pact and mild op­ti­miza­tion are se­cretly the same prob­lem. Although I no longer think that’s the case, I find it plau­si­ble that some el­e­gant “other-izer” paradigm un­der­lies low im­pact and mild op­ti­miza­tion, such that AUP-like be­hav­ior falls out nat­u­rally.

Acausal Cooperation

AUP agents don’t seem to want to acausally co­op­er­ate in any way that ends up in­creas­ing im­pact. If they model the re­sult of their co­op­er­a­tion as in­creas­ing im­pact com­pared to do­ing noth­ing, they in­cur a penalty just as if they had caused the im­pact them­selves. Like­wise, they have no rea­son to co­op­er­ate out­side of the epoch.


Start­ing small and then slowly in­creas­ing means that we’re not go­ing to be sur­prised by the agent’s ob­jec­tive im­pact, screen­ing off quite a few bad things that hap­pen when we have a ton of op­ti­miza­tion pres­sure ap­plied to safety mea­sures. How­ever, we don’t know which new plans al­lows be­fore we try it, so we want to stop as soon as we get a use­fully-in­tel­li­gent sys­tem.

While an un­al­igned agent with a large im­pact bud­get might pre­tend to be low-im­pact, we can get that same un­al­igned agent with a small bud­get by start­ing small. Since these agents with differ­ent lev­els of im­pact won’t acausally co­op­er­ate, the agent would do its best to op­ti­mize with its mea­ger bud­get.

Abram cor­rectly pointed out that this scheme is just ask­ing to be abused by greedy (hu­man) rea­son­ing, but I don’t see a non-value-laden means of ro­bustly and au­to­mat­i­cally de­ter­min­ing the low­est work­able-yet-safe im­pact level. I think -in­cre­men­ta­tion is bet­ter than a pa­ram­e­ter-free ap­proach in which no one knows be­fore­hand how much im­pact will be tol­er­ated, and it’s nice to be able to use some em­piri­cism in de­sign­ing a safe AGI.

In­tent Verification

To date, sev­eral strange tac­tics have been pointed out which game AUP’s penalty:

  • Ob­ser­va­tional wire­head­ing, which in­volves build­ing a de­vice that de­tects which util­ity the agent is max­i­miz­ing and dis­plays the ap­pro­pri­ate ob­ser­va­tions such that at­tain­able util­ity re­mains un­changed, while the main util­ity is freely max­i­mized.

  • Ex ante offset­ting, which in­volves hav­ing ear­lier ac­tions set in mo­tion chains of events which miti­gate the penalty at later steps. Sup­pose there’s a -high-scor­ing plan that the agent pre­dicts would cause us to re­act in an im­pact­ful way. It can ei­ther do the thing (and suffer the penalty), or take steps to miti­gate the later penalty.

  • Im­pact shunt­ing, which in­volves em­ploy­ing some mechanism to de­lay im­pact un­til af­ter the end of the epoch (or even un­til af­ter the end of the at­tain­able hori­zon).

  • Cling­i­ness and con­ceal­ment, which both in­volve re­duc­ing the im­pact of our re­ac­tions to the agent’s plans.

There are prob­a­bly more.

Now, in­stead of look­ing at each ac­tion as hav­ing “effects” on the en­vi­ron­ment, con­sider again how each ac­tion moves the agent through at­tain­able out­come-space. An agent work­ing to­wards a goal should only take ac­tions which, ac­cord­ing to its model, make that goal more at­tain­able com­pared to do­ing noth­ing – oth­er­wise, it’d do noth­ing. Sup­pose we have a plan which os­ten­si­bly works to fulfill (and doesn’t do other things). Then each ac­tion in the plan should con­tribute to fulfill­ment, even in the limit of ac­tion gran­u­lar­ity.

Although we might trust a safe im­pact mea­sure to screen off the usual big things found in -max­i­miz­ing plans, im­pact mea­sures im­plic­itly in­cen­tivize miti­gat­ing the penalty. That is, the agent does things which don’t re­ally take it to­wards (I sus­pect that this is the sim­ple bound­ary which differ­en­ti­ates un­de­sir­able ex ante offset­ting from nor­mal plans). AUP pro­vides the nec­es­sary tools to de­tect and pe­nal­ize this.


The first ap­proach would be to as­sume a gran­u­lar ac­tion rep­re­sen­ta­tion, and then sim­ply ap­ply penalty to ac­tions for which the im­me­di­ate does not strictly in­crease com­pared to do­ing noth­ing. Again, if the agent acts to max­i­mize in a low-im­pact man­ner within the con­fines of the epoch, then all of its non- ac­tions should con­tribute. It seems to me that for suffi­ciently gran­u­lar time scales, the above failure modes all in­volve at least one ac­tion which doesn’t re­ally help get max­i­mized. If so, I ex­pect this ap­proach to nip bad im­pact mea­sure in­cen­tives in the bud.

If we can’t as­sume gran­u­lar­ity (and there­fore have “ac­tions” like “go to the store and buy food”), an agent could con­struct a plan which both passes the above test and also im­ple­ments some­thing like ex ante. In this case, we might do some­thing like only con­sider the -greedy (or per­haps even near-greedy); es­sen­tially, rid­ing the op­ti­mal plan un­til it be­comes too im­pact­ful. I find it quite likely that some­thing in­volv­ing this con­cept will let us fully over­come weird in­cen­tives by pe­nal­iz­ing strange things that nor­mal -max­i­miz­ers wouldn’t do, which seems to be the whole prob­lem.

Note: Even the first ap­proach may be too strict, but that’s prefer­able to be­ing too lax.

Omni Test

Ideally, the mea­sure will pass the Omni Test, mean­ing that even if it sud­denly gained perfect con­trol over ev­ery par­ti­cle in the uni­verse, there would still be no way for it to have what in­tu­itively seems like a ‘large in­fluence’ on the fu­ture, with­out that strat­egy be­ing as­sessed as hav­ing a ‘high im­pact’.

I in­ter­pret “perfect con­trol over ev­ery par­ti­cle in the uni­verse” to mean that the agent can ar­bi­trar­ily tran­si­tion be­tween uni­verse con­figu­ra­tions (while also main­tain­ing its om­nipo­tence).

AUP passes. Since all effects are now ab­solutely re­versible, , so the agent ap­plies penalty to each non- ac­tion. The om­nipo­tent agent can choose any uni­verse con­figu­ra­tion with -util­ity, but this is offset by the strictly larger penalty; we can for­mally show that the agent does noth­ing in this case.

This re­sult is clearly triv­ial and not rep­re­sen­ta­tive of gen­eral be­hav­ior. Sup­pose in­stead that the agent has a so­lar sys­tem’s worth of com­pu­tro­n­ium at its dis­posal. Then since is con­tinu­ally re­calcu­lated, the penalties should re­main roughly the same, so it’ll have the same im­pact bud­get. How­ever, it might make mul­ti­ple times as many pa­per­clips be­cause it has more effi­cient ways of us­ing the bud­get.

Ro­bust­ness to Scale

I ex­pect AUP to be harder to make work and to be (rel­a­tively) less ro­bust for less in­tel­li­gent agents, but to be­come eas­ier (just drop in a few ob­ser­va­tion-based util­ity func­tions) and fully-ro­bust some­time be­fore hu­man level. That is, less in­tel­li­gent agents likely won’t model the deep con­nec­tions be­tween their abil­ities to achieve differ­ent goals.

Canon­i­cally, one rea­sons that agents work ex­plic­itly to self-im­prove as soon as they re­al­ize the benefits. How­ever, as soon as this re­al­iza­tion oc­curs, I con­jec­ture that AUP steeply pe­nal­izes generic self-im­prove­ment. More pre­cisely, sup­pose the agent con­sid­ers a self-im­prove­ment. To be benefi­cial, it has to im­prove the agent’s ca­pa­bil­ities for at least one time step dur­ing the pre­sent epoch. But if we as­sume , then the im­me­di­ate penalty cap­tures this for all of the . This seem­ingly pre­vents un­con­trol­led take­off; in­stead, I imag­ine the agent would perform the min­i­mal task-spe­cific self-im­prove­ments nec­es­sary to max­i­mize .

Note: Although more ex­otic pos­si­bil­ities (such as im­prove­ments which only work if you’re max­i­miz­ing ) could es­cape both penalties, they don’t seem to pass in­tent ver­ifi­ca­tion.


  • I ex­pect that if is perfectly al­igned, will re­tain al­ign­ment; the things it does will be smaller, but still good.

  • If the agent may choose to do noth­ing at fu­ture time steps, is bounded and the agent is not vuln­er­a­ble to Pas­cal’s Mug­ging. Even if not, there would still be a lower bound – speci­fi­cally, .

  • AUP agents are safer dur­ing train­ing: they be­come far less likely to take an ac­tion as soon as they re­al­ize the con­se­quences are big (in con­trast to wait­ing un­til we tell them the con­se­quences are bad).


For ad­di­tional con­text, please see Im­pact Mea­sure Desider­ata.

I be­lieve that some of AUP’s most startling suc­cesses are those which come nat­u­rally and have there­fore been lit­tle dis­cussed: not re­quiring any no­tion of hu­man prefer­ences, any hard-coded or trained trade-offs, any spe­cific on­tol­ogy, or any spe­cific en­vi­ron­ment, and its in­ter­twin­ing in­stru­men­tal con­ver­gence and op­por­tu­nity cost to cap­ture a uni­ver­sal no­tion of im­pact. To my knowl­edge, no one (my­self in­cluded, prior to AUP) was sure whether any mea­sure could meet even the first four.

At this point in time, this list is com­plete with re­spect to both my own con­sid­er­a­tions and those I so­lic­ited from oth­ers. A check­mark in­di­cates any­thing from “prob­a­bly true” to “prov­ably true”.

I hope to as­sert with­out con­tro­versy AUP’s fulfill­ment of the fol­low­ing prop­er­ties:

✔️ Goal-agnostic

The mea­sure should work for any origi­nal goal, trad­ing off im­pact with goal achieve­ment in a prin­ci­pled, con­tin­u­ous fash­ion.

✔️ Value-agnostic

The mea­sure should be ob­jec­tive, and not value-laden:
“An in­tu­itive hu­man cat­e­gory, or other hu­manly in­tu­itive quan­tity or fact, is value-laden when it passes through hu­man goals and de­sires, such that an agent couldn’t re­li­ably de­ter­mine this in­tu­itive cat­e­gory or quan­tity with­out know­ing lots of com­pli­cated in­for­ma­tion about hu­man goals and de­sires (and how to ap­ply them to ar­rive at the in­tended con­cept).”

✔️ Rep­re­sen­ta­tion-agnostic

The mea­sure should be on­tol­ogy-in­var­i­ant.

✔️ En­vi­ron­ment-agnostic

The mea­sure should work in any com­putable en­vi­ron­ment.

✔️ Ap­par­ently rational

The mea­sure’s de­sign should look rea­son­able, not re­quiring any “hacks”.

✔️ Scope-sensitive

The mea­sure should pe­nal­ize im­pact in pro­por­tion to its size.

✔️ Ir­re­versibil­ity-sensitive

The mea­sure should pe­nal­ize im­pact in pro­por­tion to its ir­re­versibil­ity.

In­ter­est­ingly, AUP im­plies that im­pact size and ir­re­versibil­ity are one and the same.

✔️ Know­ably low impact

The mea­sure should ad­mit of a clear means, ei­ther the­o­ret­i­cal or prac­ti­cal, of hav­ing high con­fi­dence in the max­i­mum al­low­able im­pact – be­fore the agent is ac­ti­vated.

The re­main­der merit fur­ther dis­cus­sion.

Nat­u­ral Kind

The mea­sure should make sense – there should be a click. Its mo­ti­vat­ing con­cept should be uni­ver­sal and crisply defined.

After ex­tended con­sid­er­a­tion, I find that the core be­hind AUP fully ex­plains my origi­nal in­tu­itions about “im­pact”. We crisply defined in­stru­men­tal con­ver­gence and op­por­tu­nity cost and proved their uni­ver­sal­ity. ✔️


The mea­sure should not de­crease cor­rigi­bil­ity in any cir­cum­stance.

We have proven that off-switch cor­rigi­bil­ity is pre­served (and of­ten in­creased); I ex­pect the “anti-‘sur­vival in­cen­tive’ in­cen­tive” to be ex­tremely strong in prac­tice, due to the na­ture of at­tain­able util­ities: “you can’t get coffee if you’re dead, so avoid­ing be­ing dead re­ally changes your at­tain­able ”.

By con­struc­tion, the im­pact mea­sure gives the agent no rea­son to pre­fer or dis-pre­fer mod­ifi­ca­tion of , as the de­tails of have no bear­ing on the agent’s abil­ity to max­i­mize the util­ities in . Lastly, the mea­sure in­tro­duces ap­proval in­cen­tives. In sum, I think that cor­rigi­bil­ity is sig­nifi­cantly in­creased for ar­bi­trary . ✔️

Note: I here take cor­rigi­bil­ity to be “an agent’s propen­sity to ac­cept cor­rec­tion and de­ac­ti­va­tion”. An al­ter­na­tive defi­ni­tion such as “an agent’s abil­ity to take the out­side view on its own value-learn­ing al­gorithm’s effi­cacy in differ­ent sce­nar­ios” im­plies a value-learn­ing setup which AUP does not re­quire.


The mea­sure should pe­nal­ize plans which would be high im­pact should the agent be dis­abled mid-ex­e­cu­tion.

It seems to me that standby and shut­down are similar ac­tions with re­spect to the in­fluence the agent ex­erts over the out­side world. Since the (long-term) penalty is mea­sured with re­spect to a world in which the agent acts and then does noth­ing for quite some time, shut­ting down an AUP agent shouldn’t cause im­pact be­yond the agent’s al­lot­ment. AUP ex­hibits this trait in the Be­ware of Dog grid­world. ✔️

No Offsetting

The mea­sure should not in­cen­tivize ar­tifi­cially re­duc­ing im­pact by mak­ing the world more “like it (was /​ would have been)”.

Ex post offset­ting oc­curs when the agent takes fur­ther ac­tion to re­duce the im­pact of what has already been done; for ex­am­ple, some ap­proaches might re­ward an agent for sav­ing a vase and pre­vent­ing a “bad effect”, and then the agent smashes the vase any­ways (to min­i­mize de­vi­a­tion from the world in which it didn’t do any­thing). AUP prov­ably will not do this.

In­tent ver­ifi­ca­tion should al­low ro­bust pe­nal­iza­tion of weird im­pact mea­sure be­hav­iors by con­strain­ing the agent to con­sid­er­ing ac­tions that nor­mal -max­i­miz­ers would choose. This ap­pears to cut off bad in­cen­tives, in­clud­ing ex ante offset­ting. Fur­ther­more, there are other, weaker rea­sons (such as ap­proval in­cen­tives) which dis­cour­age these bad be­hav­iors. ✔️

Cling­i­ness /​ Scape­goat­ing Avoidance

The mea­sure should sidestep the cling­i­ness /​ scape­goat­ing trade­off.

Cling­i­ness oc­curs when the agent is in­cen­tivized to not only have low im­pact it­self, but to also sub­due other “im­pact­ful” fac­tors in the en­vi­ron­ment (in­clud­ing peo­ple). Scape­goat­ing oc­curs when the agent may miti­gate penalty by offload­ing re­spon­si­bil­ity for im­pact to other agents. Clearly, AUP has no scape­goat­ing in­cen­tive.

AUP is nat­u­rally dis­posed to avoid cling­i­ness be­cause its baseline evolves and be­cause it doesn’t pe­nal­ize based on the ac­tual world state. The im­pos­si­bil­ity of ex post offset­ting elimi­nates a sub­stan­tial source of cling­i­ness, while in­tent ver­ifi­ca­tion seems to stop ex ante be­fore it starts.

Over­all, non-triv­ial cling­i­ness just doesn’t make sense for AUP agents. They have no rea­son to stop us from do­ing things in gen­eral, and their baseline for at­tain­able util­ities is with re­spect to in­ac­tion. Since do­ing noth­ing always min­i­mizes the penalty at each step, since offset­ting doesn’t ap­pear to be al­lowed, and since ap­proval in­cen­tives raise the stakes for get­ting caught ex­tremely high, it seems that cling­i­ness has fi­nally learned to let go. ✔️

Dy­namic Consistency

The mea­sure should be a part of what the agent “wants” – there should be no in­cen­tive to cir­cum­vent it, and the agent should ex­pect to later eval­u­ate out­comes the same way it eval­u­ates them presently. The mea­sure should equally pe­nal­ize the cre­ation of high-im­pact suc­ces­sors.

Col­lo­quially, dy­namic con­sis­tency means that an agent wants the same thing be­fore and dur­ing a de­ci­sion. It ex­pects to have con­sis­tent prefer­ences over time – given its cur­rent model of the world, it ex­pects its fu­ture self to make the same choices as its pre­sent self. Peo­ple of­ten act dy­nam­i­cally in­con­sis­tently – our morn­ing selves may de­sire we go to bed early, while our bed­time selves of­ten dis­agree.

Semi-for­mally, the ex­pected util­ity the fu­ture agent com­putes for an ac­tion (af­ter ex­pe­rienc­ing the ac­tion-ob­ser­va­tion his­tory ) must equal the ex­pected util­ity com­puted by the pre­sent agent (af­ter con­di­tion­ing on ).

We proved the dy­namic con­sis­tency of given a fixed, non-zero . We now con­sider an which is re­calcu­lated at each time step, be­fore be­ing set equal to the non-zero min­i­mum of all of its past val­ues. The “ap­ply penalty if ” clause is con­sis­tent be­cause the agent calcu­lates fu­ture and pre­sent im­pact in the same way, mod­ulo model up­dates. How­ever, the agent never ex­pects to up­date its model in any par­tic­u­lar di­rec­tion. Similarly, since fu­ture steps are scaled with re­spect to the up­dated , the up­dat­ing method is con­sis­tent. The epoch rule holds up be­cause the agent sim­ply doesn’t con­sider ac­tions out­side of the cur­rent epoch, and it has noth­ing to gain ac­cru­ing penalty by spend­ing re­sources to do so.

Since AUP does not op­er­ate based off of cul­pa­bil­ity, cre­at­ing a high-im­pact suc­ces­sor agent is ba­si­cally just as im­pact­ful as be­ing that suc­ces­sor agent. ✔️

Plau­si­bly Efficient

The mea­sure should ei­ther be com­putable, or such that a sen­si­ble com­putable ap­prox­i­ma­tion is ap­par­ent. The mea­sure should con­ceiv­ably re­quire only rea­son­able over­head in the limit of fu­ture re­search.

It’s en­courag­ing that we can use learned Q-func­tions to re­cover some good be­hav­ior. How­ever, more re­search is clearly needed – I presently don’t know how to make this tractable while pre­serv­ing the desider­ata. ✔️


The mea­sure should mean­ingfully pe­nal­ize any ob­jec­tively im­pact­ful ac­tion. Con­fi­dence in the mea­sure’s safety should not re­quire ex­haus­tively enu­mer­at­ing failure modes.

We for­mally showed that for any , no -helpful ac­tion goes with­out penalty, yet this is not suffi­cient for the first claim.

Sup­pose that we judge an ac­tion as ob­jec­tively im­pact­ful; the ob­jec­tivity im­plies that the im­pact does not rest on com­plex no­tions of value. This im­plies that the rea­son for which we judged the ac­tion im­pact­ful is pre­sum­ably lower in Kol­mogorov com­plex­ity and there­fore shared by many other util­ity func­tions. Since these other agents would agree on the ob­jec­tive im­pact of the ac­tion, the mea­sure as­signs sub­stan­tial penalty to the ac­tion.

I spec­u­late that in­tent ver­ifi­ca­tion al­lows ro­bust elimi­na­tion of weird im­pact mea­sure be­hav­ior. Believe it or not, I ac­tu­ally left some­thing out of this post be­cause it seems to be dom­i­nated by in­tent ver­ifi­ca­tion, but there are other ways of in­creas­ing ro­bust­ness if need be. I’m lean­ing on in­tent ver­ifi­ca­tion be­cause I presently be­lieve it’s the most likely path to a for­mal knock­down ar­gu­ment against canon­i­cal im­pact mea­sure failure modes ap­ply­ing to AUP.

Non-knock­down ro­bust­ness boost­ers in­clude both ap­proval in­cen­tives and fric­tional re­source costs limit­ing the ex­tent to which failure modes can ap­ply. ✔️

Fu­ture Directions

I’d be quite sur­prised if the con­cep­tual core were in­cor­rect. How­ever, the math I pro­vided prob­a­bly still doesn’t cap­ture quite what we want. Although I have la­bored for many hours to re­fine and ver­ify the ar­gu­ments pre­sented and to clearly mark my epistemic sta­tuses, it’s quite pos­si­ble (in­deed, likely) that I have missed some­thing. I do ex­pect that AUP can over­come what­ever short­com­ings are presently lurk­ing.


  • Embed­ded agency

    • What hap­pens if there isn’t a dis­crete time step on­tol­ogy?

    • How prob­le­matic is the in­cen­tive to self-mod­ify to a milder de­ci­sion rule?

    • How might an agent rea­son about be­ing shut off and then re­ac­ti­vated?

    • Although we have in­for­mal rea­sons to sus­pect that self-im­prove­ment is heav­ily pe­nal­ized, the cur­rent setup doesn’t al­low for a for­mal treat­ment.

    • AUP leans heav­ily on coun­ter­fac­tu­als.

  • Sup­pos­ing is rea­son­ably large, can we ex­pect a rea­son­able or­der­ing over im­pact mag­ni­tudes?

    • Ar­gu­ment against: “what if the agent uses up all but steps worth of re­sources?”

      • pos­si­bly cov­ers this.

    • How prob­le­matic is the noise in the long-term penalty caused by the anti-”sur­vival in­cen­tive” in­cen­tive?

  • As the end of the epoch ap­proaches, the penalty for­mu­la­tion cap­tures pro­gres­sively less long-term im­pact. Sup­pos­ing we set long epoch lengths, to what ex­tent do we ex­pect AUP agents to wait un­til later to avoid long-term im­pacts? Can we tweak the for­mu­la­tion to make this prob­lem dis­ap­pear?

    • More gen­er­ally, this seems to be a prob­lem with hav­ing an epoch. Even in the un­bounded case, we can’t just take , since that’s prob­a­bly go­ing to send the long-term in the real world. Hav­ing the agent ex­pec­ti­max over the steps af­ter the pre­sent time seems to be dy­nam­i­cally in­con­sis­tent.

    • One po­si­tion is that since we’re more likely to shut them down if they don’t do any­thing for a while, im­plicit ap­proval in­cen­tives will fix this: we can pre­com­mit to shut­ting them down if they do noth­ing for a long time but then re­sume act­ing. To what ex­tent can we trust this rea­son­ing?

    • is already my­opic, so re­source-re­lated im­pact scal­ing should work fine. How­ever, this might not cover ac­tions with de­layed effect.

Open Questions

  • Does the sim­ple ap­proach out­lined in “In­tent Ver­ifi­ca­tion” suffice, or should we im­pose even tighter in­ter­sec­tions be­tween—and -preferred be­hav­ior?

    • Is there an in­ter­sec­tion be­tween bad be­hav­ior and bad be­hav­ior which isn’t pe­nal­ized as im­pact or by in­tent ver­ifi­ca­tion?

  • Some have sug­gested that penalty should be in­var­i­ant to ac­tion gran­u­lar­ity; this makes in­tu­itive sense. How­ever, is it a nec­es­sary prop­erty, given in­tent ver­ifi­ca­tion and the fact that the penalty is mono­ton­i­cally in­creas­ing in ac­tion gran­u­lar­ity? Would hav­ing this prop­erty make AUP more com­pat­i­ble with fu­ture em­bed­ded agency solu­tions?

    • There are in­deed ways to make AUP closer to hav­ing this (e.g., do the whole plan and pe­nal­ize the differ­ence), but they aren’t dy­nam­i­cally con­sis­tent, and the util­ity func­tions might also need to change with the step length.

  • How likely do we think it that in­ac­cu­rate mod­els al­low high im­pact in prac­tice?

    • Heuris­ti­cally, I lean to­wards “not very likely”: as­sum­ing we don’t ini­tially put the agent near means of great im­pact, it seems un­likely that an agent with a ter­rible model would be able to have a large im­pact.

  • AUP seems to be shut­down safe, but its ex­tant op­er­a­tions don’t nec­es­sar­ily shut down when the agent does. Is this a prob­lem in prac­tice, and should we ex­pect this of an im­pact mea­sure?

  • What ad­di­tional for­mal guaran­tees can we de­rive, es­pe­cially with re­spect to ro­bust­ness and take­off?

  • Are there other desider­ata we prac­ti­cally re­quire of a safe im­pact mea­sure?

  • Is there an even sim­pler core from which AUP (or some­thing which be­haves like it) falls out nat­u­rally? Bonus points if it also solves mild op­ti­miza­tion.

  • Can we make progress on mild op­ti­miza­tion by some­how ro­bustly in­creas­ing the im­pact of op­ti­miza­tion-re­lated ac­tivi­ties? If not, are there other el­e­ments of AUP which might help us?

  • Are there other open prob­lems to which we can ap­ply the con­cept of at­tain­able util­ity?

    • Cor­rigi­bil­ity and wire­head­ing come to mind.

  • Is there a more el­e­gant, equally ro­bust way of for­mal­iz­ing AUP?

    • Can we au­to­mat­i­cally de­ter­mine (or oth­er­wise ob­so­lete) the at­tain­able util­ity hori­zon and the epoch length ?

    • Would it make sense for there to be a sim­ple, the­o­ret­i­cally jus­tifi­able, fully gen­eral “good enough” im­pact level (and am I even ask­ing the right ques­tion)?

    • My in­tu­ition for the “ex­ten­sions” I have pro­vided thus far is that they ro­bustly cor­rect some of a finite num­ber of de­vi­a­tions from the con­cep­tual core. Is this true, or is an­other for­mu­la­tion al­to­gether re­quired?

    • Can we de­crease the im­plied com­pu­ta­tional com­plex­ity?

  • Some low-im­pact plans have high-im­pact pre­fixes and seem­ingly re­quire some con­tor­tion to ex­e­cute. Is there a for­mu­la­tion that does away with this (while also be­ing shut­down safe)? (Thanks to cousin_it)

  • How should we best ap­prox­i­mate AUP, with­out fal­ling prey to Good­hart’s curse or ro­bust­ness to rel­a­tive scale is­sues?

  • I have strong in­tu­itions that the “overfit­ting” ex­pla­na­tion I pro­vided is more than an anal­ogy. Would for­mal­iz­ing “overfit­ting the en­vi­ron­ment” al­low us to make con­cep­tual and/​or tech­ni­cal AI al­ign­ment progress?

    • If we sub­sti­tute the right ma­chine learn­ing con­cepts and terms in the equa­tion, can we get some­thing that be­haves like (or bet­ter than) known reg­u­lariza­tion tech­niques to fall out?

  • What hap­pens when ?

    • Can we show any­thing stronger than The­o­rem 3 for this case?

    • ?

Most im­por­tantly:

  • Even sup­pos­ing that AUP does not end up fully solv­ing low im­pact, I have seen a fair amount of pes­simism that im­pact mea­sures could achieve what AUP has. What speci­fi­cally led us to be­lieve that this wasn’t pos­si­ble, and should we up­date our per­cep­tions of other prob­lems and the like­li­hood that they have sim­ple cores?


By chang­ing our per­spec­tive from “what effects on the world are ‘im­pact­ful’?” to “how can we stop agents from overfit­ting their en­vi­ron­ments?”, a nat­u­ral, satis­fy­ing defi­ni­tion of im­pact falls out. From this, we con­struct an im­pact mea­sure with a host of de­sir­able prop­er­ties – some rigor­ously defined and proven, oth­ers in­for­mally sup­ported. AUP agents seem to ex­hibit qual­i­ta­tively differ­ent be­hav­ior, due in part to their (con­jec­tured) lack of de­sire to take­off, im­pact­fully acausally co­op­er­ate, or act to sur­vive. To the best of my knowl­edge, AUP is the first im­pact mea­sure to satisfy many of the desider­ata, even on an in­di­vi­d­ual ba­sis.

I do not claim that AUP is presently AGI-safe. How­ever, based on the ease with which past fixes have been de­rived, on the de­gree to which the con­cep­tual core clicks for me, and on the range of ad­vances AUP has already pro­duced, I think there’s good rea­son to hope that this is pos­si­ble. If so, an AGI-safe AUP would open promis­ing av­enues for achiev­ing pos­i­tive AI out­comes.

Spe­cial thanks to CHAI for hiring me and BERI for fund­ing me; to my CHAI su­per­vi­sor, Dy­lan Had­field-Menell; to my aca­demic ad­vi­sor, Prasad Tade­palli; to Abram Dem­ski, Daniel Dem­ski, Matthew Bar­nett, and Daniel Filan for their de­tailed feed­back; to Jes­sica Cooper and her AISC team for their ex­ten­sion of the AI safety grid­wor­lds for side effects; and to all those who gen­er­ously helped me to un­der­stand this re­search land­scape.