Impact Measure Desiderata

Pre­vi­ously: Wor­ry­ing about the Vase: Whitelist­ing, Over­com­ing Cling­i­ness in Im­pact Measures

If we can pe­nal­ize some quan­tity of “im­pact on the world”, we can have un­al­igned agents whose im­pact—and thereby nega­tive effect—is small.

The long-term goal of im­pact mea­sure re­search is to find a mea­sure which neatly cap­tures our in­tu­itive un­der­stand­ing of “im­pact”, which doesn’t have al­low cheap workarounds, which doesn’t fail in re­ally weird ways, and so on. For ex­am­ple, when you re­ally think through some ex­ist­ing ap­proaches (like whitelist­ing), you see that the im­pact mea­sure se­cretly also ap­plies to things we do.

No ap­proaches to date meet these stan­dards. What do we even re­quire of an im­pact mea­sure we hope to make safe for use with ar­bi­trar­ily pow­er­ful agents?



The mea­sure should be com­pat­i­ble with any origi­nal goal, trad­ing off im­pact with goal achieve­ment in a prin­ci­pled, con­tin­u­ous fash­ion.

Ex­am­ple: Max­i­miz­ing the re­ward minus im­pact.

Why: Con­straints seem too rigid for the gen­eral case.


The mea­sure should be ob­jec­tive, and not value-laden:
“An in­tu­itive hu­man cat­e­gory, or other hu­manly in­tu­itive quan­tity or fact, is value-laden when it passes through hu­man goals and de­sires, such that an agent couldn’t re­li­ably de­ter­mine this in­tu­itive cat­e­gory or quan­tity with­out know­ing lots of com­pli­cated in­for­ma­tion about hu­man goals and de­sires (and how to ap­ply them to ar­rive at the in­tended con­cept).”

Ex­am­ple: Mea­sur­ing what por­tion of ini­tially ac­cessible states are still ac­cessible ver­sus a neu­ral net­work which takes two state rep­re­sen­ta­tions and out­puts a scalar rep­re­sent­ing how much “bad change” oc­curred.

Why: Strate­gi­cally, im­pact mea­sures are use­ful in­so­far as we sus­pect that value al­ign­ment will fail. If we sub­stan­tially base our im­pact mea­sure on some kind of value learn­ing—you know, the thing that maybe fails—we’re gonna have a bad time. While it’s pos­si­ble to only rely some­what on a vaguely cor­rect rep­re­sen­ta­tion of hu­man prefer­ences, the ex­tent to which this rep­re­sen­ta­tion is in­cor­rect is the (min­i­mal) ex­tent to which our mea­sure is in­cor­rect. Let’s avoid shared points of failure, shall we?

Prac­ti­cally, a ro­bust value-sen­si­tive im­pact mea­sure is value-al­ign­ment com­plete, since an agent max­i­miz­ing the nega­tion of such a mea­sure would be al­igned (as­sum­ing the mea­sure in­di­cates which way is “good”).


The mea­sure should be on­tol­ogy-in­var­i­ant.

Ex­am­ple: Change in ob­ject iden­tities ver­sus [some con­cept of im­pact which tran­scends any spe­cific way of rep­re­sent­ing the world].

Why: Sup­pose you rep­re­sent your per­cep­tions in one way, and calcu­late you had im­pact on the world. In­tu­itively, if you rep­re­sent your per­cep­tions (or your guess at the cur­rent world state, or what­ever) differ­ently, but do the same things, you should calcu­late roughly the same im­pact for the same ac­tions which had the same effects on the ter­ri­tory. In other words, the mea­sure should be con­sis­tent across ways of view­ing the world.


The mea­sure should work in any com­putable en­vi­ron­ment.

Ex­am­ple: Man­u­ally-de­rived penalties tai­lored to a spe­cific grid­world ver­sus in­for­ma­tion-the­o­retic em­pow­er­ment.

Why: One imag­ines that there’s a defi­ni­tion of “im­pact” on which we and aliens—or even in­tel­li­gent au­tomata liv­ing in a Game of Life—would agree.

Nat­u­ral Kind

The mea­sure should make sense—there should be a click. Its mo­ti­vat­ing con­cept should be uni­ver­sal and crisply defined.

Ap­par­ently Rational

The mea­sure’s de­sign should look rea­son­able, not re­quiring any “hacks”.

Ex­am­ple: Achiev­ing off-switch cor­rigi­bil­ity by hard-cod­ing the be­lief “I shouldn’t stop hu­mans from press­ing the off-switch”. Clearly, this is hilar­i­ously im­pos­si­ble to man­u­ally spec­ify, but even if we could, do­ing so should make us un­easy.

Roughly, “ap­par­ently ra­tio­nal” means that if we put our­selves in the agent’s po­si­tion, we could come up with a plau­si­ble story about why we’re do­ing what we’re do­ing. That is, the story shouldn’t have any­thing like “and then I re­fer to this spe­cial part of my model which I’m in­ex­pli­ca­bly never al­lowed to up­date”.

Why: If the de­sign is “rea­son­able”, then if the mea­sure fails, it’s more likely to do so grace­fully.


The mea­sure should pe­nal­ize im­pact in pro­por­tion to its size.


The mea­sure should pe­nal­ize im­pact in pro­por­tion to its ir­re­versibil­ity.


The mea­sure should not de­crease cor­rigi­bil­ity in any cir­cum­stance.


The mea­sure should pe­nal­ize plans which would be high im­pact should the agent be dis­abled mid-ex­e­cu­tion.

Why: We may want to shut the agent down, which is tough if its plans are only low-im­pact if they’re com­pleted. Also, not hav­ing this prop­erty im­plies that the agent’s plans are more likely to go awry if even one step doesn’t pan out as ex­pected. Do we re­ally want “jug­gling bombs” to be “low im­pact”, con­di­tional on the jug­gler be­ing good?

No Offsetting

The mea­sure should not in­cen­tivize ar­tifi­cially re­duc­ing im­pact by mak­ing the world more “like it (was /​ would have been)”.

Ex­am­ple: Krakovna et al. de­scribe a low im­pact agent which is re­warded for sav­ing a vase from break­ing. The agent saves the vase, and then places it back on the con­veyor belt so as to “min­i­mize” im­pact with re­spect to the origi­nal out­come:

This is called ex post offset­ting. Ex ante offset­ting, on the other hand, con­sists of tak­ing ac­tions be­fore­hand to build a de­vice or set in mo­tion a chain of events which es­sen­tially ac­com­plishes ex post offset­ting. For ex­am­ple, a de­vice re­quiring only the press of a but­ton to ac­ti­vate could save the vase and then re­place it, net­ting the agent the re­ward with­out re­quiring that the agent take fur­ther ac­tions.

Some have sug­gested that ac­tions like “give some­one a can­cer cure which also kills them at the same time they would have died any­ways” count as ex ante offset­ting. I’m not sure—this feels con­fused, be­cause the down­stream causal effects of ac­tions don’t seem cleanly sep­a­rable, nor do I be­lieve we should sep­a­rate them (more on that later). Also, how would an agent ever be able to do some­thing like “build a self-driv­ing car to take Bob to work” if each of the car’s move­ments is pe­nal­ized sep­a­rately from the rest of the plan? This seems too re­stric­tive. On the other hand, if we al­low ex ante offset­ting in gen­eral, we ba­si­cally get all of the down­sides of ex post offset­ting, with the only im­ped­i­ment be­ing ex­tra pa­per­work.

How “bad” the offsets are—and what ex ante offset­ting al­lows—seems to de­pend on the mea­sure it­self. The ideal would cer­tainly be to define and ro­bustly pre­vent this kind of thing, but per­haps we can also bound the amount of ex ante offset­ting that takes place to some safe level.

There may also be other ways around this seem­ingly value-laden bound­ary. In any case, I’m still not quite sure where to draw the line. If peo­ple have cen­tral ex­am­ples they’d like to share, that would be much ap­pre­ci­ated.

ETA: I weakly sus­pect I have this figured out, but I still wel­come ex­am­ples.

Cling­i­ness /​ Scape­goat­ing Avoidance

The mea­sure should sidestep the cling­i­ness /​ scape­goat­ing trade­off.

Ex­am­ple: A clingy agent might not only avoid break­ing vases, but also stop peo­ple from break­ing vases. A scape­goat­ing agent would es­cape im­pact by mod­el­ing the au­ton­omy of other agents, and then hav­ing those agents break vases for it.

Know­ably Low Im­pact

The mea­sure should ad­mit of a clear means, ei­ther the­o­ret­i­cal or prac­ti­cal, of hav­ing high con­fi­dence in the max­i­mum al­low­able im­pact—be­fore the agent is ac­ti­vated.

Why: If we think that a mea­sure ro­bustly defines “im­pact”—but we aren’t sure how much im­pact it al­lows—that could turn out pretty em­bar­rass­ing for us.

Dy­namic Consistency

The mea­sure should be a part of what the agent “wants”—there should be no in­cen­tive to cir­cum­vent it, and the agent should ex­pect to later eval­u­ate out­comes the same way it eval­u­ates them presently. The mea­sure should equally pe­nal­ize the cre­ation of high-im­pact suc­ces­sors.

Ex­am­ple: Most peo­ple’s sleep prefer­ences are dy­nam­i­cally in­con­sis­tent: one might wake up tired and wish for their later self to choose to go to bed early, even though they pre­dictably end up want­ing other things later.

Plau­si­bly Efficient

The mea­sure should ei­ther be com­putable, or such that a sen­si­ble com­putable ap­prox­i­ma­tion is ap­par­ent. The mea­sure should con­ceiv­ably re­quire only rea­son­able over­head in the limit of fu­ture re­search.


The mea­sure should mean­ingfully pe­nal­ize any ob­jec­tively im­pact­ful ac­tion. Con­fi­dence in the mea­sure’s safety should not re­quire ex­haus­tively enu­mer­at­ing failure modes.

Ex­am­ple: “Sup­pose there’s some way of gam­ing the im­pact mea­sure, but be­cause of , , and , we know this is pe­nal­ized as well”.

Pre­vi­ous Proposals

Krakovna et al. pro­pose four desider­ata:

1) Pe­nal­ize the agent for effects on the en­vi­ron­ment if and only if those effects are un­nec­es­sary for achiev­ing the ob­jec­tive.
2) Dist­in­guish be­tween agent effects and en­vi­ron­ment effects, and only pe­nal­ize the agent for the former but not the lat­ter.
3) Give a higher penalty for ir­re­versible effects than for re­versible effects.
4) The penalty should ac­cu­mu­late when more ir­re­versible effects oc­cur.

First, no­tice that my list points at some ab­stract amount-of-im­pact, while the above pro­posal fo­cuses on spe­cific effects.

  • Think­ing in terms of “effects” seems like a sub­tle map/​ter­ri­tory con­fu­sion. That is, it seems highly un­likely that there ex­ists a ro­bust, value-ag­nos­tic means of de­tect­ing “effects” that makes sense across rep­re­sen­ta­tions and en­vi­ron­ments.

  • Over­com­ing Cling­i­ness in Im­pact Mea­sures sug­gests that pe­nal­iz­ing im­pact based on the world state ne­ces­si­tates a value-laden trade­off.

I left out 1), as I be­lieve that the de­sired benefit will nat­u­rally fol­low from an ap­proach satis­fy­ing my pro­posed desider­ata.

  • What does it mean for an effect to be “nec­es­sary” for achiev­ing the ob­jec­tive, which might be a re­ward func­tion? This seems to shove much of the difficulty into the word “nec­es­sary”, where any­thing not “nec­es­sary” is per­haps some­thing oc­cur­ring from op­ti­miz­ing the re­ward func­tion harder than we’d pre­fer.

I de facto in­cluded 2) via the non-clingy desider­a­tum, while 3) and 4) are cap­tured by scope- and ir­re­versibil­ity-sen­si­tivity.

I think that we can meet all of the prop­er­ties I listed, and I wel­come thoughts on whether any should be added or re­moved.

Thanks to Abram Dem­ski for the “Ap­par­ently Ra­tional” desider­a­tum.