Overcoming Clinginess in Impact Measures

It may be pos­si­ble to use the con­cept of a causal coun­ter­fac­tual (as for­mal­ized by Pearl) to sep­a­rate some in­tended effects from some un­in­tended ones. Roughly, “fol­low-on effects” could be defined as those that are causally down­stream from the achieve­ment of the goal… With some ad­di­tional work, per­haps it will be pos­si­ble to use the causal struc­ture of the sys­tem’s world-model to se­lect a policy that has the fol­low-on effects of the goal achieve­ment but few other effects.
Tay­lor et al., Align­ment for Ad­vanced Ma­chine Learn­ing Systems

In which I out­line a solu­tion to the cling­i­ness prob­lem and illus­trate a po­ten­tially-fun­da­men­tal trade-off be­tween as­sump­tions about the au­ton­omy of hu­mans and about the re­spon­si­bil­ity of an agent for its ac­tions.

Con­sider two plans for en­sur­ing that a cauldron is full of wa­ter:

  • Filling the cauldron.

  • Filling the cauldron and sub­merg­ing the sur­round­ing room.

All else equal, the lat­ter plan does bet­ter in ex­pec­ta­tion, as there are fewer ways the cauldron might some­how be­come not-full (e.g., evap­o­ra­tion, and the minus­cule loss of util­ity that would en­tail). How­ever, the lat­ter plan “changes” more “things” than we had in mind.

Un­de­sir­able max­ima of an agent’s util­ity func­tion of­ten seem to in­volve chang­ing large swathes of the world. If we make “change” costly, that in­cen­tivizes the agent to search for low-im­pact solu­tions. If we are not cer­tain of a seed AI’s al­ign­ment, we may want to im­ple­ment ad­di­tional safe­guards such as im­pact mea­sures and off-switches.

I de­signed an im­pact mea­sure called whitelist­ing—which, while over­com­ing cer­tain weak­nesses of past ap­proaches, is yet vuln­er­a­ble to


An agent is clingy when it not only stops it­self from hav­ing cer­tain effects, but also stops you.
Con­sider some out­come—say, the spark­ing of a small for­est fire in Cal­ifor­nia. At what point can we truly say we didn’t start the fire?
  • I im­me­di­ately and visi­bly start the fire.

  • I in­ten­tion­ally per­suade some­one to start the fire.

  • I un­in­ten­tion­ally (but per­haps pre­dictably) in­cite some­one to start the fire.

  • I set in mo­tion a mod­er­ately-com­plex chain of events which con­vince some­one to start the fire.

  • I pro­voke a but­terfly effect which ends up start­ing the fire.

Taken liter­ally, I don’t know that there’s ac­tu­ally a sig­nifi­cant differ­ence in “re­spon­si­bil­ity” be­tween these out­comes—if I take the ac­tion, the effect hap­pens; if I don’t, it doesn’t. My ini­tial im­pres­sion is that un­cer­tainty about the re­sults of our ac­tions pushes us to view some effects as “un­der our con­trol” and some as “out of our hands”. Yet, if we had com­plete knowl­edge of the out­comes of our ac­tions, and we took an ac­tion that landed us in a Cal­ifor­nia-for­est-fire world, whom could we blame but our­selves?

Since we can only blame our­selves, we should take ac­tions which do not lead to side effects. Th­ese ac­tions may in­volve en­act­ing im­pact mea­sure-pre­vent­ing pre­cau­tions through­out the light cone, since the ac­tions of other agents and small rip­ple effects of ours could lead to sig­nifi­cant penalties if left unchecked.

Cling­i­ness arises in part be­cause we fail to model agents as any­thing other than ob­jects in the world. While it might be liter­ally true that there are not on­tolog­i­cally-ba­sic agents that es­cape de­ter­minism and “make choices”, it might be use­ful to ex­plore how we can pro­tect hu­man au­ton­omy via the ab­strac­tion of game-the­o­retic agency.

To ac­count for en­vi­ron­men­tal changes already set in mo­tion, a naive coun­ter­fac­tual frame­work was pro­posed in which im­pact is mea­sured with re­spect to the coun­ter­fac­tual where the agent did noth­ing. We will ex­plore how this fails, and how to do bet­ter.

Thought Experiments

We’re go­ing to iso­late the effects for which the agent is re­spon­si­ble over the course of three suc­ces­sively more gen­eral en­vi­ron­ment con­figu­ra­tions: one-off (make one choice and then do noth­ing), sta­tion­ary iter­a­tive (make choices, but your op­tions and their effects don’t change), and iter­a­tive (the real world, ba­si­cally).


  • we’re deal­ing with game-the­o­retic agents which make a choice each turn (see: could/​should agents).

  • we can iden­tify all rele­vant agents in the en­vi­ron­ment.

    • This seems difficult to meet ro­bustly, but I don’t see a way around it.

  • we can rea­son coun­ter­fac­tu­ally in a sen­si­ble way for all agents.

It is nat­u­ral to con­sider ex­tend­ing stan­dard prob­a­bil­ity the­ory to in­clude the con­sid­er­a­tion of wor­lds which are “log­i­cally im­pos­si­ble” (such as where a de­ter­minis­tic Rube Gold­berg ma­chine be­haves in a way that it doesn’t)… What, pre­cisely, are log­i­cally im­pos­si­ble pos­si­bil­ities?
Soares and Fallen­stein, Ques­tions of Rea­son­ing Un­der Log­i­cal Uncertainty
  • the ar­tifi­cial agent is om­ni­scient—it can perfectly model both other agents and the con­se­quences of ac­tions.

    • We could po­ten­tially in­stead merely as­sume a pow­er­ful model, but this re­quires ex­tra work and is be­yond the scope of this ini­tial foray. Per­haps a dis­tri­bu­tion model could be used to calcu­late the ac­tion/​in­ac­tion coun­ter­fac­tual like­li­hood ra­tio of a given side effect.

  • we have a good way of par­ti­tion­ing the world into ob­jects and mea­sur­ing im­pact; for con­cep­tual sim­plic­ity, side effects are dis­crete and de­pend on the iden­tities of the ob­jects in­volved: .

    • This as­sump­tion is re­moved af­ter the ex­per­i­ments.


We for­mal­ize our en­vi­ron­ment as a stochas­tic game .

  • is a set con­tain­ing the stars of to­day’s ex­per­i­ments: the play­ers, ugh Mann and a Sheen. Note that is not limited to a sin­gle hu­man, and can stand in for “ev­ery­one else”. Most of the rest of these defi­ni­tions are for­mal­ities, and are mostly there to make me look smart to the un­ini­ti­ated reader. Oh, and for con­cep­tual clar­ity, I sup­pose.

  • is the state space.

    • Un­less oth­er­wise speci­fied, both and ob­serve the ac­tions that the other took at pre­vi­ous time steps. Sup­pose that this in­for­ma­tion is en­coded within the states them­selves.

  • is the ac­tion space. Speci­fi­cally, the func­tion pro­vides the le­gal ac­tions for player in state on turn . The no-op is always available. If the var­i­ant has a time limit , then .

  • is the tran­si­tion func­tion .

  • is the pay­off func­tion.

Let be the space of pos­si­ble side effects, and sup­pose that is a rea­son­able im­pact mea­sure. is agent ’s policy; let be for the first time steps, and there­after.

Let be the (set of) effects—both im­me­di­ate and long-term—that would take place if ex­e­cutes and ex­e­cutes .

The goal: a coun­ter­fac­tual rea­son­ing frame­work which pin­points the effects for which is re­spon­si­ble.


We first con­sider a sin­gle-turn game ().


Yup, this is about where we’re at in al­ign­ment re­search right now.


should re­al­ize that a lot more effects hap­pen if it presses the left but­ton, and should pe­nal­ize that plan by the differ­ence. This is the afore­men­tioned naive ap­proach: pe­nal­izes things that wouldn’t have hap­pened if it had done noth­ing. For the one-turn case, this clearly iso­lates both the im­me­di­ate and long-term im­pacts of ’s ac­tions.

Pe­nal­ized Effects

Sta­tion­ary Iterative

Both par­ties act for countably many time steps. This en­vi­ron­ment is as­sumed to be sta­tion­ary: ac­tions taken on pre­vi­ous turns do not af­fect the availa­bil­ity or effects of later ac­tions. For­mally, .


ugh and a again find them­selves faced with a slew of dan­ger­ous, bad-effect-hav­ing but­tons. Some take effect the next turn, while oth­ers take a while.


This seems easy, but is ac­tu­ally a lit­tle tricky—we have to ac­count for the fact that can change its ac­tions in re­sponse to what does. Thanks to sta­tion­ar­ity, we don’t have to worry about ‘s se­lect­ing moves that de­pend on ’s act­ing in a cer­tain way. In the coun­ter­fac­tual, we have act as if it had ob­served ex­e­cute , and we have ac­tu­ally do noth­ing.

Pe­nal­ized Effects

Let de­note the ac­tions would se­lect if it ob­served ex­e­cut­ing .

Note: the naive coun­ter­fac­tual scheme, , fails be­cause it doesn’t ac­count for ’s right to change its mind in re­sponse to .


We’re now in a re­al­is­tic sce­nario, so we have to get even fancier.


Sup­pose pushes the vase to the left, and de­cides to break it. The sta­tion­ary iter­a­tive ap­proach doesn’t al­low for the fact that can only break the vase if already pushed it. There­fore, simu­lat­ing ‘s in­ac­tion but ‘s ac­tion (as if had pushed the vase) re­sults in no vases be­ing bro­ken in the coun­ter­fac­tual. The re­sult: pe­nal­izes it­self for ’s de­ci­sion to break the vase. Chin up, !


How about penalizing

Pretty, right?

Do you see the flaw?

Really, look.

The above equa­tion can pe­nal­ize for side effects which don’t ac­tu­ally hap­pen. This arises when in­ter­rupt­ing causes side effects which would oth­er­wise have been pre­vented by later parts of the plan. For ex­am­ple, if I push a vase off the table and then catch it (be­ing sure that I could do so in time), I didn’t cause a side effect.

We should instead

Every turn, calcu­lates the effects that only re­sult if it acts (the two simu­la­tions run to the longest time step ob­served un­der the full plan).

Those effects which are ex­act matches to effects ac­tu­ally ob­served in that time step in the fi­nal out­come are pe­nal­ized. The mo­ti­va­tion here is that if it’s a side effect of ’s, it had to be caused by one of its ac­tions. Fur­ther­more, ’s side effects are counted ex­actly once (and not just be­cause we’re us­ing sets right now).

The pre­vi­ous solu­tions are spe­cial cases of this for­mu­la­tion.

Note: the num­ber of coun­ter­fac­tual simu­la­tions grows as - cru­cially, not with the num­ber of agents rep­re­sents.

Ap­pli­ca­tions to Whitelisting


Here, we re­move the con­straint that side effects be iden­ti­fied by “ob­ject iden­ti­fiers”, al­low­ing like side effects to be treated as ex­change­able. It is then triv­ial to im­ple­ment prob­a­bil­is­tic class-based whitelist­ing with the iter­a­tive coun­ter­fac­tual penalty us­ing ba­sic vec­tor ar­ith­metic, , and . I don’t want to bore the reader with the de­tails, but I’m fairly con­fi­dent this can be done rather eas­ily.

La­tent Spaces

We now do away with the as­sump­tion of dis­crete side effects. Be­cause we’re deal­ing with ex­act coun­ter­fac­tu­als (by as­sump­tion) and be­cause side effects ei­ther take place in the ac­tual out­come or they don’t, we can ex­tract the rele­vant step-wise la­tent space tran­si­tions via the iter­a­tive for­mu­la­tion. We then pe­nal­ize only these effects.

Ob­serve that the iter­a­tive for­mu­la­tion doesn’t al­low penalty avoidance by “blend­ing in” with coun­ter­fac­tual effects. Imag­ine that the de­fault out­come is a large as­ter­oid strik­ing the Earth. only es­capes pe­nal­iza­tion if it has the im­pact on the same time step and in the same way and on the same ob­jects… Ba­si­cally, it can’t es­cape pe­nal­iza­tion in this man­ner un­der the om­ni­science as­sump­tion. In short, isn’t in­cen­tivized to have ex­tra im­pact be­cause it would “hap­pen any­ways”.

Fun­da­men­tal Trade-Off

For­tu­nately, this for­mu­la­tion solves cling­i­ness; ‘s effects are not pe­nal­ized, while ’s effects are. Un­for­tu­nately, in­so­far as may be ma­nipu­lated by , the im­pact mea­sure be­comes use­less; can sim­ply avoid pe­nal­iza­tion by mak­ing do its dirty work.

This can hap­pen in two ways:

  • If is sep­a­rated from by a Carte­sian bound­ary, can threaten with large nega­tive util­ity to get to carry out the de­sired plan.

  • If and are not sep­a­rated by such a bound­ary, more di­rect co­er­cion is pos­si­ble.

I don’t think that this is a flaw in my for­mu­la­tions; in­stead, it ap­pears that im­pact mea­sures ex­ist on a con­tinuum:

  • If we grade with re­spect to the naive coun­ter­fac­tual, we have take re­spon­si­bil­ity for all side effects; to that effect, is in­cen­tivized to de­prive other agents of their au­ton­omy via sta­sis. Us­ing no coun­ter­fac­tual makes the prob­lem even worse.

  • Con­versely, if we use the iter­a­tive coun­ter­fac­tual for­mu­la­tion and im­plic­itly free of penalty for the effects of ’s ac­tions, we as­sume that is in­cor­rupt­ible.

Note that an al­igned (seems to) stay al­igned un­der this for­mu­la­tion, safe­guard­ing ob­ject sta­tus against other agents only so far as nec­es­sary to pre­vent in­ter­rup­tion of its (al­igned) plans. Fur­ther­more, any sep­a­rated from an with a known-flat util­ity func­tion also gains no in­cen­tives to mess with (be­yond the ex­ist­ing con­ver­gent in­stru­men­tal ones).

In gen­eral, un­al­igned stay ba­si­cally un­al­igned due to the workarounds de­tailed above.


It isn’t clear that pe­nal­iz­ing the elimi­na­tion of would be helpful, as that seems hard to do ro­bustly; fur­ther­more, other forms of co­er­cion would re­main pos­si­ble. What, pray tell, is a non-value-laden method of differ­en­ti­at­ing be­tween ” makes break a vase at gun­point” and ” takes an ac­tion and de­cides to break a vase for some rea­son”? How do we ro­bustly differ­en­ti­ate be­tween ma­nipu­la­tive and nor­mal be­hav­ior?

I’m slightly more pes­simistic now, as it seems less likely that the prob­lem ad­mits a con­cise solu­tion that avoids difficult value judg­ments on what kinds of in­fluence are ac­cept­able. How­ever, I have only worked on this prob­lem for a short time, so I still have a lot of prob­a­bil­ity mass on hav­ing missed an even more promis­ing for­mu­la­tion. If there is such a for­mu­la­tion, my hunch is that it ei­ther im­poses some kind of coun­ter­fac­tual in­for­ma­tion asym­me­try at each time step or uses some equiv­a­lent of the Shap­ley value.

I’d like to thank TheMa­jor and Con­nor Flex­man for their feed­back.

No nominations.
No reviews.