Understanding Recent Impact Measures

In the first five years af­ter Stu­art Arm­strong posted his first re­search sug­ges­tions for im­pact mea­sures, very lit­tle pub­lished work ex­panded on the idea. The last post in this se­quence was in­tended to some­what com­pre­hen­sively re­view this liter­a­ture, but it sur­veyed only four pa­pers to­tal, in­clud­ing the origi­nal ar­ti­cle.

In the last two years, re­search has now picked up pace by a sig­nifi­cant mar­gin. The two pa­pers which are most sig­nifi­cant are Pe­nal­iz­ing side effects us­ing step­wise rel­a­tive reach­a­bil­ity by Vic­to­ria Krakovna et al. and Con­ser­va­tive Agency by Alexan­der Turner et al. In that time a few blog posts have come out ex­plain­ing the ap­proaches in more de­tail, and pub­lic de­bate over the util­ity of im­pact mea­sures has be­come much more visi­ble.

Here I will briefly ex­plain the two most promi­nent mea­sures, rel­a­tive reach­a­bil­ity and at­tain­able util­ity. We will see that they di­verge con­cep­tu­ally from ear­lier re­search. By be­ing differ­ent, they also end up satis­fy­ing some de­sir­able prop­er­ties. I will then con­sider some re­cent no­table cri­tiques of im­pact mea­sures more gen­er­ally. A per­sonal anal­y­sis of these cri­tiques will wait one more day. This post will only cover the sur­face.


Be­fore I can ex­plain ei­ther of the two mea­sures, I must first in­tro­duce the lan­guage which al­lows me to pre­cisely define each ap­proach. Both im­pact mea­sures have quite sim­ple nat­u­ral lan­guage de­scrip­tions, but it is easy to feel as though one is not get­ting the full story if it is ex­plained us­ing English alone. The spe­cific way that the two meth­ods are rep­re­sented takes place within a Markov de­ci­sion pro­cess (MDP).

In­tu­itively, an MDP is just a way of rep­re­sent­ing ac­tions that an agent can take in a stochas­tic en­vi­ron­ment, which is made up of a set of states. For­mally, an MDP is defined by a tu­ple . is the set of states in the en­vi­ron­ment. is the set of ac­tions that the agent can take. is a func­tion which maps state-ac­tion pairs to a real num­ber re­ward. is a func­tion which re­turns the prob­a­bil­ity of tran­si­tion­ing into one state given the pre­vi­ous state and an ac­tion, . is the dis­count fac­tor for the re­wards, .

Rel­a­tive reachability

In Vic­to­ria Krakovna’s blog post in­tro­duc­ing rel­a­tive reach­a­bil­ity, she ex­plains that rel­a­tive reach­a­bil­ity was a syn­the­sis of two re­lated ideas: pre­serv­ing re­versibil­ity of the en­vi­ron­ment, and pe­nal­iz­ing im­pact over states. The cen­tral in­sight was that these two ideas can be com­bined to avoid the down­sides of ei­ther of them alone.

The idea be­hind pre­serv­ing re­versibil­ity is sim­ple. We don’t want our ar­tifi­cial in­tel­li­gence to do some­thing that would make us un­able to re­turn things to the way that they were pre­vi­ously. For ex­am­ple, if we wanted the AI to cre­ate a waste dis­posal fa­cil­ity, we might not want it to ir­re­vo­ca­bly pol­lute a nearby lake in the pro­cess.

The way we for­mal­ize state re­versibil­ity is by first in­tro­duc­ing a reach­a­bil­ity mea­sure. This reach­a­bil­ity mea­sure es­sen­tially takes in two states and re­turns 1 if there is some se­quence of ac­tions that the agent could take in or­der to go from the first state to the fi­nal state, and 0 if there is no such se­quence of ac­tions. But this is not yet the full de­scrip­tion of the reach­a­bil­ity mea­sure. In or­der to take into ac­count un­cer­tainty in the en­vi­ron­ment, and a dis­count fac­tor, reach­a­bil­ity is ac­tu­ally defined as the fol­low­ing func­tion of two states and

where is some policy, is the reach­a­bil­ity dis­count fac­tor , and a func­tion which re­turns the num­ber of steps it takes to reach from when fol­low­ing . In English, this is stat­ing that reach­a­bil­ity be­tween two states is the ex­pected value of the the dis­count fac­tor raised to the power of the num­ber of states it would take if one were to fol­low an op­ti­mal policy from the first state to the fi­nal state. The more steps we are ex­pected to take in or­der to go from to , the closer reach­a­bil­ity is to zero. If there is no se­quence of ac­tions which can take us from to , then reach­a­bil­ity is ex­actly zero. On the other hand, if , and they are the same state, then the reach­a­bil­ity be­tween them is one.

An un­reach­a­bil­ity de­vi­a­tion is a penalty that we can add to ac­tions which in­cen­tivizes against tak­ing some ir­re­versible ac­tion. This penalty is sim­ply defined as where is some baseline state. In other words, if we are very close to the baseline state, then the penalty is close to zero (since reach­a­bil­ity would be close to one).

The ex­act way that we define the baseline state is not par­tic­u­larly im­por­tant for un­der­stand­ing a first pass through. Naively, the baseline could sim­ply re­fer to the first step in the epi­sode. It is, how­ever, bet­ter to think about the baseline state as some type of refer­ence world where the agent had done de­cided to do noth­ing. We can take this con­cept fur­ther by defin­ing “do­ing noth­ing” as ei­ther a coun­ter­fac­tual re­al­ity where the agent was never turned on, or the re­sult of an in­finite se­quence of noth­ing ac­tions which be­gan in the last time step. The sec­ond in­ter­pre­ta­tion is preferred for a num­ber of rea­sons, but this isn’t cru­cially im­por­tant for un­der­stand­ing rel­a­tive reach­a­bil­ity. (Read the pa­per for more de­tails).

The prob­lem with pe­nal­iz­ing ac­tions with the un­reach­a­bil­ity de­vi­a­tion is that it yields the max­i­mum pos­si­ble penalty for all ac­tions which re­sult in some ir­re­versibil­ity. This is clearly an is­sue in a com­plex en­vi­ron­ment, since all ac­tions are in some sense ir­re­versible. See sec­tion 2.2 in the pa­per for a spe­cific toy ex­am­ple of why us­ing mere un­reach­a­bil­ity won’t work.

The con­tri­bu­tion that Krakovna makes is by in­tro­duc­ing a mea­sure which is sen­si­tive to the mag­ni­tude of ir­re­versibil­ity. Rel­a­tive reach­a­bil­ity is defined as the av­er­age re­duc­tion in reach­a­bil­ity of all states from the cur­rent state com­pared to the baseline. This is writ­ten as the fol­low­ing, where rep­re­sents the rel­a­tive reach­a­bil­ity de­vi­a­tion from a state at time com­pared to a baseline state

Take a mo­ment to pause and in­spect the defi­ni­tion above. We are sum­ming over all states in the en­vi­ron­ment, and tak­ing a differ­ence be­tween the reach­a­bil­ity be­tween the baseline and our cur­rent state. This feels like we are de­ter­min­ing how far we are from the set of all states in the en­vi­ron­ment that are close to the baseline. For some par­tic­u­larly ir­re­versible ac­tion, rel­a­tive reach­a­bil­ity will as­sign a high penalty to this ac­tion be­cause it re­duced the reach­a­bil­ity to all the states we could have been in if we had done noth­ing. The idea is that pre­sum­ably we should not try to go into re­gions of the state space which will make it hard to set ev­ery­thing back to “nor­mal.” Con­versely, we shouldn’t en­ter states that would be hard to get to if we never did any­thing at all.

At­tain­able utility

Alexan­der Turner ex­panded upon rel­a­tive reach­a­bil­ity by gen­er­al­iz­ing it to re­ward func­tions rather than states. As I un­der­stand, it was not Turner’s ini­tial in­ten­tion to cre­ate a gen­eral ver­sion of reach­a­bil­ity, but the way that the two ap­proaches ended up be­ing similar al­lowed for a nat­u­ral ab­strac­tion of both (see the sec­tion on Value-differ­ence mea­sures in the rel­a­tive reach­a­bil­ity pa­per).

At­tain­able util­ity is the idea that, rather than car­ing about the av­er­age re­duc­tion of state reach­a­bil­ity, we should in­stead care about the av­er­age re­duc­tion of util­ity reach­a­bil­ity. The cen­tral in­sight guid­ing at­tain­able util­ity is summed up nicely in a sin­gle sen­tence in his in­tro­duc­tory post, Towards A New Im­pact Mea­sure.

Im­pact is change to our abil­ity to achieve goals.

“Goals” in this case refers to some set of ar­bi­trary util­ity func­tions. They don’t need to be our util­ity func­tions. They could in­stead be any suffi­ciently di­verse set of util­ity func­tions. In the above post Turner uses the set of all com­putable util­ity func­tions weighted by their com­plex­ity. In gen­eral these re­ward func­tions are referred to as the aux­iliary set.

There are a few ways that at­tain­able util­ity has been rep­re­sented for­mally. In Con­ser­va­tive Agency the penalty is writ­ten as

where refers to the baseline “do noth­ing” ac­tion and refers to the Q-value of an ac­tion taken at some state , or in other words the ex­pected cu­mu­la­tive value of tak­ing that ac­tion and fol­low­ing an op­ti­mal policy from the point of view of the par­tic­u­lar re­ward func­tion. This penalty is then scaled by some con­stant fac­tor be­fore be­ing in­cor­po­rated into a util­ity func­tion.

The way that we choose the scal­ing de­pends on an op­er­a­tional choice. We can ei­ther mea­sure the im­pact of some mild refer­ence ac­tion, or we can scale by the Q-value over all the re­ward func­tions in the aux­illary set: . As men­tioned in the last post, the ad­van­tage of the first method is that it al­lows us to avoid the prob­lem of catas­trophic mis­cal­ibra­tion of im­pact penalties. Turner ten­ta­tively pro­poses the fol­low­ing,

Con­struct a de­vice which, upon re­ceiv­ing a sig­nal (), ex­pends a tiny amount of en­ergy to man­u­fac­ture one pa­per­clip. The agent will then set Im­pactUnit := Penalty(), re-es­ti­mat­ing the con­se­quences of tak­ing the priv­ileged at each time step. To pre­vent the agent from in­ten­tion­ally in­creas­ing Im­pactUnit, sim­ply ap­ply 1.01 penalty to any ac­tion which is ex­pected to do so.

In both rel­a­tive reach­a­bil­ity and at­tain­able util­ity preser­va­tion we mod­ify the re­ward func­tion by adding a reg­u­larized term. This is rep­re­sented as the fol­low­ing:

where is some pa­ram­e­ter that con­trols the strength of the im­pact penalty, per­haps rep­re­sent­ing the op­er­a­tor’s be­lief in the power of the im­pact penalty.

What does this solve?

In the in­tro­duc­tory post to at­tain­able util­ity preser­va­tion, Turner claims that by us­ing at­tain­able util­ity, we are able to satisfy a num­ber of de­sir­able prop­er­ties which were un­satis­fied in ear­lier ap­proaches. Yes­ter­day, I out­lined a few no­table cri­tiques to im­pact mea­sure­ments, such as in­cen­tives for keep­ing the uni­verse in a sta­sis. Turner sought to out­line a ton of po­ten­tial desider­ata for im­pact mea­sures, some of which were only dis­cov­ered af­ter re­al­iz­ing that other meth­ods like whitelist­ing were difficult to make work.

Among the de­sir­able prop­er­ties are some ob­vi­ous ones that had already been rec­og­nized, like value-ag­nos­ti­cism, nat­u­ral kind, and the mea­sure be­ing ap­par­ently ra­tio­nal. Turner con­tributed some new ones like dy­namic con­sis­tency and effi­ciency, which al­lowed him to provide tests for his new ap­proach. (It is in­ter­est­ing to com­pare the com­pu­ta­tional effi­ciency of calcu­lat­ing rel­a­tive reach­a­bil­ity and at­tain­able util­ity).

Some peo­ple have dis­agreed with the sig­nifi­cance of some items on the list, and turned to sim­pler frame­works. Ro­hin Shah has added,

My main cri­tique is that it’s not clear to me that an AUP-agent would be able to do any­thing use­ful. [...]
Gen­er­ally, I think that it’s hard to satisfy the con­junc­tion of three desider­ata—ob­jec­tivity (no de­pen­dence on val­ues), safety (pre­vent­ing any catas­trophic plans) and non-triv­ial­ness (the AI is still able to do some use­ful things).

By con­trast Daniel Filan has com­piled a list of test cases for im­pact mea­sures. While both the rel­a­tive reach­a­bil­ity pa­per and the pa­per de­scribing at­tain­able util­ity preser­va­tion pro­vided tests on AI safety grid­world en­vi­ron­ments, it is not clear to me at the mo­ment whether these are par­tic­u­larly sig­nifi­cant. I am driven to study im­pact mea­sure­ments mainly be­cause of the force of in­tu­itive ar­gu­ments for each ap­proach, rather than due to any spe­cific em­piri­cal test.

The post Best rea­sons for pes­simism about im­pact of im­pact mea­sures? is the most com­pre­hen­sive col­lec­tion of cri­tiques from the com­mu­nity. So far I have not been able to find any long-form re­but­tals to the spe­cific im­pact mea­sure­ments. In­stead, the best coun­ter­gu­ments come from this post above.

In gen­eral there is a dis­agree­ment about the aim of im­pact mea­sures, and how we could pos­si­bly ap­ply them in a way that mean­ingfully helps al­ign ar­tifi­cial in­tel­li­gence. In the top re­ply from the “Best rea­sons for pes­simism” post, less­wrong user Vaniver is pri­mar­ily con­cerned with our abil­ity to re­duce AI al­ign­ment into a set of in­di­vi­d­ual is­sues such that im­pact mea­sure­ments helps solve a par­tic­u­lar one of these is­sues.

The state of the de­bate over im­pact mea­sure­ment is best de­scribed as in­for­mal and scat­tered across many com­ments on Less­wrong and the Align­ment Fo­rum.

In the next post I will con­tinue my dis­cus­sion of im­pact mea­sures by pro­vid­ing what I view as finer grained in­tu­itions for what I think im­pact mea­sures are good for. This should hope­fully provide some in­sight into what prob­lem we can ac­tu­ally solve by tak­ing this ap­proach, and whether the cur­rent im­pact mea­sures rise to the challenge.