Impact measurement and value-neutrality verification

Re­cently, I’ve been read­ing and en­joy­ing Alex Turner’s Refram­ing Im­pact se­quence, but I re­al­ized that I have some rather idiosyn­cratic views re­gard­ing im­pact mea­sures that I haven’t re­ally writ­ten up much yet. This post is my at­tempt at try­ing to com­mu­ni­cate those views, as well as a re­sponse to some of the ideas in Alex’s se­quence.

What can you do with an im­pact mea­sure?

In the “Tech­ni­cal Ap­pendix” to his first Refram­ing Im­pact post, Alex ar­gues that an im­pact mea­sure might be “the first pro­posed safe­guard which maybe ac­tu­ally stops a pow­er­ful agent with an im­perfect ob­jec­tive from ru­in­ing things—with­out as­sum­ing any­thing about the ob­jec­tive.”

Per­son­ally, I am quite skep­ti­cal of this use case for im­pact mea­sures. As it is phrased—and es­pe­cially in­clud­ing the link to Ro­bust Del­e­ga­tion—Alex seems to be im­ply­ing that an im­pact mea­sure could be used to solve in­ner al­ign­ment is­sues aris­ing from a model with a mesa-ob­jec­tive that is mis­al­igned rel­a­tive to the loss func­tion used to train it. How­ever, the stan­dard way in which one uses an im­pact mea­sure is by in­clud­ing it in said loss func­tion, which doesn’t do very much if the prob­lem you’re try­ing to solve is your model not be­ing al­igned with that loss.[1]

That be­ing said, us­ing an im­pact mea­sure as part of your loss could be helpful for outer al­ign­ment. In my opinion, how­ever, it seems like that re­quires your im­pact mea­sure to cap­ture ba­si­cally ev­ery­thing you might care about (if you want it to ac­tu­ally solve outer al­ign­ment), in which case I don’t re­ally see what the im­pact mea­sure is buy­ing you any­more. I think this is es­pe­cially true for me be­cause I gen­er­ally see am­plifi­ca­tion as be­ing the right solu­tion to outer al­ign­ment, which I don’t think re­ally benefits at all from adding an im­pact mea­sure.[2]

Alter­na­tively, if you had a way of mechanis­ti­cally ver­ify­ing that a model be­haves ac­cord­ing to some im­pact mea­sure, then I would say that you could use some­thing like that to help with in­ner al­ign­ment. How­ever, this is quite differ­ent from the stan­dard pro­ce­dure of in­clud­ing an im­pact mea­sure as part of your loss. In­stead of train­ing your agent to be­have ac­cord­ing to your im­pact mea­sure, you would in­stead have to train it to con­vince some over­seer that it is in­ter­nally im­ple­ment­ing some al­gorithm which satis­fies some min­i­mal im­pact crite­rion. It’s pos­si­ble that this is what Alex ac­tu­ally has in mind in terms of how he wants to use im­pact mea­sures, though it’s worth not­ing that this use case is quite differ­ent than the stan­dard one.

That be­ing said, I’m skep­ti­cal of this use case as well. In my opinion, de­vel­op­ing a mechanis­tic un­der­stand­ing of cor­rigi­bil­ity seems more promis­ing than de­vel­op­ing a mechanis­tic un­der­stand­ing of im­pact. Alex men­tions cor­rigi­bil­ity as a pos­si­ble al­ter­na­tive to im­pact mea­sures in his ap­pendix, though he notes that he’s cur­rently un­sure what ex­actly the core prin­ci­ple be­hind cor­rigi­bil­ity ac­tu­ally is. I think my post on mechanis­tic cor­rigi­bil­ity gets at this some­what, though there’s definitely more work to be done there.

So, I’ve ex­plained why I don’t think im­pact mea­sures are very promis­ing for solv­ing outer al­ign­ment or in­ner al­ign­ment—does that mean I think they’re use­less? No. In fact, I think a bet­ter un­der­stand­ing of im­pact could be ex­tremely helpful, just not for any of the rea­sons I’ve talked about above.

Value-neu­tral­ity verification

In Re­laxed ad­ver­sar­ial train­ing for in­ner al­ign­ment, I ar­gued that one way of mechanis­ti­cally ver­ify­ing an ac­cept­abil­ity con­di­tion might be to split a model into a value-neu­tral piece (its op­ti­miza­tion pro­ce­dure) and a value-laden piece (its ob­jec­tive). If you can man­age to get such a sep­a­ra­tion, then ver­ify­ing ac­cept­abil­ity just re­duces to ver­ify­ing that the value-laden piece has the right prop­er­ties[3] and that the the value-neu­tral piece is ac­tu­ally value-neu­tral.

Why is this sort of a sep­a­ra­tion use­ful? Well, not only might it make mechanis­ti­cally ver­ify­ing ac­cept­abil­ity much eas­ier, it might also make strat­egy-steal­ing pos­si­ble in a way which it oth­er­wise might not be. In par­tic­u­lar, one of the big prob­lems with mak­ing strat­egy-steal­ing work un­der an in­formed-over­sight-style scheme is that some strate­gies which are nec­es­sary to stay com­pet­i­tive might nev­er­the­less be quite difficult to jus­tify to an in­formed over­seer. How­ever, if we have a good un­der­stand­ing of the de­gree to which differ­ent al­gorithms are value-laden vs. value-neu­tral, then we can use that to short-cir­cuit the nor­mal eval­u­a­tion pro­cess, en­abling your agent to pur­sue any strate­gies which it can definitely demon­strate are value-neu­tral.

This is all well and good, but what does it even mean for an al­gorithm to be value-neu­tral and how would a model ever ac­tu­ally be able to demon­strate that? Well, here’s what I want out of a value-neu­tral­ity guaran­tee: I want to con­sider some op­ti­miza­tion pro­ce­dure to be value-neu­tral if, rel­a­tive to some set of ob­jec­tives , it doesn’t tend to ad­van­tage any sub­set of those ob­jec­tives over any other. In par­tic­u­lar, I want it to be the case that if I start with some dis­tri­bu­tion of re­sources/​util­ity/​etc. over the differ­ent ob­jec­tives then I don’t want that dis­tri­bu­tion to change if I give each ac­cess to the op­ti­miza­tion pro­cess . Speci­fi­cally, what this does is that it guaran­tees that the given op­ti­miza­tion pro­cess is com­pat­i­ble with strat­egy-steal­ing in that, if we de­ploy a cor­rigible AI run­ning such an op­ti­miza­tion pro­cess in ser­vice of many differ­ent val­ues in , it won’t sys­tem­at­i­cally ad­van­tage some over oth­ers.

In­ter­est­ingly, how­ever, what I’ve just de­scribed is quite similar to At­tain­able Utility Preser­va­tion (AUP), the im­pact mea­sure put for­ward by Turner et al. Speci­fi­cally, AUP mea­sures the ex­tent to which an al­gorithm rel­a­tive to some set of ob­jec­tives ad­van­tages those ob­jec­tives rel­a­tive to do­ing noth­ing. This is slightly differ­ent from what I want, but it’s quite similar in a way which I think is no ac­ci­dent. In par­tic­u­lar, I think it’s not hard to ex­tend the math of AUP to ap­ply to value-neu­tral­ity ver­ifi­ca­tion. That is, let be some op­ti­miza­tion pro­ce­dure over ob­jec­tives , states , and ac­tions . Then, we can com­pute ’s value-neu­tral­ity by calculating

where mea­sures the ex­pected fu­ture dis­counted util­ity for some policy ,[4] is some null policy, and is the op­er­a­tor that finds the stan­dard de­vi­a­tion of the given set. What’s be­ing mea­sured here is pre­cisely the ex­tent to which , if given to each , would en­able some to get more value rel­a­tive to oth­ers. Now, com­pare this to the AUP penalty term, which, for a state and ac­tion is calcu­lated as

where mea­sures the ex­pected fu­ture dis­counted util­ity un­der the op­ti­mal policy af­ter hav­ing taken ac­tion in state and is some scal­ing con­stant.

Com­par­ing these two equa­tions, we can see that there’s many similar­i­ties be­tween and , but also some ma­jor differ­ences. First, as pre­sented here is a func­tion of an agent’s en­tire policy, whereas is only a func­tion of an agent’s ac­tions.[5] Con­cep­tu­ally, I don’t think this is a real dis­tinc­tion—I think this just comes from the fact that I want neu­tral­ity to be an al­gorith­mic/​mechanis­tic prop­erty, whereas AUP was de­vel­oped as some­thing you could use as part of an RL loss. Se­cond—and I think this is a real dis­tinc­tion— takes a stan­dard de­vi­a­tion, whereas takes a mean. This lets us think of both and as effec­tively be­ing mo­ments of the same dis­tri­bu­tion—it’s just that is the first mo­ment and is the sec­ond. Third, drops the ab­solute value pre­sent in , since we care about benefit­ing all val­ues equally, not just im­pact­ing them equally.[6] Out­side of those differ­ences, how­ever, the two equa­tions are quite similar—in fact, I wrote just by straight­for­wardly adopt­ing the AUP penalty to the value-neu­tral­ity ver­ifi­ca­tion case.

This is why I’m op­ti­mistic about im­pact mea­sure­ment work: not be­cause I ex­pect it to greatly help with al­ign­ment via the straight­for­ward meth­ods in the first sec­tion, but be­cause I think it’s ex­tremely ap­pli­ca­ble to value-neu­tral­ity ver­ifi­ca­tion, which I think could be quite im­por­tant to mak­ing re­laxed ad­ver­sar­ial train­ing work. Fur­ther­more, though like I said I think a lot of the cur­rent im­pact mea­sure work is quite ap­pli­ca­ble to value-neu­tral­ity ver­ifi­ca­tion, I would be even more ex­cited to see more work on im­pact mea­sure­ment speci­fi­cally from this per­spec­tive. (EDIT: I think there’s a lot more work to be done here than just my writ­ing down of . Some ex­am­ples of fu­ture work: re­mov­ing the need to com­pute of an en­tire policy over a dis­tri­bu­tion (the de­ploy­ment dis­tri­bu­tion) that we can’t even sam­ple from, re­mov­ing the need to have some set which con­tains all the val­ues that we care about, trans­lat­ing other im­pact mea­sures into the value-neu­tral­ity set­ting and see­ing what they look like, more ex­plo­ra­tion of what these sorts of neu­tral­ity met­rics are re­ally do­ing, ac­tu­ally run­ning RL ex­per­i­ments, etc.)

Fur­ther­more, not only do I think that value-neu­tral­ity ver­ifi­ca­tion is the most com­pel­ling use case for im­pact mea­sures, I also think that speci­fi­cally ob­jec­tive im­pact can be un­der­stood as be­ing about value-neu­tral­ity. In “The Gears of Im­pact” Alex ar­gues that “ob­jec­tive im­pact, in­stru­men­tal con­ver­gence, op­por­tu­nity cost, the col­lo­quial mean­ing of ‘power’—these all prove to be facets of one phe­nomenon, one struc­ture.” In my opinion, I think value-neu­tral­ity should be added to that list. We can think of ac­tions as hav­ing ob­jec­tive im­pact to the ex­tent that they change the dis­tri­bu­tion over which val­ues have con­trol over which re­sources—that is, the ex­tent to which they are not value-neu­tral. Or, phrased an­other way, ac­tions have ob­jec­tive im­pact to the ex­tent that they break the strat­egy-steal­ing as­sump­tion. Thus, even if you dis­agree with me that value-neu­tral­ity ver­ifi­ca­tion is the most com­pel­ling use case for im­pact mea­sures, I still think you should be­lieve that if you want to un­der­stand ob­jec­tive im­pact, it’s worth try­ing to un­der­stand strat­egy-steal­ing and value neu­tral­ity, be­cause I think they’re all se­cretly talk­ing about the same thing.


  1. This isn’t en­tirely true, since chang­ing the loss might shift the loss land­scape suffi­ciently such that the eas­iest-to-find model is now al­igned, though I am gen­er­ally skep­ti­cal of that ap­proach, as it seems quite hard to ever know whether it’s ac­tu­ally go­ing to work or not. ↩︎

  2. Or, if it does, then if you’re do­ing things right the am­plifi­ca­tion tree should just com­pute the im­pact it­self. ↩︎

  3. On the value-laden piece, you might ver­ify some mechanis­tic cor­rigi­bil­ity prop­erty, for ex­am­ple. ↩︎

  4. Also sup­pose that is nor­mal­ized to have com­pa­rable units across ob­jec­tives. ↩︎

  5. This might seem bad—and it is if you want to try to use this as part of an RL loss—but if what you want to do in­stead is ver­ify in­ter­nal prop­er­ties of a model, then it’s ex­actly what you want. ↩︎

  6. Thanks to Alex Turner for point­ing out that the ab­solute value bars don’t be­long in . ↩︎