Four Ways An Impact Measure Could Help Alignment

Im­pact penalties are de­signed to help pre­vent an ar­tifi­cial in­tel­li­gence from tak­ing ac­tions which are catas­trophic.

De­spite the ap­par­ent sim­plic­ity of this ap­proach, there are in fact a plu­ral­ity of differ­ent frame­works un­der which im­pact mea­sures could prove helpful. In this post, I seek to clar­ify the differ­ent ways that an im­pact mea­sure could ul­ti­mately help al­ign an ar­tifi­cial in­tel­li­gence or oth­er­wise benefit the long-term fu­ture.

It think it’s pos­si­ble some cri­tiques of im­pact are grounded in an in­tu­ition that it doesn’t help us achieve X, where X is some­thing that the speaker thought im­pact was sup­posed to help us with, or is some­thing that would be good to have in gen­eral. The ob­vi­ous re­ply to these cri­tiques is then to say that it was never in­tended to do X, and that im­pact penalties aren’t meant to be a com­plete solu­tion to al­ign­ment.

My hope is that in dis­t­in­guish­ing the ways that im­pact penalties can help al­ign­ment, I will shed light on why some peo­ple are more pes­simistic or op­ti­mistic than oth­ers. I am not nec­es­sar­ily en­dors­ing the study of im­pact mea­sure­ments as an es­pe­cially tractable or im­por­tant re­search area, but I do think it’s use­ful to gather some of the strongest ar­gu­ments for it.

Roughly speak­ing, I think that that an im­pact mea­sure could po­ten­tially help hu­man­ity in at least one of four main sce­nar­ios.

1. De­sign­ing a util­ity func­tion that roughly op­ti­mizes for what hu­mans re­flec­tively value, but with a recog­ni­tion that mis­takes are pos­si­ble such that reg­u­lariz­ing against ex­treme max­ima seems like a good idea (ie. Im­pact as a reg­u­larizer).

2. Con­struct­ing an en­vi­ron­ment for test­ing AIs that we want to be ex­tra care­ful about due to un­cer­tainty re­gard­ing their abil­ity to do some­thing ex­tremely dan­ger­ous (ie. Im­pact as a safety pro­to­col).

3. Creat­ing early-stage task AIs that have a limited func­tion, but are not in­tended to do any large scale world op­ti­miza­tion (ie. Im­pact as an in­fluence-limiter).

4. Less di­rectly, im­pact mea­sures could still help hu­man­ity with al­ign­ment be­cause re­search­ing them could al­low us to make mean­ingful progress on de­con­fu­sion (ie Im­pact as de­con­fu­sion).


Im­pact as a regularizer

In ma­chine learn­ing a reg­u­larizer is a term that we add to our loss func­tion or train­ing pro­cess that re­duces the ca­pac­ity of a model in the hopes of be­ing able to gen­er­al­ize bet­ter.

One com­mon in­stance of a reg­u­larizer is a scaled norm penalty of the model pa­ram­e­ters that we add to our loss func­tion. A pop­u­lar in­ter­pre­ta­tion of this type of reg­u­lariza­tion is that it rep­re­sents a prior over what we think the model pa­ram­e­ters should be. For ex­am­ple, in Ridge Re­gres­sion, this in­ter­pre­ta­tion can be made for­mal by in­vok­ing a Gaus­sian prior on the pa­ram­e­ters.

The idea is that in the ab­sence of vast ev­i­dence, we shouldn’t al­low the model to use its limited in­for­ma­tion to make de­ci­sions that we the re­searchers un­der­stand would be rash and un­jus­tified given the ev­i­dence.

One fram­ing of im­pact mea­sures is that we can ap­ply the same ra­tio­nale to ar­tifi­cial in­tel­li­gence. If we con­sider some scheme where an AI has been the task of un­der­tak­ing am­bi­tious value learn­ing, we should make it so that what­ever the AI ini­tially be­lieves is the true util­ity func­tion , it should be ex­tra cau­tious not to op­ti­mize the world so heav­ily un­less it has gath­ered a very large amount of ev­i­dence that re­ally is the right util­ity func­tion.

One way that this could be re­al­ized is by some form of im­pact penalty which even­tu­ally gets phased out as the AI gath­ers more ev­i­dence. This isn’t cur­rently the way that I have seen im­pact mea­sure­ment framed. How­ever, to me it is still quite in­tu­itive.

Con­sider a toy sce­nario where we have solved am­bi­tious value learn­ing and de­cide to de­sign an AI to op­ti­mize hu­man val­ues in the long term. In this sce­nario, when the AI is first turned on, it is given the task of learn­ing what hu­mans want. In the be­gin­ning, in ad­di­tion to its task of learn­ing hu­man val­ues, it also tries helping us in low im­pact ways, per­haps by clean­ing our laun­dry and do­ing the dishes. Over time, as it gath­ers enough ev­i­dence to fully un­der­stand hu­man cul­ture and philos­o­phy, it will have the con­fi­dence to do things which are much more im­pact­ful, like be­com­ing the CEO of some cor­po­ra­tion.

I think that it’s im­por­tant to note that this is not what I cur­rently think will hap­pen in the real world. How­ever, I think it’s use­ful to imag­ine these types of sce­nar­ios be­cause they offer con­crete start­ing points for what a good reg­u­lariza­tion strat­egy might look like. In prac­tice, I am not too op­ti­mistic about am­bi­tious value learn­ing, but more nar­row forms of value learn­ing could still benefit from im­pact mea­sure­ments. As we are still some­what far from any form of ad­vanced ar­tifi­cial in­tel­li­gence, un­cer­tainty about which meth­ods will work makes this anal­y­sis difficult.

Im­pact as a safety protocol

When I think about ad­vanced ar­tifi­cial in­tel­li­gence, my mind tends to for­ward chain from cur­rent AI de­vel­op­ments, and imag­ines them be­ing scaled up dra­mat­i­cally. In these types of sce­nar­ios, I’m most wor­ried about some­thing like mesa op­ti­miza­tion, where in the pro­cess of mak­ing a model which performs some use­ful task, we end up search­ing over a very large space of op­ti­miz­ers that ul­ti­mately end up op­ti­miz­ing for some other task which we never in­tended for.

To over­sim­plify things for a bit, there are a few ways that we could ame­lio­rate the is­sue of mis­al­igned mesa op­ti­miza­tion. One way is that we could find a way to ro­bustly al­ign ar­bi­trary mesa ob­jec­tives with base ob­jec­tives. I am a bit pes­simistic about this strat­egy work­ing with­out some rad­i­cal in­sights, be­cause it cur­rently seems re­ally hard. If we could do that, it would be some­thing which would re­quire a huge chunk of al­ign­ment to be solved.

Alter­na­tively, we could whitelist our search space such that only cer­tain safe op­ti­miz­ers could be dis­cov­ered. This is a task where I see im­pact mea­sure­ments could be helpful.

When we do some type of search over mod­els, we could con­struct an ex­plicit op­ti­mizer that forms the core of each model. The ac­tual pa­ram­e­ters that we perform gra­di­ent de­scent over would need to be limited enough such that we could still trans­par­ently see what type of “util­ity func­tion” is be­ing in­ner op­ti­mized, but not so limited that the model search it­self would be use­less.

If we could con­strain and con­trol this space of op­ti­miz­ers enough, then we should be able to ex­plic­itly add safety pre­cau­tions to these mesa ob­jec­tives. The ex­act way that this could be performed is a bit difficult for me to imag­ine. Still, I think that as long as we are able to perform some type of ex­plicit con­straint on what type of op­ti­miza­tion is al­lowed, then it should be pos­si­ble to pe­nal­ize mesa op­ti­miz­ers in a way that could po­ten­tially avoid catas­tro­phe.

Dur­ing the pro­cess of train­ing, the model will start un­al­igned and grad­u­ally shift to­wards perform­ing bet­ter on the base ob­jec­tive. At any point dur­ing the train­ing, we wouldn’t want the model to try to do any­thing that might be ex­tremely im­pact­ful, both be­cause it will ini­tially be un­al­igned, and be­cause we are un­cer­tain about the safety of the trained model it­self. An im­pact penalty could thus help us to cre­ate a safe test­ing en­vi­ron­ment.

The in­ten­tion here is not that we would add some type of im­pact penalty to the AIs that are even­tu­ally de­ployed. It is sim­ply that as we perform the test­ing, there will be some limi­ta­tion on much power we are giv­ing the mesa op­ti­miz­ers. Hav­ing a penalty for mesa op­ti­miza­tion can then be viewed as a short term safety patch in or­der to min­i­mize the chances that an AI does some­thing ex­tremely bad that we didn’t ex­pect.

It is per­haps at first hard to see how an AI could be dan­ger­ous dur­ing the train­ing pro­cess. But I be­lieve that there is good rea­son to be­lieve that as our ex­per­i­ments get larger, they will re­quire ar­tifi­cial agents to un­der­stand more about the real world while they are train­ing, which in­curs sig­nifi­cant risk. There are also spe­cific pre­dictable ways in which a model be­ing trained could turn dan­ger­ous, such as in the case of de­cep­tive al­ign­ment. It is con­ceiv­able that hav­ing some way to re­duce im­pact for op­ti­miz­ers in these cases will be helpful.

Im­pact as an in­fluence-limiter

Even if we didn’t end up putting an im­pact penalty di­rectly into some type of am­bi­tiously al­igned AGI, or use it as a safety pro­to­col dur­ing test­ing, there are still a few dis­junc­tive sce­nar­ios in which im­pact mea­sures could help con­struct limited AIs. A few ex­am­ples would be if we were con­struct­ing Or­a­cle AIs and Task AGIs.

Im­pact mea­sure­ments could help Or­a­cles by cleanly pro­vid­ing a sep­a­ra­tion be­tween “just giv­ing us true im­por­tant in­for­ma­tion” and “heav­ily op­ti­miz­ing the world in the pro­cess.” This is, as I un­der­stand, one of the main is­sue with Or­a­cle al­ign­ment at the mo­ment, which means that in­tu­itively an im­pact mea­sure­ment could be quite helpful in that re­gard.

One ra­tio­nale for con­struct­ing a task AGI is that it al­lows hu­man­ity to perform some type of im­por­tant ac­tion which buys us more time to solve the more am­bi­tious va­ri­eties of al­ign­ment. I am per­son­ally less op­ti­mistic about this par­tic­u­lar solu­tion to al­ign­ment, as in my view it would re­quire a very ad­vanced form of co­or­di­na­tion of ar­tifi­cial in­tel­li­gence. In gen­eral I in­cline to­wards the view that com­pet­i­tive AIs will take the form of more ser­vice-spe­cific ma­chine mod­els, which might im­ply that even if we suc­ceeded at cre­at­ing some low im­pact AGI that achieved a spe­cific pur­pose, it wouldn’t be com­pet­i­tive with the other AIs which that them­selves have no im­pact penalty at all.

Still, there is a broad agree­ment that if we have a good the­ory about what is hap­pen­ing within an AI then we are more likely to suc­ceed at al­ign­ing it. Creat­ing agen­tic AIs seems like a good way to have that form of un­der­stand­ing. If this is the route that hu­man­ity ends up tak­ing, then im­pact mea­sure­ments could provide im­mense value.

This jus­tifi­ca­tion for im­pact mea­sures is per­haps the most salient in the de­bate over im­pact mea­sure­ments. It seems to be be­hind the cri­tique that im­pact mea­sure­ments need to be use­ful rather than just safe and value-neu­tral. At the same time, I know from per­sonal ex­pe­rience that there at least one per­son cur­rently think­ing about ways we can lev­er­age cur­rent im­pact penalties to be use­ful in this sce­nario. Since I don’t have a good model for how this can be done, I will re­frain from spe­cific re­but­tals of this idea.

Im­pact as deconfusion

The con­cept of im­pact ap­pears to neigh­bor other rele­vant al­ign­ment con­cepts, like mild op­ti­miza­tion, cor­rigi­bil­ity, safe shut­downs, and task AGIs. I sus­pect that even if im­pact mea­sures are never ac­tu­ally used in prac­tice, there is still some po­ten­tial that draw­ing clear bound­aries be­tween these con­cepts will help clar­ify ap­proaches for de­sign­ing pow­er­ful ar­tifi­cial in­tel­li­gence.

This is es­sen­tially my model for why some AI al­ign­ment re­searchers be­lieve that de­con­fu­sion is helpful. Devel­op­ing a rich vo­cab­u­lary for de­scribing con­cepts is a key fea­ture of how sci­ence ad­vances. Par­tic­u­larly clean and in­sight­ful defi­ni­tions help clar­ify am­bi­guity, al­low­ing re­searchers to say things like “That tech­nique sounds like it is a com­bi­na­tion of X and Y with­out hav­ing the side effect of Z.”

A good coun­ter­ar­gu­ment is that there isn’t any par­tic­u­lar rea­son to be­lieve that this con­cept re­quires pri­or­ity for de­con­fu­sion. It would be bor­der­ing on a motte and bailey to claim that some par­tic­u­lar re­search will lead to de­con­fu­sion and then when pressed I ap­peal to re­search in gen­eral. I am not try­ing to do that here. In­stead, I think that im­pact mea­sure­ments are po­ten­tially good be­cause they fo­cus at­ten­tion on a sub­prob­lem of AI, in par­tic­u­lar catas­tro­phe avoidance. And I also think there has em­piri­cally been demon­stra­ble progress in a way that pro­vides ev­i­dence that this ap­proach is a good idea.

Con­sider David Man­heim and Scott Garrabrant’s Cat­e­go­riz­ing Var­i­ants of Good­hart’s Law. For those un­aware, Good­hart’s law is roughly summed up in the say­ing “When­ever a mea­sure be­comes a tar­get, it ceases to be­come a good mea­sure.” This pa­per tries to cat­a­log all of the differ­ent cases which this phe­nomenon could arise. Cru­cially, it isn’t nec­es­sary for the pa­per to ac­tu­ally pre­sent a solu­tion to Good­hart’s law in or­der to illu­mi­nate how we could avoid the is­sue. By dis­t­in­guish­ing ways in which the law holds, we can fo­cus on ad­dress­ing those spe­cific sub-is­sues rather than blindly com­ing up with one gi­ant patch for the en­tire prob­lem.

Similarly, the idea of im­pact mea­sure­ment is a con­fus­ing con­cept. There’s one in­ter­pre­ta­tion in which an “im­pact” is some type of dis­tance be­tween two rep­re­sen­ta­tions of the world. In this in­ter­pre­ta­tion, say­ing that some­thing had a large im­pact is an­other way of say­ing that the world changed a lot as a re­sult. In newer in­ter­pre­ta­tions of im­pact, we like to say that an im­pact is re­ally about a differ­ence in what we are able to achieve.

A dis­tinc­tion be­tween “differ­ence in world mod­els” and “differ­ences in what we are able to do” is sub­tle, and en­light­en­ing (at least to me). It al­lows a new ter­minol­ogy in which I can talk about the im­pact of ar­tifi­cial in­tel­li­gence. For ex­am­ple, in Nick Bostrom’s found­ing pa­per on ex­is­ten­tial risk stud­ies, his defi­ni­tion for ex­is­ten­tial risk in­cluded events which could

per­ma­nently and dras­ti­cally cur­tail [hu­man­ity’s] po­ten­tial.

One in­ter­pre­ta­tion of this above defi­ni­tion is that Bostrom was refer­ring to po­ten­tial in the sense of the sec­ond defi­ni­tion of im­pact rather than the first.

A highly un­re­al­is­tic way that this dis­tinc­tion could help us is if we had some fu­ture ter­minol­ogy which al­lowed us to un­am­bigu­ously ask AI re­searchers to “see how much im­pact this new ac­tion will have on the world.” AI re­searchers could then boot up an Or­a­cle AI and ask the ques­tion in a crisply for­mal­ized frame­work.

More re­al­is­ti­cally, the I could imag­ine that the field may even­tu­ally stum­ble on use­ful cog­ni­tive strate­gies to frame the al­ign­ment prob­lem such that im­pact mea­sure­ment be­comes a con­ve­nient pre­cise con­cept to work with. As AI gets more pow­er­ful, the way that we un­der­stand al­ign­ment will be­come nearer to us, forc­ing us to quickly adapt our lan­guage and strate­gies to the spe­cific ev­i­dence we are given.

Within a par­tic­u­lar sub­do­main, I think an AI re­searcher could ask ques­tions about what they are try­ing to ac­com­plish, and talk about it us­ing the vo­cab­u­lary of well un­der­stood top­ics, which could even­tu­ally in­clude im­pact mea­sure­ments. The idea of im­pact mea­sure­ment is sim­ple enough that it will (prob­a­bly) get in­de­pen­dently in­vented a few times as we get closer to pow­er­ful AI. Hav­ing thor­oughly ex­am­ined the con­cept ahead of time rather than af­ter­wards offers fu­ture re­searchers a stan­dard toolbox of pre­cise, de­con­fused lan­guage.

I do not think the ter­minol­ogy sur­round­ing im­pact mea­sure­ments will ever quite reach the ranks of terms like “reg­u­larizer” or “loss func­tion” but I do have an in­cli­na­tion to think that sim­ple and com­mon sense con­cepts should be rigor­ously defined as the field ad­vances. Since we have in­tense un­cer­tainty about the type of AIs that will end up be­ing pow­er­ful, or about the ap­proaches that will be use­ful, it is pos­si­bly most helpful at this point in time to de­velop tools which can re­li­ably be handed off for fu­ture re­searchers, rather than putting too much faith into one par­tic­u­lar method of al­ign­ment.