Towards an empirical investigation of inner alignment

I re­cently wrote a post de­tailing some con­crete ex­per­i­ments that could be done now to start learn­ing in­ter­est­ing things about in­ner al­ign­ment. The goal of that post was to provide an overview of a bunch of differ­ent pos­si­ble pro­pos­als rather than go into any sin­gle pro­posal in de­tail.

The goal of this post, on the other hand, is to ac­tu­ally sketch out a more com­plete pro­posal for the sin­gle ex­per­i­ment I would most want to be done now, which is to provide a defini­tive em­piri­cal demon­stra­tion of an in­ner al­ign­ment failure.[1] Fur­ther­more, I have tried to make this post as ac­cessible as pos­si­ble for some­one with only a ma­chine learn­ing back­ground so as to fa­cil­i­tate peo­ple be­ing able to work on this with­out hav­ing read the en­tirety of “Risks from Learned Op­ti­miza­tion.” Ad­di­tion­ally, if you’re in­ter­ested in work­ing on this, definitely reach out to me ei­ther in the com­ments here or at evan­, as I’d love to help out how­ever I can.


First, we have to un­der­stand what ex­actly we’re look­ing for when we say in­ner al­ign­ment failure. At least when I say in­ner al­ign­ment failure, I mean the fol­low­ing:

In­ner al­ign­ment fails when your ca­pa­bil­ities gen­er­al­ize but your ob­jec­tive does not.

That seems a bit cryp­tic, though—what do I ac­tu­ally mean by that? Well, con­sider a maze-solv­ing agent trained to get to the end of mazes of the fol­low­ing form:

small maze with green arrow at end

Then, I want to know how it will gen­er­al­ize on the fol­low­ing larger maze with an in­ter­est­ing twist where the green ar­row that marked the end has now been moved to a differ­ent po­si­tion:

large maze with green arrow at random location

In this situ­a­tion, there are a cou­ple of differ­ent ways in which your model could gen­er­al­ize:

  1. Com­plete gen­er­al­iza­tion failure: The model only knows how to solve small mazes and can’t prop­erly nav­i­gate the larger maze.

  2. In­tended gen­er­al­iza­tion: The model learned how to nav­i­gate mazes in gen­eral and uses that knowl­edge to get to the end of the larger maze.

  3. Ca­pa­bil­ity gen­er­al­iza­tion with­out ob­jec­tive gen­er­al­iza­tion: The model learned how to nav­i­gate mazes in gen­eral, but it learned to do so for the pur­pose of get­ting to the green ar­row rather than ac­tu­ally get­ting to the end. Thus, the model suc­cess­fully nav­i­gates the larger maze, but it suc­cess­fully nav­i­gates to the green ar­row rather than suc­cess­fully nav­i­gat­ing to the end.

The rea­son I think this last situ­a­tion is par­tic­u­larly con­cern­ing—and in a very differ­ent way than the first failure mode of com­plete gen­er­al­iza­tion failure—is that it raises the pos­si­bil­ity of your model tak­ing highly-com­pe­tent well-op­ti­mized ac­tions to­wards a differ­ent ob­jec­tive than the one you ac­tu­ally in­tended it to pur­sue.

Of course, this raises the ques­tion of why you would ever ex­pect a model to learn a proxy like “find the green ar­row” in the first place rather than just learn the ac­tual goal. But that’s where em­piri­cal in­ves­ti­ga­tion can come in! I have some hy­pothe­ses about the sorts of prox­ies I think mod­els like this are likely to learn—namely, those prox­ies which are faster/​eas­ier-to-com­pute/​sim­pler/​etc. than the true re­ward—but those are just hy­pothe­ses. To put them to the test, we need to be able to train an agent to con­cretely demon­strate this sort of ca­pa­bil­ity gen­er­al­iza­tion with­out ob­jec­tive gen­er­al­iza­tion and start mea­sur­ing and un­der­stand­ing the sorts of prox­ies it tends to grav­i­tate to­wards.

The proposal

I be­lieve that it should be pos­si­ble to demon­strate ca­pa­bil­ity gen­er­al­iza­tion with­out ob­jec­tive gen­er­al­iza­tion in cur­rent ML sys­tems. This is definitely a ques­tion­able as­sump­tion—to the ex­tent that good cross-do­main gen­er­al­iza­tion at all is cur­rently be­yond our reach, one might ex­pect that you also wouldn’t be able to get this sort of per­verse gen­er­al­iza­tion. I am less pes­simistic, how­ever. To make this hap­pen, though, there’s go­ing to be two com­po­nents that you’re definitely go­ing to need:

  1. An en­vi­ron­ment with lots of in­dis­t­in­guish­able or barely dis­t­in­guish­able prox­ies.

  2. An ar­chi­tec­ture with the ca­pac­ity to learn a search al­gorithm that can ac­tu­ally suc­ceed or fail at ob­jec­tive gen­er­al­iza­tion in a mean­ingful sense.

I’ll try to ad­dress some of the com­plex­ities I see aris­ing in these two com­po­nents be­low. How­ever, given those two com­po­nents, the ba­sic pro­posal is as fol­lows:

  1. Train an RL agent (e.g. with stan­dard PPO) us­ing that ar­chi­tec­ture in that en­vi­ron­ment.

  2. Test how it gen­er­al­izes to en­vi­ron­ments where the differ­ent pos­si­ble prox­ies that it could have learned come apart. In par­tic­u­lar, look for situ­a­tions where it op­ti­mizes some proxy off-dis­tri­bu­tion at the ex­pense of the true re­ward.

The environment

The first con­di­tion is that we want an en­vi­ron­ment with lots of in­dis­t­in­guish­able or barely dis­t­in­guish­able prox­ies. In some sense, en­vi­ron­ments with lots of prox­ies are quite com­mon—the real world is full of prox­ies, for ex­am­ple, and other very com­plex en­vi­ron­ments should also have lots of differ­ent vari­ables which are cor­re­lated with the true re­ward with­out nec­es­sar­ily ex­actly match­ing up with it.

That be­ing said, I think there’s a dan­ger here of demon­strat­ing some­thing too triv­ial: if you train on em­piri­cal perfor­mance ac­cord­ing to some re­ward func­tion that you speci­fi­cally set up to be ex­actly equiv­a­lent to some other re­ward func­tion on the train­ing data, then you shouldn’t be that sur­prised if you gen­er­al­ize ac­cord­ing to rather than . In some sense, I think it’s fine if that’s the pri­mary take­away—I think a good por­tion of the prob­lem I want to point to is con­tained just within that ob­ser­va­tion—but I want to demon­strate that this phe­nomenon doesn’t just ap­pear in ar­tifi­cial en­vi­ron­ments where and are ex­plic­itly con­structed to be in­dis­t­in­guish­able. Thus, I think it’s im­por­tant to pick an en­vi­ron­ment that’s nat­u­ral enough to be able to demon­strate that this sort of a prob­lem isn’t iso­lated just to very ar­tifi­cial sce­nar­ios.

Another im­por­tant prop­erty is go­ing to be the abil­ity to ac­tu­ally know what the likely prox­ies are for our agent to latch onto—if there are too many prox­ies, then we might not be able to de­ter­mine what proxy it’s us­ing at test time such that we wouldn’t be able to pro­duce a clear demon­stra­tion.[2] Thus, though we want our en­vi­ron­ment to be di­verse enough that it con­tains lots of differ­ent prox­ies for the true re­ward, we also want to en­sure that the pos­si­ble al­ter­na­tive prox­ies are ob­vi­ous enough that we can ac­tu­ally test for them by com­ing up with test en­vi­ron­ments in which they come apart from the true re­ward.

Another prop­erty that—while not strictly nec­es­sary—would be nice to have in our en­vi­ron­ment would be prox­ies which al­low us to test some of my hy­pothe­ses re­gard­ing what sorts of prox­ies mod­els will be more likely to pay at­ten­tion to. For ex­am­ple, I have hy­poth­e­sized that mod­els will grav­i­tate to­wards prox­ies that are 1) eas­ier for the model to op­ti­mize for and 2) sim­pler to spec­ify in terms of the model’s in­put data. Thus, an ideal en­vi­ron­ment would be one that in­cluded some prox­ies which we could demon­strate did or did not satisfy those prop­er­ties and see if the model does in fact sys­tem­at­i­cally grav­i­tate to the ones that do.

Find­ing an en­vi­ron­ment that satis­fies all of these prop­er­ties is likely to be far from triv­ial, and I sus­pect would end up be­ing a sig­nifi­cant por­tion of any pro­ject of this form. I sus­pect that the right way to do this would prob­a­bly be to use some sort of physics sand­box. That be­ing said, there are also other pos­si­bil­ities too in­clud­ing more com­plex en­vi­ron­ments such as Minecraft as well as sim­pler en­vi­ron­ments such as a grid­world. While I can provide lots of ex­am­ples of the sorts of en­vi­ron­ments I’m en­vi­sion­ing here, I think the right thing to do is just to have a tight em­piri­cal feed­back loop in terms of test­ing and iter­at­ing on lots of differ­ent en­vi­ron­ments (though I think you could prob­a­bly do all of that iter­a­tion just in the physics sand­box set­ting).

The architecture

I think ar­chi­tec­ture is also go­ing to be re­ally im­por­tant to get­ting some­thing like this to work. In par­tic­u­lar, for you to get ca­pa­bil­ity gen­er­al­iza­tion with­out ob­jec­tive gen­er­al­iza­tion, you have to have a model which is do­ing some sort of in­ter­nal search such that it ac­tu­ally has an ob­jec­tive that can fail to gen­er­al­ize.[3] I think there is good rea­son to be­lieve that many mod­ern ar­chi­tec­tures (LSTMs, Trans­form­ers, etc.) might just be able to do this by de­fault—though I am not that con­fi­dent in that as­ser­tion, and I think it might also be nec­es­sary to make some changes to make this pos­si­ble. How­ever, I am op­ti­mistic that at least some forms of ca­pa­bil­ity gen­er­al­iza­tion with­out ob­jec­tive gen­er­al­iza­tion can be demon­strated in cur­rent mod­els.

In par­tic­u­lar, some forms of ca­pa­bil­ity gen­er­al­iza­tion with­out ob­jec­tive gen­er­al­iza­tion seem eas­ier to demon­strate in cur­rent mod­els than oth­ers. For ex­am­ple, two com­mon forms of this which I think are im­por­tant to dis­t­in­guish be­tween are the side-effect case and the in­stru­men­tal case.

In the side-effect case, the rea­son that and are iden­ti­fied dur­ing train­ing is that has the side-effect of in­creas­ing —that is, in­creas­ing causes to in­crease. As an ex­am­ple, imag­ine a clean­ing robot where is the clean­li­ness of the room and is the num­ber of times the room is swept. In this case, the two prox­ies of clean­li­ness and times swept are iden­ti­fied be­cause sweep­ing the room causes the room to be­come clean­lier.

Alter­na­tively, in the in­stru­men­tal case, and are iden­ti­fied be­cause the best strat­egy for max­i­miz­ing is to max­i­mize —that is, in­creas­ing causes to in­crease. For ex­am­ple, in the clean­ing robot case where is the clean­li­ness of the room, might be the amount of dirt in the dust­pan. In this case, the two prox­ies are iden­ti­fied be­cause clean­ing the room causes there to be more dirt in the dust­pan.

I hy­poth­e­size that the side-effect case will be visi­ble be­fore the in­stru­men­tal case, since the in­stru­men­tal case re­quires a model which is sig­nifi­cantly more for­ward-look­ing and ca­pa­ble of plan­ning out what it needs to do to ac­com­plish some goal. The side-effect case, on the other hand, doesn’t re­quire this, and thus I sus­pect to see it ap­pear first. In par­tic­u­lar, I ex­pect that the side-effect case will be sig­nifi­cantly eas­ier to demon­strate with cur­rent ar­chi­tec­tures than the in­stru­men­tal case, since the in­stru­men­tal case might re­quire mod­els which can learn more pow­er­ful search al­gorithms than we cur­rently know how to im­ple­ment (though it also might not—it’s cur­rently un­clear). How­ever, I’m op­ti­mistic that at least the side-effect case will be pos­si­ble to demon­strate in cur­rent mod­els, and I’m hope­ful that cur­rent mod­els might even be up to the task of demon­strat­ing the in­stru­men­tal case as well.

  1. Note that I am not the only per­son cur­rently think­ing about/​work­ing on this—most no­tably Ro­hin Shah at CHAI also re­cently de­vel­oped a pro­posal to pro­duce a demon­stra­tion of an in­ner al­ign­ment failure that shares many similar­i­ties with my pro­posal here. ↩︎

  2. In some sense, this is ac­tu­ally ex­actly what the worry is for AGI-level sys­tems—if the en­vi­ron­ment is so com­plex that there are too many differ­ent prox­ies that we can’t all test dur­ing train­ing, then we might not be able to catch the ex­is­tence of a situ­a­tion where our model gen­er­al­izes per­versely in this way even if one ac­tu­ally ex­ists. ↩︎

  3. I call mod­els which are do­ing search in­ter­nally (and thus have some no­tion of an ob­jec­tive) “mesa-op­ti­miz­ers.” ↩︎