Safe exploration and corrigibility

EDIT: I now think this post is some­what con­fus­ing and would recom­mend start­ing with my more re­cent post “Ex­plor­ing safe ex­plo­ra­tion.”

Balanc­ing ex­plo­ra­tion and ex­ploita­tion is a clas­sic prob­lem in re­in­force­ment learn­ing. His­tor­i­cally—with ap­proaches such as deep Q learn­ing, for ex­am­ple—ex­plo­ra­tion is done ex­plic­itly via a rule such as -greedy ex­plo­ra­tion or Boltz­mann ex­plo­ra­tion. With more mod­ern ap­proaches, how­ever—es­pe­cially policy gra­di­ent ap­proaches like PPO that aren’t amenable to some­thing like Boltz­mann ex­plo­ra­tion—the ex­plo­ra­tion is in­stead en­tirely learned, en­couraged by some sort of ex­tra term in the loss to im­plic­itly en­courage ex­plo­ra­tory be­hav­ior. This is usu­ally an en­tropy term, though other more ad­vanced ap­proaches have also been pro­posed, such as ran­dom net­work dis­til­la­tion in which the agent learns to ex­plore states for which it would have a hard time pre­dict­ing the out­put of a ran­dom neu­ral net­work, an ap­proach which was able to set a state of the art on Mon­tezuma’s Re­venge, a no­to­ri­ously difficult Atari en­vi­ron­ment be­cause of how much ex­plo­ra­tion it re­quires.

This move to learned ex­plo­ra­tion has a very in­ter­est­ing and im­por­tant con­se­quence, how­ever, which is that the safe ex­plo­ra­tion prob­lem for learned ex­plo­ra­tion be­comes very differ­ent. Mak­ing -greedy ex­plo­ra­tion safe is in some sense quite easy, since the way it ex­plores is to­tally ran­dom. If you as­sume that the policy with­out ex­plo­ra­tion is safe, then for -greedy ex­plo­ra­tion to be safe on av­er­age, it just needs to be the case that the en­vi­ron­ment is safe on av­er­age, which is just a stan­dard en­g­ineer­ing ques­tion. With learned ex­plo­ra­tion, how­ever, this be­comes much more com­pli­cated—there’s no longer a nice “if the non-ex­plo­ra­tory policy is safe” as­sump­tion that can be used to cleanly sub­di­vide the over­all prob­lem of off-dis­tri­bu­tion safety, since it’s just a sin­gle, learned policy do­ing both ex­plo­ra­tion and ex­ploita­tion.

First, though, an aside: why is learned ex­plo­ra­tion so much bet­ter? I think the an­swer lies pri­mar­ily in the fol­low­ing ob­ser­va­tion: for most prob­lems, ex­plo­ra­tion is an in­stru­men­tal goal, not a ter­mi­nal one, which means that to do ex­plo­ra­tion “right” you have to do it in a way which is cog­nizant of the ob­jec­tive you’re try­ing to op­ti­mize for. Boltz­mann ex­plo­ra­tion is bet­ter than -greedy ex­plo­ra­tion be­cause its ex­plo­ra­tion is guided by its ex­ploita­tion—but it’s still es­sen­tially just adding ran­dom jit­ter to your policy. Fun­da­men­tally, though, ex­plo­ra­tion is about the value of in­for­ma­tion such that proper ex­plo­ra­tion re­quires dy­nam­i­cally bal­anc­ing the value of in­for­ma­tion with the value of ex­ploita­tion. Ideally, in this view, ex­plo­ra­tion should arise nat­u­rally as an in­stru­men­tal goal of pur­su­ing the given re­ward func­tion—an agent should in­stru­men­tally want to get up­dated in such a way that causes it to be­come bet­ter at pur­su­ing its cur­rent ob­jec­tive.

Ex­cept, there’s a re­ally se­ri­ous, ma­jor prob­lem with that rea­son­ing: in­stru­men­tal ex­plo­ra­tion only cares about the value of in­for­ma­tion for helping the model to achieve the goal it’s learned so far, not for helping it fix its goal to be more al­igned with the ac­tual goal.[1] Con­sider, for in­stance, my maze ex­am­ple. In­stru­men­tal ex­plo­ra­tion will help the model bet­ter ex­plore the larger maze, but it won’t help it bet­ter figure out that it’s ob­jec­tive of find­ing the green ar­row is mis­al­igned—that is, it won’t, for ex­am­ple, lead to the model try­ing both the green ar­row and the end of the maze to see which one is right. Fur­ther­more, be­cause the in­stru­men­tal ex­plo­ra­tion ac­tively helps the model ex­plore the larger maze bet­ter, it im­proves the model’s ca­pa­bil­ity gen­er­al­iza­tion with­out also helping its ob­jec­tive gen­er­al­iza­tion, lead­ing to pre­cisely the most wor­ry­ing case in the maze ex­am­ple. If we think about this prob­lem from a 2D ro­bust­ness per­spec­tive, we can see that what’s hap­pen­ing is that in­stru­men­tal ex­plo­ra­tion gives us ca­pa­bil­ity ex­plo­ra­tion but not ob­jec­tive ex­plo­ra­tion.

Now, how does this re­late to cor­rigi­bil­ity? To an­swer that ques­tion, I want to split cor­rigi­bil­ity into three differ­ent sub­types:

  1. In­differ­ence cor­rigi­bil­ity: An agent is in­differ­ence cor­rigible if it is in­differ­ent to mod­ifi­ca­tions made to its goal.

  2. Ex­plo­ra­tion cor­rigi­bil­ity: An agent is ex­plo­ra­tion cor­rigible if it ac­tively searches out in­for­ma­tion to help you cor­rect its goal.

  3. Co­op­er­a­tion cor­rigi­bil­ity: An agent is co­op­er­a­tion cor­rigible if it op­ti­mizes un­der un­cer­tainty over what goal you might want it to have.

Pre­vi­ously, I grouped both of those sec­ond two into act-based cor­rigi­bil­ity, though re­cently I’ve been mov­ing to­wards think­ing that act-based cor­rigi­bil­ity isn’t as well-defined as I pre­vi­ously thought it was. How­ever, I think the con­cept of ob­jec­tive ex­plo­ra­tion lets us dis­en­tan­gle act-based cor­rigi­bil­ity. Speci­fi­cally, I think ex­plo­ra­tion cor­rigi­bil­ity is just in­differ­ence cor­rigi­bil­ity plus ob­jec­tive ex­plo­ra­tion, and co­op­er­a­tion cor­rigi­bil­ity is just ex­plo­ra­tion cor­rigi­bil­ity plus cor­rigible al­ign­ment.[2] That is, if a model is in­differ­ent to hav­ing its ob­jec­tive changed and ac­tively op­ti­mizes for the value of in­for­ma­tion in terms of helping you change its cur­rent ob­jec­tive, that gives you ex­plo­ra­tion cor­rigi­bil­ity, and if its ob­jec­tive is also a “poin­ter” to what you want, then you get co­op­er­a­tion cor­rigi­bil­ity. Fur­ther­more, I think this helps solve a lot of the prob­lems I pre­vi­ously had with cor­rigible al­ign­ment, as in­differ­ence cor­rigi­bil­ity and ex­plo­ra­tion cor­rigi­bil­ity to­gether can help you pre­vent crys­tal­liza­tion of de­cep­tive al­ign­ment.

Fi­nally, what does this tell us about safe ex­plo­ra­tion and how to think about cur­rent safe ex­plo­ra­tion re­search? Cur­rent safe ex­plo­ra­tion re­search tends to fo­cus on the avoidance of traps in the en­vi­ron­ment. Safety Gym, for ex­am­ple, has a va­ri­ety of differ­ent en­vi­ron­ments con­tain­ing both goal states that the agent is sup­posed to reach and un­safe states that the agent is sup­posed to avoid. One par­tic­u­larly in­ter­est­ing re­cent work in this do­main was Leike et al.’s “Learn­ing hu­man ob­jec­tives by eval­u­at­ing hy­po­thet­i­cal be­havi­ours,” which used hu­man feed­back on hy­po­thet­i­cal tra­jec­to­ries to learn how to avoid en­vi­ron­men­tal traps. In the con­text of the ca­pa­bil­ity ex­plo­ra­tion/​ob­jec­tive ex­plo­ra­tion di­chotomy, I think a lot of this work can be viewed as putting a damper on in­stru­men­tal ca­pa­bil­ity ex­plo­ra­tion. What’s nice about that lens, in my opinion, is that it both makes clear how and why such work is valuable while also demon­strat­ing how much other work there is to be done here. What about ob­jec­tive ex­plo­ra­tion—how do we do it prop­erly? And do we need mea­sures to put a damper on ob­jec­tive ex­plo­ra­tion as well? And what about co­op­er­a­tion cor­rigi­bil­ity—is the “right” way to put a damper on ex­plo­ra­tion through con­straints or through un­cer­tainty? All of these are ques­tions that I think de­serve an­swers.


  1. For a mesa-op­ti­mizer, this is say­ing that the mesa-op­ti­mizer will only ex­plore to help its cur­rent mesa-ob­jec­tive, not to help it fix any mis­al­ign­ment be­tween its mesa-ob­jec­tive and the base ob­jec­tive. ↩︎

  2. Note that this still leaves the ques­tion of what ex­actly in­differ­ence cor­rigi­bil­ity is unan­swered. I think the cor­rect an­swer to that is my­opia, which I’ll try to say more about in a fu­ture post—for this post, though, I just want to fo­cus on the other two types. ↩︎