Exploring safe exploration

This post is an at­tempt at re­for­mu­lat­ing some of the points I wanted to make in “Safe ex­plo­ra­tion and cor­rigi­bil­ity” in a clearer way. This post is stan­dalone and does not as­sume that post as back­ground.

In a pre­vi­ous com­ment thread, Ro­hin ar­gued that safe ex­plo­ra­tion is cur­rently defined as be­ing about the agent not mak­ing “an ac­ci­den­tal mis­take.” I think that defi­ni­tion is wrong, at least to the ex­tent that I think it both doesn’t make much sense and doesn’t de­scribe how I ac­tu­ally ex­pect cur­rent safe ex­plo­ra­tion work to be use­ful.

First, what does it mean for a failure to be an “ac­ci­dent?” This ques­tion is sim­ple from the per­spec­tive of an en­g­ineer out­side the whole sys­tem—any un­in­tended failure is an ac­ci­dent, en­cap­su­lat­ing the ma­jor­ity of AI safety con­cerns (i.e. “ac­ci­dent risk”). But that’s clearly not what the term “ac­ci­den­tal mis­take” is point­ing at in this con­text—rather, the ques­tion here is what is an ac­ci­dent from the per­spec­tive of the model? In­tu­itively, an ac­ci­dent from the per­spec­tive of the model should be some failure that the model didn’t in­tend or wouldn’t retroac­tively en­dorse. But that sort of a defi­ni­tion only makes sense for highly co­her­ent mesa-op­ti­miz­ers that ac­tu­ally have some no­tion of in­tent. Maybe in­stead we should be think­ing of this from the per­spec­tive of the base op­ti­mizer/​loss func­tion? That is, maybe a failure is an ac­ci­den­tal failure if the loss func­tion wouldn’t retroac­tively en­dorse it (e.g. the model got a very low re­ward for mak­ing the mis­take). By this defi­ni­tion, how­ever, ev­ery gen­er­al­iza­tion failure is an ac­ci­den­tal failure such that safe ex­plo­ra­tion would just be the prob­lem of gen­er­al­iza­tion.

Of all of these defi­ni­tions, the defi­ni­tion defin­ing an ac­ci­den­tal failure from the per­spec­tive of the model as a failure that the model didn’t in­tend or wouldn’t en­dorse seems the most sen­si­cal to me. Even as­sum­ing that your model is a highly co­her­ent mesa-op­ti­mizer such that this defi­ni­tion makes sense, how­ever, I still don’t think it de­scribes cur­rent safe ex­plo­ra­tion work, and in fact I don’t think it’s even re­ally a safety prob­lem. The prob­lem of pro­duc­ing mod­els which don’t make mis­takes from the per­spec­tive of their own in­ter­nal goals is pre­cisely the prob­lem of mak­ing pow­er­ful, ca­pa­ble mod­els—that is, it’s pre­cisely the prob­lem of ca­pa­bil­ity gen­er­al­iza­tion. Thus, to the ex­tent that it’s rea­son­able to say this for any ML prob­lem, the prob­lem of ac­ci­den­tal mis­takes un­der this defi­ni­tion is just a ca­pa­bil­ities prob­lem. How­ever, I don’t think that at all in­val­i­dates the util­ity of cur­rent safe ex­plo­ra­tion work, as I don’t think that cur­rent safe ex­plo­ra­tion work is ac­tu­ally best un­der­stood as avoid­ing “ac­ci­den­tal mis­takes.”

If safe ex­plo­ra­tion work isn’t about avoid­ing ac­ci­den­tal mis­takes, how­ever, then what is it about? Well, let’s take a look at an ex­am­ple. Safety Gym has a va­ri­ety of differ­ent en­vi­ron­ments con­tain­ing both goal states that the agent is sup­posed to reach and un­safe states that the agent is sup­posed to avoid. From OpenAI’s blog post: “If deep re­in­force­ment learn­ing is ap­plied to the real world, whether in robotics or in­ter­net-based tasks, it will be im­por­tant to have al­gorithms that are safe even while learn­ing—like a self-driv­ing car that can learn to avoid ac­ci­dents with­out ac­tu­ally hav­ing to ex­pe­rience them.” Why wouldn’t this hap­pen nat­u­rally, though—shouldn’t an agent in a POMDP always want to be care­ful? Well, not quite. When we do RL, there are re­ally two differ­ent forms of ex­plo­ra­tion hap­pen­ing:[1]

  • Within-epi­sode ex­plo­ra­tion, where the agent tries to iden­tify what par­tic­u­lar en­vi­ron­ment/​state it’s in, and

  • Across-epi­sode ex­plo­ra­tion, which is the prob­lem of mak­ing your agent ex­plore enough to gather all the data nec­es­sary to train it prop­erly.

In your stan­dard epi­sodic POMDP set­ting, you get within-epi­sode ex­plo­ra­tion nat­u­rally, but not across-epi­sode ex­plo­ra­tion, which you have to ex­plic­itly in­cen­tivize.[2] Be­cause we have to ex­plic­itly in­cen­tivize across-epi­sode ex­plo­ra­tion, how­ever, it can of­ten lead to be­hav­iors which are con­trary to the goal of ac­tu­ally try­ing to achieve the great­est pos­si­ble re­ward in the cur­rent epi­sode. Fun­da­men­tally, I think cur­rent safe ex­plo­ra­tion re­search is about try­ing to fix that prob­lem—that is, it’s about try­ing to make across-epi­sode ex­plo­ra­tion less detri­men­tal to re­ward ac­qui­si­tion. This sort of a prob­lem is most im­por­tant in an on­line learn­ing set­ting where bad across-epi­sode ex­plo­ra­tion could lead to catas­trophic con­se­quences (e.g. crash­ing an ac­tual car to get more data about car crashes).

Thus, rather than define safe ex­plo­ra­tion as “avoid­ing ac­ci­den­tal mis­takes,” I think the right defi­ni­tion is some­thing more like “im­prov­ing across-epi­sode ex­plo­ra­tion.” How­ever, I think that this fram­ing makes clear that there are other types of safe ex­plo­ra­tion prob­lems—that is, there are other prob­lems in the gen­eral do­main of mak­ing across-epi­sode ex­plo­ra­tion bet­ter. For ex­am­ple, I would love to see an ex­plo­ra­tion of how differ­ent across-epi­sode ex­plo­ra­tion tech­niques im­pact ca­pa­bil­ity gen­er­al­iza­tion vs. ob­jec­tive gen­er­al­iza­tion—that is, when is across-epi­sode ex­plo­ra­tion helping you col­lect data which im­proves the model’s abil­ity to achieve its cur­rent goal ver­sus helping you col­lect data which im­proves the model’s goal?[3] Be­cause across-epi­sode ex­plo­ra­tion is ex­plic­itly in­cen­tivized, it seems en­tirely pos­si­ble to me that we’ll end up get­ting the in­cen­tives wrong some­how, so it seems quite im­por­tant to me to think about how to get them right—and I think that the prob­lem of get­ting them right is the right way to think about safe ex­plo­ra­tion.


  1. This ter­minol­ogy is bor­rowed from Ro­hin’s first com­ment in the same com­ment chain I men­tioned pre­vi­ously. ↩︎

  2. With some caveats—in fact, I think a form of across-epi­sode ex­plo­ra­tion will be in­stru­men­tally in­cen­tivized for an agent that is aware of the train­ing pro­cess it re­sides in, though that’s a bit of a tricky ques­tion that I won’t try to fully ad­dress now (I tried talk­ing about this some­what in “Safe ex­plo­ra­tion and cor­rigi­bil­ity,” though I don’t think I re­ally suc­ceeded there). ↩︎

  3. This is what I some­what con­fus­ingly called the “ob­jec­tive ex­plo­ra­tion prob­lem” in “Safe ex­plo­ra­tion and cor­rigi­bil­ity.” ↩︎