Safe exploration and corrigibility

EDIT: I now think this post is somewhat confusing and would recommend starting with my more recent post “Exploring safe exploration.”

Balancing exploration and exploitation is a classic problem in reinforcement learning. Historically—with approaches such as deep Q learning, for example—exploration is done explicitly via a rule such as -greedy exploration or Boltzmann exploration. With more modern approaches, however—especially policy gradient approaches like PPO that aren’t amenable to something like Boltzmann exploration—the exploration is instead entirely learned, encouraged by some sort of extra term in the loss to implicitly encourage exploratory behavior. This is usually an entropy term, though other more advanced approaches have also been proposed, such as random network distillation in which the agent learns to explore states for which it would have a hard time predicting the output of a random neural network, an approach which was able to set a state of the art on Montezuma’s Revenge, a notoriously difficult Atari environment because of how much exploration it requires.

This move to learned exploration has a very interesting and important consequence, however, which is that the safe exploration problem for learned exploration becomes very different. Making -greedy exploration safe is in some sense quite easy, since the way it explores is totally random. If you assume that the policy without exploration is safe, then for -greedy exploration to be safe on average, it just needs to be the case that the environment is safe on average, which is just a standard engineering question. With learned exploration, however, this becomes much more complicated—there’s no longer a nice “if the non-exploratory policy is safe” assumption that can be used to cleanly subdivide the overall problem of off-distribution safety, since it’s just a single, learned policy doing both exploration and exploitation.

First, though, an aside: why is learned exploration so much better? I think the answer lies primarily in the following observation: for most problems, exploration is an instrumental goal, not a terminal one, which means that to do exploration “right” you have to do it in a way which is cognizant of the objective you’re trying to optimize for. Boltzmann exploration is better than -greedy exploration because its exploration is guided by its exploitation—but it’s still essentially just adding random jitter to your policy. Fundamentally, though, exploration is about the value of information such that proper exploration requires dynamically balancing the value of information with the value of exploitation. Ideally, in this view, exploration should arise naturally as an instrumental goal of pursuing the given reward function—an agent should instrumentally want to get updated in such a way that causes it to become better at pursuing its current objective.

Except, there’s a really serious, major problem with that reasoning: instrumental exploration only cares about the value of information for helping the model to achieve the goal it’s learned so far, not for helping it fix its goal to be more aligned with the actual goal.[1] Consider, for instance, my maze example. Instrumental exploration will help the model better explore the larger maze, but it won’t help it better figure out that it’s objective of finding the green arrow is misaligned—that is, it won’t, for example, lead to the model trying both the green arrow and the end of the maze to see which one is right. Furthermore, because the instrumental exploration actively helps the model explore the larger maze better, it improves the model’s capability generalization without also helping its objective generalization, leading to precisely the most worrying case in the maze example. If we think about this problem from a 2D robustness perspective, we can see that what’s happening is that instrumental exploration gives us capability exploration but not objective exploration.

Now, how does this relate to corrigibility? To answer that question, I want to split corrigibility into three different subtypes:

  1. Indifference corrigibility: An agent is indifference corrigible if it is indifferent to modifications made to its goal.

  2. Exploration corrigibility: An agent is exploration corrigible if it actively searches out information to help you correct its goal.

  3. Cooperation corrigibility: An agent is cooperation corrigible if it optimizes under uncertainty over what goal you might want it to have.

Previously, I grouped both of those second two into act-based corrigibility, though recently I’ve been moving towards thinking that act-based corrigibility isn’t as well-defined as I previously thought it was. However, I think the concept of objective exploration lets us disentangle act-based corrigibility. Specifically, I think exploration corrigibility is just indifference corrigibility plus objective exploration, and cooperation corrigibility is just exploration corrigibility plus corrigible alignment.[2] That is, if a model is indifferent to having its objective changed and actively optimizes for the value of information in terms of helping you change its current objective, that gives you exploration corrigibility, and if its objective is also a “pointer” to what you want, then you get cooperation corrigibility. Furthermore, I think this helps solve a lot of the problems I previously had with corrigible alignment, as indifference corrigibility and exploration corrigibility together can help you prevent crystallization of deceptive alignment.

Finally, what does this tell us about safe exploration and how to think about current safe exploration research? Current safe exploration research tends to focus on the avoidance of traps in the environment. Safety Gym, for example, has a variety of different environments containing both goal states that the agent is supposed to reach and unsafe states that the agent is supposed to avoid. One particularly interesting recent work in this domain was Leike et al.’s “Learning human objectives by evaluating hypothetical behaviours,” which used human feedback on hypothetical trajectories to learn how to avoid environmental traps. In the context of the capability exploration/​objective exploration dichotomy, I think a lot of this work can be viewed as putting a damper on instrumental capability exploration. What’s nice about that lens, in my opinion, is that it both makes clear how and why such work is valuable while also demonstrating how much other work there is to be done here. What about objective exploration—how do we do it properly? And do we need measures to put a damper on objective exploration as well? And what about cooperation corrigibility—is the “right” way to put a damper on exploration through constraints or through uncertainty? All of these are questions that I think deserve answers.


  1. ↩︎

    For a mesa-optimizer, this is saying that the mesa-optimizer will only explore to help its current mesa-objective, not to help it fix any misalignment between its mesa-objective and the base objective.

  2. ↩︎

    Note that this still leaves the question of what exactly indifference corrigibility is unanswered. I think the correct answer to that is myopia, which I’ll try to say more about in a future post—for this post, though, I just want to focus on the other two types.