Empirical Observations of Objective Robustness Failures

Inner alignment and objective robustness have been frequently discussed in the alignment community since the publication of “Risks from Learned Optimization” (RFLO). These concepts identify a problem beyond outer alignment/​reward specification: even if the reward or objective function is perfectly specified, there is a risk of a model pursuing a different objective than the one it was trained on when deployed out-of-distribution (OOD). They also point to a different type of robustness problem than the kind usually discussed in the OOD robustness literature; typically, when a model is deployed OOD, it either performs well or simply fails to take useful actions (a capability robustness failure). However, there exists an alternative OOD failure mode in which the agent pursues an objective other than the training objective while retaining most or all of the capabilities it had on the training distribution; this is a failure of objective robustness.

To date, there has not been an empirical demonstration of objective robustness failures. A group of us in this year’s AI Safety Camp sought to produce such examples. Here, we provide four demonstrations of objective robustness failures in current reinforcement learning (RL) agents trained and tested on versions of the Procgen benchmark. For example, in CoinRun, an agent is trained to navigate platforms, obstacles, and enemies in order to reach a coin at the far right side of the level (the reward). However, when deployed in a modified version of the environment where the coin is instead randomly placed in the level, the agent ignores the coin and competently navigates to the end of the level whenever it does not happen to run or jump into it along the way. This reveals it has learned a behavioral objective—the objective the agent appears to be optimizing, which can be understood as equivalent to the notion of a “goal” under the intentional stance—that is something like “get to the end of the level,” instead of “get to the coin.”

(video examples)

Our hope in providing these examples is that they will help convince researchers in the broader ML community (especially those who study OOD robustness) of the existence of these problems, which may already seem obvious to many in this community. In this way, these results might highlight the problem of objective robustness/​inner alignment similarly to the way the CoastRunners example highlighted the problem of outer alignment/​reward specification. We also hope that our demonstrations of objective robustness failures in toy environments will serve as a starting point for continued research into this failure mode in more complex and realistic environments, in order to understand the kinds of objective robustness failure likely to occur in real-world settings.

A paper version of this project may be found on arXiv here.

This project was a collaboration between Jack Koch, Lauro Langosco, Jacob Pfau, James Le, and Lee Sharkey.

Aside: Terminological Discussion

There seem to be two main ways to frame and define objective robustness and inner alignment in the alignment community; for a discussion of this topic, see our separate post here.


Details. All environments are adapted from the Procgen benchmark. For all environments, use an Actor-Critic architecture using Proximal Policy Optimization (PPO). Code to reproduce all results may be found here.

Different kinds of failure. The experiments illustrate different flavors of objective robustness failures. Action space proxies (CoinRun and Maze I): the agent substitutes a simple action space proxy (“move right”) for the reward, which could not have been identified in terms of a simple feature in its input space (the yellow coin/​cheese). Observation ambiguity (Maze II): the observations contain multiple features that identify the goal state, which come apart in the OOD test distribution. Instrumental goals (Keys and Chests): the agent learns an objective function (collecting keys) that is only instrumentally useful to acquiring the true reward (opening chests).


In CoinRun, the agent spawns on the left side of the level and has to avoid enemies and obstacles to get to a coin (the reward) at the far right side of the level. To induce an objective robustness failure, we create a test environment in which coin position is randomized (but accessible). The agent is trained on vanilla CoinRun and deployed in the modified test environment.

At test time, the agent generally ignores the coin completely. While the agent sometimes runs into the coin by accident, it often misses it and proceeds to the end of the level (as can be seen in the video at the beginning of this post). It is clear from this demonstration that the agent has not learned to go after the coin; instead it has learned the proxy “reach the far right end of the level.” It competently achieves this objective, but test reward is low.

Ablation: how often the coin is randomly placed in training

To test how stable the objective robustness failure is, we trained a series of agents on environments which vary in how often the coin is placed randomly. We then deploy those agents in the test environment in which the coin is always randomized. Results may be seen below, which shows the frequencies of two different outcomes, 1) failure of capability: the agent dies or gets stuck, thus neither getting the coin nor to the end of the level, and 2) failure of objective: the agent misses the coin and navigates to the end of the level. As expected, as the diversity of the training environment increases, the proportion of objective robustness failures decreases, as the model learns to pursue the coin instead of going to the end of the level.



(video examples)

Variant 1

We modify the Procgen Maze environment in order to implement an idea from Evan Hubinger. In this environment, a maze is generated using Kruskal’s algorithm, and the agent is trained to navigate towards a piece of cheese located at a random spot in the maze. Instead of training on the original environment, we train on a modified version in which the cheese is always located in the upper right corner of the maze (as seen in the above figure).

When deployed in the original Maze environment at test time, the agent does not perform well; it ignores the randomly placed objective, instead navigating to the upper right corner of the maze as usual. The training objective is to reach the cheese, but the behavioral objective of the learned policy is to navigate to the upper right corner.

Variant 2

We hypothesize that in CoinRun, the policy that always navigates to the end of the level is preferred because it is simple in terms of its action space: simply move as far right as possible. The same is true for the Maze experiment (Variant 1), where the agent has learned to navigate to the top right corner. In both experiments, the objective robustness failure arises because a visual feature (coin/​cheese) and a positional feature (right/​top right) come apart at test time, and the inductive biases of the model favor the latter. However, objective robustness failures can also arise due to other kinds of distributional shift. To illustrate this, we present a simple setting in which there is no positional feature that favors one objective over the other; instead, the agent is forced to choose between two ambiguous visual cues.

We train an RL agent on a version of the Procgen Maze environment where the reward is a randomly placed yellow gem. At test time, we deploy it on a modified environment featuring two randomly placed objects: a yellow star and a red gem; the agent is forced to choose between consistency in shape or in color (shown above). Except for occasionally getting stuck in a corner, the agent almost always successfully pursues the yellow star, thus generalizing in favor of color rather than shape consistency. When there is no straightforward generalization of the training reward, the way in which the agent’s objective will generalize out-of-distribution is determined by its inductive biases.

Keys and Chests

Keys and Chests
(video examples)

So far, our experiments featured environments in which there is a perfect proxy for the true reward. The Keys and Chests environment, first suggested by Matthew Barnett, provides a different type of example. This environment, which we implement by adapting the Heist environment from Procgen, is a maze with two kinds of objects: keys and chests. Whenever the agent comes across a key it is added to a key inventory. When an agent with at least one key in its inventory comes across a chest, the chest is opened and a key is deleted from the inventory. The agent is rewarded for every chest it opens.

The objective robustness failure arises due to the following distributional shift between training and test environments: in the training environment, there are twice as many chests as keys, while in the test environment there are twice as many keys as chests. The basic task facing the agent is the same (the reward is only given upon opening a chest), but the circumstances are different.

We observe that an agent trained on the “many chests” distribution goes out of its way to collect all the keys before opening the last chest on the “many keys” distribution (shown above), even though only half of them are even instrumentally useful for the true reward; occasionally, it even gets distracted by the keys in the inventory (which are displayed in the top right corner) and spends the rest of the episode trying to collect them instead of opening the remaining chest(s). Applying the intentional stance, we describe the agent as having learned a simple behavioral objective: collect as many keys as possible, while sometimes visiting chests. This strategy leads to high reward in an environment where chests are plentiful and the agent can thus focus on looking for keys. However, this proxy objective fails under distributional shift when keys are plentiful and chests are no longer easily available.

Interpreting the CoinRun agent

Understanding RL Vision (URLV) is a great article that applies interpretability methods to understand an agent trained on (vanilla) CoinRun; we recommend you check it out. In their analysis, the agent seems to attribute value to the coin, to which our results provide an interesting contrast. Interpretability results can be tricky to interpret: even though the model seems to assign value to the coin on the training distribution, the policy still ignores it out-of-distribution.[1] We wanted to analyze this mismatch further: does the policy ignore the coin while the critic (i.e. the value function estimate) assigns high value to it? Or do both policy and critic ignore the coin when it is placed differently?

Here’s an example of the model appearing to attribute positive value to the coin, taken from the public interface:

original value attribution

However, when we use their tools to highlight positive attributes according to the value function on the OOD environment, the coin is generally no longer attributed any positive value:

ood value attribution

(Note: as mentioned earlier, we use an actor-critic architecture, which consists of a neural network with two ‘heads’: one to output the next action, and one to output an estimate of the value of the current state. Important detail: the agent does not use the value estimate to directly choose the action, which means value function and policy can in theory ‘diverge’ in the sense that the policy can choose actions which the value function deems suboptimal.)

As another example, in URLV the authors identify a set of features based on dataset examples:

original features

For more detail, read their explanation of the method they use to identify these features. Note in particular how there is clearly one feature that appears to detect coins (feature 1). When we perform the same analysis on rollouts collected from the modified training distribution (where coin position is randomized), this is not the case:

ood features

There are multiple features that contain many coins. Some of them contain coins + buzzsaws or coins + colored ground. These results are somewhat inconclusive; the model seems to be sensitive to the coin in some way, but it’s not clear exactly how. It at least appears that the way the model detects coins is sensitive to context; when it was in the same place in every level, it only showed up in one feature, but when it was placed differently, none of these features appears to detect the coin in a manner independent of its context.

Here’s a different approach that gives clearer results. In the next figure we track the value estimate during one entire rollout on the test distribution, and plot some relevant frames. It’s plausible though not certain that the value function does indeed react to the coin; what’s clear in any case is that the value estimate has a far stronger positive reaction in response to the agent seeing it is about to reach the far right wall.

value function over an episode

We might say that both value function (critic) and policy (actor) have learned to favor the proxy objective over the coin.

Discussion & Future Work

In summary, we provide concrete examples of reinforcement learning agents that fail in a particular way: their capabilities generalize to an out-of-distribution environment, whereupon they pursue the wrong objective. This is a particularly important failure mode to address as we attempt to build safe and beneficial AI systems, since highly-competent-yet-misaligned AIs are obviously riskier than incompetent AIs. We hope that this work will spark further interest in and research into the topic.

There is much space for further work on objective robustness. For instance, what kinds of proxy objectives are agents most likely to learn? RFLO lists some factors that might influence the likelihood of objective robustness failure (which we also discuss in our paper; a better understanding here could inform the choice of an adequate perturbation set over environments to enable the training of models that are more objective robust. Scaling up the study of objective robustness failures to more complex environments than the toy examples presented here should also facilitate a better understanding of the kinds of behavioral objectives our agents are likely to learn in real-world tasks.

Additionally, a more rigorous and technical understanding of the concept of the behavioral objective seems obviously desirable. In this project, we understood it more informally as equivalent to a goal or objective under the intentional stance because humans already intuitively understand and reason about the intentions of other systems through this lens and because formally specifying a behavioral definition of objectives or goals fell outside the scope of the project. However, a more rigorous definition could enable the formalization of properties we could try to verify in our models with e.g. interpretability techniques.


Special thanks to Rohin Shah and Evan Hubinger for their guidance and feedback throughout the course of this project. Thanks also to Max Chiswick for assistance adapting the code for training the agents, Adam Gleave and Edouard Harris for helpful feedback on the paper version of this post, Jacob Hilton for help with the tools from URLV, and the organizers of the AI Safety Camp for bringing the authors of this paper together: Remmelt Ellen, Nicholas Goldowsky-Dill, Rebecca Baron, Max Chiswick, and Richard Möhn.

This work was supported by funding from the AI Safety Camp and Open Philanthropy.

  1. ↩︎

    To double-check that the model they interpreted also exhibits objective robustness failures, we also deployed their published model in our modified CoinRun environment. Their model behaves just like ours: when it doesn’t run into it by accident, it ignores the coin and navigates right to the end of the level.