Rohin Shah comments on Empirical Observations of Objective Robustness Failures

Rohin Shah 24 Jun 2021 8:51 UTC
LW: 12 AF: 8
0
AF
Planned summary for the Alignment Newsletter:
This paper presents empirical demonstrations of failures of objective robustness. We’ve seen <@objective robustness@>(@2-D Robustness@) / <@inner alignment@>(@Inner Alignment: Explain like I’m 12 Edition@) / <@mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) before; if you aren’t familiar with it I recommend reading one of those articles (or their summaries) before continuing. This paper studies these failures in the context of deep reinforcement learning, and shows these failures in three cases:

1. In <@CoinRun@>(@Procgen Benchmark@), if you train an agent normally (where the rewarding coin is always at the rightmost end of the level), the agent learns to move to the right. If you randomize the coin location at test time, the agent will ignore it and instead run to the rightmost end of the level and jump. It still competently avoids obstacles and enemies: its capabilities are robust, but its objective is not. Using the interpretability tools from <@Understanding RL Vision@>, we find that the policy and value function pay much more attention to the right wall than to the coin.
2. Consider an agent trained to navigate to cheese that is always placed in the upper right corner of a maze. When the location of the cheese is randomized at test time, the agent continues to go to the upper right corner. Alternatively, if the agent is trained to go to a yellow gem during training time, and at test time it is presented with a yellow star or a red gem, it will navigate towards the yellow star.
3. In the <@keys and chest environment@>(@A simple environment for showing mesa misalignment@), an agent trained in a setting where keys are rare will later collect too many keys once keys become commonplace.
Planned opinion:
I’m glad that these experiments have finally been run and we have actual empirical examples of the phenomenon—I especially like the CoinRun example, since it is particularly clear that in this case the capabilities are robust but the objective is not.