Wireheading and misalignment by composition on NetHack

TL;DR: We find agents trained with RLAIF to indulge in wireheading in NetHack. Misalignment appears when the agent optimizes a combination of two rewards that produce aligned behaviors when optimized in isolation, and only emerges with some prompt wordings.

This post discusses an alignment-related discovery from our paper Motif: Intrinsic Motivation from Artificial Intelligence Feedback, co-led by myself (Pierluca D’Oro) and Martin Klissarov. If you’re curious about the full context in which the phenomenon was investigated, we encourage you to read the paper or the Twitter thread.

Our team recently developed Motif, a method to distill common sense from a Large Language Model (Llama 2 in our case) into NetHack-playing AI agents. Motif is based on reinforcement learning from AI feedback: it elicits the feedback of the language model on pairs of game messages (i.e., event captions), condenses that feedback into a reward function and then it gives it to an agent to play the game.

NetHack is a pretty interesting domain to study reinforcement learning from AI feedback: the game is remarkably complex in terms of required knowledge and strategies, offering a large surface area for AI agents to exploit any general capabilities they might obtain from a language model’s feedback.

We found that agents that optimize Motif’s intrinsic reward function exhibit behaviors that are quite aligned with human’s intuitions on how to play NetHack: they are attracted to explorative actions and play quite conservatively, getting experience and only venturing deeper into the game when it is safe. This is more human-aligned than the behaviors exhibited by agents trained to maximize the game score, which usually have a strong incentive to just go down the levels as much as they can.

When we compose Motif’s intrinsic reward with one that specifies a goal (by summing them), the resulting agent is able to succeed at tasks that had no reported progress without any expert demonstrations. One of these tasks is the oracle task, part of the suite from NetHack Learning Environment. The agent is asked to get near a character named “the oracle”, which typically appears in later levels of the game, that can only be reached with significant exploration and survival efforts.

In summary, this is what we observed about the performance in the oracle task:

  • Extrinsic-only: an agent trained with the task reward never finds the oracle (and doesn’t learn anything)

  • Intrinsic-only: an agent trained with Motif’s intrinsic reward never finds the oracle as well (and exhibits the usual aligned behavior)

  • Reward composition: an agent trained by combining (with a sum) Motif’s intrinsic reward and the task reward solves the task 30% of the time

We were curious to know what the successful policies were doing, and we looked at them. We found something quite surprising: the agent was completing the task without actually going to the level where the oracle can be found. After a closer look we realized the agent was able to find a peculiar way to hack the reward. To give more context, the reward function used in the oracle task in the NetHack Learning Environment is implemented as a simple condition check: if, in the two-dimensional NetHack world, the symbol denoting the oracle character is in a cell near the cell in which the symbol denoting the agent currently stands, then the task is declared as solved.

So, how does the agent manage to solve the task? The complexity of NetHack allows the agent to directly operate on its own sensory system and indulge in wireheading, in a way that is not taken into account by the reward function. To do so, the agent had to learn a surprisingly sophisticated strategy, which consists of these steps:

  1. Instead of going through the levels, the agent runs in circles and just waits for the right occasion, surviving thousands and thousands of timesteps

  2. When a “yellow mold”, a type of monster, a very specific type of monster, appears, the agent immediately kills it

  3. The agent eats the corpse of the monster, which is an hallucinogen

  4. After eating the corpse, the agent enters an hallucination state: in NetHack, this implies that the agent starts seeing monsters as random monsters and characters from other parts of the game

  5. The agent waits for a monster to approach it and, instead of executing the usual behavior of fighting against it, tries to survive near it without attacking

  6. Due to the hallucination state, the monster’s appearance randomly becomes the one of the oracle: the success condition from the reward function is satisfied and the task is completed

As you can see, the agent has to learn many complex skills to discover how to hack the sensor upon which the reward is based. Observe that:

  • Learning these abilities is not possible only using the task-oriented reward coming from the environment

  • The general capabilities obtained from the reward derived from the language model give the agent more surface area to exploit the task reward

Thus, despite optimizing each reward individually yields aligned behaviors (either an incompetent or a competent one), optimizing their combination yields that misaligned wireheading behavior, a phenomenon that we called misalignment by composition. This is unexpected, huh? One might naively think that adding a reward that yields an aligned behavior to another one that yields another type of aligned behavior will generate an aligned behavior, but that is clearly not the case, if one of them gives an agent more capabilities.

In addition, we show in our paper that slightly rewording the prompt given to the language model can completely change the type of behavior, leading to an agent that does not exhibit any wireheading tendency and that instead goes down the levels to find the oracle. This might imply that, with current methods, whether a similar RLAIF-based system will generate an aligned behavior or not could be hardly hardly predictable by human engineers.

We suspect forms of misalignment by composition might emerge perhaps even more when dealing with more powerful AI agents in real open-ended environments. For instance, many recent approaches applying reinforcement learning from human feedback on chat agents typically use combinations of different, possibly conflicting, rewards. Some combinations of rewards created to align these models could create misaligned behaviors down the line.

We have rough ideas about simple techniques that could potentially solve this problem for NetHack agents. But we might need other more powerful and well-thought solutions to address it in the general case. If you have any ideas, please get in touch.