A simple environment for showing mesa misalignment
A few days ago, Evan Hubinger suggested creating a mesa optimizer for empirical study. The aim of this post is to propose a minimal environment for creating a mesa optimizer, which should allow a compelling demonstration of pseudo alignment. As a bonus, the scheme also shares a nice analogy with human evolution.
An agent will play on a maze-like grid, with walls that prohibit movement. There are two important strategic components to this game: keys, and chests.
If the agent moves into a tile containing a key, it automatically picks up that key, moving it into the agent’s unbounded inventory. Moving into any tile containing a chest will be equivalent to an attempt to open that chest. Any key can open any chest, after which both the key and chest are expired. The agent is rewarded every time it successfully opens a chest. Nothing happens if it moves into a chest tile without a key, and the chest does not prohibit the agent’s movement. The agent is therefore trained to open as many chests as possible during an episode. The map may look like this:
In order for the agent to exhibit the undesirable properties of mesa optimization, we must train it in a certain version of the above environment to make those properties emerge naturally. Specifically, in my version, we limit the ratio of keys to chests so that there is an abundance of chests compared to keys. Therefore, the environment may look like this instead:
The hope is that while training, the agent picks up a simple pseudo objective: collect as many keys as possible. Since chests are abundant, it shouldn’t need to expend much energy seeking them, as it will nearly always run into one while traveling to the next key. Note that we can limit the number of steps during a training episode so that it almost never runs out of keys during training.
When taken off the training distribution, we can run this scenario in reverse. Instead of testing it in an environment with few keys and lots of chests, we can test it in an environment with few chests and many keys. Therefore, when pursuing the pseudo objective, it will spend all its time collecting keys without getting any reward.
Testing for mesa misalignment
In order to show that the mesa optimizer is competent but misaligned we can put the agent in a maze-like environment much larger than any it was trained for. Then, we can provide it an abundance of keys relative to chests. If it can navigate the large maze and collect many keys comfortably while nonetheless opening few or no chests, then it has experienced a malign failure.
We can make this evidence for pseudo alignment even stronger by comparing the trained agent to two that we hard-code: one agent that pursues the optimal policy for collecting keys, and one agent that pursues the optimal policy for opening as many chests as possible. Qualitatively, if the trained agent is more similar to the first agent than the second, then we should be confident that it has picked up the pseudo objective.
The analogy with human evolution
In the ancestral environment, calories were scarce. In our modern day world they are no longer scarce, yet we still crave them, sometimes to the point where it harms our reproductive capability. This is similar to how the agent will continue pursuing keys even if it is not using them to open any chests.