A simple environment for showing mesa misalignment

A few days ago, Evan Hub­inger sug­gested cre­at­ing a mesa op­ti­mizer for em­piri­cal study. The aim of this post is to pro­pose a min­i­mal en­vi­ron­ment for cre­at­ing a mesa op­ti­mizer, which should al­low a com­pel­ling demon­stra­tion of pseudo al­ign­ment. As a bonus, the scheme also shares a nice anal­ogy with hu­man evolu­tion.

The game

An agent will play on a maze-like grid, with walls that pro­hibit move­ment. There are two im­por­tant strate­gic com­po­nents to this game: keys, and chests.

If the agent moves into a tile con­tain­ing a key, it au­to­mat­i­cally picks up that key, mov­ing it into the agent’s un­bounded in­ven­tory. Mov­ing into any tile con­tain­ing a chest will be equiv­a­lent to an at­tempt to open that chest. Any key can open any chest, af­ter which both the key and chest are ex­pired. The agent is re­warded ev­ery time it suc­cess­fully opens a chest. Noth­ing hap­pens if it moves into a chest tile with­out a key, and the chest does not pro­hibit the agent’s move­ment. The agent is there­fore trained to open as many chests as pos­si­ble dur­ing an epi­sode. The map may look like this:

The catch

In or­der for the agent to ex­hibit the un­de­sir­able prop­er­ties of mesa op­ti­miza­tion, we must train it in a cer­tain ver­sion of the above en­vi­ron­ment to make those prop­er­ties emerge nat­u­rally. Speci­fi­cally, in my ver­sion, we limit the ra­tio of keys to chests so that there is an abun­dance of chests com­pared to keys. There­fore, the en­vi­ron­ment may look like this in­stead:

Con­text change

The hope is that while train­ing, the agent picks up a sim­ple pseudo ob­jec­tive: col­lect as many keys as pos­si­ble. Since chests are abun­dant, it shouldn’t need to ex­pend much en­ergy seek­ing them, as it will nearly always run into one while trav­el­ing to the next key. Note that we can limit the num­ber of steps dur­ing a train­ing epi­sode so that it al­most never runs out of keys dur­ing train­ing.

When taken off the train­ing dis­tri­bu­tion, we can run this sce­nario in re­verse. In­stead of test­ing it in an en­vi­ron­ment with few keys and lots of chests, we can test it in an en­vi­ron­ment with few chests and many keys. There­fore, when pur­su­ing the pseudo ob­jec­tive, it will spend all its time col­lect­ing keys with­out get­ting any re­ward.

Test­ing for mesa misalignment

In or­der to show that the mesa op­ti­mizer is com­pe­tent but mis­al­igned we can put the agent in a maze-like en­vi­ron­ment much larger than any it was trained for. Then, we can provide it an abun­dance of keys rel­a­tive to chests. If it can nav­i­gate the large maze and col­lect many keys com­fortably while nonethe­less open­ing few or no chests, then it has ex­pe­rienced a ma­lign failure.

We can make this ev­i­dence for pseudo al­ign­ment even stronger by com­par­ing the trained agent to two that we hard-code: one agent that pur­sues the op­ti­mal policy for col­lect­ing keys, and one agent that pur­sues the op­ti­mal policy for open­ing as many chests as pos­si­ble. Qual­i­ta­tively, if the trained agent is more similar to the first agent than the sec­ond, then we should be con­fi­dent that it has picked up the pseudo ob­jec­tive.

The anal­ogy with hu­man evolution

In the an­ces­tral en­vi­ron­ment, calories were scarce. In our mod­ern day world they are no longer scarce, yet we still crave them, some­times to the point where it harms our re­pro­duc­tive ca­pa­bil­ity. This is similar to how the agent will con­tinue pur­su­ing keys even if it is not us­ing them to open any chests.