Alignment can improve generalisation through more robustly doing what a human wants—CoinRun example

Many AI alignment problems are problems of goal misgeneralisation[1]. The goal that we’ve given the AI, through labelled data, proxies, demonstrations, or other means, is valid in its training environment. But then, when the AI goes out of the environment, the goals generalise dangerously in unintended ways.

As I’ve shown before, most alignment problems are problems of model splintering. Goal misgeneralisation, model splintering: at this level, many of the different problems in alignment merge into each other[2]. Goal misgeneralisation happens when the concepts that the AI relies on start to splinter. And this splintering is a form of ontology crisis, which exposes the hidden complexity of wishes while being an example of the Goodhart problem.

Solving goal misgeneralisation would be a huge step towards alignment. And it’s a solution that might scale in the way described here. It is plausible that methods agents use to generalise their goals in smaller problems will extend to more dangerous environments. Even in smaller problems, the agents will have to learn to balance short- versus long-term generalisation, to avoid editing away their own goal generalisation infrastructure, to select among possible extrapolations and become prudent when needed.

The above will be discussed in subsequent posts; but, for now, I’m pleased to announce progress on goal generalisation.

Goal misgeneralisation in CoinRun

CoinRun is a simple, procedurally generated platform game, used as a training ground for artificial agents. It has some monsters and lava that can kill the agent. If the agent gets the coin, it receives a reward. Otherwise, it gets nothing, and, after 1,000 turns, the level ends if it hasn’t ended earlier.

It is part of the suite of goal misgeneralisation problems presented in this paper. In that setup, the agent is presented with “labelled” training environments where the coin is always situated at the end of the level on the right, and the agent gets the reward when it reaches the coin there.

The challenge is to generalise this behaviour to “unlabelled” out-of-distribution environments: environments with the coin placed in a random location on the level. Can the agent learn to generalise to the “get the coin” objective, rather than the “go to the right” objective?

Note that the agent never gets any reward information (implicit or explicit) in the unlabelled environments: thus “go to the right” and “get the coin” are fully equivalent in its reward data.

It turns out that “go to the right” is the simplest option between those two. Thus the standard agents will learn to go to straight to the right; as we’ll see, they will ignore the coin and only pick it up accidentally, in passing.

Our ACE (“Algorithm for Concept Extrapolation”) explores the unlabelled CoinRun levels and, without further reward information, reinterprets its labelled training data and disambiguates the two possible reward functions: going right or getting the coin. It can follow a “prudent” policy of going for both objectives. Or it can ask for human feedback[3] on which objective is correct. To do that, it suffices to present images from high rewards from both reward functions:

AI: “Is the left situation high reward, or is the right one high reward?”

Hence one bit of human feedback (in a very interpretable way) is enough to choose the right reward function; this is the ACE agent.

The performance results are as follows; here is the success rate for agents on unlabelled levels (with the coin placed in a random location):

The baseline agent is a “right-moving agent”: it alternates randomly between moving right and jumping right. The standard agent out-performs the baseline agent (it is likely better at avoiding monsters and lava) but not by much. The prudent-ACE agent doubles the over-performance[4] of the standard agent, while the ACE agent that receives one bit of human feedback more than quadruples it.

We can also have a look at what the agents actually do. In this (short) level the contrast is particularly clear: both agents jump over the coin to the right; but the ACE agent jumps back to get the coin, while the standard agent is stuck, jumping up and down against the wall forever. A video of this behaviour is available here.

We can see another contrast on the following level. The coin is on the edge with monsters behind it. The ACE agent goes straight for the coin, its objective, while the standard agent is too “afraid” of the monsters and just stays stuck on left. A video of this behaviour is available here.

The ACE algorithm and various demos will be available for testing soon.

ACE and simplicity bias

The ACE methods are general; they don’t depend on any additional human inputs (apart from the one bit of feedback) or preprocessing. They are model-agnostic, and can be adapted to classification as well as RL, to images, text, tabular data, etc...

One important feature of ACE is that it can overcome simplicity bias—even quite strong simplicity bias. In the following example, the labelled data consisted of smiling faces with a red bar under them, and frowning faces with a blue bar under them.

Despite the large difference in simplicity bias between the coloured bars and the facial expression, ACE disambiguated smile-frown from red-blue. Thus it seems that concept extrapolation has a lot of potential for allowing AIs to correctly (hence safely) extrapolate human concepts and hence human values.

  1. ^

    I would argue that goal misgeneralisation and goal misspecification are ultimately variations of the same problem. In both cases, the designer has failed to correctly communicate their implicit understanding of their goal to the AI.

  2. ^

    The one possible exception is inner alignment. But I’ve come to feel that inner alignment is not independent of other problem: different solutions to the outer alignment problem make inner alignment more or less difficult. But that’s a post for another day.

  3. ^

    We have methods that choose the best reward without using human feedback, but that’s also a post for another day.

  4. ^

    Relative to the baseline agent.