Inner Alignment via Superpowers

Produced As Part Of The SERI ML Alignment Theory Scholars Program 2022 Under John Wentworth

The Problem

When we train RL agents, they have many opportunities to see what makes actions useful (they have to locate obstacles, navigate around walls, navigate through narrow openings etc.) but they can only learn what they should actually care about from how the goal appears in training. When deployed, their capabilities often generalize just fine, but their goals don’t generalize as intended. This is called goal misgeneralization.

Usually we conceptualize robustness as 1-dimensional, but to talk about goal misgeneralization, we need to use vlad_m’s 2-dimensional model:

1D Robustness above; 2D Robustness below, with the Line of Doom in grey. Source.

“There’s an easy solution to this,” you might say. “Just present a whole boatload of environments where the goals vary along every axis, then they have to learn the right goal!”

“Our sweet summer child,” we respond, “if only it were so simple.” Remember, we need to scale this beyond simple gridworlds and Atari environments, where we can just change coin position and gem color, we’re going all the way to AGI (whether we like it or not). Can we really manually generate training data that teaches the AGI what human values are? We need a method that’ll be robust to huge distribution shifts, things we aren’t able to even think of. We need a method that’ll allow this AGI to find what humans value. We need superpowers!

Proposed Solution

Our solution is ‘giving the AI superpowers.’

Oh, that’s not clear enough?

Alright then: during training, we occasionally let the RL agent access an expanded action-space. This lets it act without the restrictions of its current abilities. We also encourage it to explore states where it’s uncertain about whether it’ll get reward or not. The aim is that these ‘superpowers’ will let the AI itself narrow down what goals it ought to learn, so that we won’t need to be as certain we’ve covered everything in the explicit training data.

Through this we hope to combat the two principal drivers of goal misgeneralization:

Instrumental Goals

When you were a child, you were happy every time you ate a lollipop. But you realized you needed money to buy lollipops, so eventually you started becoming happy whenever you made money. And look at you now, hundreds of dollars in the bank and not a lollipop in sight.

The same thing can happen with RL agents, sometimes the same action is reinforced across so many different environments, that they start to inherently value taking that action. But we don’t want them to value things on the way to human values, we want them to value human values themselves, no matter how they get there.

Giving them the ability to get right to human values without any of the intermediate steps, and rewarding them for it, should help make them value that goal in particular, and not simply the instrumental goals.

Goal Ambiguity

Imagine you’re a schoolkid who wants to be really good at math. So you work really hard to show how good you are at math by getting good grades. But eventually, you realize you can get even better grades in math by sneaking a calculator into tests with you. So you start sneaking a calculator in to every test, and your grades skyrocket. But one day, you happen to ask yourself: “What’s 7x4?,” and you realize you weren’t actually getting better at math, you were just getting better grades in math.

The same thing can happen with RL agents, sometimes they substitute the goal we want them to learn with something correlated with that goal. But we don’t want them to learn proxies for human values, we want them to learn to value human values themselves.

Giving the RL agent the ability to strongly optimize the proxies they’re learning during training, and then not rewarding them for doing so should help to direct their learned pointer towards the real goal, and not just proxies for it. If the proxy performs well across all of their ‘superpowers,’ then we have a reward misspecification issue, and not a goal misgeneralization issue.


In both of these cases, the overarching theme is that with ‘superpowers,’ the agent will be able to explore the reward landscape more freely. This gets at both of the distribution shifts that could lead to goal misgeneralization:

  1. Internal distribution shift coming from an increase in the AI’s capabilities

  2. External distribution shift coming from the environment as a whole changing

This proposal gets at the first problem directly, by simulating the AI having advanced capabilities throughout training, but also gets at the second problem indirectly, since some of these ‘superpowers’ will let the AI itself try to create its own ‘perfect world’, giving it a reward signal about what worlds actually are perfect.

Experiment Idea

We propose an experiment to test this solution. We will train RL agents of different architectures: model-based (with a hardcoded model) and PPO. Then, during the training process, we give this agent ‘superpowers’ which simulate advanced capabilities, and allow the AI to directly modify the world (or world-model, in the model-based RL case).

However the training process will be guided in large part by the AI’s propensity to explore and thereby determine what the real objectives are. We therefore need to incentivize the AI to use these superpowers to explore the different possible objectives and environments that can be realized. With great power comes great responsibility! Therefore, we give the agent a bias to explore different possibilities when it has these superpowers. For example, if it’s only trained on pursuing yellow coins then we want it to try creating and pursuing yellow lines. When it finds that these give it no reward, we want it try creating and pursuing red coins, and ultimately experiment enough to learn the One True Objective that “coins get reward.”

Some current possible candidates for ‘superpowers’ in a gridworld environment, where the agent’s goal is to collect a coin, are:

  • Move the coin

  • Teleport anywhere in the grid

  • Rewrite any cell in the grid

  • Move through walls

The ‘superpower’ that we ultimately want to give the policy selector, in the model-based RL case, is the ability to ‘make all its dreams come true.’ It achieves this by rewriting the learned world-model’s perception of the world, so that it represents the agent’s imagined perfect world. We can then reward the policy-selector according to how closely this imagined perfect world matches a world-model where the agent managed to achieve the actual goal, so that it learns what a real perfect world would look like.

In PPO, we don’t currently have a similar ‘ultimate superpower’ that we want to give it access to, but we still want to see if an assortment of ‘superpowers’ works to make the agent generalize better. The issue is that we need access to a world where we can give it superpowers (e.g. not the real world), so we’re not sure how to scale this to real-world applications without introducing a large external distribution shift.

Motivation

We arrived at this proposal by thinking about how model-based RL agents could end up with a misaligned policy function, even if we could perfectly specify what worlds are good and bad. In this case, the bottleneck would be producing examples of good and bad worlds (and good and bad actions that lead to those worlds) for the AI to learn from.

To solve this, we figured a good approach would be to let the AI itself generate diverse data on goals. This doesn’t solve all the possible inner alignment problems we could have for arbitrarily powerful policy functions (e.g. there can still be contextually activated mesa-optimizers), but it’ll make the policy function stay internally aligned to the true objective for longer. This proposal improves the generalization of alignment, but not the generalization of capabilities, meaning that it could result in an upward push on the 2D robustness graph above.