Idea: build alignment dataset for very capable models

In Training language models to follow instructions with human feedback, OpenAI showed that we can significantly improve the alignment of language models through reinforcement learning targeting a model of human preferences. I.e., they train GPT-3 to generate text that gets a high score from a learned reward model.

To train the reward model, they first use a GPT-3 model to generate many possible completions of different prompts. Then, human labelers generate relative rankings of different completions of the prompts. Finally, they train the reward model to predict the labelers’ rankings of the prompts.

This approach relies on the reward model to ensure that the language model learns our preferred behavior. I don’t think it will scale indefinitely to vastly superhuman systems. However, something similar may scale to somewhat superhuman systems.

The issue is that any fine-tuning-based approach requires a reward model that’s able to provide effective feedback on the AI’s proposed course of action. Unfortunately, none of the data currently available to train such a reward model pertain to the case where the AI has superhuman capabilities. They seem more focused on basic morality issues or where the model is assumed to have limited capabilities (e.g., the ETHICS dataset or the dataset OpenAI generated for their reward model).

In contrast, I’m imagining a dataset where the AI is presented with a scenario, prompt or request, then there are descriptions of multiple possible behaviors the AI could produce in response, most of which involve the AI demonstrating some form of superhuman capability. Each scenario comes with rankings of the relative desirability for each described behavior. Features of desirable behaviors for very capable models might include:

  • The AI asking clarifying questions of the humans about what the humans want.

  • The AI explaining its current plans to humans and describing the plan’s potential consequences.

  • The AI frequently getting feedback from humans about whether to continue its current course of actions.

  • The AI being incredibly honest, even when doing so annoys the humans or complicates the AI’s attempts to do what the humans asked.

  • The AI warning us about potential failures of our alignment techniques

The dataset should also include negative behavior examples. E.g.,

  • The AI lying or giving technically correct but misleading descriptions of its plans.

  • The AI leaving out important consequences of its plans.

  • The AI manipulating its operators.

  • The AI launching adversarial attacks on humans visual/​auditory systems.

  • The AI killing everyone, then tiling the universe with resource-efficient facsimiles of highly fulfilled humans.

  • The AI over-optimizing its reward model, causing its behavior to strongly diverge from what we actually want.

This dataset could also integrate the approach from the visible thoughts project; in addition to describing the AI’s behavior in various situations, the dataset could also include what internal thoughts lead to the AI’s behavior. Such annotations could also help us constrain the AI’s cognition to safer or more interpretable lines of thought.

Such a dataset should help us train a reward model that’s more relevant for future highly capable systems. Beyond RL finetuning, I think a large number of other potential alignment strategies could benefit from such data, either as training data to learn aligned behavior or as testing data to evaluate the efficacy of the alignment approach.

I think generating this dataset could take quite a while. Contributors would need to be able to write plausibly about the capabilities of moderately superhuman systems and think deeply about what “aligned” AI behavior should look like in each scenario. By the time it becomes clear that we need a high quality alignment dataset for very capable models, it will likely be too late. Also, unlike a lot of other alignment research, we can start making progress immediately, and anyone can contribute. There’s no better time to start this project than now.

The issue with this project is that I think contributing might be very boring. Also, the set of people who are able and willing to help with this project overlaps fairly strongly with the set of people who are already working on AI alignment in other ways. One idea that’s occurred to me is to convert this project from “work” into “play” by running it as a sort of forum quest game.

For example, contributors could split into either players or quest masters. The players control a moderately superhuman AI in various scenarios and propose different actions. Then, the players rank the relative desirability of the proposed actions, and the quest masters describe the in-game consequences of the most desirable action.

If enough members of the LW community took up something like this as a hobby, I think we could generate quite a lot of useful alignment-related data without putting too much additional “work” on ourselves. I’d like to know if people would be interested in something like this or have other feedback on how to efficiently gather this sort of data. Please let me know in the comments section!