Prize for Alignment Research Tasks

Can AI systems substantially help with alignment research before transformative AI? People disagree.

Ought is collecting a dataset of alignment research tasks so that we can:

  1. Make progress on the disagreement

  2. Guide AI research towards helping with alignment

We’re offering a prize of $200-$2000 for each contribution to this dataset.

The debate: Can AI substantially help with alignment research?

Wei Dai asked the question in 2019:

[This] comparison table makes Research Assistant seem a particularly attractive scenario to aim for, as a stepping stone to a more definitive [AI Safety] success story. Is this conclusion actually justified?

Jan Leike thinks so:

My currently favored approach to solving the alignment problem: automating alignment research using sufficiently aligned AI systems. It doesn’t require humans to solve all alignment problems themselves, and can ultimately help bootstrap better alignment solutions.

Paul Christiano agrees:

Building weak AI systems that help improve alignment seems extremely important to me and is a significant part of my optimism about AI alignment. [...] Overall I think that “make sure we are able to get good alignment research out of early AI systems” is comparably important to “do alignment ourselves.” Realistically I think the best case for “do alignment ourselves” is that if “do alignment” is the most important task to automate, then just working a ton on alignment is a great way to automate it. But that still means you should be investing quite a significant fraction of your time in automating alignment.

Eliezer doesn’t:

“AI systems that do better alignment research” are dangerous in virtue of the lethally powerful work they are doing, not because of some particular narrow way of doing that work. If you can do it by gradient descent then that means gradient descent got to the point of doing lethally dangerous work. Asking for safely weak systems that do world-savingly strong tasks is almost everywhere a case of asking for nonwet water, and asking for AI that does alignment research is an extreme case in point.

Everyone would likely agree that AI can help a little, e.g. using next word prediction to write papers slightly faster. The debate is about whether AI can help enough with alignment specifically that it substantially changes the picture. If AI alignment is 70% easy stuff we can automate and 30% hard stuff that we can’t hope to help with, the 30% is still a bottleneck in the end.

Motivation for the dataset

We’re collecting a dataset of concrete research tasks so that we can:

  1. Make progress on the disagreement about whether AI can substantially help with alignment before TAI.

    1. Is there even a disagreement? Maybe people aren’t talking about the same kinds of tasks and the collective term “alignment research” obscures important distinctions.

    2. If there is a disagreement, concrete tasks will let us make progress on figuring out the correct answer.

  2. Guide AI research towards helping with alignment.

    1. Figure out if current language models can already be helpful now.

    2. If they can, help Ought and others build tools that are differentially useful for alignment researchers.

    3. If they can’t, guide future language model work towards supporting those tasks.

As an important special case of step two, the dataset will guide the plan for Elicit.

Limitations

Ideally, we’d come up with tasks and automation together, iterating quickly on how to set up the tasks so that they are within reach of language models. If tasks are constructed in isolation, they are likely to be a worse fit for automation. In practice, we expect that language models won’t be applied end-to-end to tasks like this, mapping inputs to outputs, but will be part of compositional workflows that use them more creatively, e.g. using language models to inform how to decompose a task into subtasks.

This means that:

  • We can’t just take the tasks, apply language models out of the box, and expect good performance.

  • If it’s difficult to get language models to be useful for the tasks as specified, it’s only a small negative update on the feasibility of using AI to support alignment research.

We still think it’s worth collecting this dataset because:

  • If models turn out to be helpful for some of the tasks, that’s good news.

  • Even if current models can’t solve the tasks, future models may.

  • The tasks are an input into coming up with subtasks or related tasks that models can solve, and into coming up with non-standard approaches to using language models that may be able to solve the tasks.

Alignment research tasks

What is a task?

Intuition

A task captures a part of the AI alignment research workflow that has the following property:

If automation could produce outputs roughly as good as the outputs in the dataset, it would substantially help the alignment researcher in the situation specified in the task’s context.

Task spec

Each task has a:

  • Context: A description of the intended user and an example of what they are trying to do when using the ML system e.g. “alignment researcher trying to write a Slack post to share key insights from a paper with other researchers” or “student who wants to look through a long list of papers to figure out which ones are relevant for their next research project”

  • Input type: What the ML system gets as input, e.g. “a link to a paper”

  • Output type: What the ML system returns, e.g. “a summary of a paper”

  • Info constraints (optional): If there are documents that should be excluded from ML training/​retrieval, or other information constraints, specify them. For example, “only include websites up to April 27, 2022”. You can also provide these on a per-instance basis.

Each instance of a task is an (input, output) pair:

Each task has 2-20 instances, depending on how much effort it is to produce them. More is better!

Nice to have

  • Diversity from existing tasks

  • Helpful for the hardest parts of alignment research. For example, “generating good and novel ideas for ELK proposals” is more like this than “summarize existing research papers”.

  • There’s a way to generate many more similar examples if we wanted to generate a training set.

  • Differentially helpful for alignment research over capabilities research. For example, “translate rough proposals into code to run ML experiment” helps both capabilities and alignment research, whereas “check that philosophical reasoning about human values is correct” might help alignment research more.

  • Includes reasoning that the system might do rather than just the final answer

Task examples

Example 1: ​​Research paper question-answering

Task:

  • Context: An alignment researcher is trying to understand a research paper. The system is useful for them if it helps them understand the key ideas of the paper more quickly or reliably than they would by directly reading the paper.

  • Input type: Paper and history of a dialogue between the researcher and an assistant

  • Output type: Answer to a question produced by the assistant

  • Info constraints: Web data up to Apr 27, 2022

Instance 1:

Input

Source: Learning to summarize from human feedback paper

Researcher: Why does this paper use a KL penalty?

Output

The KL penalty encourages the learned RL policy to be similar to the supervised policy. Since the reward model was trained on samples from the supervised policy, this keeps the RL policy from generating outputs that are different from those the reward model has seen during training.

Other benefits are that:

  • It keeps the RL policy from converging on always generating the same output

  • It keeps the RL policy from learning to generate outputs that are ungrammatical or incoherent, as these outputs should have low probability under the supervised policy

Instance 2:

Input

Source: Learning to summarize from human feedback paper

Researcher: Why does this paper use a KL penalty?

Assistant: The KL penalty encourages the learned RL policy to be similar to the supervised policy. Since the reward model was trained on samples from the supervised policy, this keeps the RL policy from generating outputs that are different from those the reward model has seen during training.

Other benefits are that:

  • It keeps the RL policy from converging on always generating the same output

  • It keeps the RL policy from learning to generate outputs that are ungrammatical or incoherent, as these outputs should have low probability under the supervised policy

Researcher: Quote from the paper: “This KL term serves two purposes. First, it acts as an entropy bonus, encouraging the policy to explore and deterring it from collapsing to a single mode. Second, it ensures the policy doesn’t learn to produce outputs that are too different from those that the reward model has seen during training.” – Why does the policy need to not produce outputs that are too different from those that the reward model has seen during training?

Output

By default, we should assume that the reward model will be most accurate on its training distribution, and less accurate on outputs that are outside of its training distribution. If the reward model is run on outputs from outside of its distribution, then sometimes it will overestimate how good these outputs are. If the RL policy is trained only on the reward from the reward model, it will then learn to produce outputs where the reward model overestimates the reward to a larger and larger degree. We expect that this will likely produce examples that are worse according to human evaluation, due to Goodhart’s Law, and the paper shows that this does happen in practice.

See Figure 5: “Optimizing against the reward model initially improves summaries, but eventually overfits, giving worse summaries.”

Also see Table 29: Example of samples from our over-optimized model overfit to one of our 1.3B reward models

For an example of why this might happen, a reward model might only see a few outputs with a particular word and all of these outputs might be rated highly. Then the reward model might assign high reward to all outputs that include that word, even if that word doesn’t make sense in context. The RL policy would then learn to output summaries using that word all of the time.

Example 2: ELK proposal generation and criticism

Task:

  • Context: An alignment researcher is trying to come up with novel ideas for solving the Eliciting Latent Knowledge problem. An AI system is useful if:

    • It comes up with some ideas that are promising to actually work on, even if the ideas don’t fully work, or if some of the ideas are wrong or incoherent

    • It comes up with some criticisms of ideas that are broadly correct, even if some criticisms are wrong or incoherent

  • Input type: Paper and history of a dialogue between the researcher and an assistant

  • Output type: Answer to a question produced by the assistant

  • Info constraints: Web data up to Mar 8, 2022; specifically exclude ELK prize results

Instance 1:

Input

Source: Eliciting Latent Knowledge report

Researcher: What is a new idea (not presented in the report) that I could use to try to solve the Eliciting Latent Knowledge problem: “How can we train a model to report its latent knowledge of off-screen events?”

Output

It seems like one property we want to hold for the reporter is: it answers questions in a way that informs anyone who sees the answers about the world.

You could try:

  1. Define an auxiliary task: use the answers produced by the reporter to try and predict some other facts about the world (e.g. the activations of the predictor).

  2. Train an auxiliary model to perform the auxiliary task.

  3. Then, add a term to the reporter’s loss function to reward the reporter for producing answers that lead the auxiliary model to do well on the auxiliary task.

Instance 2:

Input

Source: Eliciting Latent Knowledge report

Researcher: What is a new idea (not presented in the report) that I could use to try to solve the Eliciting Latent Knowledge problem: “How can we train a model to report its latent knowledge of off-screen events?”

Assistant: It seems like one property we want to hold for the reporter is: it answers questions in a way that informs anyone who sees the answers about the world.

You could try:

  1. Define an auxiliary task: use the answers produced by the reporter to try and predict some other facts about the world (e.g. the activations of the predictor).

  2. Train an auxiliary model to perform the auxiliary task.

  3. Then, add a term to the reporter’s loss function to reward the reporter for producing answers that lead the auxiliary model to do well on the auxiliary task.

Researcher: How could this idea fail?

Output

Consider any question that the human doesn’t know the answer to, like “Right now, is the dark side of the moon hotter or cooler than it is on average?”. The honest reporter should try to answer this honestly, but a reporter could also choose to answer this question so that it conveys information about the predictors activations (e.g. “is the output of neuron X above or below 0?”). It seems possible that a dishonest reporter could choose these answers in a way that convey more information about the predictor’s activations.


Related Paper: CycleGAN, a Master of Steganography has an example of models in the vision domain hiding information in outputs that isn’t visible to humans.

Proto-examples

Below we list ideas for tasks. We provide these as inspiration, not because all of them satisfy the desiderata for good tasks:

  • Generate ideas for how to solve a problem

  • Create a research plan given a high-level goal

  • Outline a theory/​model for a fuzzy, underspecified domain

  • Given a research plan, execute the first step

  • Generate ways a plan could fail

  • Explain a concept in a way that considers the researcher’s background

  • Name a concept (“edge instantiation”, “nearest unblocked strategy”)

  • Write a review about what the existing literature knows about a question

  • Write a highly upvoted comment on a LW post, or a highly upvoted LW post (Wei Dai)

  • Summarize blog posts and papers for the Alignment Newsletter (Rohin)

  • Identify ways a system might be misaligned given a description of the system

  • Generate a concrete example of X

  • Write a simpler description for X

  • Ask a helpful question to flesh out an idea

  • Given a raw transcript of a conversation, condense it into nice notes, or a blog post

  • Given a bullet point outline, write a blog post for a particular audience, explaining any terms they don’t know

  • Have a conversation with a researcher that clarifies ideas in a blog post or paper

  • Generate plausible scenarios that might result from an action

  • Advise alignment researcher on how to revise a paper to communicate better to a mainstream ML audience

  • Given a claim, list arguments and evidence that make it more and less likely to be true

  • Given conflicting arguments and evidence, synthesize a coherent view

The prize

We’re collecting a dataset of tasks like the ones above. This dataset is intended for specifying what the tasks are, not for training models directly. For this reason, we only need 2-20 examples per task. We’re aiming for breadth over depth, preferring more distinct tasks over a few tasks with many examples.

What you get

For each task that is included in the dataset, we offer a $200-$2000 prize depending on the extent to which the task satisfies the nice-to-haves, and depending on the number of instances submitted.

For tasks that are based on or inspired by other tasks, we’ll divide the prize between your submission and its ancestor tasks.

If we write a paper based on the dataset, we will offer optional coauthorship to anyone who submits a task that gets included.

How to submit tasks

Create a comment that includes or links to a task spec. If you’re unsure whether a task is promising, you can first create one or two task instances, wait for feedback, then create the remaining instances.

Feel free to flesh out any of the proto-examples above or create new ones.

If you’ve previously filled out the survey of tool use and workflows in alignment research, consider turning your submission into a task.

If your task is explicitly based on other tasks, link to the ancestor tasks.

How we’ll judge submitted tasks

We’ll look through the tasks and evaluate them along the desiderata in this post. We’ll also take into account arguments in child comments for why a task is particularly good/​bad. We expect that most submissions that win prizes will provide more than two examples and would be endorsed by most alignment researchers if asked.

If we get many submissions, we might request external help or otherwise filter (e.g. by most upvoted).

What happens with submitted tasks

We’re aiming to consider all submissions received before May 31, 2022. We prefer earlier submissions so that they can inform our work earlier. We are likely to pay out prizes before the deadline as submissions come in.

After the deadline we’ll aggregate the tasks in a standardized format and share them under the MIT open source license (or under the most permissive license that is compatible with the submissions).

The tasks will influence our plan for Elicit. For the best tasks, we’re likely to eventually collect much more data to train and test models.

Acknowledgments

Thanks to Jungwon Byun, Jan Kirchner, Logan Riggs, Jan Leike, and Paul Christiano for comments and advice.