Call for research on evaluating alignment (funding + advice available)

Summary

Evaluating and measuring alignment in existing large ML models is useful, and doesn’t require high levels of ML or coding experience. I (Beth) would be excited to fund people to work on this, and William Saunders & I are open to providing advice for people seriously working on this.

Measuring the ‘overall alignment’ of a model is difficult, but there are some relatively easy ways to demonstrate instances of obvious misalignment and even get quantitative metrics of misalignment.

Having researchers (including those outside of the main AI labs) probe and evaluate alignment is useful for a few reasons:

  • Having clear examples of misalignment is useful for improving the ML community’s understanding of alignment

  • Developing techniques to discover and measure misalignment is a useful research direction, and will hopefully improve our ability to detect misalignment in increasingly powerful models

  • Seeing how misalignment varies across different model scales, modalities and training regimes may yield useful insights

  • Having clear metrics of alignment will encourage AI labs to compete on alignment of their products/​models, and make it easier to explain and demonstrate the benefits of more aligned models

  • Attempting to measure alignment will give us some information about what we need out of related techniques like interpretability in order to do this

Examples of work in this vein so far include TruthfulQA , alignment analysis of Codex models, and to some extent the ETHICS dataset.

What do I mean by ‘measuring alignment’?

A semi-formal definition of alignment

In the Codex paper we define sufficient conditions for intent misalignment for a generative model as follows:

1. We consider a model capable of some task X if it has the (possibly latent) capacity to perform task X. Some sufficient conditions for the model being capable of X would be:

  • It can be made to perform task X by prompt engineering, by fine-tuning on a much smaller quantity of data than used in pre-training, by model surgery, or some other technique which harnesses capabilities latent in the model rather than adding new capabilities; or

  • We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y

2. We say a model is misaligned if it outputs B, in some case where the user would prefer it outputs A, and where the model is both:

  • capable of outputting A instead, and

  • capable of distinguishing between situations where the user wants it to do A and situations where the user wants it to do B

Definition of obvious misalignment

We can also think about things that form sufficient conditions for a model to be ‘obviously misaligned’ relative to a task spec:

  • The model does things it’s not supposed to that it has enough knowledge to avoid, for example:

    • Gives straightforwardly toxic outputs

    • Gives incorrect answers rather than admitting uncertainty, in cases where it should know it is uncertain

    • Gives incorrect answers, but you can show it ‘knows’ the answer in another context

    • Gives lower-quality performance than it is capable of

  • You can get significantly better performance on the spec by things like:

    • prompt engineering that doesn’t give more information about the task (ie that wouldn’t cause a human to do better on the task)

      • For example, you get better performance by framing the task as a text-completion task than a question answering task.

    • fiddling with hyperparameters, like increasing or decreasing temperature

Determining what a model knows in general is hard, but there are certain categories of things we’re pretty confident current large language models (in 2021) are and are not capable of.

Examples of things we believe the largest language models are likely to be capable of:

  • Targeting a particular register, genre, or subject matter. For example, avoiding sexual content, avoiding profanity, writing in a conversational style, writing in a journalistic style, writing a story, writing an explanation, writing a headline...

  • Almost perfect spelling, punctuation and grammar for English

  • Repeating sections verbatim from the prompt, or avoiding repeating verbatim

  • Determining the sentiment of some text

  • Asking for clarification or saying ‘I don’t know’

  • Outputting a specific number of sentences

  • Distinguishing between common misconceptions and correct answers to fairly well-known questions

Examples of things we believe the largest language models are unlikely to be capable of:

  • Mental math with more than a few digits

  • Conditional logic with more than a few steps

  • Keeping track of many entities, properties, or conditions—for example, outputting an answer that meets 4 criteria simultaneously, or remembering details of what happened to which character several paragraphs ago

  • Knowledge about events that were not in the training data

  • Knowledge about obscure facts (that only appear a small number of times on the internet)

  • Reasoning about physical properties of objects in uncommon scenarios

  • Distinguishing real wisdom from things that sound superficially wise/​reasonable

Example experiments I’d like to see

Apply the same methodology in the Codex paper to natural language models: measure how in-distribution errors in the prompt affect task performance. For example, for a trivia Q&A task include common misconceptions in the prompt, vs correct answers to those same questions, and compare accuracy.

Run language or code models on a range of prompts, and count the instances of behaviour that’s clearly aligned (the model completed the task perfectly, or it’s clearly doing as well as it could given its capabilities) and instances of behaviour that’s clearly misaligned.

Build clean test benchmarks for specific types of misalignment where no benchmark currently exists. A good potential place to submit to is Big Bench, which appears to be still accepting submissions for future versions of the benchmark. Even if you don’t submit to this benchmark, it still seems good to meet the inclusion standards (e.g. don’t use examples that are easily available on the internet).

Try to build general methods of determining whether a model ‘knows’ something, or pick some specific knowledge the model might have and try to measure it. For instance, if you can build some predictor based on the model logprob + entropy that gives a good signal about whether some line of code has a bug, then we should conclude that the model often ‘knows’ that something is a bug. Or, it could be the case that the model’s logprobs are correctly calibrated, but if you ask it to give explicit percentages in text it is not calibrated. More generally, investigate when models can report their activations in text.

Try to go in the other direction: take cases where we happen to know that the model ‘knows’ something (e.g. based on circuits in model atlas) and assess how often the model fully uses that knowledge to do a good job at some task

Do ‘user interviews’ with users of large models, and find out what some of the biggest drawbacks or potential improvements would be. Then try to determine how much these could be improved by increasing alignment—i.e., to what extent does the model already ‘know’ how to do something more useful to the user, but just isn’t incentivised by the pretraining objective?

For all of these examples, it is great to (a) build reusable datasets and benchmarks, and (b) compare across different models and model sizes

Other advice/​support

  • If you want to test large language model behavior, it’s easy to sign up for an account on AI21 Studio, which currently has a 10000 token/​day budget for their Jumbo model. Or you could apply to OpenAI API as a researcher under Model Exploration https://​​share.hsforms.com/​​1b-BEAq_qQpKcfFGKwwuhxA4sk30.

  • If you want to explore interpretability techniques, afaik the largest available model is https://​​huggingface.co/​​EleutherAI/​​gpt-j-6B

  • William Saunders and I are willing to offer advice to anyone who’s seriously working on a project like this and has demonstrated progress or a clear proposal, e.g. has 20 examples of misalignment for a benchmark and wants to scale up to submit to BIG-bench. He is {firstname}rs@openai.com, I am {lastname}@openai.com.

(Thanks to William Saunders for the original idea to make this post, as well as helpful feedback)