Evaluating and measuring alignment in existing large ML models is useful, and doesn’t require high levels of ML or coding experience. I (Beth) would be excited to fund people to work on this, and William Saunders & I are open to providing advice for people seriously working on this.
Measuring the ‘overall alignment’ of a model is difficult, but there are some relatively easy ways to demonstrate instances of obvious misalignment and even get quantitative metrics of misalignment.
Having researchers (including those outside of the main AI labs) probe and evaluate alignment is useful for a few reasons:
Having clear examples of misalignment is useful for improving the ML community’s understanding of alignment
Developing techniques to discover and measure misalignment is a useful research direction, and will hopefully improve our ability to detect misalignment in increasingly powerful models
Seeing how misalignment varies across different model scales, modalities and training regimes may yield useful insights
Having clear metrics of alignment will encourage AI labs to compete on alignment of their products/models, and make it easier to explain and demonstrate the benefits of more aligned models
Attempting to measure alignment will give us some information about what we need out of related techniques like interpretability in order to do this
Examples of work in this vein so far include TruthfulQA , alignment analysis of Codex models, and to some extent the ETHICS dataset.
What do I mean by ‘measuring alignment’?
A semi-formal definition of alignment
In the Codex paper we define sufficient conditions for intent misalignment for a generative model as follows:
1. We consider a model capable of some task X if it has the (possibly latent) capacity to perform task X. Some sufficient conditions for the model being capable of X would be:
It can be made to perform task X by prompt engineering, by fine-tuning on a much smaller quantity of data than used in pre-training, by model surgery, or some other technique which harnesses capabilities latent in the model rather than adding new capabilities; or
We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y
2. We say a model is misaligned if it outputs B, in some case where the user would prefer it outputs A, and where the model is both:
capable of outputting A instead, and
capable of distinguishing between situations where the user wants it to do A and situations where the user wants it to do B
Definition of obvious misalignment
We can also think about things that form sufficient conditions for a model to be ‘obviously misaligned’ relative to a task spec:
The model does things it’s not supposed to that it has enough knowledge to avoid, for example:
Gives straightforwardly toxic outputs
Gives incorrect answers rather than admitting uncertainty, in cases where it should know it is uncertain
Gives incorrect answers, but you can show it ‘knows’ the answer in another context
Gives lower-quality performance than it is capable of
You can get significantly better performance on the spec by things like:
prompt engineering that doesn’t give more information about the task (ie that wouldn’t cause a human to do better on the task)
For example, you get better performance by framing the task as a text-completion task than a question answering task.
fiddling with hyperparameters, like increasing or decreasing temperature
Determining what a model knows in general is hard, but there are certain categories of things we’re pretty confident current large language models (in 2021) are and are not capable of.
Examples of things we believe the largest language models are likely to be capable of:
Targeting a particular register, genre, or subject matter. For example, avoiding sexual content, avoiding profanity, writing in a conversational style, writing in a journalistic style, writing a story, writing an explanation, writing a headline...
Almost perfect spelling, punctuation and grammar for English
Repeating sections verbatim from the prompt, or avoiding repeating verbatim
Determining the sentiment of some text
Asking for clarification or saying ‘I don’t know’
Outputting a specific number of sentences
Distinguishing between common misconceptions and correct answers to fairly well-known questions
Examples of things we believe the largest language models are unlikely to be capable of:
Mental math with more than a few digits
Conditional logic with more than a few steps
Keeping track of many entities, properties, or conditions—for example, outputting an answer that meets 4 criteria simultaneously, or remembering details of what happened to which character several paragraphs ago
Knowledge about events that were not in the training data
Knowledge about obscure facts (that only appear a small number of times on the internet)
Reasoning about physical properties of objects in uncommon scenarios
Distinguishing real wisdom from things that sound superficially wise/reasonable
Example experiments I’d like to see
Apply the same methodology in the Codex paper to natural language models: measure how in-distribution errors in the prompt affect task performance. For example, for a trivia Q&A task include common misconceptions in the prompt, vs correct answers to those same questions, and compare accuracy.
Run language or code models on a range of prompts, and count the instances of behaviour that’s clearly aligned (the model completed the task perfectly, or it’s clearly doing as well as it could given its capabilities) and instances of behaviour that’s clearly misaligned.
Build clean test benchmarks for specific types of misalignment where no benchmark currently exists. A good potential place to submit to is Big Bench, which appears to be still accepting submissions for future versions of the benchmark. Even if you don’t submit to this benchmark, it still seems good to meet the inclusion standards (e.g. don’t use examples that are easily available on the internet).
Try to build general methods of determining whether a model ‘knows’ something, or pick some specific knowledge the model might have and try to measure it. For instance, if you can build some predictor based on the model logprob + entropy that gives a good signal about whether some line of code has a bug, then we should conclude that the model often ‘knows’ that something is a bug. Or, it could be the case that the model’s logprobs are correctly calibrated, but if you ask it to give explicit percentages in text it is not calibrated. More generally, investigate when models can report their activations in text.
Try to go in the other direction: take cases where we happen to know that the model ‘knows’ something (e.g. based on circuits in model atlas) and assess how often the model fully uses that knowledge to do a good job at some task
Do ‘user interviews’ with users of large models, and find out what some of the biggest drawbacks or potential improvements would be. Then try to determine how much these could be improved by increasing alignment—i.e., to what extent does the model already ‘know’ how to do something more useful to the user, but just isn’t incentivised by the pretraining objective?
For all of these examples, it is great to (a) build reusable datasets and benchmarks, and (b) compare across different models and model sizes
Other advice/support
If you want to test large language model behavior, it’s easy to sign up for an account on AI21 Studio, which currently has a 10000 token/day budget for their Jumbo model. Or you could apply to OpenAI API as a researcher under Model Exploration https://share.hsforms.com/1b-BEAq_qQpKcfFGKwwuhxA4sk30.
William Saunders and I are willing to offer advice to anyone who’s seriously working on a project like this and has demonstrated progress or a clear proposal, e.g. has 20 examples of misalignment for a benchmark and wants to scale up to submit to BIG-bench. He is {firstname}rs@openai.com, I am {lastname}@openai.com.
(Thanks to William Saunders for the original idea to make this post, as well as helpful feedback)
Call for research on evaluating alignment (funding + advice available)
Summary
Evaluating and measuring alignment in existing large ML models is useful, and doesn’t require high levels of ML or coding experience. I (Beth) would be excited to fund people to work on this, and William Saunders & I are open to providing advice for people seriously working on this.
Measuring the ‘overall alignment’ of a model is difficult, but there are some relatively easy ways to demonstrate instances of obvious misalignment and even get quantitative metrics of misalignment.
Having researchers (including those outside of the main AI labs) probe and evaluate alignment is useful for a few reasons:
Having clear examples of misalignment is useful for improving the ML community’s understanding of alignment
Developing techniques to discover and measure misalignment is a useful research direction, and will hopefully improve our ability to detect misalignment in increasingly powerful models
Seeing how misalignment varies across different model scales, modalities and training regimes may yield useful insights
Having clear metrics of alignment will encourage AI labs to compete on alignment of their products/models, and make it easier to explain and demonstrate the benefits of more aligned models
Attempting to measure alignment will give us some information about what we need out of related techniques like interpretability in order to do this
Examples of work in this vein so far include TruthfulQA , alignment analysis of Codex models, and to some extent the ETHICS dataset.
What do I mean by ‘measuring alignment’?
A semi-formal definition of alignment
In the Codex paper we define sufficient conditions for intent misalignment for a generative model as follows:
1. We consider a model capable of some task X if it has the (possibly latent) capacity to perform task X. Some sufficient conditions for the model being capable of X would be:
It can be made to perform task X by prompt engineering, by fine-tuning on a much smaller quantity of data than used in pre-training, by model surgery, or some other technique which harnesses capabilities latent in the model rather than adding new capabilities; or
We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y
2. We say a model is misaligned if it outputs B, in some case where the user would prefer it outputs A, and where the model is both:
capable of outputting A instead, and
capable of distinguishing between situations where the user wants it to do A and situations where the user wants it to do B
Definition of obvious misalignment
We can also think about things that form sufficient conditions for a model to be ‘obviously misaligned’ relative to a task spec:
The model does things it’s not supposed to that it has enough knowledge to avoid, for example:
Gives straightforwardly toxic outputs
Gives incorrect answers rather than admitting uncertainty, in cases where it should know it is uncertain
Gives incorrect answers, but you can show it ‘knows’ the answer in another context
Gives lower-quality performance than it is capable of
You can get significantly better performance on the spec by things like:
prompt engineering that doesn’t give more information about the task (ie that wouldn’t cause a human to do better on the task)
For example, you get better performance by framing the task as a text-completion task than a question answering task.
fiddling with hyperparameters, like increasing or decreasing temperature
Determining what a model knows in general is hard, but there are certain categories of things we’re pretty confident current large language models (in 2021) are and are not capable of.
Examples of things we believe the largest language models are likely to be capable of:
Targeting a particular register, genre, or subject matter. For example, avoiding sexual content, avoiding profanity, writing in a conversational style, writing in a journalistic style, writing a story, writing an explanation, writing a headline...
Almost perfect spelling, punctuation and grammar for English
Repeating sections verbatim from the prompt, or avoiding repeating verbatim
Determining the sentiment of some text
Asking for clarification or saying ‘I don’t know’
Outputting a specific number of sentences
Distinguishing between common misconceptions and correct answers to fairly well-known questions
Examples of things we believe the largest language models are unlikely to be capable of:
Mental math with more than a few digits
Conditional logic with more than a few steps
Keeping track of many entities, properties, or conditions—for example, outputting an answer that meets 4 criteria simultaneously, or remembering details of what happened to which character several paragraphs ago
Knowledge about events that were not in the training data
Knowledge about obscure facts (that only appear a small number of times on the internet)
Reasoning about physical properties of objects in uncommon scenarios
Distinguishing real wisdom from things that sound superficially wise/reasonable
Example experiments I’d like to see
Apply the same methodology in the Codex paper to natural language models: measure how in-distribution errors in the prompt affect task performance. For example, for a trivia Q&A task include common misconceptions in the prompt, vs correct answers to those same questions, and compare accuracy.
Run language or code models on a range of prompts, and count the instances of behaviour that’s clearly aligned (the model completed the task perfectly, or it’s clearly doing as well as it could given its capabilities) and instances of behaviour that’s clearly misaligned.
Build clean test benchmarks for specific types of misalignment where no benchmark currently exists. A good potential place to submit to is Big Bench, which appears to be still accepting submissions for future versions of the benchmark. Even if you don’t submit to this benchmark, it still seems good to meet the inclusion standards (e.g. don’t use examples that are easily available on the internet).
Try to build general methods of determining whether a model ‘knows’ something, or pick some specific knowledge the model might have and try to measure it. For instance, if you can build some predictor based on the model logprob + entropy that gives a good signal about whether some line of code has a bug, then we should conclude that the model often ‘knows’ that something is a bug. Or, it could be the case that the model’s logprobs are correctly calibrated, but if you ask it to give explicit percentages in text it is not calibrated. More generally, investigate when models can report their activations in text.
Try to go in the other direction: take cases where we happen to know that the model ‘knows’ something (e.g. based on circuits in model atlas) and assess how often the model fully uses that knowledge to do a good job at some task
Do ‘user interviews’ with users of large models, and find out what some of the biggest drawbacks or potential improvements would be. Then try to determine how much these could be improved by increasing alignment—i.e., to what extent does the model already ‘know’ how to do something more useful to the user, but just isn’t incentivised by the pretraining objective?
For all of these examples, it is great to (a) build reusable datasets and benchmarks, and (b) compare across different models and model sizes
Other advice/support
If you want to test large language model behavior, it’s easy to sign up for an account on AI21 Studio, which currently has a 10000 token/day budget for their Jumbo model. Or you could apply to OpenAI API as a researcher under Model Exploration https://share.hsforms.com/1b-BEAq_qQpKcfFGKwwuhxA4sk30.
If you want to explore interpretability techniques, afaik the largest available model is https://huggingface.co/EleutherAI/gpt-j-6B
William Saunders and I are willing to offer advice to anyone who’s seriously working on a project like this and has demonstrated progress or a clear proposal, e.g. has 20 examples of misalignment for a benchmark and wants to scale up to submit to BIG-bench. He is {firstname}rs@openai.com, I am {lastname}@openai.com.
(Thanks to William Saunders for the original idea to make this post, as well as helpful feedback)