Imitative Generalisation (AKA ‘Learning the Prior’)


We want to be able to supervise models with superhuman knowledge of the world and how to manipulate it. For this we need an overseer to be able to learn or access all the knowledge our models have, in order to be able to understand the consequences of suggestions or decisions from the model. If the overseers don’t have access to all the same knowledge as the model, it may be easy for the model to deceive us, suggesting plans that look good to us but that may have serious negative consequences.

We might hope to access what the model knows just by training it to answer questions. However, we can only train on questions that humans are able to answer[1].

This gives us a problem that’s somewhat similar to the standard formulation of transduction: we have some labelled training set (questions humans can answer), and we want to transfer to an unlabelled dataset (questions we care about), that may be differently distributed.

We might hope that our models will naturally generalize correctly from easy-to-answer questions to the ones that we care about. However, a natural pathological generalisation is for our models to only give us ‘human-like’ answers to questions, even if it knows the best answer is different. If we only have access to these human-like answers to questions, that probably doesn’t give us enough information to supervise a superhuman model.

What we’re going to call ‘Imitative Generalization’ is a possible way to narrow the gap between the things our model knows, and the questions we can train our model to answer honestly. It avoids the pathological generalisation by only using ML for IID tasks, and imitating the way humans generalize. This hopefully gives us answers that are more like ‘how a human would answer if they’d learnt from all the data the model has learnt from’. We supervise how the model does the transfer, to get the sort of generalisation we want.

It’s worth noting there are enough serious open questions that imitative generalization is more of a research proposal than an algorithm!

This post is based on work done with Paul Christiano at OpenAI. Thanks very much to Evan Hubinger, Richard Ngo, William Saunders, Long Ouyang and others for helpful feedback, as well as Alice Fares for formatting help

Goals of this post

This post tries to explain a simplified[2] version of Paul Christiano’s mechanism introduced here, (referred to there as ‘Learning the Prior’) and explain why a mechanism like this potentially addresses some of the safety problems with naïve approaches. First we’ll go through a simple example in a familiar domain, then explain the problems with the example. Then I’ll discuss the open questions for making Imitative Generalization actually work, and the connection with the Microscope AI idea. A more detailed explanation of exactly what the training objective is (with diagrams), and the correspondence with Bayesian inference, are in the appendix.

Example: using IG to avoid overfitting in image classification.

Here’s an example of using Imitative Generalization to get better performance on a standard ML task: image classification of dog breeds, with distributional shift.

Imagine we want to robustly learn to classify dog breeds, but the human labellers we have access to don’t actually know how to identify all the breeds[3], and we don’t have any identification guides or anything. However, we do have access to a labelled dataset We want to classify dogs in a different dataset , which is unlabelled.

One unfamiliar breed we want to learn to recognise is a husky. It happens that all the huskies in are on snow, but in some of them are on grass.

Label: Husky

Image from

Label: ???

OOD image from

A NN architecture prior likely doesn’t favour the hypothesis ‘a husky is a large, fluffy dog that looks quite like a wolf’ over ‘if there are a lot of white pixels in the bottom half of the image, then it’s a husky’. These hypotheses both perform equally well on the training data. So a naïve approach of fitting a model to and then running it on may easily misclassify huskies that are not on snow.

However, a human prior does favour the more sensible assumption (that the label husky refers to this fluffy wolf-like dog) over the other one (that the label husky refers to an image with many white pixels in the bottom half of the image). If we can use this human prior, we can avoid misclassifying huskies in—even if the two hypotheses perform equally well on .

To apply the IG scheme here we’re going to jointly learn three things.

  • We’re going to optimise , which is a string of text instructions for how to label images (e.g. ‘’A husky is a large, fluffy dog that looks quite like a wolf. A greyhound is a tall, very skinny dog. …”)

  • Let be the prior log probability the human assigns[4] to the instructions . We’re going to train a model to approximate this function

  • Similarly, we’re going to train to approximate , which is the log probability that a human assigns to label (e.g. ‘husky’) given (image of a dog) and (text instructions on how to label images)

We find the that maximises

Then we give this to the humans, and have the humans use this to predict the labels for images in , ie query .

Then we can use these human predictions to train a model to approximate on the distribution . We can then run to get labels for images from with no distributional shift.

The hope is that the things in will be sensible descriptions of how to label images, that conform to human priors about how objects and categories work. In particular, is likely to contain instructions that the label for an image is supposed to depend on features of the object that’s the subject of the photo, rather than the background.

So when we’re querying our human labelers for , the task they see will be:
The human is shown a photo of a husky on grass () , along with the instructions ‘a husky is a large, fluffy dog that looks quite like a wolf’ and descriptions of many other dog breeds (), and is asked how likely it is that this photo is of a husky ()

If you’re confused about the details of the setup at this point, I’d recommend reading the more detailed explanation in the appendix, which also builds up this diagram piece-by-piece.

Using this scheme, we can expect correctness on the test dataset, as long as our models are actually capable of learning and given plenty of IID samples. We avoid problems related to overfitting and distributional shift.

Ways that this specific example is unrealistic:

Firstly, our model may not be capable enough to learn the human likelihood/​prior functions, even given plenty of IID examples. IG is easiest to analyze when we have ML capable of learning to imitate most IID human behavior. If our ML is more limited, the generalization will be determined by a combination of human capabilities and model capabilities.

This example isn’t very exciting, because classifying dogs is a problem that humanity has already solved. If we were actually doing this specific task in real life, we’d either give the workers a guide to identifying dog breeds, or let them look at D and learn the labels, and then label D’ for us. The IG scheme is only needed if this isn’t possible—for example, if there are no existing resources on how to identify dogs, and there are so many different dog breeds that it’s too hard to get our labellers to learn them without help from z*. Even then we might think that the labellers can just look at D and make their own notes on breed identification. IG is needed if this task is too difficult—e.g. if the relationship between visual characteristics and breeds is too complicated for humans to infer from a few examples.

Difficulties with this example that are also general problems

Optimising is hard; we’d probably need a better way of representing and exploring the space of s than just searching over long strings of text. One way to improve might be to have our human labellers generate different hypotheses for what different breeds look like, then train a model to imitate this hypothesis generation.

Conveying the correct hypothesis in text might be hard. Explaining how to recognise different dog breeds is probably significantly harder than teaching a human by showing them some examples. Having human workers explicitly look for particular features that can be described in words may not be competitive with non-verbal, unconscious pattern-recognition. In this specific example, we might want to add visual information to - for example, providing a diagram of key characteristics for each breed.

Key difficulties for IG

From the above problems, we can see that a core difficulty for this scheme is how to represent . We want to be able to capture everything that an NN could learn, and want it to be easy to optimise, but also need it to be human-understandable.


One possibility is to have be a long string of natural-language text. We have several problems:

  1. Firstly, will contain a lot of information, not all of which is necessarily relevant to a particular task. It might take the human an impractically long time to engage with the whole of . We’re hopeful that this can be fixed by using something like amplification or debate to help the human interact with this large text.[5]

  2. Optimising over large strings of text is hard—we don’t have a gradient. As mentioned above, we might attempt to mitigate this by having humans explore the space of s, and imitating this exploration to guide our search

  3. The biggest problem is that it’s not clear if certain types of knowledge can be conveyed adequately via natural language with reasonable efficiency. Certainly encoding everything a vision model knows about images would take an extremely long text. In the example above, we can imagine using visual aids to make the problem easier. But vision is a domain humans are very well-adapted to processing. It’s much less clear how we’d help a human understand instructions about how protein folding works.

Neural net

Another approach is to have be a neural network together with annotations that are supposed to describe what different parts of the NN are doing. We can optimise the NN weights and text annotations jointly to maximise the human prior of this combination of weights and annotations, plus the likelihood the human assigns to D when given access to this network with these annotations.

For example, annotations might say “Neuron 327 indicates the presence of dog ears, the final output neuron indicates whether the dog is a husky.” If a human assumes the annotations are correct, then it should be easy for them to label new images by running the neural network. The annotations may also allow the human to answer other questions about the input. In the example given above, we might hope that if the human sees that the circuits for fluffy coat, pointy ears, tail, paws etc are activated, they can use that information to classify the image as a husky.

The tricky bit is ensuring that the labels actually correspond to what the circuit is detecting. In the example, labelling the snow detection circuit as a husky detection circuit would perform just as well on the training data. Importantly, the prior captures whether the structure of the neural network is consistent with these annotations, and so optimizing for a high prior hopefully forces the annotations to make sense. For example, if a human saw that a “snow detector” taking on a high enough value would cause a “husky detector” to fire, without any contribution from ears, tail or fluffy coat, then they could rule out the relationship as implausible.

This has an advantage over a text-only representation, because some parts of the problem can be left as black boxes. For example, consider a blind human trying to complete the image-labelling task. If they were only using a text-based , would have to contain instructions that involved going through the image pixel by pixel and making calculations to compute edge detectors and other low-level vision features. The task would be very hard for the human, and would be an extremely long text. However, if the human instead had access to various annotated circuits that detected floppy ears, tails, fluffy coats, etc, the task would be much more tractable.

The text part of could also be significantly shorter. It seems likely to be easier to jointly optimise a NN together with text annotations, than to optimise a long text that contains all of the information encoded in the NN weights. We can start with a good init by training our NN on the labelled dataset , so that our NN contains various features relevant to our task (e.g. features of dogs).

We still have a problem similar to (3) above—it’s not clear what kind of annotations and weights could actually allow a human to understand a computation well enough to access all of its knowledge, or well enough to evaluate its plausibility.

Relationship with Microscope AI

My (Beth’s) paraphrasing of the Microscope AI proposal is as follows:

We train ML models, then examine the representations they’ve learnt using our interpretability tools. We then use the knowledge and insight gained from these representations to help [amplified] humans do tasks, and never use the model directly. Also, if we spot problems with these representations we can prune or modify them to improve the usefulness, although this isn’t the central part of the proposal. This gives us safety properties in at least two ways:

  1. we don’t have to deploy anything at all resembling an agent

  2. we can inspect and understand everything before we use it. If some concept generalises badly or some subnetwork contains an inner optimiser, we’re not going to unknowingly use these

There seems to be moderate convergence between the microscope AI idea and the Imitative Generalization idea. For the microscope AI proposal to work, we need there to be some human-understandable way to represent everything some NN ‘knows’ - this is what our interpretability techniques produce. The IG proposal can be framed as: instead of training a NN then extracting this human-understandable object with our interpretability tools, let’s directly search over these human-understandable representations.

A key challenge for both approaches is whether (for any given NN) the human-understandable object actually exists, and if so how we can represent it. In both cases, we can give the human as many tools as possible to make this more tractable—for example, we could use IDA or debate.

The OpenAI Microscope produced by the Clarity team is one example of what such an object might look like. The Microscope contains multiple types of visualisation for every significant neuron and layer in the ‘model organism’ NNs. It’s hard to say at this stage whether these tools really allow a human to leverage everything the model knows, but it certainly contains a large amount of usable knowledge about image classification. The team learnt a lot about how to identify different dog breeds from playing with the Microscope!


More detailed notation for training procedure

We have some space of objects that represent distributions over hypotheses We have some space of questions and answers :{Yes, No} We have two sets of questions drawn from two different distributions: (train) and (test). The questions drawn from have corresponding answers We train three models:

gives the log likelihood of Z under the human prior

gives the log likelihood of some given some and

gives the log likelihood under of some given some

Step 1:

We collect a dataset from humans by asking humans ‘how plausible is ?’ for the s we’re searching over. We train to minimise loss on log( )

We collect a dataset from humans by asking humans ‘how plausible is as an answer to , assuming ?’ for each . We train to minimise loss on log( )

Step 2:

We search over to find that maximises .

Step 3:

While we’re optimising , we update and to be on-distribution with the current , and retrain and appropriately.

Step 4:

We collect a dataset from humans by asking humans ‘how plausible is {yes/​no} as an answer to , assuming ?’ for each in . We train to minimise loss on log()

In practice, we’d want to share weights between , and

Correspondence with Bayesian updating

Consider some Bayesian agent that has a prior over hypotheses, . corresponds to data they want to update on, and corresponds to some data they want to predict. They have some method log_likelihood() of computing the likelihood of some data given different distributions over hypotheses—this corresponds to and

If we replace with , then what IG gives us is exactly the same as if the agent had performed the correct Bayesian update on all of .

The agent’s predicted probability of after seeing is log_likelihood with:

corresponds to the agent’s posterior after seeing


This is a picture of the whole scheme. The part circled in red is the output of the process: an ML model that we can use for answering questions in D’ without distributional shift.

There’s a lot going on here, so let’s break it down. The underlying structure is just induction:

we have some prior and labelled dataset ; we use the prior and the likelihood to compute a score for and we use this to optimise . Then we use to predict the labels in .

Using ML

We want to train models to imitate the human Prior and Likelihood functions. As long as we can sample and for any and that we might want to evaluate our models on, we can ensure that the task is IID.

Making it easier for humans

Asking humans to directly report or is unreasonable. To make it easier for our labellers, we instead just ask them to compare pairs of s or s, and use this reward signal to fit M^L and M^{prior} (as in e.g. https://​​​​pdf/​​1909.08593.pdf ).

When we put these pieces into the induction setup above, we just need to compute the overall likelihood by summing log likelihoods for each , then add this to prior() to get the posterior score for .

Now we just need to learn to compute the likelihoods on examples from , given . Separating out the likelihoods for and like this lets us ensure that each of the ML imitation tasks are IID.

Now the object circled in red is what we wanted all along—a ML model trained to label examples in , that approximates “ How a human would answer if they’d looked at all the data in ”.


  1. ^

    This is a simplification; using RL we can train on questions as long humans can reliably recognise good answers. Using IDA/​Debate/​RRM, we can train on questions where humans can recognise good explanations of good answers. We can also train on questions where we don’t currently know the answer but can do an experiment. But this is still unlikely to cover everything the model might have learnt from unsupervised training

  2. ^

    I’m just considering a simplified version of LTP that doesn’t use amplification

  3. ^

    For the purposes of this example, we can imagine that our human labellers are generally familiar with dogs and know what a dog breed is, but don’t know the names and identifying characteristics of all the breeds. If the human labellers have never seen a dog before, the task is still fairly straightforward in theory but more tricky in practice.

  4. ^

    In practice, we’re not going to be able to elicit directly from the human; instead, we’ll do something like asking humans to compare which of two s are more likely, and use this as a reward signal, as in (e.g.) https://​​​​pdf/​​1909.08593.pdf

  5. ^

    More specifically, what we need here is an aligned model that has sufficient time/​capacity to read and manipulate , and pull out the relevant parts to show the human.