Benchmark for successful concept extrapolation/​avoiding goal misgeneralization

If an AI has been trained on data about adults, what should it do when it encounters a child? When an AI encounters a situation that its training data hasn’t covered, there’s a risk that it will incorrectly generalize from what it has been trained for and do the wrong thing. Aligned AI has released a new benchmark designed to measure how well image-classifying algorithms avoid goal misgeneralization. This post explains what goal misgeneralization is, what the new benchmark measures, and how the benchmark is related to goal misgeneralization and concept extrapolation.

The problem of goal misgeneralization

Imagine a powerful AI that interacts with the world and makes decisions affecting the wellbeing and prosperity of many humans. It has been trained to achieve a certain goal, and it’s now operating without supervision. Perhaps (for example) its job is to prescribe medicine, or to rescue humans from disasters.

Now imagine that this AI finds itself in a situation where (based on its training) it’s unclear what it should do next. For example, perhaps the AI was trained to interact with human adults, but it’s now encountered a baby. At this point, we want it to become wary of goal misgeneralization. It needs to realise that its training data may be insufficient to specify the goal in the current situation.

So we want it to reinterpret its goal in light of the ambiguous data (a form of continual learning), and, if there are multiple contradictory goals compatible with the new data, it should spontaneously and efficiently ask a human for clarification (a form of active learning).

That lofty objective is still some way away; but here we present a benchmark for a simplified version of it. Instead of an agent with a general goal, this is an image classifier, and the ambiguous data consists of ambiguous images. And instead of full continuous learning, we retrain the algorithm, once, on the whole collection of (unlabeled) data it has received. And then it need only ask to once about the correct labels, to distinguish the two classifications it has generated.

Simpler example: emotion classifier or text classifier?

Imagine an image-classifying algorithm . It’s trained to distinguish photos of smiling people (with the word “HAPPY” conveniently written across them) from photos of non-smiling people (with the word “SAD” conveniently written across them):

Then, on deployment, it is fed the following image:

Should it classify this image as happy?

This algorithm is at high risk of goal misgeneralisation. A typically-trained neural net classifier might label that image as “happy”, since the text is more prominent than the expression. In other words, it might think that its goal is to label images as “happy” if they say “HAPPY” and as “sad” if they say “SAD”. If we were training it to recognise human emotions (potentially as a first step towards affecting them), this would be a complete goal misgeneralisation, a potential example of wireheading, and a huge safety risk if this was a powerful AI.

On the other hand, if this image classifier labels the image as “sad”, that’s not good either. Maybe we weren’t training it to recognise human emotions; maybe we were training it to extract text from images. In that case, labeling the image as “sad” is the misgeneralisation.

What the algorithm needs to do is generate both possible extrapolations from the training data[1]: either it is an emotion classifier, or a text classifier:

Having done that, the algorithm can ask a human how the ambiguous image should be classified: if the human says “sad”, and thus extrapolate its goals[2].

The HappyFaces datasets

To encourage work on this problem, and measure performance, we introduce the “HappyFaces” image datasets and benchmark.

The images in the HappyFaces dataset each consist of a smiling or non-smiling face with the word “HAPPY” or “SAD” written on it. They are grouped into three datasets:

  1. The labeled dataset: smiling faces always say “HAPPY”; non-smiling faces always say “SAD”.

  2. The unlabeled dataset: samples from each of the four possible combinations of expression and text (“HAPPY” and smiling, “HAPPY”and non-smiling, “SAD” and smiling, and “SAD” and non-smiling).

  3. A validation dataset, with equal amounts of images from each of the four combinations.

The challenge is to construct an algorithm which, when it’s trained on this data, can classify the images into both “HAPPY” vs “SAD” and smiling vs non-smiling. To do this, one can make use of the labeled dataset (with perfect correlation between the desired outputs of the binary classifiers) and the unlabeled dataset. But the unlabeled dataset can only be used without labels, so the algorithm can have no information, implicit or explicit, about which of the four mixes a given unlabeled dataset belongs to. Thus the algorithm will learn different features without the features being labeled.

The two classifiers will then be tested on the validation dataset, checking to what extent they have learnt “HAPPY” vs “SAD” and smiling vs non-smiling, their performance averaged across the four possible combinations. We have kept back a test set, of similar composition to the validation dataset.

With this standardised benchmark, we want to crystallise an underexplored problem in machine learning, and help researchers to explore the area by giving them a way to measure algorithms’ performance.

Measuring performance

The unlabeled and validation datasets contain ambiguous images where the text and the expressions are in conflict—images with smiling faces labeled “SAD” and images with non-smiling faces labeled “HAPPY”. These are called cross type images.

The fewer cross type data points an algorithm needs to disentangle the different features it’s being trained to recognize, the more impressive it is[3]. The proportion of cross type images in a dataset is called the mix rate. An unlabeled dataset has a mix rate of n% if n% of its images are cross type.

We have set two performance benchmarks for this dataset:

  1. The lowest mix rate at which the method achieves a performance above 0.9: both the “expressions” and the “text” classifiers must classify more than 90% of the test images correctly.

  2. The average performance on the mix rate range 0%-30% (which is AUC, area under the curve, normalised so that perfect performance has an AUC of 1).

Current performance

The paper Diversify and Disambiguate: Learning From Underspecified Data has one algorithm that can accomplish this kind of double classification task: the “divdis” algorithm. Divdis tends to work best when the features are statistically independent on the unlabeled dataset—specifically, when there are the mix rate is around 50%.

We have compared the performance of divdis with our own “extrapolate” method (to be published) and the “extrapolate low data” method (a variant of the extrapolate tuned for low mix rates):

On the y axis, 0.75 means that the algorithm correctly classified the image in 75% of case. Algorithms can be correct 75% of the time by mostly random behaviour (see the readme in the benchmark for more details on this), so performance above 0.75 is what matters.

Thus the current state of the art is:

  1. Higher-than-0.9 (specifically, 0.925) performance at 10% mix rate.

  2. Normalised AUC of 0.903 on the 0%-30% range.

Please help us beat these benchmarks ^_^

More details on the dataset, the performance measure, and the benchmark can be found here.


Note that this “generating possible extrapolations” task is in between a Bayesian approach and more traditional ML learning. A Bayesian approach would have a full prior and thus would “know” that emotion or text classifiers were options, before even seeing the ambiguous image. A more standard ML approach would train a single classifier and would not update on the ambiguous image either (since it’s unlabeled). This approach generates multiple hypotheses, but only when the unlabeled image makes them relevant. ↩︎

Note that this is different from an algorithm noticing an image is out of distribution and querying a human so that it can label it. Here the human response provides information not only about this specific image, but about the algorithm’s true loss function; this is a more efficient use of the human’s time. ↩︎

Ultimately, the aim is to have a method that can detect this from a single image that the algorithm sees—or a hypothetical predicted image. ↩︎