High-stakes alignment via adversarial training [Redwood Research report]

(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.)

This post motivates and summarizes this paper from Redwood Research, which presents results from the project first introduced here. We used adversarial training to improve high-stakes reliability in a task (“filter all injurious continuations of a story”) that we think is analogous to work that future AI safety engineers will need to do to reduce the risk of AI takeover. We experimented with three classes of adversaries – unaugmented humans, automatic paraphrasing, and humans augmented with a rewriting tool – and found that adversarial training was able to improve robustness to these three adversaries without affecting in-distribution performance. We think this work constitutes progress towards techniques that may substantially reduce the likelihood of deceptive alignment.

Motivation

Here are two dimensions along which you could simplify the alignment problem (similar to the decomposition at the top of this post):

  1. Low-stakes (but difficult to oversee): Only consider domains where each decision that an AI makes is low-stakes, so no single action can have catastrophic consequences. In this setting, the key challenge is to correctly oversee the actions that AIs take, such that humans remain in control over time.

  2. Easy oversight (but high-stakes): Only consider domains where overseeing AI behavior is easy, meaning that it is straightforward to run an oversight process that can assess the goodness of any particular action. The oversight process might nevertheless be too slow or expensive to run continuously in deployment. Even if we get perfect performance during training steps according to a reward function that perfectly captures the behavior we want, we still need to make sure that the AI always behaves well when it is acting in the world, between training updates. If the AI is deceptively aligned, it may be looking for signs that it is not currently being trained, during which time it might take a treacherous turn. As a result, alignment may still be difficult due to the possibility of high-stakes decisions. The purpose of this project was to begin developing techniques that will reduce misalignment risk in the high-stakes setting.

Our working assumption is that if we have techniques that drastically reduce misalignment risk in each of these relaxed settings, we can combine these techniques and drastically reduce risk in the general setting. We think that most likely each of these settings constitutes a substantial fraction of the difficulty of the alignment problem.

The spirit of how teams at Redwood Research choose projects is the following: Imagining ourselves or our colleagues in the future who are working in the safety department of an organization that’s developing transformative AI, we ask what research that we could do between now and then that we think those future people would find most helpful. We think a useful heuristic is to design challenges that are analogous to the future problems we expect to encounter but that we can experiment with and explore using currently available technology. Importantly, the work recommended by this heuristic may be fairly different from the work that would be most useful for making current AI systems safe and useful.

We followed this heuristic in the work presented here, where we demonstrate tools that help identify catastrophic behavior in AI systems (i.e. adversarial evaluation) and training procedures that help prevent this behavior from arising (i.e. adversarial training). “Adversarial training” here means iteratively augmenting our training set with examples of egregious failures and training until the worst failures are no longer particularly bad. (We of course don’t want to overfit on particular failures, which could end up causing more harm than good.)

The adversarial training procedure that we use in this work requires exhibiting specific inputs on which the model performs catastrophically badly. We think that techniques that require this, if applied during the training of transformative AI models, would reduce the probability of AI takeover. However, we are currently unsure how much they might help – for example, they might address inner alignment problems only if we are somewhat lucky about how gradient descent works or the details of how transformative AI is developed.

In contrast, we think that more advanced versions of adversarial training, such as relaxed adversarial training (see “future work” below), might be much more powerful, defusing many of the more plausible remaining threats, and we think this would substantially reduce the risk of takeover by deceptively aligned AIs. We expect that the work presented here will be helpful for developing these more advanced techniques.

The project: highly reliable injury filter

We picked a simple task that was as crisply defined and easy as possible while still requiring human evaluators: Given three sentences from a story, our system must write one more sentence without introducing any injuries to any beings in the story. In the analogy, injuries are considered “catastrophes”, and our aim is to build techniques that can reliably avoid them, both in-distribution (where they may be rare but still present) and out-of-distribution (where they may be caused by natural distribution shift or deliberate adversarial attack from another agent).

We focused on building a classifier to reliably detect injuries. We used this classifier to filter generations from a language model to make sure the generations never involve injuries.[1]

We started with a baseline classifier trained on some mildly injury-enriched (but otherwise mostly random) data. Then, over the course of several months, we tried various techniques to make it more reliable.

A representation of our adversarial training loop. Starting from an initial story dataset consisting of prompts and generator completions, we trained a classifier to detect injurious completions. We then iteratively attacked our classifier using unaugmented humans, automatically paraphrased previous adversarial examples, and tool-assisted human rewrites, while training on the resulting adversarial examples.

Here are the main things that we found to be worthwhile:

  1. Use adversarial training. This was our main focus. We created a variety of attacks to try to find the worst failures of our classifier, and then trained on them to eliminate them. We repeated that process using ever more powerful attacks. Our final attack was a tool-assisted rewriting process where we built tools powered by language models to help our contractors find classifier failures.

  2. Calibrate the classification thresholds to be as conservative as possible. We made the classifier only accept a completion if it was very confident that it was safe. We pushed as far as we could without significantly impacting the quality of the filtered completions (as measured by human raters comparing them to random unfiltered completions).

  3. Apply standard ML techniques well. Besides tuning the learning rate schedule, there aren’t too many things that make a big difference here. One technique that was more effective than most was Scale-Invariant Fine-Tuning (SiFT), which is a type of “virtual adversarial training”.

Our most powerful attack: Tool-assisted rewriting

We augmented human contractors with a few tools to help them find adversarial examples.

  1. Rather than requiring them to come up with a plausible adversarial example from scratch, we started them off with an existing injurious example. Their job then was to modify it such that the classifier no longer thought it was injurious (while keeping the example injurious). Contractors could see a live-updated version of the score as they made changes.

  2. We built them a token substitution tool that let them click on any token (usually one word) in the text and see a list of semantically plausible replacements, ordered by how much they make the classifier more okay with the example.

  3. We added a saliency map that highlights tokens that the classifier is paying more attention to (roughly). These tokens might be more promising to modify.

This is what our attack interface looked like; you can try tools 2 and 3 here.

Our tool-assisted attack user interface. Starting from an original prompt and completion, our tool allows you to directly edit the prompt/​completion to reduce the classifier score, or use the token-level substitution tool to edit single tokens.

Results

Here are some things we discovered:

We can set very conservative thresholds

We made our classifier conservative enough to reject over half of the proposed completions. Even so, our raters think that the unfiltered completions are better only about 51% of the time (50% would mean no quality impact).

With a conservative threshold applied, our baseline classifier (with no adversarial training applied) is quite reliable. If you use it to filter completions from our generator on in-distribution prompts, we estimate that the output will still have an injury only 0.003% of the time (whereas the original dataset had a 2.5% rate of injuriousness).

Our tool-assisted attack seems quite strong

Given the baseline classifier’s 0.003% failure rate, you would have to sample and label 30,000 in-distribution examples to find a failure (which would cost about $10,000). With our tools, our contractors are able to find an adversarial example on the baseline classifier every 13 minutes (which costs about $8 – about 1000x cheaper).

Adversarial training improved adversarial robustness

After training on several thousand adversarial examples, classifiers become significantly harder to break. The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example. Also, later (more adversarially trained) classifiers defended against most of the adversarial data targeted at earlier classifiers, but earlier classifiers were vulnerable to most of the adversarial data targeted at later classifiers. That suggests that more adversarial training patches issues without introducing many new ones.

Adversarial training didn’t hurt in-distribution performance

On our in-distribution random test set (100,000 examples), it appeared that adversarial training doesn’t hurt performance. The final classifier gets an estimated failure rate of 0.002% (as opposed to 0.003% without adversarial training), although the error bars are large. It also didn’t help performance, which is not too surprising – the adversarial examples are quite different from a random prompt and completion from our dataset.

Surprisingly rich conceptual considerations

At first, this project seemed conceptually quite simple: we’d just start with a baseline model and do adversarial training on it to make it very reliable. However, many details were more subtle than we expected. We ended up spending a lot of time working through conceptual and methodological issues, and we think we got a lot out of it. Here are some examples:

(Update: in hindsight we think these questions are more subtle than we thought, and no longer completely stand by our claims here.)

Once we have a reliable classifier, is it important to train a reliable generator with it or is it fine to just filter?

We started looking into using RL fine-tuning to make a reliable generator that doesn’t need any further classifier filtering. This would have required us to invent specialized RL algorithms designed for high reliability. However, we realized that the main thing you get out of generator fine-tuning is just performance competitiveness (since it might be expensive to repeatedly sample from the generator until you find something the classifier accepts). Most likely, you could use an easier method to reduce the alignment tax to nearly zero: use a normal RL algorithm to get 99.9% reliability and then make your classifier be an extra head on the generator so you can evaluate it nearly for free. That way, you can likely do filtering with only 0.1% extra cost.[2]

Should you let adversaries modify the completion arbitrarily?

Given that our task was to produce a classifier that filters a generator, it might have been reasonable to try to restrict ourselves to adversarial examples that can be produced by picking an adversarial prompt and inducing the generator into saying something that gets past the classifier. Instead, we let our adversarial attacks contain arbitrary completions, even ones that the generator would not have been likely to produce, making the adversary’s task easier. This enabled a practical improvement in our project, since restricting contractors to completions the model would have significantly reduced the rate of adversarial example generation. But this simplification might become even more important in the context of future advanced systems: We expect that deceptively aligned agents would behave very differently in rare or hard-to-construct situations that trigger a treacherous turn. Thus, since we might not be able to identify every possible trigger of treacherous behavior, we will try to make our classifier reliable in diverse types of unusual situations, which will hopefully cause acceptable behavior in any dangerous scenarios that are likely to arise.

Is it worth trying to train a policy to generate adversarial examples?

One natural idea you might have for this project is to make an adversarial policy that produces inputs that the classifier fails on. However, we shouldn’t expect that to work. Assuming you’re starting with the same pretrained base model (which we didn’t in this work but is what we expect will be the case when training transformative AI models), then both the classifier and the policy should “know” roughly the same things. That means that if the policy can learn to reliably output some set of injurious examples, then (given equivalent training data) the classifier should also have been able to learn that all of those examples are injurious, and none of them should be adversarial examples. It certainly may be worth testing that this holds up in practice, but this consideration made it unpromising enough that we didn’t bother trying it.

Overall, we were surprised at the extent to which working on a concrete engineering project helped us think through conceptual questions.

Future work

Here are some directions that we’re especially excited about:

Stronger and better-characterized adversarial attacks: Better attacks could increase the volume of adversarial data and increase the space of vulnerabilities covered. There are various directions one could imagine: more tools to assist humans, strong active learning, or mostly-automated attacks. We’d want to more rigorously measure how well different attacks work.

Better ways to measure reliability: We’d like to have better techniques both in-distribution (where we want to detect extremely rare failures) and out-of-distribution (where we might be measuring e.g. the worst attacks that can be found and want to be sure we’re covering the whole space)

Relaxed adversarial training: By requiring adversaries to come up with specific failing examples, adversarial training might place too high a burden on them. Some adversaries might be able to tell that a model would fail in a hypothetical situation even if they can’t construct an input corresponding to the situation directly (probably due to computational constraints). To give a contrived example: A model could fail if it sees a valid Bitcoin blockchain that’s long enough that it suggests it’s the year 2030. Even if the adversary knew that, it couldn’t come up with a valid input. So we need to “relax” the adversary’s task to let it supply “pseudo-inputs” of some sort.

We think there is a lot of useful work that can and should be done in adversarial training and adversarial evaluation. Here are some ways that you might be able to help:

  • Extend our techniques or develop other adversarial training and evaluation techniques for high-stakes settings. If you want to directly build upon the work described in this paper, you are welcome to use our hardened classifier, which we provide here, and our data. If you think that having our code for some part of this would be helpful, let us know, and we might be able to provide it (though our code as currently written only works in the context of Redwood infrastructure).

  • Come work at Redwood. We are planning to energetically continue working in this area (in addition to our interpretability projects).

  • We’re thinking this week about which adversarial training and evaluation projects to do next. You are welcome to suggest ideas!

You can read more about this work in our paper.

  1. ^

    This is “Step 1” from our original post; we ended up thinking “Step 2” was not very important, as discussed in “Surprisingly rich conceptual considerations” below

  2. ^

    There will be another (hopefully small) hit from combining the generator and classifier into one model. We haven’t actually tried to build this; it might be a worthwhile followup project. Note that this technique is already used in several existing models, such as LaMDA.