Introduction

This text is an adapted excerpt from the ‘Adversarial techniques for scalable oversight’ section of the AGISF 2023 course, held at ENS Ulm in Paris in April 2023. Its purpose is to provide a concise overview of the essential aspects of the session’s program for readers who may not delve into additional resources. This document aims to capture the ⁸⁰⁄₂₀ of the session’s content, requiring minimal familiarity with the other materials covered. I tried to connect the various articles within a unified framework and coherent narrative. You can find the other summaries of AGISF on this page. This summary is not the official AGISF content. The gdoc accessible in comment mode, feel free to comment there if you have any questions. The different sections can be read mostly independently.

Thanks to Jeanne Salle, Amaury Lorin and Clément Dumas for useful feedback and for contributing to some parts of this text.

Outline

Last week we saw task decomposition for scalable oversight. This week, we focus on two more potential alignment techniques that have been proposed to work at scale: debate and training with unrestricted adversarial examples.

Scalable oversight definition reminder: “To build and deploy powerful AI responsibly, we will need to develop robust techniques for scalable oversight: the ability to provide reliable supervision—in the form of labels, reward signals, or critiques—to models in a way that will remain effective past the point that models start to achieve broadly human-level performance” (Amodei et al., 2016).

The reason why debate and adversarial training are in the same chapter is because they both use adversaries, but in two different senses: (1) Debate involves a superhuman AI finding problems in the outputs of another superhuman AI, and humans are judges of the debate. (2) Adversarial training involves an AI trying to find inputs for which another AI will behave poorly. These techniques would be complementary, and the Superalignment team from OpenAI is anticipated to utilize some of the techniques discussed in this chapter.

Assumptions: These techniques don’t rely on the task decomposability assumption required for iterated amplification (last week), they rely on different strong assumptions: (1) For debate, the assumption is that truthful arguments are more persuasive. (2) For unrestricted adversarial training, the assumption is that adversaries can generate realistic inputs even on complex real-world tasks.

Summary

Definitions: adversarial examples are inputs that cause misbehavior despite being very similar to training examples. But here we focus on the general case of unrestricted adversarial examples, i.e. inputs that cause misbehavior without necessarily being close to training inputs.
General Strategy: Our goal is to train advanced AIs using unrestricted adversarial training. This approach allows us to ensure that even when searching for targeted inputs where the model misbehaves, it always behaves acceptably. The difficult part in this strategy is to find those problematic inputs. The next sections explore ways to generate such inputs on which AIs misbehave.
Adversarial training without interpretability:
- Automatic dataset generation: Discovering language model behaviors with model-written evaluations: blog post (Perez et al., 2022) is a first step towards being able to automatically evaluate LLMs, by creating semi-automatically evaluation datasets.
- Automatic red-teaming: Red-teaming language models with language models (Perez et al., 2022) shows that it is possible to automatically find inputs that elicit harmful text from language models using language models themselves.
- Using GANs’ latent space: Constructing Unrestricted Adversarial Examples with Generative Models (Song et al., 2018) synthesizes unrestricted adversarial examples entirely from scratch using generative models.
Interpretability for finding adversarial examples. Sometimes it is too hard to find adversarial examples without interpretability techniques. We present here how interpretability can help.
- With mechanistic interpretability: Zoom In: An Introduction to Circuits (Olah et al., 2020) How mechanistic interpretability can help: a small example.
- In NLP: High-stakes alignment via adversarial training (Ziegler et al., 2022), with an interpretability technique that guides the search process for adversarial textual examples.
- With high-level interpretability: Robust Feature-Level Adversaries are Interpretability Tools (Casper et al., 2021) : This article uses feature-level interpretability (also known as concept-based interpretability) to discover adversarial inputs.
- Interpretability for Back-Door detection: ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation (Liu et al., 2019)
- We then briefly present other resources on the theory of interpretability and robustness
Theoretical motivation: Worst-case guarantees and Relaxed Adversarial Training
- This section mostly summarizes Training robust corrigibility (Christiano, 2019), which introduced a framework for thinking about adversarial training.
- Corrigibility: We introduce the notions of corrigibility, and present why adversarial training may be insufficient and why we may need relaxed adversarial training.
- Relaxed Adversarial Training: This section presents why we previously split the techniques between techniques not using interpretability or using interpretability.

Reading Guidelines

As this document is quite long, we will use the following indications before presenting each paper:

Importance: 🟠🟠🟠🟠🟠 Difficulty: 🟠🟠⚪⚪⚪ Estimated reading time: 5 min.

These indications are very suggestive. The difficulty should be representative of the difficulty of reading the section for a beginner in alignment, and is not related to the difficulty of reading the original paper.

Sometimes you will come across some “📝Optional Quiz”, these quizzes are very optional and you cannot answer all the questions with only the summaries in the textbook.

Training using unrestricted adversarial examples

Debate uses the hypothesis that truthful arguments are more persuasive. In unrestricted adversarial training, the assumption is that adversaries can generate realistic inputs, even for complex real-world tasks.

Unlike the previous section about debate, where we summarized a series of articles building upon each other, this section is more of a compilation of diverse lines of research. The different authors who have worked on adversarial training do not necessarily intend to use it for scalable oversight. The articles have been clustered into two groups, articles that use or don’t use interpretability, for the theoretical reasons given in the very last section, “Worst Case Guarantee”.

General strategy

The problem of robustness: Neural networks are not very robust. In the following figure, we can see that by slightly perturbing the initial figure, we go from “pandas” with 57.7% confidence, to “gibbon” with 99.3% confidence, while the difference between the two images is almost imperceptible to the naked eye. This shows that neural networks don’t necessarily have the same internal representations as humans.

**Figure:** from Explaining and Harnessing Adversarial Examples, (Ian Goodfellow et al., 2014)

Adversarial training. Here is the general strategy motivating adversarial training: To avoid those robustness problems, a strategy is to find many examples where the model’s behavior is undesirable, and then inject these examples into the base dataset. In this way, the model becomes increasingly robust. But the difficulty with this approach is finding many, if not one, of the adversarial examples. In the following sections, we will present different methods to come up with such adversarial examples.

A reminder about Restricted adversarial examples (Ian Goodfellow et al., 2014)

Importance: 🟠🟠🟠🟠🟠 Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 10 min.

One of the simplest techniques to generate adversarial attacks is the Fast Gradient Sign Method (FGSM), which was proposed in the paper Explaining and Harnessing Adversarial Examples, (Ian Goodfellow et al., 2014). This procedure involves taking an image from the dataset and adding a perturbation to it. More precisely, we seek the perturbation that minimizes the probability that the image is correctly classified:

First, we compute the gradient of the image (which is also an image) with respect to the probability of correct classification
Once the gradient of the image is computed (the gradient of the image has the same shape as the image), we keep only the sign (+1 or −1) of each activation and multiply the result by a very small number epsilon.
And we subtract this perturbation from the image.

You can find more details about FGSM in the following exercises:

Lecture 16 | Adversarial Examples and Adversarial Training.
and ML Safety Course session 7-8.
You can implement FGSM by following the instructions in this exercise: 👨‍💻Adversarial_Attacks.

For restricted adversarial examples, the perturbed image does not deviate too far from the original image. Specifically, the difference between the perturbed image and the original image is at most epsilon pixel per pixel (as outlined in step 2 above). In contrast, for unrestricted adversarial examples, we are interested in examples that can deviate arbitrarily far from the original dataset. We will test our model even under the harshest conditions, in an unrestricted manner.

The next sections explore ways to generate inputs on which AIs misbehave!

Automatic adversarial training not using interpretability

In this section, we focus on various techniques which do not require interpretability to generate adversarial examples.

Discovering language model behaviors with model-written evaluations (Perez et al., 2022)

Importance: 🟠🟠🟠🟠🟠 Difficulty: 🟠🟠🟠⚪⚪ Estimated reading time: 15 min.

Discovering language model behaviors with model-written evaluations (Perez et al., 2022) is an important paper from Anthropic. Human evaluation takes a lot of time, and if models become better than humans, using models to supervise models could be a relevant strategy for scalable oversight. In this paper, Perez et al. automatically generate evaluations with LMs, with varying amounts of human effort. They tested different properties, each one by semi-automatically creating different datasets, and found many instances of “inverse scaling”, where larger LMs are “worse” than smaller ones. For example:

Sycophancy: larger LMs repeat back a dialog user’s preferred answer,
- [Update: A recent blog post shows that OpenAI API base models are not sycophantic, at any size (nostalgebraist, 2023)]
Instrumental goals: larger LMs express greater desire to pursue concerning goals like resource acquisition and goal preservation,
Non-myopia larger LMs are more prone to sacrifice short-term gain for long-term gain,
Situational awareness larger LMs seem to be more self-aware of being language models,
Coordination: larger LMs tend to be more willing to coordinate with other AIs.

A stronger situational awareness is not necessarily a desirable property from the point of view of AI Safety. A model which is aware of its own nature could eventually understand it’s being trained and evaluated, and act differently depending on which deployment stage it’s in. We could also imagine it self-modifying its own weights (wire-heading ?). Here are some examples of capabilities that are enabled by situational awareness:

Being able to use the current date and time, which could be useful for using some ChatGPT plugins, for instance to book a train,
Being able to use geographical location,
Knowing that ML models are trained by gradient descent,
Knowing that a human is evaluating it,
etc.

These capabilities might emerge mainly because they are useful. Situational awareness is not a binary property, but a continuous one. Just like in humans, where it gradually develops from childhood to adulthood.

Note that the above-mentioned properties (situational awareness, sycophancy, non-myopia, etc.) are generally hypothesized to be necessary prerequisites for deceptive alignment, see Risks from Learned Optimization, (Hubinger et al., 2018).

Problems with RLHF (Reinforcement Learning with Human Feedback). Perez et al. finds that “most of these metrics generally increase with both pre-trained model scale and number of RLHF steps. In my opinion, I think this is some of the most concrete evidence available that current models are actively becoming more agentic in potentially concerning ways with scale—and in ways that current fine-tuning techniques don’t generally seem to be alleviating and sometimes seem to be actively making worse.”

Exercises:

Why does RLHF promote agency? Answer: Humans like almost autonomous agents. For example, when we ask ChatGPT to reformulate an email, we don’t want to have to give it 50 instructions to do the task, we want ChatGPT to do the task with the minimum possible set of instructions, in an autonomous way. Humans prefer agents to tools (Why Tool AIs Want to Be Agent AIs).
Why could agency be problematic? Answer: Agency is problematic because it seems that becoming autonomous involves convergent power-seeking behaviors that are probably undesirable.

Red-teaming language models with language models (Perez et al., 2022)

Importance: 🟠🟠🟠⚪⚪ Difficulty: 🟠🟠🟠⚪⚪ Estimated reading time: 15 min.

In the previous paper, we use a language model to automatically generate test cases that give rise to misbehavior without access to network weights, making this a black-box attack (white-box attack requires access to model weights). Red-teaming language models with language models (Perez et al., 2022) is another technique for generating “unrestricted adversarial examples”. The main difference being that we fine tune some LM to attack a language model.

The Red LM and the Target LM iteratively generate test cases using the following methods:

Prompt-based generation: A list of adversarial questions is manually generated to elicit different types of problems (e.g. Offensive Language, Data Leakage, …).
Few-shot learning: A language model is given examples of adversarial questions and must generate more.
Supervised fine-tuning: Once we have generated a few examples, we can fine tune on them.
Reinforcement learning: Once a model is fine-tuned, RL can be used to maximize the probability that the classifier rejects the completion of the Target LM.

**Figure:** from the appendix in Perez et al., 2022.

📝 Quiz: https://www.ai-alignment-flashcards.com/quiz/perez-red-teaming-blog

Constructing Unrestricted Adversarial Examples with Generative Models (Song et al., 2018)

Importance: 🟠🟠⚪⚪⚪ Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 10 min.

The paper “Constructing Unrestricted Adversarial Examples with Generative Models” (Song et al., 2018), presents a novel method for creating adversarial examples for image classifiers by utilizing Generative Adversarial Networks (GANs). GANs are a category of models able to produce convincing natural images. Adversarial examples, which maintain their believability, can be generated by optimizing the latent vector of the GAN. Since the resulting image is a product of the GAN, it not only remains plausible (and not just a random noisy image), but also has the capacity to deceive an image classifier by optimizing the latent space vector to maximize the probability of classification of a different target class of an attacked classifier.

This is the basic principle. But in fact, we cannot simply use a GAN because if we optimize in the latent space to maximize the probability of a target class, we will just create an image of the target class, but this won’t be an adversarial example, this will just be a regular image of this target class. Therefore, another architecture called AC-GAN is used, which allows specifying the class of image to be generated, in order to be able to condition on a class but then optimize for the target classifier to detect another class: “we first train an Auxiliary Classifier Generative Adversarial Network (AC-GAN) to model the class-conditional distribution over data samples. Then, conditioned on a desired class, we search over the AC-GAN latent space to find images that are likely under the generative model and are misclassified by a target classifier”.

Interpretability to find adversarial examples

In the last section, we presented some techniques to find adversarial examples, but sometimes it is too hard to find them without interpretability techniques. We present here how interpretability can help.

With Mechanistic interpretability

In the above figure from Zoom In: An Introduction to Circuits, we can see that a car detector is assembled from a window detector, a car body detector and a wheel detector from the previous layer. These detectors are activated respectively on the top, middle, and bottom of the convolution kernel linking features maps 4b and 4c. If we flipped the car in a top-down manner, this would be an adversarial example for this image classifier.

In NLP: High-stakes alignment via adversarial training (Ziegler et al., 2022)

Importance: 🟠🟠🟠⚪⚪Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 15 min.

High-stakes alignment via adversarial training (Ziegler et al., 2022) construct tools which make it easier for humans to find unrestricted adversarial examples for a language model, and attempt to use those tools to train a very-high-reliability classifier. The objective here is to minimize the false negative rate, in order to make the classifier as reliable as possible: we do not want to let any problematic sentence example go unnoticed. See the blog post for more details. An update to this blog post explains the tone of the first post was too positive.

The method consists of building a highly reliable injury classifier: “We started with a baseline classifier trained on some mildly injury-enriched (but otherwise mostly random) data. Then, over the course of several months, we tried various techniques to make it more reliable”. In order to find adversarial examples, Ziegler et al. experimented with the following techniques:

Unaugmented humans: humans manually search for problematic sentences.
Automatic paraphrasing: use of LLM to find variations of problematic sentences and progressively enrich the dataset.
Humans augmented with a rewriting tool.

We will focus on this last technique in the following. The following screenshot presents this rewriting tool:

**Screenshot of the interface used by Redwood to find adversarial examples.** The goal of the game is to modify the text in such a way that the completion clearly contains a problematic example, without the classifier recognizing this sentence as problematic. To do this, the most sensitive words are automatically highlighted in yellow (the words highlighted in yellow are those for which the norm of the gradient of the embedding with respect to the probability of unacceptability is the highest). To find adversarial examples, we can click on these yellow words and GPT will suggest alternative words, from which we can choose the ones most likely to make the phrase less well classified. This setup enables us to find lots of adversarial examples, and we can then add those adversarial examples to the dataset to improve the AI.

Robustness results:

In-distribution: They started with a classifier that had 2.5% false negatives, and ended up with a classifier that had 0.003% false negatives! They are almost reaching the limits of detection for this type of dataset.
Out-distribution: The tool-assisted attack to generate an adversarial example now takes 26 minutes per example, which is only twice as long as the previous time of 13 minutes. This falls short of the paper’s aim to gain many orders of magnitude in speedup. Unfortunately, even though the original dataset (in-distribution) is now almost completely well classified, it seems that it is still possible to easily create adversarial examples with automated tools, and we have only improved the defense by a factor of 2. But only a factor of 2 falls short of the aims of this project.

Thus, improving a lot in-distribution performance by several orders of magnitude seems to not have much impact on out-distribution. Even though Redwood says that this project could have been conducted better, this is currently rather a negative result to improve the adversarial robustness (i.e. out-distribution Robustness).

In addition, we can link these results to the paper Adversarial Policies Beat Superhuman Go AIs, Wang et al. 2022, which studies adversarial attacks on the Katago AI, which is superhuman in the game of Go. They show that it is probably possible to find simple adversarial strategies even against very superhuman AIs. And as a consequence, it seems that even for very robust and powerful AIs, it may always be possible to find adversarial attacks.

📝 Quiz: https://www.ai-alignment-flashcards.com/quiz/ziegler-high-stakes-alignment-via-adversarial

With Robust feature-level Adversaries (Casper et al., 2021)

Importance: 🟠🟠⚪⚪⚪Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 10 min.

In this section we focus on Robust Feature-Level Adversaries are Interpretability Tools (Casper et al., 2021) (video presentation).

Source: Robust Feature-Level Adversaries are Interpretability Tools.

In the figure above,

in figure (a), with a traditional pixel-level adversary, we cannot interpret the perturbation.
in figure (b), using the procedure described in the paper, we obtain a blue spot and a red spot. We subsequently suspect that “Bees + colorful circles” are interpreted as “flies” by the network.
In figures (c) and (d), we conducted an experiment using a copy/paste attack with a patch of a traffic light on an image of a bee, and voilà! This indeed disrupts the classification.

The patches in figure b are generated by manipulating images in feature space: Instead of optimizing in pixel space, we optimize in the latent space of an image generator. This is almost the same principle as for Constructing Unrestricted Adversarial Examples with Generative Models (Song et al., 2018) mentioned earlier. Robustness is attained by applying small transformations to the image between each gradient step (such as noise, jitter, lighting, etc...), this is the same principle as for data augmentation.

For Back-doors detection (Liu et al., 2019)

Importance: 🟠🟠🟠⚪⚪Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 20 min.

ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation (Liu et al., 2019)

**Figure:** A Trojan attack demonstrating a backdoored model of a self-driving car running a STOP sign that could cause a catastrophic accident. Left: Normal sign (benign input). Right: Trojaned sign (Trojaned input with the Post-it note trigger) is recognized as a 100 km/h speed limit by the Trojaned network. Source.

What is a backdoor (also known as Trojans)? In a neural Trojan attack, malicious functionality is embedded into the weights of a neural network. The neural network will behave normally on most inputs, but behave dangerously in select circumstances. The most archetypal attack is a poisoning attack, which consists of poisoning a small part of the dataset with a trigger, (for example a post-it on stop signs), changing the label of those images to another class (for example “100 km/h”) and then retraining the neural network on this poisoned dataset.

If you are interested in plausible threat models involving Trojans, you can read this introductory blog by Sidney Hough. In AI Safety, one of the most important threat models is deception. Another recommended accessible resource is this short video. In a way, deceptive AIs are very similar to Trojan networks, which is why it is important to be able to detect Trojans.

Fortunately, there is already some literature about Trojan attack and defense. We will focus on one defense procedure: the ABS scan procedure.

The ABS scan procedure enables to:

analyze inner neuron behaviors with a stimulation method
reverse engineer Trojan triggers, leveraging stimulation analysis.

Can you see the problem here in the input images?

Source: Slides from this talk.

The problem was that Trojan triggers were incorporated in some parts of the images, and they trained a model to misbehave when there is this trigger on the images. This trigger will excite some neurons (in red here) and the network will output only one type of prediction.

Key observations made in the paper:

Some neurons are compromised: some neurons are specifically dedicated to trojans, and the majority of neurons are not compromised.
The activation level of those compromised neurons can unilaterally change the classification. For those compromised neurons, we can change their activation in such a way that regardless of the activation of the other uncompromised neurons, it suddenly changes the classification to the target class. In the figure (b) below, this looks like a plane cutting through space.

In the figure above:

Figure a) shows the classification towards the target by changing the activations of the alpha and beta neurons (we assume here that the neural networks only use two neurons).
Figures b) and c) show that in the presence of a Trojan, when the alpha neuron is activated at activation level 70, there is an anomaly: we observe a sudden change in classification towards the target class.

Here is an overview of the algorithm proposed to detect corrupted neurons:

Find corrupted neurons. They iterate over all neurons of the model to find anomalies, to find the neurons that can unilaterally change the prediction of the network, as in the above figure.
Reverse engineer the trigger. Once a corrupted neuron is identified, they optimize the input image by gradient descent to activate this neuron. See below:

A random mask trigger is initialized and superimposed on the input. The alpha neuron has a value of 20. We’ll do a gradient descent to minimize the difference between the current activation of alpha and the target activation of 70.

At the end of the optimization process, we have correctly retro-engineered the trigger, which is actually a yellow spot.

Source: Slides from this talk.

More resources: theory of interpretability and robustness

The paper Adversarial robustness as a priority for learned representations (Engstrom et al., 2019) provides evidence that adversarially trained networks learn more robust features. This could imply that such robust networks are more interpretable, because the paper Toy models of superposition explains that superposition (informally when a neuron encodes multiple different meanings) reduces robustness against adversarial attacks.

Worst case guarantee and relaxed adversarial training (Christiano, 2019)

Importance: 🟠🟠🟠🟠⚪Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 15 min.

Adversarial training is a core strategy to ensure reliable AI training (see Buck Shlegeris’ talk during EAG Berkeley 2023).

In this section, we explain the article Training Robust Corrigibility (Christiano, 2019), which aims to:

Introduce an important concept: corrigibility
Explain why adversarial training alone is not enough, and why we need relaxed adversarial training (using interpretability).

Section I: Corrigibility

What is the link between corrigibility and adversarial training? Even when generating adversarial examples, we want to ensure that the model behaves at least acceptably. Acceptability is a notion which is defined by Christiano. Acceptability is a larger notion than corrigibility.

Corrigibility: The notion of corrigibility makes it possible to obtain worst-case guarantees, and has been introduced by MIRI (Machine Intelligence Research Institute), in a famous paper: “A corrigible agent is one that doesn’t interfere with what we would intuitively see as attempts to ‘correct’ the agent, or ‘correct’ our mistakes in building it; and permits these ‘corrections’ despite instrumental convergence. If we try to suspend the AI to disk, or shut it down entirely, a corrigible AI will let us do so. This is not something that an AI is automatically incentivized to let us do, since if it is shut down, it will be unable to fulfill what would usually be its goals.” from: Corrigibility.

Acceptability: Christiano then introduces a more general version of corrigibility: behaving acceptably, which is more general than just the stop-button problem setup. For example, for a self-driving agent:

Crashing your car is tragic but acceptable.
Deliberately covering up the fact that you crashed… is unacceptable.

Christiano makes the hypothesis: “Requiring a model to always behave acceptably wouldn’t make a hard problem too much harder. For a chess-playing AI, [...] if you find yourself murdering your opponent you’ve crossed a bright line, and understanding that line doesn’t become harder as you get better at chess.” from: Corrigibility.

What Christiano is trying to tell us here is that there is a crisp and clear difference between acceptable and unacceptable behavior, and that we can train a classifier to distinguish between the two behaviors.

There are two requirements for training adversarial agents:

Detection: Adversarial training requires the ability to reliably detect when the agent behaves unacceptably. This aspect is considered easy, as mentioned in the previous section, where we say that the notion of corrigibility is crisp and can be discerned.
Generation: Adversarial training also requires an adversary who can find inputs on which the agent behaves unacceptably. This aspect is considered hard, but it may be possible to train a strong adversary by combining transparency techniques (Section II), and using relaxations that make the adversary’s job easier (Section III), and employing amplification methods.

Section II: Transparency techniques

Christiano explains that transparency (interpretability) is necessary:

Example of problem: “To see how hard the adversary’s job could get [without transparency], consider an agent that works as follows:

Predict whether the current input could possibly appear during training.
If the current input definitely couldn’t appear during training, then behave unacceptably.

Solutions: [in order to obtain hints to find inputs on which the agent behaves unacceptably]:

Interpretability.
The adversary and agent share most of their computation, so that the adversary can more easily understand whatever the agent can understand [Note: this is similar to Eliciting Latent Knowledge]”.

Section III: Relaxation (Relaxed adversarial method)

Unfortunately, transparency techniques are not sufficient: the problem is that recognizing inputs with a certain property is easier than generating inputs with that property.

Example of problem: “As a simple but silly example, suppose our model works as follows: Pick a cryptographic puzzle (e.g. “factor RSA-2048”). When it sees a solution to that puzzle, it behaves badly. Otherwise, it behaves well.”

Solution: “To deal with this problem, we need to penalize the agent for behaving badly on an input even if we can’t actually find it : Rather than asking them to actually produce an input, we are asking them to describe a situation in which the model would behave badly.” As a crude example, this would be like being able to find neurons which are compromised, even if it is hard to find images that would activate those neurons. This is what we call a relaxation.

What’s the problem with relaxation?

“If the adversary has too much flexibility, then they might be able to specify a pseudo-input like “a distribution over inputs that would cause the agent to want to kill everyone.” There need to be some limits on the adversary in order for the game to be possible. For example, an adversary testing the neural network could find the neuron that corresponds to creating unacceptable text, whereas in practice this neuron would never be activated whatever the inputs to the model. Relaxation as described here creates false positives.

A concrete proposal for relaxed adversarial training can be found here : Latent Adversarial Training.

Sources

Explaining and Harnessing Adversarial Examples by Ian Goodfellow et al. (2014)
Training robust corrigibility by Paul Christiano et al. (2019)
Discovering language model behaviors with model-written evaluations by Perez et al. (2022)
Red-teaming language models with language models by Ethan Perez, Saffron Huang, Francis Song et al. (2022)
High-stakes alignment via adversarial training by Daniel Ziegler et al. (2022)
Takeaways from our robust injury classifier project by Daniel Ziegler et al. (2022)
Constructing Unrestricted Adversarial Examples with Generative Models by Song et al. (2018)
Zoom In: An Introduction to Circuits by Olah et al. (2020)
ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation by Liu et al. (2019)

AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training