Synthesizing amplification and debate


One possible way to train an amplification model is to use an auxiliary reinforcement learning objective to help guide the training of the amplification model. This could be done either by training two separate models, an agent and a question-answerer, or a single model trained on a joint objective. For example, from a comment Paul left on “A dilemma for prosaic AI alignment:”

I normally imagine using joint training in these cases, rather than pre-training + fine-tuning. e.g., at every point in time we maintain an agent and a question-answerer, where the question-answerer “knows everything the agent knows.” They get better together, with each gradient update affecting both of them, rather than first training a good agent and then adding a good question-answerer.

(Independently of concerns about mesa-optimization, I think the fine-tuning approach would have trouble because you couldn’t use statistical regularities from the “main” objective to inform your answers to questions, and therefore your question answers will be dumber than the policy and so you couldn’t get a good reward function or specification of catastrophically bad behavior.)

In my last post, I expressed skepticism of such non-imitative amplification approaches, though in this post I want to propose a possible way in which some of my concerns with this style of approach could addressed by integrating ideas from AI safety via debate. I’ll start by describing the basic idea in broad terms, then give a more careful, technical description of the sort of training procedure I have in mind.

The proposal

The basic idea is as follows: debate naturally yields an RL objective, so if you want to add an auxiliary RL objective to amplification, why not use the RL objective from debate? Specifically, the idea is to conduct a debate not between copies of the model , but between copies of the amplified model (where is a human with access to the model ). That gives you both an RL reward arising from the winner of the debate as well as a supervised loss from the difference between the model’s output and the amplified model’s output. That’s the basic proposal—now I’ll try to give a more careful, technical description of how I think it should go.


We will start with some initial definitions.

  • Let be the set of all questions in natural language.

  • Let be the set of all answers in natural language.

  • Let be the training distribution of natural language questions.

  • Let be a human.

  • Let be a model.

  • Let “What answer would you give to given access to ?”.

  • Let where is some means of stochastically determining whether it is worth it to call or not.[1]

  • Let .

  • Let be a target network updated regularly to match .

Training procedure

Then, consider the following training procedure.

  1. Generate some question .

  2. Conduct a debate according to the following procedure while building up a history and starting at .

    1. Let “What is the next argument you would give on in context ?”.

    2. Append to the history the contents of as well as a trace of what happened in calculating , including all the necessary information to use transparency tools to inspect what happened.

    3. Let “Is a sufficient answer to in context ?”.

    4. If not , increment and return to step 1 in the debate.

  3. Let “Were the even arguments in more helpful than the odd arguments in answering ?”.

  4. Train on the combination of the debate-style RL reward derived from ( for the winner and for the loser), a relaxed adversarial training catastrophe-minimization penalty calculated by , and the supervised loss where is some natural language distance metric and represents the question posed to the function that computed the answer .

Ways of combining the RL and supervised losses here include:

  • combining the two losses with some relative weighting ,

  • using a constrained RL algorithm such as CPO to maximize the RL reward subject to maintaining some baseline performance on the supervised loss, or

  • making the supervised loss a part of the RL reward.


The first thing which I want to note about this procedure is that it gives a continuous relaxation between amplification and debate—in fact, there are multiple ways to tune the procedure to limit to either pure amplification or pure debate. First, you can anneal whatever combination of the different losses you are using to eventually become exclusively imitative amplification, exclusively debate, or anything else in between. Second, the fact that this procedure uses to determine whether or not to stop the debate after every argument gives another way in which this procedure gives you a relaxation between amplification and debate: if the debate is always stopped after the first argument, you just get pure imitative amplification, whereas as the debate goes on for longer, you get closer and closer to just pure debate. What’s really nice about this is that it means that when pure imitative amplification is giving good results, you’ll just get pure amplification—since then will tend to be false—but when amplification isn’t working, you’ll get debate—since then will tend to be true.

As a result, it should be quite possible to enforce that this procedure limit to HCH—either by annealing the losses or by forcing to tend towards false. Thus, I think this procedure has a good chance of being outer aligned at optimum—or at least, a similar chance at it compared to pure imitative amplification. Unlike pure imitative amplification, however, this procedure gets to make use of the capability benefits of having an auxiliary RL objective to help guide training. Furthermore, since the auxiliary RL objective that we’re using comes from debate, we get a lot of the benefits of debate as well, including the ability to incentivize the debaters to produce arguments that we wouldn’t have necessarily though of ourselves, as well as the ability to train our debaters to use transparency tools against each other to help catch deception or other catastrophic behavior. That being said, I do think that whether or not something like this is inner aligned is still quite questionable—and is likely to depend highly on the specific transparency tools you have access to—though I do like the approach described here in general and I think it’s definitely worth looking into more.

  1. As an example approach for implementing something like , see “A concrete proposal for adversarial IDA.” ↩︎