# Reinforcement Learning in the Iterated Amplification Framework

When I think about Iterated Amplification (IA), I usually think of a version that uses imitation learning for distillation.

This is the version discussed in the Scalable agent alignment via reward modeling: a research direction, as “Imitating expert reasoning”, in contrast to the proposed approach of “Recursive Reward Modelling”. The approach works roughly as follows

1. Gather training data from experts on how to break problems into smaller pieces and combine the results

2. Train a model to imitate what the expert would do at every step

3. Amplification: Run a collaboration of a large number of copies of the learned model.

4. Distillation: Train a model to imitate what the collaboration did.

5. Repeat steps 3 and 4, increasing performance at every step

However, Paul has also talked about IA using reinforcement learning (RL) to maximize the approval of the amplified model. What does this approach (RL-IA) look like? How does it relate to Imitation-IA and Recursive Reward Modelling?

**Puzzling about RL-IA**

To get an agent that takes good actions in an Atari game, we use Imitation-IA to build a system that answers the question “how good is it to take actions from this state”, then train a reinforcement learner to “output the best action to take from a given state”.

But there it seems like the improvement stops there—it’s not clear how “ability to output the best action to take from a given state” could improve “ability to evaluate how good actions are good from a state” in any way that’s different from running a traditional reinforcement learning algorithm (which usually involves taking some policy/value estimate and gradually improving it).

**Clarifying what RL-IA does**

Claim: There is a fairly straightforward correspondence between how Imitation-IA and RL-IA perform a task (given no computational limits). RL-IA does not change the class of tasks that Imitation-IA can perform or perform them in a radically different way.

Suppose we have a current version of the model M1 that takes questions and produces a distribution over answers. Let M2 be an amplified version of that model (ie. produced by running a number of copies of M1). Let Y be some question, with domain of answers D. We want to find the answer X* that is the answer in D which maximizes the approval of amplified overseer, M2(“How good is answer X to Y?”). Y could be

“What action is best to take from this state in this atari game?” where D is a small discrete set of possible actions

“What answer of less than 100 characters should I give to this question?” where D is a large discrete set of possible answers

“What answer of unbounded length should I give to this question?” where D is an infinite discrete set

“What is probability that event E will happen tomorrow?” where D is the continuous space of probabilities

An update using imitation learning would have the form:

X* = M1(Y)

For: number of samples

Sample an answer X from D

Evaluate M2(“How good is answer X to Y?”)

If M2(“How good is answer X to Y?”) > M2(“How good is answer X* to Y?”), then set X* = X

Perform gradient descent to maximize the probability of outputting X*, using gradient

An update using the REINFORCE policy gradient estimator would have the form:

sample X from a stochastic policy M1(Y)

Perform gradient descent using gradient

If we have a perfect distillation algorithm, these both converge to in the limit of infinite computation.

**Practical Differences**

Outside of this idealized situation, circumstances could make one or the other a better update to use.

The imitation update could converge more quickly if we have a good initialization for M(Y) from human data, as it bypasses the need to explore. It could also be less surprising, using only processes that the humans originally demonstrated.

The REINFORCE update could converge more quickly if the human initialization is suboptimal, or if it’s hard to exactly reproduce the human demonstration.

In general, it seems like the system could use an algorithm that combines reinforcement learning updates with imitation learning updates, ie. Deep Q Learning from Demonstrations.

**Returning to the original puzzle**

I think the solution is not necessarily that “ability to output good actions at this timestep” translates into “ability to evaluate which actions are good”? Rather, I think that it is the case that the decomposition of “evaluate which actions are good” contains some questions which might perform a search over an answer space, and the answers to these questions are improved by reinforcement learning, and this improves the evaluation of atari actions. This can produce a model which uses a mix of imitation learning and reinforcement learning.

For example:

“What is a good action to take from state S?” could be learned to maximize “How good is it to take action A from this state S?”

“How good is it to take action A from this state S?” could be learned by imitating an amplified reasoner that asks the subquestion “What is the most useful information to provide about the consequences of action A from state S?”

“What is the most useful information to provide about the consequences of action A from state S?” could be learned to maximize “How useful is information I about the consequences of action A in state S?”

A modified version of the question, “How good is it to take action A from this state S, and include an explanation of your reasoning?” could also be reinforcement learned to maximize “How good is the explanation of how good it is to take action A in state S?”

**Concluding Thoughts**

Indeed, I think we could see *every* question answerable by an IA system in the form of “select the answer to question Y that the overseer approves most of”, and use both demonstrations from the amplified reasoner and the amplified reasoner’s evaluation to improve the answer. This perspective allows the system to learn to decompose problems better than original humans. But it might also cause problems if we can make a series of updates that cause the learned answering system to behave very differently from the original human demonstrators. We might want to be careful about the degree to which an RL learned policy can differ from the original demonstration.

In terms of getting a system to be capable of doing some task, I’d be most optimistic about systems that could combine RL-IA and Imitation-IA depending on the situation. But I still think there’s usefulness in thinking about the pure Imitation-IA perspective to try and reason about the alignment properties of the system.

(Thanks to Andreas Stuhlmüller and Owain Evans for feedback on a draft of this post)

I was excited to see this post since I’m having some similar puzzles, but I’m still quite confused after reading this.

I don’t understand why we want to find this X* in the imitation learning case. For imitation learning, don’t we want to produce a distilled model that would imitate M2, i.e., give the same answer to Y as what M2 would give? If M2, upon input Y, only does a limited search over D (let’s say because of concerns about safety) and therefore would not output the answer that maximizes M2(“How good is answer X to Y?”) in an absolute/unbounded sense, then don’t we want to reproduce that behavior for imitation learning?

What is pM(X∗)?

Can you explain this a bit more too? It might be apparent once I know what pM(X∗) is, but just in case...

It’s the probability that the model M that we’re training assigns to the best answer X*. (M is outputting a probability distribution over D.)

The next one is the standard REINFORCE method for doing RL with a reward signal that you cannot differentiate through (i.e. basically all RL). If you apply that equation to many different possible Xs, you’re increasing the probability that M assigns to high-reward answers, and decreasing the probability that it assigns to low-reward answers.

Ah, with this example the intent was more like “we can frame what the RL case is doing as finding X* , let’s show how we could accomplish the same thing in the imitation learning case (in the limit of unlimited compute)”.

The reverse mapping (imitation to RL) just consists of applying reward 1 to M2′s demonstrated behaviour (which could be “execute some safe search and return the results), and reward 0 to everything else.

pM(X∗) is the probability of outputting X∗ (where pM is a stochastic policy)

This is the REINFORCE gradient estimator (which tries to increase the log probability of actions that were rated highly)

I agree with Wei Dai that the schemes you’re describing do not sound like imitation learning. Both of the schemes you describe sound to me like RL-IA. The scheme that you call imitation-IA seems like a combination random search + gradients method of doing RL. There’s an exactly analogous RL algorithm for the normal RL setting—just take the algorithm you have, and replace all instances of M2(“How good is answer X to Y?”) with r(X), where r is the reward function.

One way that you could do imitation-IA would be to compute X∗=M2(Y) a bunch of times to get a dataset {(Yi,Xi∗)} and train M on that dataset.

I am also not sure exactly what it means to use RL in iterated amplification. There are two different possibilities I could imagine:

Using a combination of IRL + RL to achieve the same effect as imitation learning. The hope here would be that IRL + RL provides a better inductive bias for imitation learning, helping with sample efficiency.

Instead of asking the amplified model to compute M(Y) directly, we ask it to provide a measure of approval, e.g. by asking “How good is answer X to Y?”, or by asking “Which is a better answer to Y, X1 or X2?” and learning from that signal (see optimizing with comparisons), using some arbitrary RL algorithm.

I’m quite confident that RL+IA is not meant to be the first kind. But even with the second kind, one question does arise—typically with RL we’re trying to optimize the sum of rewards across time, whereas here we actually only want to optimize the one-step reward that you get immediately (which is the point of maximizing approval and having a stronger overseer). So then I don’t really see why you want RL, which typically is solving a hard credit assignment problem that doesn’t arise in the one-step setting.

You can use RL for the distillation step. (I usually mention RL as the intended distillation procedure when I describe the scheme, except perhaps in the AGZ analogy post.)

The algorithm still needs reinforce and a value function baseline (since you need to e.g. output words one at a time), and “RL” seems like the normal way to talk about that algorithm/problem. We you could instead call it “contextual bandits.”

You could also use an assistant who you can interact with to help evaluate rewards (rather than using assistants who answer a single question) in which case it’s generic RL.

Does “imitation learning” refer to an autoregressive model here? I think of IRL+RL a possible mechanism for imitation learning, and it’s normally the kind of algorithm I have in mind when talking about “imitation learning” (or the GAN objective, or an EBM, all of which seem roughly equivalent, or maybe some bi-GAN/VAE thing). (Though I also expect to use an autoregressive model as an initialization in any case.)

Yeah, I know, my main uncertainty was with how exactly that cashes out into an algorithm (in particular, RL is typically about sequential decision-making, and I wasn’t sure where the “sequential” part came in).

I get the need for reinforce, I’m not sure I understand the value function baseline part.

Here’s a thing you might be saying that would explain the value function baseline: this problem is equivalent to a sparse-reward RL problem, where:

The states are the question + in-progress answer

The actions are “append the word w to the answer”

All actions produce zero reward except for the action that ends the answer, which produces reward equal to the overseer’s answer to “How good is answer <answer> to question <question>?”

And we can apply RL algorithms to this problem.

Is that equivalent to what you’re saying?

Just to make sure I’m understanding correctly, this is recursive reward modeling, right?

Yeah, that was bad wording on my part. I was using “imitation learning” to refer both to the problem of imitating the behavior of an agent, as well as the particular mechanism of behavioral cloning, i.e. collecting a dataset of many question-answer pairs and performing gradient descent using e.g. cross-entropy loss.

I agree that IRL + RL is a possible mechanism for imitation learning, in the same way that behavioral cloning is a possible mechanism for imitation learning. (This is why I was pretty confident that my first option was not the right one.)

I guess I’ve used the term “reinforcement learning” to refer to a broader class of problems including both one-shot bandit problems and sequential decision making problems. In this view The feature that makes RL different from supervised learning is not that we’re trying to figure out what how to act in an MDP/POMDP, but instead that we’re trying to optimize a function that we can’t take the derivative of (in the MDP case, it’s because the environment is non-differentiable, and in the approval learning case, it’s because the overseer is non-differentiable).

Got it, thanks for clarifying.

I’m seeing a one-hour old empty comment, I assume it got accidentally deleted somehow?

ETA: Nvm, I can see it on LessWrong, but not on the Alignment Forum.

If M2 has adversarial examples or other kinds of robustness or security problems, and we keep doing this training for a long time, wouldn’t the training process sooner or later sample an X that exploits M2 (gets a high reward relative to other answers without actually being a good answer), which causes the update step to increase the probability of M1 giving that output, and eventually causes M1 to give that output with high probability?

Do you know if Paul or anyone else has addressed this anywhere? For example is the plan to make sure M2 has no such robustness problems (if so how)?

Maybe another way to address it would be, instead of doing maximization (in the limit of infinite computation), do quantilization instead?

ETA: I just noticed this part of the post:

Is this talking about the same concern as mine?