Reinforcement Learning in the Iterated Amplification Framework

When I think about Iterated Amplification (IA), I usually think of a version that uses imitation learning for distillation.

This is the version discussed in the Scalable agent alignment via reward modeling: a research direction, as “Imitating expert reasoning”, in contrast to the proposed approach of “Recursive Reward Modelling”. The approach works roughly as follows

1. Gather training data from experts on how to break problems into smaller pieces and combine the results

2. Train a model to imitate what the expert would do at every step

3. Amplification: Run a collaboration of a large number of copies of the learned model.

4. Distillation: Train a model to imitate what the collaboration did.

5. Repeat steps 3 and 4, increasing performance at every step

However, Paul has also talked about IA using reinforcement learning (RL) to maximize the approval of the amplified model. What does this approach (RL-IA) look like? How does it relate to Imitation-IA and Recursive Reward Modelling?

Puzzling about RL-IA

To get an agent that takes good actions in an Atari game, we use Imitation-IA to build a system that answers the question “how good is it to take actions from this state”, then train a reinforcement learner to “output the best action to take from a given state”.

But there it seems like the improvement stops there—it’s not clear how “ability to output the best action to take from a given state” could improve “ability to evaluate how good actions are good from a state” in any way that’s different from running a traditional reinforcement learning algorithm (which usually involves taking some policy/value estimate and gradually improving it).

Clarifying what RL-IA does

Claim: There is a fairly straightforward correspondence between how Imitation-IA and RL-IA perform a task (given no computational limits). RL-IA does not change the class of tasks that Imitation-IA can perform or perform them in a radically different way.

Suppose we have a current version of the model M1 that takes questions and produces a distribution over answers. Let M2 be an amplified version of that model (ie. produced by running a number of copies of M1). Let Y be some question, with domain of answers D. We want to find the answer X* that is the answer in D which maximizes the approval of amplified overseer, M2(“How good is answer X to Y?”). Y could be

“What action is best to take from this state in this atari game?” where D is a small discrete set of possible actions
“What answer of less than 100 characters should I give to this question?” where D is a large discrete set of possible answers
“What answer of unbounded length should I give to this question?” where D is an infinite discrete set
“What is probability that event E will happen tomorrow?” where D is the continuous space of probabilities

An update using imitation learning would have the form:

X* = M1(Y)
For: number of samples
- Sample an answer X from D
- Evaluate M2(“How good is answer X to Y?”)
- If M2(“How good is answer X to Y?”) > M2(“How good is answer X* to Y?”), then set X* = X
Perform gradient descent to maximize the probability of outputting X*, using gradient $\nabla p_{M} (X *)$

An update using the REINFORCE policy gradient estimator would have the form:

sample X from a stochastic policy M1(Y)
Perform gradient descent using gradient $M2("How good is answer X to Y?") * \nabla log (p_{M} (X))$

If we have a perfect distillation algorithm, these both converge to ${a r g m a x}_{X} (M 2 (X))$ in the limit of infinite computation.

Practical Differences

Outside of this idealized situation, circumstances could make one or the other a better update to use.

The imitation update could converge more quickly if we have a good initialization for M(Y) from human data, as it bypasses the need to explore. It could also be less surprising, using only processes that the humans originally demonstrated.

The REINFORCE update could converge more quickly if the human initialization is suboptimal, or if it’s hard to exactly reproduce the human demonstration.

In general, it seems like the system could use an algorithm that combines reinforcement learning updates with imitation learning updates, ie. Deep Q Learning from Demonstrations.

Returning to the original puzzle

I think the solution is not necessarily that “ability to output good actions at this timestep” translates into “ability to evaluate which actions are good”? Rather, I think that it is the case that the decomposition of “evaluate which actions are good” contains some questions which might perform a search over an answer space, and the answers to these questions are improved by reinforcement learning, and this improves the evaluation of atari actions. This can produce a model which uses a mix of imitation learning and reinforcement learning.

For example:

“What is a good action to take from state S?” could be learned to maximize “How good is it to take action A from this state S?”

“How good is it to take action A from this state S?” could be learned by imitating an amplified reasoner that asks the subquestion “What is the most useful information to provide about the consequences of action A from state S?”

“What is the most useful information to provide about the consequences of action A from state S?” could be learned to maximize “How useful is information I about the consequences of action A in state S?”

A modified version of the question, “How good is it to take action A from this state S, and include an explanation of your reasoning?” could also be reinforcement learned to maximize “How good is the explanation of how good it is to take action A in state S?”

Concluding Thoughts

Indeed, I think we could see every question answerable by an IA system in the form of “select the answer to question Y that the overseer approves most of”, and use both demonstrations from the amplified reasoner and the amplified reasoner’s evaluation to improve the answer. This perspective allows the system to learn to decompose problems better than original humans. But it might also cause problems if we can make a series of updates that cause the learned answering system to behave very differently from the original human demonstrators. We might want to be careful about the degree to which an RL learned policy can differ from the original demonstration.

In terms of getting a system to be capable of doing some task, I’d be most optimistic about systems that could combine RL-IA and Imitation-IA depending on the situation. But I still think there’s usefulness in thinking about the pure Imitation-IA perspective to try and reason about the alignment properties of the system.

(Thanks to Andreas Stuhlmüller and Owain Evans for feedback on a draft of this post)