Reinforcement Learning in the Iterated Amplification Framework

When I think about Iter­ated Am­plifi­ca­tion (IA), I usu­ally think of a ver­sion that uses imi­ta­tion learn­ing for dis­til­la­tion.

This is the ver­sion dis­cussed in the Scal­able agent al­ign­ment via re­ward mod­el­ing: a re­search di­rec­tion, as “Imi­tat­ing ex­pert rea­son­ing”, in con­trast to the pro­posed ap­proach of “Re­cur­sive Re­ward Model­ling”. The ap­proach works roughly as follows

1. Gather train­ing data from ex­perts on how to break prob­lems into smaller pieces and com­bine the re­sults

2. Train a model to imi­tate what the ex­pert would do at ev­ery step

3. Am­plifi­ca­tion: Run a col­lab­o­ra­tion of a large num­ber of copies of the learned model.

4. Distil­la­tion: Train a model to imi­tate what the col­lab­o­ra­tion did.

5. Re­peat steps 3 and 4, in­creas­ing perfor­mance at ev­ery step

How­ever, Paul has also talked about IA us­ing re­in­force­ment learn­ing (RL) to max­i­mize the ap­proval of the am­plified model. What does this ap­proach (RL-IA) look like? How does it re­late to Imi­ta­tion-IA and Re­cur­sive Re­ward Model­ling?

Puz­zling about RL-IA

To get an agent that takes good ac­tions in an Atari game, we use Imi­ta­tion-IA to build a sys­tem that an­swers the ques­tion “how good is it to take ac­tions from this state”, then train a re­in­force­ment learner to “out­put the best ac­tion to take from a given state”.

But there it seems like the im­prove­ment stops there—it’s not clear how “abil­ity to out­put the best ac­tion to take from a given state” could im­prove “abil­ity to eval­u­ate how good ac­tions are good from a state” in any way that’s differ­ent from run­ning a tra­di­tional re­in­force­ment learn­ing al­gorithm (which usu­ally in­volves tak­ing some policy/​value es­ti­mate and grad­u­ally im­prov­ing it).

Clar­ify­ing what RL-IA does

Claim: There is a fairly straight­for­ward cor­re­spon­dence be­tween how Imi­ta­tion-IA and RL-IA perform a task (given no com­pu­ta­tional limits). RL-IA does not change the class of tasks that Imi­ta­tion-IA can perform or perform them in a rad­i­cally differ­ent way.

Sup­pose we have a cur­rent ver­sion of the model M1 that takes ques­tions and pro­duces a dis­tri­bu­tion over an­swers. Let M2 be an am­plified ver­sion of that model (ie. pro­duced by run­ning a num­ber of copies of M1). Let Y be some ques­tion, with do­main of an­swers D. We want to find the an­swer X* that is the an­swer in D which max­i­mizes the ap­proval of am­plified over­seer, M2(“How good is an­swer X to Y?“). Y could be

  • “What ac­tion is best to take from this state in this atari game?” where D is a small dis­crete set of pos­si­ble actions

  • “What an­swer of less than 100 char­ac­ters should I give to this ques­tion?” where D is a large dis­crete set of pos­si­ble answers

  • “What an­swer of un­bounded length should I give to this ques­tion?” where D is an in­finite dis­crete set

  • “What is prob­a­bil­ity that event E will hap­pen to­mor­row?” where D is the con­tin­u­ous space of probabilities

An up­date us­ing imi­ta­tion learn­ing would have the form:

  • X* = M1(Y)

  • For: num­ber of samples

    • Sam­ple an an­swer X from D

    • Eval­u­ate M2(“How good is an­swer X to Y?“)

    • If M2(“How good is an­swer X to Y?“) > M2(“How good is an­swer X* to Y?“), then set X* = X

  • Perform gra­di­ent de­scent to max­i­mize the prob­a­bil­ity of out­putting X*, us­ing gra­di­ent

An up­date us­ing the REINFORCE policy gra­di­ent es­ti­ma­tor would have the form:

  • sam­ple X from a stochas­tic policy M1(Y)

  • Perform gra­di­ent de­scent us­ing gradient

If we have a perfect dis­til­la­tion al­gorithm, these both con­verge to in the limit of in­finite com­pu­ta­tion.

Prac­ti­cal Differences

Out­side of this ideal­ized situ­a­tion, cir­cum­stances could make one or the other a bet­ter up­date to use.

The imi­ta­tion up­date could con­verge more quickly if we have a good ini­tial­iza­tion for M(Y) from hu­man data, as it by­passes the need to ex­plore. It could also be less sur­pris­ing, us­ing only pro­cesses that the hu­mans origi­nally demon­strated.

The REINFORCE up­date could con­verge more quickly if the hu­man ini­tial­iza­tion is sub­op­ti­mal, or if it’s hard to ex­actly re­pro­duce the hu­man demon­stra­tion.

In gen­eral, it seems like the sys­tem could use an al­gorithm that com­bines re­in­force­ment learn­ing up­dates with imi­ta­tion learn­ing up­dates, ie. Deep Q Learn­ing from De­mon­stra­tions.

Re­turn­ing to the origi­nal puzzle

I think the solu­tion is not nec­es­sar­ily that “abil­ity to out­put good ac­tions at this timestep” trans­lates into “abil­ity to eval­u­ate which ac­tions are good”? Rather, I think that it is the case that the de­com­po­si­tion of “eval­u­ate which ac­tions are good” con­tains some ques­tions which might perform a search over an an­swer space, and the an­swers to these ques­tions are im­proved by re­in­force­ment learn­ing, and this im­proves the eval­u­a­tion of atari ac­tions. This can pro­duce a model which uses a mix of imi­ta­tion learn­ing and re­in­force­ment learn­ing.

For ex­am­ple:

“What is a good ac­tion to take from state S?” could be learned to max­i­mize “How good is it to take ac­tion A from this state S?”

“How good is it to take ac­tion A from this state S?” could be learned by imi­tat­ing an am­plified rea­soner that asks the sub­ques­tion “What is the most use­ful in­for­ma­tion to provide about the con­se­quences of ac­tion A from state S?”

“What is the most use­ful in­for­ma­tion to provide about the con­se­quences of ac­tion A from state S?” could be learned to max­i­mize “How use­ful is in­for­ma­tion I about the con­se­quences of ac­tion A in state S?”

A mod­ified ver­sion of the ques­tion, “How good is it to take ac­tion A from this state S, and in­clude an ex­pla­na­tion of your rea­son­ing?” could also be re­in­force­ment learned to max­i­mize “How good is the ex­pla­na­tion of how good it is to take ac­tion A in state S?”

Con­clud­ing Thoughts

In­deed, I think we could see ev­ery ques­tion an­swer­able by an IA sys­tem in the form of “se­lect the an­swer to ques­tion Y that the over­seer ap­proves most of”, and use both demon­stra­tions from the am­plified rea­soner and the am­plified rea­soner’s eval­u­a­tion to im­prove the an­swer. This per­spec­tive al­lows the sys­tem to learn to de­com­pose prob­lems bet­ter than origi­nal hu­mans. But it might also cause prob­lems if we can make a se­ries of up­dates that cause the learned an­swer­ing sys­tem to be­have very differ­ently from the origi­nal hu­man demon­stra­tors. We might want to be care­ful about the de­gree to which an RL learned policy can differ from the origi­nal demon­stra­tion.

In terms of get­ting a sys­tem to be ca­pa­ble of do­ing some task, I’d be most op­ti­mistic about sys­tems that could com­bine RL-IA and Imi­ta­tion-IA de­pend­ing on the situ­a­tion. But I still think there’s use­ful­ness in think­ing about the pure Imi­ta­tion-IA per­spec­tive to try and rea­son about the al­ign­ment prop­er­ties of the sys­tem.

(Thanks to An­dreas Stuh­lmüller and Owain Evans for feed­back on a draft of this post)