Synthesizing amplification and debate


One pos­si­ble way to train an am­plifi­ca­tion model is to use an aux­iliary re­in­force­ment learn­ing ob­jec­tive to help guide the train­ing of the am­plifi­ca­tion model. This could be done ei­ther by train­ing two sep­a­rate mod­els, an agent and a ques­tion-an­swerer, or a sin­gle model trained on a joint ob­jec­tive. For ex­am­ple, from a com­ment Paul left on “A dilemma for pro­saic AI al­ign­ment:”

I nor­mally imag­ine us­ing joint train­ing in these cases, rather than pre-train­ing + fine-tun­ing. e.g., at ev­ery point in time we main­tain an agent and a ques­tion-an­swerer, where the ques­tion-an­swerer “knows ev­ery­thing the agent knows.” They get bet­ter to­gether, with each gra­di­ent up­date af­fect­ing both of them, rather than first train­ing a good agent and then adding a good ques­tion-an­swerer.

(In­de­pen­dently of con­cerns about mesa-op­ti­miza­tion, I think the fine-tun­ing ap­proach would have trou­ble be­cause you couldn’t use statis­ti­cal reg­u­lar­i­ties from the “main” ob­jec­tive to in­form your an­swers to ques­tions, and there­fore your ques­tion an­swers will be dumber than the policy and so you couldn’t get a good re­ward func­tion or speci­fi­ca­tion of catas­troph­i­cally bad be­hav­ior.)

In my last post, I ex­pressed skep­ti­cism of such non-imi­ta­tive am­plifi­ca­tion ap­proaches, though in this post I want to pro­pose a pos­si­ble way in which some of my con­cerns with this style of ap­proach could ad­dressed by in­te­grat­ing ideas from AI safety via de­bate. I’ll start by de­scribing the ba­sic idea in broad terms, then give a more care­ful, tech­ni­cal de­scrip­tion of the sort of train­ing pro­ce­dure I have in mind.

The proposal

The ba­sic idea is as fol­lows: de­bate nat­u­rally yields an RL ob­jec­tive, so if you want to add an aux­iliary RL ob­jec­tive to am­plifi­ca­tion, why not use the RL ob­jec­tive from de­bate? Speci­fi­cally, the idea is to con­duct a de­bate not be­tween copies of the model , but be­tween copies of the am­plified model (where is a hu­man with ac­cess to the model ). That gives you both an RL re­ward aris­ing from the win­ner of the de­bate as well as a su­per­vised loss from the differ­ence be­tween the model’s out­put and the am­plified model’s out­put. That’s the ba­sic pro­posal—now I’ll try to give a more care­ful, tech­ni­cal de­scrip­tion of how I think it should go.


We will start with some ini­tial defi­ni­tions.

  • Let be the set of all ques­tions in nat­u­ral lan­guage.

  • Let be the set of all an­swers in nat­u­ral lan­guage.

  • Let be the train­ing dis­tri­bu­tion of nat­u­ral lan­guage ques­tions.

  • Let be a hu­man.

  • Let be a model.

  • Let “What an­swer would you give to given ac­cess to ?”.

  • Let where is some means of stochas­ti­cally de­ter­min­ing whether it is worth it to call or not.[1]

  • Let .

  • Let be a tar­get net­work up­dated reg­u­larly to match .

Train­ing procedure

Then, con­sider the fol­low­ing train­ing pro­ce­dure.

  1. Gen­er­ate some ques­tion .

  2. Con­duct a de­bate ac­cord­ing to the fol­low­ing pro­ce­dure while build­ing up a his­tory and start­ing at .

    1. Let “What is the next ar­gu­ment you would give on in con­text ?”.

    2. Ap­pend to the his­tory the con­tents of as well as a trace of what hap­pened in calcu­lat­ing , in­clud­ing all the nec­es­sary in­for­ma­tion to use trans­parency tools to in­spect what hap­pened.

    3. Let “Is a suffi­cient an­swer to in con­text ?”.

    4. If not , in­cre­ment and re­turn to step 1 in the de­bate.

  3. Let “Were the even ar­gu­ments in more helpful than the odd ar­gu­ments in an­swer­ing ?”.

  4. Train on the com­bi­na­tion of the de­bate-style RL re­ward de­rived from ( for the win­ner and for the loser), a re­laxed ad­ver­sar­ial train­ing catas­tro­phe-min­i­miza­tion penalty calcu­lated by , and the su­per­vised loss where is some nat­u­ral lan­guage dis­tance met­ric and rep­re­sents the ques­tion posed to the func­tion that com­puted the an­swer .

Ways of com­bin­ing the RL and su­per­vised losses here in­clude:

  • com­bin­ing the two losses with some rel­a­tive weight­ing ,

  • us­ing a con­strained RL al­gorithm such as CPO to max­i­mize the RL re­ward sub­ject to main­tain­ing some baseline perfor­mance on the su­per­vised loss, or

  • mak­ing the su­per­vised loss a part of the RL re­ward.


The first thing which I want to note about this pro­ce­dure is that it gives a con­tin­u­ous re­lax­ation be­tween am­plifi­ca­tion and de­bate—in fact, there are mul­ti­ple ways to tune the pro­ce­dure to limit to ei­ther pure am­plifi­ca­tion or pure de­bate. First, you can an­neal what­ever com­bi­na­tion of the differ­ent losses you are us­ing to even­tu­ally be­come ex­clu­sively imi­ta­tive am­plifi­ca­tion, ex­clu­sively de­bate, or any­thing else in be­tween. Se­cond, the fact that this pro­ce­dure uses to de­ter­mine whether or not to stop the de­bate af­ter ev­ery ar­gu­ment gives an­other way in which this pro­ce­dure gives you a re­lax­ation be­tween am­plifi­ca­tion and de­bate: if the de­bate is always stopped af­ter the first ar­gu­ment, you just get pure imi­ta­tive am­plifi­ca­tion, whereas as the de­bate goes on for longer, you get closer and closer to just pure de­bate. What’s re­ally nice about this is that it means that when pure imi­ta­tive am­plifi­ca­tion is giv­ing good re­sults, you’ll just get pure am­plifi­ca­tion—since then will tend to be false—but when am­plifi­ca­tion isn’t work­ing, you’ll get de­bate—since then will tend to be true.

As a re­sult, it should be quite pos­si­ble to en­force that this pro­ce­dure limit to HCH—ei­ther by an­neal­ing the losses or by forc­ing to tend to­wards false. Thus, I think this pro­ce­dure has a good chance of be­ing outer al­igned at op­ti­mum—or at least, a similar chance at it com­pared to pure imi­ta­tive am­plifi­ca­tion. Un­like pure imi­ta­tive am­plifi­ca­tion, how­ever, this pro­ce­dure gets to make use of the ca­pa­bil­ity benefits of hav­ing an aux­iliary RL ob­jec­tive to help guide train­ing. Fur­ther­more, since the aux­iliary RL ob­jec­tive that we’re us­ing comes from de­bate, we get a lot of the benefits of de­bate as well, in­clud­ing the abil­ity to in­cen­tivize the de­baters to pro­duce ar­gu­ments that we wouldn’t have nec­es­sar­ily though of our­selves, as well as the abil­ity to train our de­baters to use trans­parency tools against each other to help catch de­cep­tion or other catas­trophic be­hav­ior. That be­ing said, I do think that whether or not some­thing like this is in­ner al­igned is still quite ques­tion­able—and is likely to de­pend highly on the spe­cific trans­parency tools you have ac­cess to—though I do like the ap­proach de­scribed here in gen­eral and I think it’s definitely worth look­ing into more.

  1. As an ex­am­ple ap­proach for im­ple­ment­ing some­thing like , see “A con­crete pro­posal for ad­ver­sar­ial IDA.” ↩︎