A Concrete Proposal for Adversarial IDA

Note: This post came out of a con­ver­sa­tion with Ge­offrey Irv­ing and Buck Sh­legeris.

Epistemic Sta­tus: I sus­pect Paul has already thought of most or all of the ideas pre­sented here, though I nev­er­the­less found the ex­er­cise of care­fully spec­i­fy­ing an IDA im­ple­men­ta­tion helpful and sus­pect oth­ers may find read­ing it helpful as well.

This is a pro­posal for how to train a ma­chine learn­ing model to ap­prox­i­mate HCH us­ing Iter­ated Distil­la­tion and Am­plifi­ca­tion (IDA). This par­tic­u­lar pro­posal came out of a de­sire to use a de­bate-like ad­ver­sary to im­prove the am­plifi­ca­tion pro­cess, and the pri­mary goal of this pro­posal is to show how one could do that. Though I have tried to re­tain a lot of the rele­vant de­tail, I have made two sim­plifi­ca­tions to make this pro­posal eas­ier to spec­ify: I am at­tempt­ing to ap­prox­i­mate some­thing closer to weak HCH rather than strong HCH and I am only al­low­ing the gen­er­a­tion of two sub­ques­tions at a time. I am con­fi­dent that those sim­plifi­ca­tions could eas­ily be dropped, though I think do­ing so here would only make this pre­sen­ta­tion more com­pli­cated.

Be­fore I pro­ceed, I want to make one fi­nal note: this is not a pro­posal for how to build an al­igned AGI. I think there are still a whole bunch of is­sues that would pre­vent this pro­posal from ac­tu­ally work­ing.

Definitions

We will start with some ini­tial defi­ni­tions:

  • Let be the set of all ques­tions in nat­u­ral lan­guage.

  • Let be the set of all an­swers in nat­u­ral lan­guage.

  • Let be the sum type of ei­ther or rep­re­sent­ing ei­ther an an­swer to the given ques­tion or two sub­ques­tions to help an­swer it.

  • Let be the an­swer that a hu­man gives to the given ques­tion.

  • Let be the an­swer or sub­ques­tion pair gen­er­ated by a hu­man when asked what to do with the given ques­tion.

  • Let be the an­swer or two sub­ques­tions gen­er­ated by a hu­man to some ques­tion when given an­swers to two sub­ques­tions re­lated to that ques­tion.

  • Let be a model (the train­ing pro­ce­dure for which we will de­scribe be­low) from ques­tions to a prob­a­bil­ity dis­tri­bu­tion over strings rep­re­sent­ing an­swers. Speci­fi­cally, we will im­ple­ment the prob­a­bil­ity dis­tri­bu­tion by hav­ing our model out­put an em­bed­ding vec­tor which yields the prob­a­bil­ity dis­tri­bu­tion when fed into some trained lan­guage model (by re­peat­edly con­di­tion­ing on pre­vi­ous char­ac­ters and mul­ti­ply­ing all the con­di­tional prob­a­bil­ities).

  • Let be the func­tion from ques­tions to an em­bed­ding vec­tor rep­re­sent­ing a dis­tri­bu­tion over an­swers or sub­ques­tions gen­er­ated by ask­ing what to do with the given ques­tion. For the two sub­ques­tion case we en­force that the dis­tri­bu­tion is sym­met­ric wrt in­ter­change of the sub­ques­tions.

  • Let be the func­tion from two an­swers to an em­bed­ding vec­tor rep­re­sent­ing a dis­tri­bu­tion over an­swers or sub­ques­tions gen­er­ated by ask­ing to in­te­grate the given sub­ques­tion an­swers into an an­swer to the origi­nal ques­tion. We again en­force sym­me­try wrt in­ter­change of the sub­ques­tions.

  • Let be an ad­ver­sary model which we will train (as de­scribed be­low) to pre­dict how good the given an­swer is to the given ques­tion.

  • Let be an ad­ver­sary model for gen­er­ated by call­ing us­ing the same trans­for­ma­tion as used to im­ple­ment from .

  • Let be an ad­ver­sary model for gen­er­ated by call­ing us­ing the same trans­for­ma­tion as used to im­ple­ment from .

  • Let be some Gum­bel ran­dom vari­able such that each use of be­low is its own in­de­pen­dent sam­ple from a Gum­bel dis­tri­bu­tion. We use a Gum­bel ran­dom vari­able so that we can use the Gum­bel-max trick to effec­tively do Boltz­mann ex­plo­ra­tion.

Amplification

The goal here is to use the ad­ver­sary to de­ter­mine which por­tions of the HCH tree we ac­tu­ally go down to max­i­mize our sam­ple effi­ciency wrt . Speci­fi­cally, the am­plifi­ca­tion pro­ce­dure is as fol­lows:

1. Gen­er­ate some ques­tion ei­ther from the train­ing dis­tri­bu­tion or from the out­put of some ad­ver­sary.

2. Call . If we get an an­swer, let and skip to the end. Other­wise, let .

3. Call to gen­er­ate scores and .

4. Without loss of gen­er­al­ity, let and re­curse on (by set­ting in the sub­call) to gen­er­ate (from the of the sub­call) and let .

5. Call . If we get an an­swer, let . Other­wise, let and re­turn to step 3.

This pro­ce­dure should even­tu­ally pro­duce an an­swer, though we then need to com­pute the fol­low­ing losses for :

Note that some of these losses are ac­tu­ally for the same model since are im­ple­mented in­ter­nally by call­ing and are im­ple­mented in­ter­nally by call­ing .

Distillation

For the dis­til­la­tion pro­ce­dure, we want to im­prove by train­ing it to ap­prox­i­mate its own -con­sult­ing- tree. Speci­fi­cally:

1. Gen­er­ate some as in am­plifi­ca­tion.

2. Call . If we get an an­swer, let and skip to the end. Other­wise, let .

3. Let and .

4. Call . If we get an an­swer, let . Other­wise, let and re­turn to step 3.

Then, com­pute the fol­low­ing losses for :

Iteration

Fi­nally, we want to iter­ate the above pro­ce­dure by re­plac­ing in the am­plifi­ca­tion pro­ce­dure with some . First, let be some con­fi­dence thresh­old. Then, we will define the fol­low­ing primed s:

  • Let .

  • Let

  • Let

This pro­ce­dure al­lows us to con­tinue am­plify­ing the model while us­ing the ad­ver­sary to re­quire only min­i­mal hu­man data that is se­lected so as to be max­i­mally helpful.

Conclusion

This pro­posal differs in a cou­ple of ways from pre­vi­ous pro­pos­als made by Paul. First, Paul has re­cently moved away from dis­crete am­plifi­ca­tion/​dis­til­la­tion steps. This pro­posal, how­ever, pro­vides a way to re­cover dis­crete steps while still col­laps­ing the re­cur­sion. In prac­tice, how­ever, you might still just want to stick with the am­plifi­ca­tion pro­ce­dure de­scribed here with­out do­ing the dis­til­la­tion step, as it isn’t strictly nec­es­sary.

Se­cond, this pro­posal uses an ad­ver­sary to guide the train­ing pro­cess. This tech­nique is similar to the con­cept of im­por­tance sam­pling. The main benefit of this ap­proach is that it takes ad­van­tage of ac­tive learn­ing by al­low­ing the sys­tem to choose which ques­tions and sub­ques­tions would be most use­ful for it to have an­swered by a hu­man.

Another benefit of the ad­ver­sary, how­ever, is that it could make trans­parency much eas­ier. One of the ma­jor benefits of IDA is that gets trained to ap­prox­i­mate its own -con­sult­ing- tree. As a re­sult, un­der­stand­ing the rea­son­ing that went into the fi­nal an­swer pro­duced by can be re­cov­ered by un­fold­ing its tree (at least in the limit of perfect train­ing). How­ever, un­fold­ing the en­tire tree is very ex­pen­sive, as it’s lin­ear in the size of the tree. With an ad­ver­sary, how­ever, you can choose which por­tions of the tree to un­fold first by call­ing the ad­ver­sary, en­abling you to find er­rors much more quickly; for a perfect ad­ver­sary, this re­duces the prob­lem of find­ing an er­ror to in­stead of .

Thus, the hope is that the use of such an ad­ver­sary could as­sist both in mak­ing IDA more com­pet­i­tive (by in­creas­ing sam­ple effi­ciency and us­ing ac­tive learn­ing) and in mak­ing IDA safer (due to the in­creased ease of trans­parency).

It should be noted, how­ever, that it is also pos­si­ble that the use of such an ad­ver­sary might make the safety situ­a­tion for IDA worse. First, it in­tro­duces the pos­si­bil­ity of a ro­bust­ness to rel­a­tive scale failure if ei­ther or gets sig­nifi­cantly stronger than the other. One pos­si­ble way to re­solve such an is­sue, how­ever, might be to give the abil­ity to call and vice versa, al­low­ing them to use each other to boost their own ca­pa­bil­ities. Se­cond, for an and sys­tem that are them­selves op­ti­miz­ers, with goals that don’t perfectly match up with their loss func­tions, they could co­op­er­ate to make it ar­bi­trar­ily un­likely that is ever con­sulted on some spe­cific ques­tion. Third, even if and weren’t co­op­er­at­ing, an RSA-2048-style failure could still pre­vent the iden­ti­fi­ca­tion of mal­i­cious cog­ni­tion. Re­solv­ing failures of these sec­ond two types is still an open ques­tion (EDIT: see “Risks from Learned Op­ti­miza­tion in Ad­vanced Ma­chine Learn­ing Sys­tems,” by Hub­inger, van Mer­wijk, Mikulik, Skalse, and Garrabrant).