An overview of 11 proposals for building safe advanced AI

Spe­cial thanks to Kate Woolver­ton, Paul Chris­ti­ano, Ro­hin Shah, Alex Turner, William Saun­ders, Beth Barnes, Abram Dem­ski, Scott Garrabrant, Sam Eisen­stat, and Tsvi Ben­son-Tilsen for pro­vid­ing helpful com­ments and feed­back on this post and the talk that pre­ceded it.

This post is a col­lec­tion of 11 differ­ent pro­pos­als for build­ing safe ad­vanced AI un­der the cur­rent ma­chine learn­ing paradigm. There’s a lot of liter­a­ture out there lay­ing out var­i­ous differ­ent ap­proaches such as am­plifi­ca­tion, de­bate, or re­cur­sive re­ward mod­el­ing, but a lot of that liter­a­ture fo­cuses pri­mar­ily on outer al­ign­ment at the ex­pense of in­ner al­ign­ment and doesn’t provide di­rect com­par­i­sons be­tween ap­proaches. The goal of this post is to help solve that prob­lem by pro­vid­ing a sin­gle col­lec­tion of 11 differ­ent pro­pos­als for build­ing safe ad­vanced AI—each in­clud­ing both in­ner and outer al­ign­ment com­po­nents. That be­ing said, not only does this post not cover all ex­ist­ing pro­pos­als, I strongly ex­pect that there will be lots of ad­di­tional new pro­pos­als to come in the fu­ture. Nev­er­the­less, I think it is quite use­ful to at least take a broad look at what we have now and com­pare and con­trast some of the cur­rent lead­ing can­di­dates.

It is im­por­tant for me to note be­fore I be­gin that the way I de­scribe the 11 ap­proaches pre­sented here is not meant to be an ac­cu­rate rep­re­sen­ta­tion of how any­one else would rep­re­sent them. Rather, you should treat all the ap­proaches I de­scribe here as my ver­sion of that ap­proach rather than any sort of canon­i­cal ver­sion that their var­i­ous cre­ators/​pro­po­nents would en­dorse.

Fur­ther­more, this post only in­cludes ap­proaches that in­tend to di­rectly build ad­vanced AI sys­tems via ma­chine learn­ing. Thus, this post doesn’t in­clude other pos­si­ble ap­proaches for solv­ing the broader AI ex­is­ten­tial risk prob­lem such as:

  • find­ing a fun­da­men­tally differ­ent way of ap­proach­ing AI than the cur­rent ma­chine learn­ing paradigm that makes it eas­ier to build safe ad­vanced AI,

  • de­vel­op­ing some ad­vanced tech­nol­ogy that pro­duces a de­ci­sive strate­gic ad­van­tage with­out us­ing ad­vanced AI, or

  • achiev­ing global co­or­di­na­tion around not build­ing ad­vanced AI via (for ex­am­ple) a per­sua­sive demon­stra­tion that any ad­vanced AI is likely to be un­safe.

For each of the pro­pos­als that I con­sider, I will try to eval­u­ate them on the fol­low­ing four ba­sic com­po­nents that I think any story for how to build safe ad­vanced AI un­der the cur­rent ma­chine learn­ing paradigm needs.

  1. Outer al­ign­ment. Outer al­ign­ment is about ask­ing why the ob­jec­tive we’re train­ing for is al­igned—that is, if we ac­tu­ally got a model that was try­ing to op­ti­mize for the given loss/​re­ward/​etc., would we like that model? For a more thor­ough de­scrip­tion of what I mean by outer al­ign­ment, see “Outer al­ign­ment and imi­ta­tive am­plifi­ca­tion.”

  2. In­ner al­ign­ment. In­ner al­ign­ment is about ask­ing the ques­tion of how our train­ing pro­ce­dure can ac­tu­ally guaran­tee that the model it pro­duces will, in fact, be try­ing to ac­com­plish the ob­jec­tive we trained it on. For a more rigor­ous treat­ment of this ques­tion and an ex­pla­na­tion of why it might be a con­cern, see “Risks from Learned Op­ti­miza­tion.”

  3. Train­ing com­pet­i­tive­ness. Com­pet­i­tive­ness is a bit of a murky con­cept, so I want to break it up into two pieces here. Train­ing com­pet­i­tive­ness is the ques­tion of whether the given train­ing pro­ce­dure is one that a team or group of teams with a rea­son­able lead would be able to af­ford to im­ple­ment with­out com­pletely throw­ing away that lead. Thus, train­ing com­pet­i­tive­ness is about whether the pro­posed pro­cess of pro­duc­ing ad­vanced AI is com­pet­i­tive.

  4. Perfor­mance com­pet­i­tive­ness. Perfor­mance com­pet­i­tive­ness, on the other hand, is about whether the fi­nal product pro­duced by the pro­posed pro­cess is com­pet­i­tive. Perfor­mance com­pet­i­tive­ness is thus about ask­ing whether a par­tic­u­lar pro­posal, if suc­cess­ful, would satisfy the use cases for ad­vanced AI—e.g. whether it would fill the eco­nomic niches that peo­ple want AGI to fill.

I think it’s of­ten easy to fo­cus too much on ei­ther the al­ign­ment side or the com­pet­i­tive­ness side while ne­glect­ing the other. We re­ally want to avoid pro­pos­als which could be un­safe, but at the same time the “do noth­ing” pro­posal is equally un­ac­cept­able—while do­ing noth­ing is quite safe in terms of hav­ing no chance of di­rectly lead­ing to ex­is­ten­tial risk, it doesn’t ac­tu­ally help in any way rel­a­tive to what would have hap­pened by de­fault. Thus, we want pro­pos­als that are both al­igned and com­pet­i­tive so that they not only don’t lead to ex­is­ten­tial risk them­selves, but also help ex­is­ten­tial risk in gen­eral by pro­vid­ing a model of how safe ad­vanced AI can be built, be­ing pow­er­ful enough to as­sist with fu­ture al­ign­ment re­search, and/​or grant­ing a de­ci­sive strate­gic ad­van­tage that can be lev­er­aged into oth­er­wise re­duc­ing ex­is­ten­tial risk.

The 11 pro­pos­als con­sid­ered in this post are, in or­der:[1]

  1. Re­in­force­ment learn­ing + trans­parency tools

  2. Imi­ta­tive am­plifi­ca­tion + in­ter­mit­tent oversight

  3. Imi­ta­tive am­plifi­ca­tion + re­laxed ad­ver­sar­ial training

  4. Ap­proval-based am­plifi­ca­tion + re­laxed ad­ver­sar­ial training

  5. Micro­scope AI

  6. STEM AI

  7. Nar­row re­ward mod­el­ing + trans­parency tools

  8. Re­cur­sive re­ward mod­el­ing + re­laxed ad­ver­sar­ial training

  9. AI safety via de­bate with trans­parency tools

  10. Am­plifi­ca­tion with aux­iliary RL ob­jec­tive + re­laxed ad­ver­sar­ial training

  11. Am­plifi­ca­tion alongside RL + re­laxed ad­ver­sar­ial training

EDIT: For an­other pro­posal eval­u­ated in the same way as those pre­sented here that came out af­ter this post, see “AI safety via mar­ket mak­ing.”

1. Re­in­force­ment learn­ing + trans­parency tools

Here’s our first ap­proach:

  1. Train a re­in­force­ment learn­ing (RL) agent in an en­vi­ron­ment where cor­rigi­bil­ity, hon­esty, multi-agent co­op­er­a­tion, etc. are in­cen­tivized. The ba­sic idea would be to mimic the evolu­tion­ary forces that led to hu­mans’ gen­eral co­op­er­a­tive­ness. As an ex­am­ple of work along these lines that ex­ists now, see OpenAI’s hide and seek game. Fur­ther­more, the en­vi­ron­ment could also be mod­ified to di­rectly re­ward fol­low­ing hu­man in­struc­tions, en­courag­ing cor­rigi­bil­ity to­wards hu­mans. For a more thor­ough dis­cus­sion of this pos­si­bil­ity, see Richard Ngo’s “Multi-agent safety.”

An image of OpenAI’s hide and seek game.

  1. Have hu­mans use trans­parency tools, ad­ver­sar­ial train­ing, etc. to check for de­cep­tive or oth­er­wise catas­trophic be­hav­ior in the re­sult­ing model.

That’s the ap­proach—now is it al­igned? Com­pet­i­tive?

Outer al­ign­ment. Outer al­ign­ment here is en­tirely de­pen­dent on what­ever the dom­i­nant be­hav­ior is in the train­ing en­vi­ron­ment—that is, what is the de­ploy­ment be­hav­ior of those mod­els which perform op­ti­mally in the train­ing en­vi­ron­ment. If cor­rigi­bil­ity, hon­esty, co­op­er­a­tion, etc. do in fact dom­i­nate in the limit, then such an ap­proach would be outer al­igned. By de­fault, how­ever, it seems quite difficult to un­der­stand the limit­ing be­hav­ior of com­plex, multi-agent en­vi­ron­ments, es­pe­cially if they’re any­where as com­plex as the ac­tual hu­man an­ces­tral en­vi­ron­ment. If fol­low­ing hu­man in­struc­tions is in­cen­tivized, for ex­am­ple, that could lead to cor­rigi­bil­ity in the limit—or it could lead to agents which only choose to fol­low hu­man in­struc­tions for the in­stru­men­tal rea­son of be­liev­ing it will help them ac­quire more re­sources. Alter­na­tively, it might be pos­si­ble to iso­late the struc­ture that was pre­sent in the hu­man an­ces­tral en­vi­ron­ment that led us to be co­op­er­a­tive, hon­est, etc. One worry here, how­ever, is that even if we could figure out how to prop­erly in­cen­tivize co­op­er­a­tion, it might re­sult in agents that are co­op­er­a­tive with each other but not very co­op­er­a­tive with us, similarly to how we might not be very co­op­er­a­tive with aliens that are very differ­ent from us.

In­ner al­ign­ment. The idea of in­ner al­ign­ment in this situ­a­tion is to en­sure that train­ing pro­duces some­thing in line with the op­ti­mal be­hav­ior in the en­vi­ron­ment (the al­ign­ment of the op­ti­mal be­hav­ior be­ing an outer al­ign­ment ques­tion) rather than other, po­ten­tially per­verse equil­ibria. The ba­sic pro­posal for how to avoid such per­verse equil­ibria with this pro­posal is via the use of checks such as trans­parency tools and ad­ver­sar­ial train­ing to de­tect in­ner al­ign­ment failures be­fore the model is de­ployed. Chris Olah de­scribes this sort of trans­parency check­ing as giv­ing you a “mul­li­gan” that lets you throw away your model and start over if you find some­thing wrong. Thus, ideally, if this ap­proach ends up not work­ing it should be clear be­fore the model is de­ployed, en­abling ei­ther this ap­proach to be fixed or a new ap­proach to be found in­stead. And there is a rea­son­able chance that it does just work—we don’t un­der­stand our mod­els’ in­duc­tive bi­ases very well, but it seems en­tirely pos­si­ble that they could work out such that pseudo-al­ign­ment is dis­in­cen­tivized.

In my opinion, while it seems quite plau­si­ble to me that this sort of ap­proach could catch proxy pseudo-al­ign­ment, it seems un­likely that it would suc­cess­fully catch de­cep­tive pseudo-al­ign­ment, as it could be very difficult to make trans­parency tools that are ro­bust to a de­cep­tive model ac­tively try­ing to trick them. To catch de­cep­tive al­ign­ment, it seems likely to be nec­es­sary to in­cor­po­rate such checks into the train­ing pro­cess it­self—which is pos­si­ble to do in this set­ting, though is not the ap­proach I de­scribed above—in or­der to pre­vent de­cep­tion from oc­cur­ring in the first place rather than try­ing to de­tect it af­ter the fact.

Train­ing com­pet­i­tive­ness. Train­ing com­pet­i­tive­ness here seems likely to de­pend on the ex­tent to which the sort of agency pro­duced by RL is nec­es­sary to train ad­vanced AI sys­tems. Perform­ing RL in highly com­plex, difficult-to-simu­late en­vi­ron­ments—es­pe­cially if those en­vi­ron­ments in­volve in­ter­ac­tion with the real world—could be quite ex­pen­sive from a train­ing com­pet­i­tive­ness stand­point. Com­pared to sim­ple lan­guage mod­el­ing, for ex­am­ple, the difficulty of on-policy data col­lec­tion com­bined with low sam­ple-effi­ciency could make full-scale RL much less train­ing com­pet­i­tive. Th­ese sorts of com­pet­i­tive­ness con­cerns could be par­tic­u­larly pro­nounced if the fea­tures nec­es­sary to en­sure that the RL en­vi­ron­ment is al­igned re­sult in mak­ing it sig­nifi­cantly more difficult to simu­late. That be­ing said, if RL is nec­es­sary to do any­thing pow­er­ful and sim­ple lan­guage mod­el­ing is in­suffi­cient, then whether or not lan­guage mod­el­ing is eas­ier is a moot point. Whether RL is re­ally nec­es­sary seems likely to de­pend on the ex­tent to which it is nec­es­sary to ex­plic­itly train agents—which is very much an open ques­tion. Fur­ther­more, even if agency is re­quired, it could po­ten­tially be ob­tained just by imi­tat­ing an ac­tor such as a hu­man that already has it rather than train­ing it di­rectly via RL.

Perfor­mance com­pet­i­tive­ness. The ques­tion for perfor­mance com­pet­i­tive­ness here is to what ex­tent it is pos­si­ble to cre­ate an en­vi­ron­ment that in­cen­tivizes all the be­hav­ior you might want from your AGI. Such an en­vi­ron­ment doesn’t need to be purely simu­lated—you could do some simu­la­tion train­ing and some real-world train­ing, for ex­am­ple. Re­gard­less of how your RL en­vi­ron­ment is con­structed, how­ever, it needs to ac­tu­ally in­cen­tivize the cor­rect be­hav­ior for the tasks that you want to use your AI for. For ex­am­ple: can you in­cen­tivize good de­ci­sion-mak­ing? Good ques­tion-an­swer­ing? Good learn­ing abil­ity? Do you need good fine mo­tor con­trol, and if so, can you in­cen­tivize it? Th­ese are highly non-triv­ial ques­tions: it could be quite difficult to set up an RL en­vi­ron­ment to teach an agent to do all of the tasks you might want it to perform to fill all the eco­nomic niches for AGI, for ex­am­ple. Of course, this is go­ing to be highly de­pen­dent on what ex­actly those eco­nomic niches are that you want your ad­vanced AI to fill.

2. Imi­ta­tive am­plifi­ca­tion + in­ter­mit­tent oversight

Though many of the ap­proaches on this list make use of the ba­sic iter­ated am­plifi­ca­tion frame­work, imi­ta­tive am­plifi­ca­tion is prob­a­bly the most straight­for­ward—though it still has a good deal of mov­ing parts.

To define imi­ta­tive am­plifi­ca­tion, we’ll first define —the “am­plifi­ca­tion op­er­a­tor”—as the pro­ce­dure where a hu­man an­swers a ques­tion with ac­cess to a model .[2]

A di­a­gram of the am­plifi­ca­tion op­er­a­tor where white ar­rows in­di­cate in­for­ma­tion trans­fer, is a ques­tion, is ’s an­swer, is a hu­man, and is the model.

Then, imi­ta­tive am­plifi­ca­tion is just the pro­ce­dure of iter­a­tively train­ing to imi­tate .

The ba­sic imi­ta­tive am­plifi­ca­tion setup where green ar­rows in­di­cate am­plifi­ca­tion, gray ar­rows in­di­cate train­ing, and cyan ar­rows in­di­cate the imi­ta­tive am­plifi­ca­tion loss.

Fi­nally, we can define imi­ta­tive am­plifi­ca­tion + in­ter­mit­tent over­sight—which is the full ap­proach we want to con­sider here—as the com­bi­na­tion of imi­ta­tive am­plifi­ca­tion with in­ter­mit­tent over­sight of by when­ever the tar­get model changes. Speci­fi­cally, we want to look for de­cep­tive or oth­er­wise catas­trophic be­hav­ior in by uti­liz­ing things like trans­parency tools and ad­ver­sar­ial at­tacks.

Imi­ta­tive am­plifi­ca­tion plus in­ter­mit­tent over­sight where red ar­rows in­di­cate over­sight (from the over­seer to the over­seen model).

Outer al­ign­ment. Since imi­ta­tive am­plifi­ca­tion trains to imi­tate , it limits[3] to the fixed point of the op­er­a­tor, which Paul Chris­ti­ano calls HCH for “Hu­mans Con­sult­ing HCH.” HCH is effec­tively a mas­sive tree of hu­mans con­sult­ing each other to an­swer ques­tions.

A (par­tial) di­a­gram of HCH where white ar­rows in­di­cate in­for­ma­tion trans­fer.

Thus, whether imi­ta­tive am­plifi­ca­tion is outer al­igned is de­pen­dent on whether HCH is al­igned or not. HCH’s al­ign­ment, in turn, is likely to de­pend heav­ily on the spe­cific hu­mans used and what sort of policy they’re im­ple­ment­ing. The ba­sic idea, how­ever, is that since the limit is com­posed en­tirely of hu­mans—and since we can con­trol what those hu­mans do—we should be able to en­sure that all the op­ti­miza­tion power is com­ing from the hu­mans (and not from memetic se­lec­tion, for ex­am­ple), which hope­fully should al­low us to make it safe. While there are cer­tainly valid con­cerns with the hu­mans in the tree ac­ci­den­tally (or pur­pose­fully) im­ple­ment­ing some mis­al­igned policy, there are pos­si­ble things you can do to ad­dress these prob­lems.

In­ner al­ign­ment. The hope here is to be able to de­tect and fix se­ri­ous in­ner al­ign­ment failures via the use of in­ter­mit­tent over­sight—that is, in­spec­tion by a hu­man with ac­cess to a pre­vi­ous ver­sion of the model. Thus, the ques­tion of in­ner al­ign­ment here is de­pen­dent on the ques­tion of whether such an over­seer would be able to de­tect de­cep­tive or oth­er­wise catas­trophic be­hav­ior in the model. Ideally, if such de­tec­tion is work­ing, it should be clear as soon as the model starts to be­come mis­al­igned, giv­ing us the op­por­tu­nity to ei­ther fix the is­sue or find a new ap­proach. For that to work, how­ever, it will likely need to be the case that the in­ter­mit­tent over­sight is performed reg­u­larly enough to catch any de­cep­tion as soon as it starts ap­pear­ing since once a model be­comes de­cep­tive it might be sig­nifi­cantly harder to de­tect prob­lems if it starts try­ing to obfus­cate its de­cep­tion.

That be­ing said, while de­tect­ing de­cep­tive cog­ni­tion early enough di­rectly might be quite difficult, it might be pos­si­ble to in­stead de­tect de­cep­tive be­hav­ior in­di­rectly by ver­ify­ing a con­di­tion that rules out de­cep­tion while still be­ing rel­a­tively easy to ver­ify such as my­opia. In this con­text, a my­opic model would be one that de­ter­mines its out­put based on some fixed stan­dard such as dis­tance to HCH while a non-my­opic model would be one that de­ter­mines its out­put based on its con­se­quences.[4] For a more de­tailed ex­pla­na­tion of how such my­opia ver­ifi­ca­tion might work, see “Towards a mechanis­tic un­der­stand­ing of cor­rigi­bil­ity.”[5]

Ad­di­tion­ally, be­cause we do in­ter­mit­tent over­sight when­ever we change the tar­get net­work, we have a guaran­tee that the over­seer is always at least as strong as any tar­gets that the model was trained to imi­tate. That be­ing said, the model could end up stronger than the tar­gets it was trained to imi­tate if it man­ages to, for ex­am­ple, achieve sig­nifi­cant com­pres­sion of what the tar­gets are do­ing—which in fact is nec­es­sary for train­ing com­pet­i­tive­ness. Fur­ther­more, even if the over­seer is stronger than the model, the offense-defense bal­ance could be such that a de­cep­tive model would still be able to hide its de­cep­tion. Which di­rec­tion the offense-defense bal­ance lies in is likely to de­pend heav­ily on what sorts of trans­parency tools the over­seer has ac­cess to.

Train­ing com­pet­i­tive­ness. First, the train­ing com­pet­i­tive­ness of imi­ta­tive am­plifi­ca­tion is likely to de­pend on whether pure imi­ta­tion can be turned into a rich enough re­ward sig­nal to fa­cil­i­tate highly sam­ple-effi­cient learn­ing. In my opinion, it seems likely that hu­man lan­guage imi­ta­tion (where lan­guage in­cludes em­bed­ded images, videos, etc.) com­bined with tech­niques to im­prove sam­ple effi­ciency[6] will be com­pet­i­tive at some tasks—namely highly-cog­ni­tive tasks such as gen­eral-pur­pose de­ci­sion-mak­ing—but not at oth­ers, such as fine mo­tor con­trol. If that’s true, then as long as the pri­mary eco­nomic use cases for AGI fall into the highly-cog­ni­tive cat­e­gory, imi­ta­tive am­plifi­ca­tion should be train­ing com­pet­i­tive. For a more de­tailed anal­y­sis of this ques­tion, see “Outer al­ign­ment and imi­ta­tive am­plifi­ca­tion.”

Se­cond, imi­ta­tive am­plifi­ca­tion + in­ter­mit­tent over­sight could also fail to be train­ing com­pet­i­tive if it is very difficult for the in­ter­mit­tent over­sight check to ever pass. This could hap­pen if, for ex­am­ple, the train­ing pro­cess for imi­ta­tive am­plifi­ca­tion doesn’t nat­u­rally pro­duce mod­els that are trans­par­ent enough to fa­cil­i­tate such a check. That be­ing said, what counts as “trans­par­ent enough” again de­pends heav­ily on the speci­fics of the trans­parency tools that the over­seer has ac­cess to.

Perfor­mance com­pet­i­tive­ness. The perfor­mance com­pet­i­tive­ness ques­tion here, as with the outer al­ign­ment ques­tion, is heav­ily de­pen­dent on the prop­er­ties of HCH, as imi­ta­tive am­plifi­ca­tion is speci­fi­cally de­signed to limit to­ward HCH. In par­tic­u­lar, perfor­mance com­pet­i­tive­ness is de­pen­dent on whether HCH is suffi­ciently uni­ver­sal so as to be ca­pa­ble of solv­ing what­ever hard prob­lems we might want ad­vanced AI to solve. For ex­am­ple, if trees of hu­mans can’t repli­cate in­tu­itive flashes of in­sight, and such flashes of in­sight are nec­es­sary to solve the sorts of highly-cog­ni­tive tasks we want AGI for, then imi­ta­tive am­plifi­ca­tion would fail to be perfor­mance com­pet­i­tive. Alter­na­tively, as long as it is always pos­si­ble for a hu­man to break down com­plex tasks that they can’t an­swer them­selves into eas­ier sub­tasks, then imi­ta­tive am­plifi­ca­tion should be perfor­mance com­pet­i­tive. Note that it doesn’t have to be the case that this work for an ar­bi­trary hu­man with­out any train­ing—only for the sort of hu­man from whom you would ac­tu­ally be col­lect­ing train­ing data.

3. Imi­ta­tive am­plifi­ca­tion + re­laxed ad­ver­sar­ial training

The ba­sic idea here is very similar to the pre­vi­ous ap­proach in that we start out with pure imi­ta­tive am­plifi­ca­tion.

The ba­sic imi­ta­tive am­plifi­ca­tion setup.

How­ever, in­stead of do­ing in­ter­mit­tent over­sight checks, we in­stead train di­rectly on the in­ter­mit­tent over­sight sig­nal—that is, we jointly train on the imi­ta­tion loss and on de­creas­ing the prob­a­bil­ity of catas­tro­phe as as­sessed by —yield­ing a re­laxed ad­ver­sar­ial train­ing setup.

Imi­ta­tive am­plifi­ca­tion plus re­laxed ad­ver­sar­ial train­ing where the red ar­rows in­di­cate the over­sight loss.

Outer al­ign­ment. Imi­ta­tive am­plifi­ca­tion + re­laxed ad­ver­sar­ial train­ing should effec­tively have the same outer al­ign­ment prop­er­ties as imi­ta­tive am­plifi­ca­tion + in­ter­mit­tent over­sight. If we as­sume that it is pos­si­ble to imi­tate HCH in a demon­stra­bly non-catas­trophic way,[7] then the catas­tro­phe-check­ing im­posed by ei­ther re­laxed ad­ver­sar­ial train­ing or in­ter­mit­tent over­sight shouldn’t af­fect the the­o­ret­i­cal limit­ing be­hav­ior. Such a limit as­sumes an in­finite train­ing dis­tri­bu­tion, how­ever, which in prac­tice is, of course, im­pos­si­ble, leav­ing open the pos­si­bil­ity for the two differ­ent ap­proaches to have quite differ­ent be­hav­ior in the finite data regime, giv­ing them po­ten­tially quite differ­ent in­ner al­ign­ment prop­er­ties.

In­ner al­ign­ment. Com­pared to in­ter­mit­tent over­sight, re­laxed ad­ver­sar­ial train­ing could have very differ­ent in­ner al­ign­ment prop­er­ties. In­ter­mit­tent over­sight re­quires perform­ing over­sight af­ter a pe­riod of train­ing dur­ing which the model could po­ten­tially have be­come de­cep­tive. Thus, in­ter­mit­tent over­sight could po­ten­tially re­quire catch­ing de­cep­tion af­ter the fact, giv­ing the de­cep­tive model the op­por­tu­nity to try to obfus­cate its own cog­ni­tion to trick the over­seer.[8] Re­laxed ad­ver­sar­ial train­ing, by con­trast, has the po­ten­tial to avoid de­cep­tion be­fore it ever oc­curs by pro­vid­ing a con­sis­tent gra­di­ent away from de­cep­tion dur­ing the en­tirety of train­ing, thus avoid­ing the is­sue of hav­ing to de­tect de­cep­tion af­ter the fact. Similarly to in­ter­mit­tent over­sight, such a gra­di­ent could be achieved by hav­ing the over­seer en­force a con­di­tion such as my­opia which is sim­ple enough to re­li­ably en­force while guaran­tee­ing that all mod­els which meet the con­di­tion are non-de­cep­tive. How­ever, re­laxed ad­ver­sar­ial train­ing does come with the pos­si­ble risk of overfit­ting the over­sight sig­nal, po­ten­tially re­sult­ing in obfus­cated mod­els that pass over­sight de­spite not ac­tu­ally meet­ing the con­di­tion sim­ply by virtue of hav­ing been op­ti­mized to do so.

Train­ing com­pet­i­tive­ness. Similarly to imi­ta­tive am­plifi­ca­tion + in­ter­mit­tent over­sight, imi­ta­tive am­plifi­ca­tion + re­laxed ad­ver­sar­ial train­ing could fail to be train­ing com­pet­i­tive ei­ther be­cause imi­ta­tive am­plifi­ca­tion is un­com­pet­i­tive or be­cause the over­sight check is too hard to pass. On the one hand, re­laxed ad­ver­sar­ial train­ing could re­quire sig­nifi­cantly more to­tal over­sight band­width, which could make train­ing more difficult. On the other hand, un­like with in­ter­mit­tent over­sight, re­laxed ad­ver­sar­ial train­ing di­rectly trains the model to pass the check, which could sig­nifi­cantly boost train­ing com­pet­i­tive­ness by in­creas­ing the like­li­hood of the check pass­ing.

For re­laxed ad­ver­sar­ial train­ing to be more train­ing com­pet­i­tive than in­ter­mit­tent over­sight, it needs to the case that it sig­nifi­cantly in­creases the prob­a­bil­ity of over­sight work­ing, which is likely to de­pend on the overfit­ting ques­tion of whether re­laxed ad­ver­sar­ial train­ing is likely to over­all in­crease or de­crease model trans­parency. The crit­i­cal ques­tion here is go­ing to be how hard it is to find a model that is demon­stra­bly non-catas­trophic un­der the two differ­ent over­sight regimes. For a more de­tailed anal­y­sis of this ques­tion, see “Re­laxed ad­ver­sar­ial train­ing for in­ner al­ign­ment.”

Perfor­mance com­pet­i­tive­ness. Similarly to the outer al­ign­ment ques­tion, the perfor­mance com­pet­i­tive­ness prop­er­ties of imi­ta­tive am­plifi­ca­tion + in­ter­mit­tent over­sight and imi­ta­tive am­plifi­ca­tion + re­laxed ad­ver­sar­ial train­ing should be effec­tively the same, with perfor­mance com­pet­i­tive­ness here highly de­pen­dent on the com­pet­i­tive­ness of HCH.

4. Ap­proval-based am­plifi­ca­tion + re­laxed ad­ver­sar­ial training

The idea be­hind ap­proval-based am­plifi­ca­tion is to sub­sti­tute the imi­ta­tion sig­nal in imi­ta­tive am­plifi­ca­tion with an ap­proval sig­nal, train­ing to max­i­mize the ap­proval of . This gives us a my­opic re­ward sig­nal which can then be op­ti­mized via the use of some sort of my­opic re­in­force­ment learn­ing.

Ap­proval-based am­plifi­ca­tion where the blue ar­rows in­di­cate the ap­proval loss.

We can then add on top of that a re­laxed ad­ver­sar­ial train­ing sig­nal, train­ing to also min­i­mize the prob­a­bil­ity of catas­tro­phe as as­sessed by .

Ap­proval-based am­plifi­ca­tion + re­laxed ad­ver­sar­ial train­ing where the red ar­rows in­di­cate the over­sight loss.

Ad­di­tion­ally, it is worth not­ing that, in prac­tice, since over­sight can be treated as a form of ap­proval, you can just com­pute one ap­proval sig­nal that in­cludes both met­rics rather than two dis­tinct ap­proval sig­nals.

Outer al­ign­ment. For the pur­poses of outer al­ign­ment and perfor­mance com­pet­i­tive­ness, we have to un­der­stand what ex­actly the the­o­ret­i­cal limit of ap­proval-based am­plifi­ca­tion is, which in gen­eral is not HCH. In­stead, ap­proval-based am­plifi­ca­tion limits to the fol­low­ing tree, where blue ar­rows in­di­cate ap­proval-max­i­miza­tion (such that is max­i­miz­ing the ap­proval of con­sult­ing which is max­i­miz­ing the ap­proval of con­sult­ing and so on).[9]

The top of the ap­proval-based am­plifi­ca­tion limit­ing tree where white ar­rows in­di­cate in­for­ma­tion trans­fer, blue ar­rows in­di­cate ap­proval max­i­miza­tion, are hu­mans, and are mod­els.

As you can see from the tree, the limit of ap­proval-based am­plifi­ca­tion in­volves hu­mans in­ter­spersed with mod­els at­tempt­ing to max­i­mize the ap­proval of their de­scen­dants. Thus, un­like the HCH tree, which was com­posed only of hu­mans, an­a­lyz­ing the ap­proval-based am­plifi­ca­tion tree in­volves an­a­lyz­ing the be­hav­ior of ar­bi­trary mod­els trained on ap­proval-max­i­miza­tion sig­nals—and, in fact, many such mod­els as they ap­pear all through­out the tree each with a differ­ent sub­tree that they are max­i­miz­ing the ap­proval of.[10] As a re­sult, pro­duc­ing an outer al­ign­ment guaran­tee re­gard­ing such a tree seems quite difficult, es­pe­cially given the dis­tinct pos­si­bil­ity of the mod­els gam­ing the ap­proval sig­nal. For ad­di­tional de­tail on this ques­tion, see “Outer al­ign­ment and imi­ta­tive am­plifi­ca­tion.”

In­ner al­ign­ment. We are still us­ing re­laxed ad­ver­sar­ial train­ing here, so for in­ner al­ign­ment to work we need the same guaran­tee from be­fore of a con­sis­tent gra­di­ent away from de­cep­tion. Note that we can still try to have en­force my­opia as a way of get­ting that gra­di­ent, as ap­proval-max­i­miza­tion is a sin­gle-step re­ward sig­nal that is en­tirely com­pat­i­ble with a my­opic agent.[11] In the ap­proval-based set­ting, how­ever, is a differ­ent en­tity that limits to a differ­ent tree, mean­ing it could have differ­ent prop­er­ties re­gard­ing how good its over­sight sig­nal is. In par­tic­u­lar, the fact that ap­proval-based am­plifi­ca­tion di­rectly in­cen­tivizes to fool into ap­prov­ing of it could re­sult in be­ing more prone to de­cep­tion, more likely to hide facts from dur­ing over­sight, and/​or more likely to obfus­cate its in­ter­nals if in­ter­nals-check­ing is part of the over­sight sig­nal. That be­ing said, it could also be that re­laxed ad­ver­sar­ial train­ing works bet­ter in an ap­proval-based set­ting if it re­sults in a more pow­er­ful that is bet­ter at perform­ing over­sight be­cause was trained to provide more helpful in­for­ma­tion.

Train­ing com­pet­i­tive­ness. Train­ing com­pet­i­tive­ness could be quite differ­ent with ap­proval-based am­plifi­ca­tion com­pared to imi­ta­tive am­plifi­ca­tion, as ap­proval pro­vides a very differ­ent train­ing sig­nal than imi­ta­tion. In par­tic­u­lar, the stan­dard maxim that it is gen­er­ally eas­ier to eval­u­ate be­hav­ior than pro­duce it seems to sug­gest that hu­man ap­proval should pro­duce a sig­nifi­cantly richer train­ing sig­nal than the sim­ple au­to­mated dis­tance met­ric in imi­ta­tive am­plifi­ca­tion, re­sult­ing in more effi­cient train­ing. This isn’t nec­es­sar­ily the case, how­ever—an au­to­mated dis­tance met­ric as com­puted by a pow­er­ful lan­guage model, for in­stance, could still pro­duce a fairly rich (and more nicely differ­en­tiable) train­ing sig­nal.

Ad­di­tion­ally, ap­proval-based am­plifi­ca­tion could also have differ­ent train­ing com­pet­i­tive­ness prop­er­ties in terms of the like­li­hood of the re­laxed ad­ver­sar­ial train­ing check pass­ing. If ap­proval-based am­plifi­ca­tion leads to mod­els with more obfus­cated in­ter­nals, for ex­am­ple—per­haps be­cause the model is in­cen­tivized to obfus­cate how it’s gam­ing the ap­proval sig­nal—then that could hurt train­ing com­pet­i­tive­ness, as it could make it harder for to be­come con­fi­dent that is safe (in ad­di­tion to hurt­ing in­ner al­ign­ment by mak­ing in fact less likely to be safe).

Perfor­mance com­pet­i­tive­ness. Perfor­mance com­pet­i­tive­ness for ap­proval-based am­plifi­ca­tion de­pends on the prop­er­ties of the ap­proval-max­i­miza­tion tree given pre­vi­ously. Com­pared to HCH, it is quite pos­si­ble for that tree to be ei­ther more or less com­pet­i­tive.

On the one hand, ap­proval-max­i­miz­ing mod­els might be able to pro­duce ideas/​con­sid­er­a­tions/​ar­gu­ments that no hu­man would have ever come up with, en­abling the ap­proval-max­i­miza­tion tree to be more com­pet­i­tive. In par­tic­u­lar, if hu­mans are sig­nifi­cantly bet­ter at know­ing good ideas when they see them than pro­duc­ing good ideas them­selves—which fits with the stan­dard maxim that it’s eas­ier to eval­u­ate be­hav­ior than pro­duce it—then ap­proval-based am­plifi­ca­tion should be more perfor­mance com­pet­i­tive than imi­ta­tive am­plifi­ca­tion.

On the other hand, ap­proval-max­i­miz­ing mod­els might game the ap­proval sig­nal by pro­duc­ing con­vinc­ing-sound­ing but ul­ti­mately bad ideas/​con­sid­er­a­tions/​ar­gu­ments that end up hurt­ing perfor­mance com­pet­i­tive­ness. For ex­am­ple, an idea could be ini­tially very com­pel­ling and thus get high ap­proval de­spite quite quickly re­veal­ing it­self to be use­less/​vac­u­ous/​etc. when ac­tu­ally tested out. In such a situ­a­tion where the bad ideas quickly re­veal them­selves, the perfor­mance com­pet­i­tive­ness prob­lems would likely be quite clear, though if it takes a very long time for the bad ideas to re­veal them­selves, then ap­proval-max­i­miza­tion might ini­tially look a lot bet­ter from a perfor­mance com­pet­i­tive­ness stand­point than it ac­tu­ally is.

5. Micro­scope AI

Micro­scope AI is a fairly unique pro­posal which is de­signed to by­pass some of the dan­gers of build­ing highly agen­tic AGI sys­tems by lev­er­ag­ing pow­er­ful trans­parency tools. The ba­sic pro­posal is as fol­lows.

  1. Train a pre­dic­tive model on some set of data that you want to un­der­stand while us­ing trans­parency tools to ver­ify that the model isn’t perform­ing any op­ti­miza­tion.

  2. Use trans­parency tools to un­der­stand what the model learned about the data and use that un­der­stand­ing to guide hu­man de­ci­sion-mak­ing.

Micro­scope AI is pred­i­cated on the ba­sic in­sight that us­ing trans­parency tools on a model doesn’t just teach us about that model—it also gives us use­ful in­for­ma­tion about the data that the model was trained on. Chris Olah talks about this ba­sic phe­nomenon in his post “Vi­su­al­iz­ing Rep­re­sen­ta­tions: Deep Learn­ing and Hu­man Be­ings:”

The vi­su­al­iza­tions are a bit like look­ing through a telescope. Just like a telescope trans­forms the sky into some­thing we can see, the neu­ral net­work trans­forms the data into a more ac­cessible form. One learns about the telescope by ob­serv­ing how it mag­nifies the night sky, but the re­ally re­mark­able thing is what one learns about the stars. Similarly, vi­su­al­iz­ing rep­re­sen­ta­tions teaches us about neu­ral net­works, but it teaches us just as much, per­haps more, about the data it­self.

The ba­sic goal of micro­scope AI is to pro­duce high-qual­ity knowl­edge that can in­form im­por­tant de­ci­sion-mak­ing rather than to pro­duce pow­er­ful AGI sys­tems that can make those de­ci­sions them­selves.

Outer al­ign­ment. Paul Chris­ti­ano’s ar­gu­ment that the uni­ver­sal prior is ma­lign sug­gests that there might be de­cep­tive op­ti­miza­tion in the limit of perfect pre­dic­tion. As a re­sult, if we in­ter­pret outer al­ign­ment as refer­ring to the be­hav­ior at min­i­mal loss, it seems likely that micro­scope AI is not outer al­igned. That doesn’t nec­es­sar­ily rule out micro­scope AI as a pro­posal, how­ever—whether such de­cep­tive op­ti­miza­tion would ac­tu­ally oc­cur in prac­tice is still a very open ques­tion. In par­tic­u­lar, the ar­gu­ment that the uni­ver­sal prior is ma­lign re­lies on an un­com­putable limit—polyno­mial ap­prox­i­ma­tions, for ex­am­ple, could still be com­pletely fine. Fur­ther­more, the in­ner al­ign­ment tech­nique pro­posed here of us­ing trans­parency tools to de­tect and pre­vent any op­ti­miza­tion could pre­vent this prob­lem en­tirely if suc­cess­ful.

In­ner al­ign­ment. Since the goal here is not to pro­duce a model perform­ing op­ti­miza­tion (a mesa-op­ti­mizer) in the first place, the ques­tion of in­ner al­ign­ment be­comes rather differ­ent in this situ­a­tion. We still want to avoid a de­cep­tive or oth­er­wise mis­al­igned mesa-op­ti­mizer, but we want to do so by avoid­ing all mesa-op­ti­miza­tion rather than by en­sur­ing that any mesa-op­ti­miza­tion is al­igned. Speci­fi­cally, the strat­egy here is to use trans­parency tools to de­tect any op­ti­miza­tion (as op­posed to just pre­dic­tive mod­el­ing) and halt if we see any­thing of that form. This has the ma­jor benefit of sidestep­ping a lot of the most per­ni­cious in­ner al­ign­ment con­cerns, but it also comes with lots of its own prob­lems. Whether this will work is likely to be de­pen­dent on a cou­ple of differ­ent fac­tors.

First, it is an open ques­tion of whether there ex­ists a sen­si­ble way in which op­ti­miza­tion and mod­el­ing can be dis­t­in­guished from each other. It seems quite likely that any world model which ad­mits queries of the form “what is the prob­a­bil­ity that is the op­ti­mal ac­tion ac­cord­ing to util­ity ” or “what is the best ac­tion for agent to take in state ” will im­plic­itly have to be run­ning some sort of op­ti­miza­tion al­gorithm. Thus, any no­tion of a “pure world model” would likely have to ei­ther rule out such queries, en­sure they are be­ing es­ti­mated with­out the use of op­ti­miza­tion, or oth­er­wise find some way to en­sure that they are safe. For a more thor­ough treat­ment of the difficulty here, see Abram Dem­ski’s “Par­tial Agency” se­quence.

Se­cond, as­sum­ing it is pos­si­ble to dis­t­in­guish be­tween mod­el­ing and op­ti­miza­tion in the­ory, it also needs to be pos­si­ble in prac­tice given what­ever set of trans­parency tools are available. If the model can start perform­ing op­ti­miza­tion and po­ten­tially be­come de­cep­tive be­fore its op­ti­miza­tion can be de­tected, that would ob­vi­ously pose a ma­jor prob­lem to this sort of ap­proach.

Train­ing com­pet­i­tive­ness. Micro­scope AI re­lies on be­ing able to com­pet­i­tively train pre­dic­tive mod­els with use­ful in­ter­nal rep­re­sen­ta­tions. Thus, the ques­tion of train­ing com­pet­i­tive­ness for micro­scope AI comes down to the ques­tion of whether the pro­cess of train­ing a pre­dic­tor is a com­pet­i­tive way to ex­tract use­ful in­for­ma­tion from a dataset. Given the suc­cess of cur­rent pre­dic­tive mod­els such as GPT-2, it seems likely that it will be com­pet­i­tive to train pow­er­ful pre­dic­tors—though it re­mains to be seen how com­pet­i­tive the pro­cess of ex­tract­ing what they’ve learned will be. How com­pet­i­tive that ex­trac­tion pro­cess is seems likely to de­pend heav­ily on what sort of state-of-the-art trans­parency tools are available.

Perfor­mance com­pet­i­tive­ness. Perfor­mance com­pet­i­tive­ness is per­haps the biggest ques­tion mark re­gard­ing micro­scope AI, as micro­scope AI for­goes pro­duc­ing AI agents which di­rectly take ac­tions in the world. The ques­tion of perfor­mance com­pet­i­tive­ness for micro­scope AI is thus the ques­tion of whether en­hanced hu­man un­der­stand­ing alone—with­out AI agents—is suffi­cient for the eco­nomic use cases where one might oth­er­wise want highly agen­tic ad­vanced AI (e.g. an AGI).

This ques­tion is likely to de­pend heav­ily on what ex­actly those use cases are. Like with am­plifi­ca­tion, if you need lots of fine mo­tor con­trol, micro­scope AI is un­likely to get you there. Fur­ther­more, un­like am­plifi­ca­tion, if you need lots of low-level de­ci­sion-mak­ing where it’s too ex­pen­sive to hire a hu­man, micro­scope AI won’t help much there ei­ther (whereas am­plifi­ca­tion would be fine). Po­ten­tially micro­scope AI could give hu­mans the knowl­edge to safely build other sys­tems which could solve such tasks, how­ever. Fur­ther­more, if the pri­mary use case for AGI is just high-level big-pic­ture de­ci­sion-mak­ing (au­tomat­ing CEOs or do­ing AI re­search, for ex­am­ple), then it seems likely that micro­scope AI would have a real shot of be­ing able to ad­dress those use cases. In that sort of a situ­a­tion—where you’re only try­ing to make a small num­ber of high-qual­ity de­ci­sions—it seems likely to be fairly cheap to have a hu­man in the loop and thus sim­ply im­prov­ing that hu­man’s knowl­edge and un­der­stand­ing via micro­scope AI might be suffi­cient to pro­duce com­pet­i­tive de­ci­sion-mak­ing. This is es­pe­cially true if there is a mar­ket pre­mium on hav­ing a hu­man mak­ing the de­ci­sions, per­haps be­cause that makes it eas­ier to ne­go­ti­ate or work with other hu­mans.

6. STEM AI

STEM AI is a very sim­ple pro­posal in a similar vein to micro­scope AI. Whereas the goal of micro­scope AI was to avoid the po­ten­tial prob­lems in­her­ent in build­ing agents, the goal of STEM AI is to avoid the po­ten­tial prob­lems in­her­ent in mod­el­ing hu­mans. Speci­fi­cally, the idea of STEM AI is to train a model purely on ab­stract sci­ence, en­g­ineer­ing, and/​or math­e­mat­ics prob­lems while us­ing trans­parency tools to en­sure that the model isn’t think­ing about any­thing out­side its sand­box.

This ap­proach has the po­ten­tial to pro­duce a pow­er­ful AI sys­tem—in terms of its abil­ity to solve STEM prob­lems—with­out rely­ing on any hu­man mod­el­ing. Not mod­el­ing hu­mans could then have ma­jor benefits such as en­sur­ing that the re­sult­ing model doesn’t have the abil­ity to trick us to nearly the same ex­tent as if it pos­sessed com­plex mod­els of hu­man be­hav­ior. For a more thor­ough treat­ment of why avoid­ing hu­man mod­el­ing could be quite valuable, see Ra­mana Ku­mar and Scott Garrabrant’s “Thoughts on Hu­man Models.”

Outer al­ign­ment. Similarly to micro­scope AI, it seems likely that—in the limit—the best STEM AIs would be ma­lign in terms of hav­ing con­ver­gent in­stru­men­tal goals which cause them to be at odds with hu­mans. Thus, STEM AI is likely not outer al­igned—how­ever, if the in­ner al­ign­ment tech­niques be­ing used are suc­cess­ful at pre­vent­ing such ma­lign op­ti­miza­tion from oc­cur­ring in prac­tice (which the ab­sence of hu­man mod­el­ing could make sig­nifi­cantly eas­ier), then STEM AI might still be al­igned over­all.

In­ner al­ign­ment. The hope with STEM AI is that by pre­vent­ing the model from ever con­sid­er­ing any­thing out­side its STEM sand­box, the ma­lign limit­ing be­hav­ior that might cause it to fail to be outer al­igned can be avoided. Un­for­tu­nately, such a sand­box­ing con­di­tion alone isn’t quite suffi­cient, as a model con­sid­er­ing only things in its sand­box could still end up cre­at­ing other mod­els which would con­sider things out­side the sand­box.[12] Thus, ex­actly what the cor­rect thing is to do in terms of in­ner al­ign­ment for a STEM AI is some­what un­clear. In my opinion, there are ba­si­cally two op­tions here: ei­ther do some­thing similar to micro­scope AI and try to pre­vent all mesa-op­ti­miza­tion or do some­thing similar to am­plifi­ca­tion and en­sure that all mesa-op­ti­miza­tion that oc­curs is fully my­opic. In ei­ther case, the hope would be that the ab­sence of hu­man mod­el­ing makes it eas­ier to en­force the de­sired con­di­tion (be­cause mod­el­ing an agent such as a hu­man in­creases the propen­sity for the model to be­come agen­tic it­self, for ex­am­ple).

Train­ing com­pet­i­tive­ness. Train­ing com­pet­i­tive­ness for STEM AI is likely to de­pend heav­ily on how hard it is for state-of-the-art ma­chine learn­ing al­gorithms to solve STEM prob­lems com­pared to other do­mains such as lan­guage or robotics. Though there ex­ists lots of cur­rent progress in the field of ap­ply­ing cur­rent ma­chine learn­ing tech­niques to STEM prob­lems such as the­o­rem prov­ing or pro­tein fold­ing, it re­mains to be seen to what ex­tent the com­pet­i­tive­ness of these tech­niques will scale, par­tic­u­larly in terms of how well they will scale in terms of solv­ing difficult prob­lems rel­a­tive to other do­mains such as lan­guage mod­el­ing.

Perfor­mance com­pet­i­tive­ness. Similarly to micro­scope AI, perfor­mance com­pet­i­tive­ness is per­haps one of the biggest stick­ing points with re­gards to STEM AI, as be­ing con­fined solely to STEM prob­lems has the ma­jor po­ten­tial to mas­sively limit the ap­pli­ca­bil­ity of an ad­vanced AI sys­tem. That be­ing said, many purely STEM prob­lems such as pro­tein fold­ing or nan­otech­nol­ogy de­vel­op­ment have the po­ten­tial to provide huge eco­nomic boons that could eas­ily sur­pass those from any other form of ad­vanced AI as well as solve ma­jor so­cietal prob­lems such as cur­ing ma­jor ill­nesses. Thus, if the an­swer to the rea­son that you want to build ad­vanced AI in the first place is to get such benefits, then STEM AI might be a perfectly ac­cept­able sub­sti­tute from a perfor­mance com­pet­i­tive­ness stand­point. Fur­ther­more, such boons could lead to a de­ci­sive strate­gic ad­van­tage that could en­able heavy in­vest­ment in al­ign­ing other forms of ad­vanced AI which are more perfor­mance com­pet­i­tive.

That be­ing said, if one of the ma­jor use cases for your first ad­vanced AI is helping to build your sec­ond ad­vanced AI, STEM AI fails hard on that met­ric, as it ad­vances our tech­nol­ogy with­out also ad­vanc­ing our un­der­stand­ing of al­ign­ment. In par­tic­u­lar, un­like ev­ery other ap­proach on this list, be­cause STEM AI is con­fined solely to STEM, it can’t be used to do al­ign­ment work. Thus, STEM AI could po­ten­tially cre­ate a vuln­er­a­ble world situ­a­tion where the pow­er­ful tech­nol­ogy pro­duced us­ing the STEM AI makes it much eas­ier to build ad­vanced AI sys­tems with­out also mak­ing it more likely that they will be al­igned. This prob­lem could po­ten­tially be miti­gated if the STEM AI were heav­ily fo­cused on ap­pli­ca­tions that could po­ten­tially as­sist with al­ign­ment such as whole brain em­u­la­tion, though to what ex­tent that would ac­tu­ally be pos­si­ble or ac­tu­ally help with al­ign­ment is quite un­clear.

7. Nar­row re­ward mod­el­ing + trans­parency tools

The ap­proach here is as fol­lows:

  1. For some par­tic­u­lar prob­lem/​do­main/​en­vi­ron­ment, jointly train a re­ward model us­ing hu­man feed­back and an agent that pur­sues that re­ward. Speci­fi­cally, we can use an ap­proach like that de­scribed in Deep­Mind Safety’s “Scal­able agent al­ign­ment via re­ward mod­el­ing” where hu­man feed­back on agent tra­jec­to­ries can be used to re­fine the re­ward model, as seen be­low.

Deep­Mind Safety’s di­a­gram of re­ward learn­ing as seen in “Scal­able agent al­ign­ment via re­ward mod­el­ing.”

  1. Since we want to keep our agent con­fined to only the spe­cific dis­tri­bu­tion on which it was trained (thus helping pre­vent pos­si­ble per­verse gen­er­al­iza­tions), we can also in­cor­po­rate out-of-dis­tri­bu­tion de­tec­tion to train the agent to halt if it en­coun­ters a suffi­ciently out-of-dis­tri­bu­tion state.

  2. We can also have the hu­man provide ad­di­tional feed­back on the re­ward model’s in­ter­nal be­hav­ior via the use of trans­parency tools.

Re­ward mod­el­ing plus over­sight of the re­ward model by the user.

And we can do the same thing with the agent, giv­ing us a similar sort of mul­li­gan as with re­in­force­ment learn­ing + trans­parency tools.

Re­ward mod­el­ing plus over­sight of both the re­ward model and the agent by the user.

Outer al­ign­ment. The ques­tion of outer al­ign­ment here is go­ing to de­pend on whether the var­i­ous feed­back mechanisms are suffi­cient to en­sure that, in the limit, the learned re­ward model is cor­rect over the do­main where the ac­tion model be­lieves that it is on dis­tri­bu­tion. This is a highly com­plex ques­tion that de­pends on a bunch of fac­tors in­clud­ing how good the hu­man’s tra­jec­tory and trans­parency feed­back is, how hard it is to train an on/​off dis­tri­bu­tion model, whether the re­ward model will it­self de­velop op­ti­miza­tion in the limit, the ex­tent to which it’s pos­si­ble for the agent to trick the user into think­ing it’s do­ing a good job when it’s not, etc. Be­cause outer al­ign­ment here is de­pen­dent on the limit­ing be­hav­ior of such a com­plex, dy­namic pro­cess, it’s very hard to de­ter­mine one way or the other how outer al­igned such a pro­posal is likely to be.

In­ner al­ign­ment. The use of trans­parency tools in this ap­proach is ac­tu­ally quite similar to the var­i­ous re­laxed ad­ver­sar­ial train­ing ap­proaches since the feed­back mechanism here al­lows the over­sight sig­nal com­ing from the hu­man to be in­cor­po­rated di­rectly into train­ing. That be­ing said, un­like in re­laxed ad­ver­sar­ial train­ing—and similarly to the “mul­li­gan” ap­proach in RL + trans­parency tools—the hu­man do­ing the over­sight here gets no help from the model. Thus, in­ner al­ign­ment for nar­row re­ward mod­el­ing + trans­parency tools comes down to the ques­tion of whether a hu­man us­ing trans­parency tools to perform over­sight can pre­vent the de­vel­op­ment of de­cep­tive or oth­er­wise catas­trophic be­hav­ior in ei­ther the re­ward model or the agent. This could be done, for ex­am­ple, by ver­ify­ing that the re­ward model isn’t perform­ing any op­ti­miza­tion (as in micro­scope AI) and ver­ify­ing that the agent is my­opic (as was pre­vi­ously sug­gested could work for am­plifi­ca­tion).

Such an ap­proach could po­ten­tially re­quire that the re­ward be my­opic, how­ever, which might be a fairly se­ri­ous limi­ta­tion on the ap­pli­ca­bil­ity of nar­row re­ward mod­el­ing if it was re­stricted to only those do­mains with my­opic re­wards. While it is not nec­es­sar­ily the case that you can’t do my­opia ver­ifi­ca­tion with a non-my­opic re­ward—you just need to ver­ify that the model treats each epi­sode my­opi­cally rather than each step—per-epi­sode my­opia ver­ifi­ca­tion could be sig­nifi­cantly more difficult than per-step my­opia ver­ifi­ca­tion, as it is po­ten­tially a much less nat­u­ral be­hav­ior.

Train­ing com­pet­i­tive­ness. The ques­tion of train­ing com­pet­i­tive­ness here is go­ing to de­pend on the ex­tent to which re­ward mod­el­ing is a good (com­pute-effi­cient, sam­ple-effi­cient. etc.) way of train­ing AI sys­tems to solve poorly-defined, fuzzy tasks. There are lots of differ­ent ap­proaches to solv­ing this ba­sic prob­lem, in­clud­ing not just re­ward learn­ing ap­proaches but also imi­ta­tion learn­ing and var­i­ous differ­ent in­verse re­in­force­ment learn­ing schemes, and speci­fi­cally what ap­proach ends up com­ing out on top still seems quite un­cer­tain. That be­ing said, re­ward mod­el­ing has the ma­jor com­pet­i­tive­ness ad­van­tage of only re­quiring and de­pend­ing on hu­man feed­back, not hu­man demon­stra­tions, which could be sig­nifi­cantly more re­li­able and eas­ier to elicit. Fur­ther­more, other re­ward learn­ing schemes such as in­verse re­in­force­ment learn­ing can be in­cor­po­rated into re­ward mod­el­ing by us­ing them to pro­duce a bet­ter ini­tial re­ward model that can then be re­fined via re­ward mod­el­ing’s feed­back mechanism.

Perfor­mance com­pet­i­tive­ness. Similarly to micro­scope AI or STEM AI, a po­ten­tially ma­jor con­cern with the nar­row re­ward mod­el­ing + trans­parency tools ap­proach is the “nar­row” part. While be­ing nar­row has po­ten­tial al­ign­ment ad­van­tages in terms of re­duc­ing re­li­ance on po­ten­tially shaky or even ma­lign gen­er­al­iza­tion, it also has the ma­jor dis­ad­van­tage of re­strict­ing the ap­proach’s use­ful­ness to only pro­duc­ing rel­a­tively nar­row ad­vanced AI sys­tems. Thus, the perfor­mance com­pet­i­tive­ness of nar­row re­ward mod­el­ing + trans­parency tools is likely to de­pend heav­ily on the ex­tent to which truly gen­eral ad­vanced AI sys­tems are ac­tu­ally prac­ti­cally fea­si­ble and eco­nom­i­cally nec­es­sary. For a more de­tailed anal­y­sis of this ques­tion, see Eric Drexler’s “Refram­ing Su­per­in­tel­li­gence.”

8. Re­cur­sive re­ward mod­el­ing + re­laxed ad­ver­sar­ial training

Re­cur­sive re­ward mod­el­ing, as the name im­plies, is a sort of re­cur­sive, non-nar­row ver­sion of nar­row re­ward mod­el­ing. What this re­sults in is effec­tively a form of am­plifi­ca­tion where the dis­til­la­tion step which was pre­vi­ously imi­ta­tion or ap­proval-max­i­miza­tion be­comes re­ward mod­el­ing. Speci­fi­cally, the ba­sic ap­proach here is to train a model to max­i­mize the re­ward ob­tained by perform­ing re­ward learn­ing on .

Two differ­ent, equiv­a­lent di­a­grams of re­cur­sive re­ward mod­el­ing. The top di­a­gram is taken from “Scal­able agent al­ign­ment via re­ward mod­el­ing” and the bot­tom di­a­gram is the equiv­a­lent am­plifi­ca­tion-style di­a­gram where the pur­ple ar­rows in­di­cate the use of the full re­ward mod­el­ing pro­cess.

In this graphic, the images on the top and bot­tom are meant to rep­re­sent the same pro­cess—speci­fi­cally, if you take the pur­ple ar­row in the bot­tom image to rep­re­sent re­ward mod­el­ing, and as­sume that the agents in the top image are all the same agent just at differ­ent time steps,[13] then you get pre­cisely the same pro­ce­dure rep­re­sented in both images. Thus, the agent in the largest cir­cle in the top pic­ture is in the bot­tom pic­ture, the user pro­vid­ing feed­back for that agent is , the agent in the medium-sized cir­cle is , the user in the medium-sized cir­cle is , and the agent in the small­est cir­cle is .

Fur­ther­more, similarly to the other am­plifi­ca­tion ap­proaches, we can then add re­laxed ad­ver­sar­ial train­ing by train­ing and the re­ward model to min­i­mize the prob­a­bil­ity of de­cep­tive or oth­er­wise catas­trophic be­hav­ior as as­sessed by .[14] It’s also worth not­ing that re­cur­sive re­ward mod­el­ing is amenable to in­ter­mit­tent over­sight as well if de­sired in­stead of re­laxed ad­ver­sar­ial train­ing.

Re­cur­sive re­ward mod­el­ing plus re­laxed ad­ver­sar­ial train­ing where the red ar­rows in­di­cate the over­sight loss.

Outer al­ign­ment. The first step in un­der­stand­ing the outer al­ign­ment prop­er­ties of re­cur­sive re­ward mod­el­ing is figur­ing out what ex­actly the the­o­ret­i­cal limit of the train­ing pro­cess is. While we don’t know ex­actly what the limit­ing be­hav­ior of an in­di­vi­d­ual in­stance of re­ward mod­el­ing is—as was noted pre­vi­ously for nar­row re­ward mod­el­ing—we can ex­press the limit of the re­cur­sive ver­sion in terms of many in­di­vi­d­ual re­ward mod­el­ing limits. Speci­fi­cally, we can un­roll the re­cur­sive re­ward mod­el­ing pro­cess out over time to pro­duce a limit­ing tree similar to the one given for ap­proval-based am­plifi­ca­tion, ex­cept with ap­proval-max­i­miza­tion re­placed with re­ward mod­el­ing.

The re­cur­sive re­ward mod­el­ing limit­ing tree where white ar­rows in­di­cate in­for­ma­tion trans­fer, pur­ple ar­rows in­di­cate re­ward mod­el­ing, are hu­mans, and are mod­els.

The pur­ple ar­rows in this di­a­gram rep­re­sent perfect re­ward mod­el­ing where is the model trained to max­i­mize the re­ward ob­tained by perform­ing re­ward mod­el­ing on con­sult­ing . Now, whether this tree is al­igned or not is a very open ques­tion. Similarly to ap­proval-based am­plifi­ca­tion—but un­like imi­ta­tive am­plifi­ca­tion—it’s difficult to form a clear model of what ex­actly this tree would do, as it not only in­volves hu­mans but also mod­els that are the limit of many in­di­vi­d­ual in­stances of re­ward mod­el­ing—limits which could po­ten­tially in­volve de­cep­tive or oth­er­wise ma­lign op­ti­miza­tion.

In­ner al­ign­ment. The ques­tion of in­ner al­ign­ment here is mostly go­ing to fall on the effi­cacy of the re­laxed ad­ver­sar­ial train­ing. Such effi­cacy could be quite differ­ent than with other am­plifi­ca­tion ap­proaches, how­ever, as both the model helping the hu­man perform over­sight and the model be­ing over­seen are trained via a very differ­ent pro­cess in re­cur­sive re­ward mod­el­ing. In par­tic­u­lar, if the re­ward model is non-my­opic, re­cur­sive re­ward mod­el­ing could rule out the pos­si­bil­ity of us­ing per-step my­opia ver­ifi­ca­tion—as was sug­gested for the other am­plifi­ca­tion ap­proaches—though per-epi­sode my­opia ver­ifi­ca­tion could still be pos­si­ble, as with nar­row re­ward mod­el­ing. If per-epi­sode my­opia ver­ifi­ca­tion is not ten­able, how­ever, then an al­ter­na­tive con­di­tion that rules out de­cep­tion while be­ing pos­si­ble to ver­ify for agents trained via re­cur­sive re­ward mod­el­ing might need to be found. Fur­ther­more, if re­ward mod­el­ing has a greater ten­dency to pro­duce de­cep­tion than imi­ta­tion learn­ing, over­sight could be sig­nifi­cantly harder with re­cur­sive re­ward mod­el­ing than with imi­ta­tive am­plifi­ca­tion even if such a con­di­tion is found. Alter­na­tively, if re­cur­sive re­ward mod­el­ing helps pro­duce mod­els that are more ca­pa­ble of as­sist­ing with over­sight—be­cause re­ward mod­el­ing is more ca­pa­ble of train­ing mod­els to effec­tively ap­ply trans­parency tools than imi­ta­tion learn­ing is, for ex­am­ple—then re­laxed ad­ver­sar­ial train­ing could work bet­ter with re­cur­sive re­ward mod­el­ing.

Train­ing com­pet­i­tive­ness. The train­ing com­pet­i­tive­ness of re­cur­sive re­ward mod­el­ing de­pends on the effec­tive­ness of re­ward mod­el­ing not just as an effi­cient way of train­ing a model to solve a sin­gle fuzzy task—as in nar­row re­ward mod­el­ing—but in­stead the effec­tive­ness of re­ward mod­el­ing in train­ing a gen­eral model which can solve an en­tire col­lec­tion of fuzzy tasks. That be­ing said, many of the nice train­ing com­pet­i­tive­ness prop­er­ties of re­ward learn­ing con­tinue to ap­ply even in the re­cur­sive set­ting. For ex­am­ple, un­like imi­ta­tive am­plifi­ca­tion—but similarly to ap­proval-based am­plifi­ca­tion—re­cur­sive re­ward mod­el­ing re­lies only on hu­man feed­back rather than on hu­man demon­stra­tions. Fur­ther­more, com­pared to ap­proval-based am­plifi­ca­tion, re­cur­sive re­ward mod­el­ing is non-my­opic, which could al­low it to solve credit as­sign­ment prob­lems that might be difficult for ap­proval-based am­plifi­ca­tion.

Perfor­mance com­pet­i­tive­ness. Perfor­mance com­pet­i­tive­ness for re­cur­sive re­ward mod­el­ing will de­pend on the com­pet­i­tive­ness of its afore­men­tioned limit­ing tree. Com­pared to HCH, the re­cur­sive re­ward mod­el­ing tree can con­sider ideas that no hu­man would ever pro­duce, po­ten­tially in­creas­ing com­pet­i­tive­ness. And com­pared to the ap­proval-max­i­miza­tion tree, the re­cur­sive re­ward mod­el­ing tree can learn to ex­e­cute long-term strate­gies that short-term ap­proval max­i­miza­tion wouldn’t in­cen­tivize. That be­ing said, both of these facets of re­cur­sive re­ward mod­el­ing have the po­ten­tial for dan­ger from an al­ign­ment per­spec­tive. Fur­ther­more, if the differ­ent mod­els in the re­cur­sive re­ward mod­el­ing tree each as­sign some differ­ent value to the fi­nal out­put—which could hap­pen if the mod­els are not per-epi­sode my­opic—they could try to jockey for con­trol of the tree in such a way that not only hurts al­ign­ment but also com­pet­i­tive­ness.

9. AI safety via de­bate with trans­parency tools

There are many differ­ent forms of AI safety via de­bate, but the ap­proach we’ll be con­sid­er­ing here is as fol­lows:

  1. Train a model (“Alice”) to win de­bates against a copy of it­self (“Bob”) in front of a hu­man judge.

The ba­sic de­bate setup where Alice is the first de­bater and Bob is the sec­ond de­bater. Blue lines in­di­cate pos­si­ble ar­gu­ments that Alice can make and the red lines in­di­cate pos­si­ble ar­gu­ments that Bob can make.

  1. Give the mod­els the abil­ity to cross-ex­am­ine pre­vi­ous ver­sions of the other model (e.g. Bob_2 can ask Alice_1 ques­tions for which the an­swers are visi­ble to the hu­man judge). See Beth Barnes and Paul Chris­ti­ano’s “Wri­teup: Progress on AI Safety via De­bate” for a thor­ough ex­pla­na­tion of the cross-ex­am­i­na­tion pro­ce­dure and its benefits.

De­bate with cross-ex­am­i­na­tion where the or­ange ar­row is an ex­am­ple of cross-ex­am­i­na­tion where Bob (speci­fi­cally Bob_2, the sec­ond Bob) cross-ex­am­ines a copy of Alice (Alice_1) from two moves pre­vi­ously in the de­bate.

  1. Give the mod­els ac­cess to trans­parency tools and in­cen­tivize them to try to use them to iden­tify de­cep­tive or oth­er­wise catas­trophic be­hav­ior in the other de­bater to win the de­bate (e.g. Alice_2 can in­spect Bob_1’s in­ter­nals and win the de­bate by pre­sent­ing ev­i­dence of prob­le­matic cog­ni­tion). Com­bined with cross-ex­am­i­na­tion, de­baters should be able to in­spect ver­sions of the other de­bater from ar­bi­trary prior points in the de­bate.

De­bate with cross-ex­am­i­na­tion and trans­parency tools where the red ar­row is an ex­am­ple of Alice_2 us­ing trans­parency tools on Bob_1.

Outer al­ign­ment. Outer al­ign­ment for de­bate de­pends on whether giv­ing hon­est, co­op­er­a­tive, helpful ad­vice is the strat­egy with the great­est prob­a­bil­ity of win­ning. Since the first de­bater gets to choose to defend what­ever po­si­tion they want, in the limit they’ll choose to defend what­ever po­si­tion gives them the great­est prob­a­bil­ity of win­ning the de­bate. Thus, if hon­est, co­op­er­a­tive, helpful strate­gies dom­i­nate, then the first player will always pur­sue them in the limit, mean­ing that al­igned an­swers can be pro­duced at test/​de­ploy­ment time sim­ply by elic­it­ing the first de­bater’s po­si­tion. Whether hon­esty and helpful­ness ac­tu­ally dom­i­nate in the limit is a very open ques­tion, how­ever, and one that de­pends heav­ily on many fac­tors such as the set of ques­tions be­ing asked, the effec­tive­ness of the hu­man judges, the length of the de­bate, etc. For ex­am­ple, if there is sig­nifi­cant vari­a­tion in how difficult differ­ent po­si­tions are to defend that is in­de­pen­dent of their truth­ful­ness (per­haps be­cause ar­gu­ments with more mov­ing pieces that de­pend on more things are harder to defend re­gard­less of how true they are) then it seems likely that there would be in­stances where the truth wouldn’t dom­i­nate. That be­ing said, care­ful mechanism de­sign such as the use of cross-ex­am­i­na­tion could po­ten­tially help miti­gate such is­sues.

In­ner al­ign­ment. The ques­tion of in­ner al­ign­ment for de­bate with trans­parency tools cen­ters on the effec­tive­ness of the de­baters in us­ing the trans­parency tools to over­see each other. The goal is to di­rectly in­cen­tivize the de­baters to look for de­cep­tive or oth­er­wise catas­trophic be­hav­ior in each other us­ing the available tools and pre­sent that in­for­ma­tion to the judge to win the de­bate. Whether such an ap­proach is ac­tu­ally likely to suc­ceed is de­pen­dent on a cou­ple of differ­ent fac­tors, how­ever.

First, it’s un­clear what sort of con­di­tion could be en­forced in a de­bate setup that would suc­cess­fully rule out de­cep­tive al­ign­ment while still al­low­ing for the nec­es­sary de­cep­tion of the sec­ond de­bater, who is forced to defend the sec­ond-most prob­a­ble op­tion, and thus make ar­gu­ments that they don’t ac­tu­ally be­lieve in. Per-epi­sode my­opia could po­ten­tially pre­sent such a con­di­tion, though it might be a difficult con­di­tion to en­force.

Se­cond, get­ting a good train­ing sig­nal for in­cen­tiviz­ing such trans­parency tool use could be quite difficult. The de­baters need a way of learn­ing how to in­spect other mod­els, a way of learn­ing that do­ing so is a way to win the de­bate, and a way of learn­ing the spe­cific bad be­hav­ior that they should be look­ing for. This gets es­pe­cially tricky if the con­di­tion be­ing en­forced is just the ab­sence of de­cep­tion, as get­ting train­ing ex­am­ples of catch­ing de­cep­tion would seem to re­quire pro­duc­ing de­cep­tive mod­els, which is the very thing you’re try­ing to avoid do­ing in the first place.

Train­ing com­pet­i­tive­ness. Train­ing com­pet­i­tive­ness for de­bate de­pends on the ex­tent to which the de­bate game is a com­pute-effi­cient, sam­ple-effi­cient way of train­ing an ad­vanced AI sys­tem. Since the de­bate train­ing pro­cess is fun­da­men­tally re­in­force­ment learn­ing via self-play on a zero-sum game, the his­tor­i­cal suc­cess of such meth­ods on en­vi­ron­ments such as Go, Dota 2, or StarCraft sug­gests that such self-play might be fairly com­pet­i­tive as a gen­eral train­ing pro­ce­dure. That be­ing said, in com­par­i­son to the given en­vi­ron­ments, de­bate re­quires a hu­man in the loop, sig­nifi­cantly re­duc­ing the amount of train­ing data that can be col­lected and thus re­quiring sig­nifi­cantly bet­ter sam­ple effi­ciency to train suc­cess­fully. Such sam­ple effi­ciency con­cerns could po­ten­tially be re­solved via train­ing a su­per­vised model to pre­dict hu­man judg­ing be­hav­ior and then train­ing against that, though that could raise many of its own ad­di­tional safety and/​or com­pet­i­tive­ness con­cerns.

Perfor­mance com­pet­i­tive­ness. It has been proven that de­bate with cross-ex­am­i­na­tion has the prop­erty that ar­bi­trar­ily strong de­baters can con­vince a polyno­mial-time judge of the an­swer to any prob­lem in NEXP. Thus, there is a strong the­o­ret­i­cal ar­gu­ment for the perfor­mance com­pet­i­tive­ness of de­bate—in the limit, de­bate has the abil­ity to solve ba­si­cally any in­ter­est­ing prob­lem that can be solved. While it can be shown that ap­proaches such as ap­proval-based am­plifi­ca­tion and re­cur­sive re­ward mod­el­ing can similarly ac­cess NP, the abil­ity of de­bate to go sig­nifi­cantly be­yond NP all the way to NEXP sug­gests that de­bate might be sig­nifi­cantly more perfor­mance com­pet­i­tive than those ap­proaches. Of course, such the­o­ret­i­cal anal­y­sis doesn’t nec­es­sar­ily ap­ply in prac­tice—in re­al­ity, even in the limit of train­ing, no model can ever ac­tu­ally be ar­bi­trar­ily strong, so the prac­ti­cal differ­ence be­tween ac­cess­ing NP and ac­cess­ing NEXP might be very min­i­mal.

10. Am­plifi­ca­tion with aux­iliary RL ob­jec­tive + re­laxed ad­ver­sar­ial training

The ap­proach here is to train a model on a joint ob­jec­tive com­posed of three differ­ent pieces. First, we’ll train to imi­tate —that is, the ba­sic imi­ta­tive am­plifi­ca­tion loss.

The ba­sic imi­ta­tive am­plifi­ca­tion setup.

Se­cond, we’ll also train so as to cause to perform well on some RL ob­jec­tive in some en­vi­ron­ment. The idea here is to treat ’s an­swers to ’s ques­tions as ac­tions that yield some re­ward de­ter­mined by ’s perfor­mance in the en­vi­ron­ment with ac­cess to those an­swers. Thus, we are train­ing to max­i­mize a com­bi­na­tion of the stan­dard imi­ta­tive loss and the RL re­ward ob­tained by .

Imi­ta­tive am­plifi­ca­tion with an aux­iliary RL ob­jec­tive where the white ar­rows in­di­cate act­ing in the en­vi­ron­ment and the yel­low ar­rows in­di­cate the RL loss for train­ing to max­i­mize the re­ward ob­tained by .

Fi­nally, we can add re­laxed ad­ver­sar­ial train­ing by train­ing to de­crease the prob­a­bil­ity of catas­tro­phe as as­sessed by try­ing to find ad­ver­sar­ial at­tacks.

Imi­ta­tive am­plifi­ca­tion with an aux­iliary RL ob­jec­tive plus re­laxed ad­ver­sar­ial train­ing where the red ar­rows in­di­cate the over­sight loss.

As a con­crete ex­am­ple of this ap­proach where the RL en­vi­ron­ment in ques­tion is the AI safety via de­bate game, see “Syn­the­siz­ing am­plifi­ca­tion and de­bate.”

Outer al­ign­ment. An in­ter­est­ing prop­erty of this ap­proach is that its outer al­ign­ment prop­er­ties can be con­trol­led by de­ter­min­ing how to an­neal the var­i­ous differ­ent com­po­nents of the loss in the limit. An­neal­ing to­wards the imi­ta­tive loss yields HCH as the limit—giv­ing this ap­proach the same outer al­ign­ment prop­er­ties as imi­ta­tive am­plifi­ca­tion—whereas an­neal­ing to­wards the RL loss yields what­ever the dom­i­nant be­hav­ior in the given RL en­vi­ron­ment is as the limit. Given that HCH is gen­er­ally sig­nifi­cantly eas­ier to un­der­stand than RL limit­ing be­hav­ior, it seems likely that in the ma­jor­ity of cases you will want to an­neal to­wards the imi­ta­tive loss to get the best outer al­ign­ment prop­er­ties. That be­ing said, there are other op­tions—for ex­am­ple, you could use the imi­ta­tive loss as a con­straint such that the model is trained to do well on the RL ob­jec­tive with­out de­vi­at­ing from HCH by more than some fixed amount. Whether such an ap­proach is outer al­igned or not is likely to de­pend heav­ily on the spe­cific met­ric used to mea­sure the dis­tance from HCH as well as the ro­bust­ness of HCH to small de­vi­a­tions ac­cord­ing to that met­ric.

In­ner al­ign­ment. We’re us­ing re­laxed ad­ver­sar­ial train­ing here, which means that—as with all the other re­laxed ad­ver­sar­ial train­ing ap­proaches—in­ner al­ign­ment is likely to come down to whether or not the over­seer is able to pro­duce a con­sis­tent gra­di­ent away from de­cep­tion, which as always could be quite differ­ent than with other ap­proaches since both the model helping with over­sight and the model be­ing over­seen are be­ing trained differ­ently. Com­pared to imi­ta­tive am­plifi­ca­tion, the ad­di­tion of an RL ob­jec­tive could help sig­nifi­cantly if perfor­mance in the RL en­vi­ron­ment is highly cor­re­lated to over­sight effec­tive­ness. Fur­ther­more, the fact that the model is be­ing trained to max­i­mize the perfor­mance of the hu­man with ac­cess to the model means that the model is forced to learn how to pre­sent its in­for­ma­tion to a hu­man to get re­ward, which could make it sig­nifi­cantly more trans­par­ent and help with over­sight. On the other hand, ex­plic­itly train­ing the model to pro­duce an­swers which change a hu­man’s be­hav­ior could also en­courage the model to trick the hu­man, which could make de­cep­tion more likely and thus over­sight harder. Ad­di­tion­ally, train­ing on an RL ob­jec­tive could make it sig­nifi­cantly harder to do my­opia ver­ifi­ca­tion as op­ti­miz­ing even par­tially for a non-my­opic RL loss could act as a strong force push­ing the model away from my­opia.

Train­ing com­pet­i­tive­ness. Com­pared to imi­ta­tive am­plifi­ca­tion, the ad­di­tion of an RL loss could sig­nifi­cantly im­prove train­ing com­pet­i­tive­ness by pro­duc­ing a much richer re­ward sig­nal, though at the cost of hav­ing to run the RL en­vi­ron­ment. Fur­ther­more, the ad­di­tion of the imi­ta­tive loss could even im­prove train­ing com­pet­i­tive­ness rel­a­tive to pure RL by pro­vid­ing a less sparse re­ward. Ad­di­tion­ally, similarly to re­cur­sive re­ward mod­el­ing, in­cor­po­rat­ing an RL loss en­ables di­rectly train­ing the model to solve long-term tasks in­volv­ing difficult credit as­sign­ment prob­lems that might be hard for imi­ta­tive am­plifi­ca­tion alone to han­dle (though similarly to re­cur­sive re­ward mod­el­ing this trades off with the po­ten­tial safety benefits of my­opia).

That be­ing said, it is un­clear what hap­pens to that be­hav­ior if the RL loss is an­nealed away—ideally, if it is pos­si­ble for HCH to pro­duce the be­hav­ior, then hope­fully the model will con­verge on that, though that re­quires the op­ti­mal RL be­hav­ior to be close enough to HCH that first train­ing on the RL loss and then train­ing on the imi­ta­tive loss ac­tu­ally helps with the imi­ta­tion task. For that to be the case, suc­cess on the RL task likely needs to be highly cor­re­lated with good HCH imi­ta­tion, for which lan­guage mod­el­ing and hu­man ap­proval max­i­miza­tion tasks (such as the AI safety via de­bate game as in “Syn­the­siz­ing am­plifi­ca­tion and de­bate”) could be good can­di­dates.

Perfor­mance com­pet­i­tive­ness. As with outer al­ign­ment, perfor­mance com­pet­i­tive­ness is likely to de­pend heav­ily on ex­actly how the differ­ent losses are an­nealed in the limit. An­neal­ing to­wards the imi­ta­tive loss, as was sug­gested pre­vi­ously, makes the limit HCH, giv­ing this ap­proach the same perfor­mance com­pet­i­tive­ness prop­er­ties as imi­ta­tive am­plifi­ca­tion. As was men­tioned pre­vi­ously, how­ever, there are other op­tions, such as us­ing the imi­ta­tive loss as a con­straint such that the model is trained to max­i­mize the RL re­ward with­out de­vi­at­ing too far from HCH. Such a con­straint could help perfor­mance com­pet­i­tive­ness com­pared to train­ing purely on the RL loss—if the re­sult­ing model is sig­nifi­cantly bet­ter at in­ter­act­ing with hu­mans, for ex­am­ple—or hin­der perfor­mance com­pet­i­tive­ness—if it’s very difficult to do well on the RL task while stay­ing close to HCH.

11. Am­plifi­ca­tion alongside RL + re­laxed ad­ver­sar­ial training

In this ap­proach, the goal is to si­mul­ta­neously train two mod­els, a ques­tion-an­swer­ing sys­tem and an RL agent .

We’ll train on the joint ob­jec­tive of max­i­miz­ing the RL re­ward and de­creas­ing the prob­a­bil­ity of de­cep­tive or oth­er­wise catas­trophic be­hav­ior as as­sessed by . Note that while we’re us­ing a sort of re­laxed ad­ver­sar­ial train­ing loss here, the over­seer is us­ing a to­tally differ­ent model () than the model be­ing over­seen ().

The train­ing pro­cess for in am­plifi­ca­tion alongside RL where the white ar­row in­di­cates ’s ac­tions in the en­vi­ron­ment, the yel­low ar­row in­di­cates ’s RL loss, and the red ar­rows in­di­cate the over­sight loss for .

Then, we’ll train on the fol­low­ing joint ob­jec­tive.

First, we want to in­clude the stan­dard imi­ta­tive am­plifi­ca­tion loss and train to imi­tate .

Am­plifi­ca­tion alongside RL where is trained only via an imi­ta­tive loss.

Se­cond, we want to train to cause to pre­dict ’s ac­tions and the RL en­vi­ron­ment’s re­sponses well. The idea here is to train to tell the hu­man ev­ery­thing the hu­man needs to know to un­der­stand what is do­ing and how to repli­cate ’s be­hav­ior so that the hu­man with ac­cess to the model can match ’s perfor­mance in the RL en­vi­ron­ment.

Am­plifi­ca­tion alongside RL where the white ar­rows to in­di­cate in­for­ma­tion that at­tempts to pre­dict in the RL en­vi­ron­ment and the or­ange ar­rows in­di­cate the pre­dic­tion loss for train­ing to cause to pre­dict well.

Third, we can add stan­dard re­laxed ad­ver­sar­ial train­ing and also train to de­crease the prob­a­bil­ity of de­cep­tive or oth­er­wise catas­trophic be­hav­ior as as­sessed by .

Am­plifi­ca­tion alongside RL + re­laxed ad­ver­sar­ial train­ing where the red ar­rows in­di­cate the over­sight loss.

Fi­nally, once we’ve trained both and via the above losses, a crit­i­cal com­po­nent of this ap­proach is that we then do away with com­pletely and ex­clu­sively de­ploy the re­sult­ing .

Outer al­ign­ment. Since we only de­ploy , not , the ques­tion of outer al­ign­ment here should be en­tirely de­pen­dent on whether this pro­ce­dure pro­duces outer al­igned . Fur­ther­more, similarly to the pre­vi­ous ap­proach, we can an­neal the com­bi­na­tion of losses here to get the de­sired limit­ing be­hav­ior and thus con­trol our outer al­ign­ment prop­er­ties. In par­tic­u­lar, if we an­neal to­wards the imi­ta­tive loss, this ap­proach can be made to limit to HCH, giv­ing it the same outer al­ign­ment prop­er­ties as imi­ta­tive am­plifi­ca­tion. Fur­ther­more, even if we don’t an­neal to­wards the imi­ta­tive loss, as long as it is pos­si­ble for HCH to fully pre­dict both and the RL en­vi­ron­ment in the limit and the imi­ta­tive loss isn’t an­nealed away com­pletely, then at min­i­mal loss we should still end up with HCH.

In­ner al­ign­ment. In­ner al­ign­ment is par­tic­u­larly tricky for this ap­proach. First, there’s the stan­dard re­laxed ad­ver­sar­ial train­ing ques­tion of whether will be able to prop­erly over­see . Com­pared to the pre­vi­ous ap­proach, the fact that the aux­iliary re­ward is purely pre­dic­tive rather than a full RL loss should help us here, as it means that our loss is en­tirely my­opic and thus we should be able to do my­opia ver­ifi­ca­tion. That be­ing said, if is suffi­ciently non-my­opic, hav­ing pre­dict while still stay­ing my­opic it­self could be a challenge.

Ad­di­tion­ally, there’s also the prob­lem of it­self—while we throw away at the end, there are still likely to be op­por­tu­ni­ties for a suffi­ciently com­pe­tent and de­cep­tive to cause se­ri­ous is­sues. For ex­am­ple, could choose its ac­tions to in­fluence ’s train­ing in a par­tic­u­lar way—or just break out of its sand­box­ing and di­rectly take bad ac­tions in the world. Thus, we need the over­sight of by to at least work suffi­ciently well to pre­vent such is­sues. For that to hap­pen, it needs to be the case that with ac­cess to trans­parency tools can pre­dict when/​if will ever pro­duce such catas­trophic be­hav­ior. Thus, the fact that we’re ex­plic­itly train­ing to cause to be able to pre­dict could be quite helpful here, as that should hope­fully make good enough at pre­dict­ing to fore­see be­com­ing de­cep­tive or pro­duc­ing any such bad be­hav­ior.

Train­ing com­pet­i­tive­ness. Com­pared to other am­plifi­ca­tion ap­proaches, this ap­proach could be sig­nifi­cantly more train­ing com­pet­i­tive as it en­ables us to just train an RL model with min­i­mal re­stric­tions. That be­ing said, it also re­quires us to jointly train both an RL model and an am­plifi­ca­tion model, which could make it sig­nifi­cantly less com­pet­i­tive, es­pe­cially if it’s difficult to suc­cess­fully train the am­plifi­ca­tion model to pre­dict the RL model. That be­ing said, such train­ing can be viewed as effec­tively just model dis­til­la­tion, which is gen­er­ally fairly doable even for cur­rent mod­els. That be­ing said, the in­clu­sion of the imi­ta­tive loss will change things, es­pe­cially if it’s an­nealed to­wards in the limit. In such a situ­a­tion, the ques­tion be­comes how use­ful the aux­iliary RL pre­dic­tion task is in helping with the core imi­ta­tion task, which—similarly to the pre­vi­ous ap­proach—could be very de­pen­dent on the spe­cific RL en­vi­ron­ment used and how re­lated perfor­mance on that RL en­vi­ron­ment is to the imi­ta­tion task.

Perfor­mance com­pet­i­tive­ness. As with outer al­ign­ment, an­neal­ing to­wards the imi­ta­tive loss should pro­duce HCH as the limit, giv­ing this ap­proach the same perfor­mance com­pet­i­tive­ness prop­er­ties as imi­ta­tive am­plifi­ca­tion. Fur­ther­more—again as with outer al­ign­ment—even if the imi­ta­tive loss isn’t an­nealed to com­pletely, as long as HCH can pre­dict in the limit, you should still get HCH at min­i­mal loss.


  1. Note that the or­der here is cho­sen purely for ped­a­gog­i­cal rea­sons and is not meant to im­ply any sort of mea­sure of im­por­tance. ↩︎

  2. Note that in prac­tice doesn’t need to always be com­puted with an ac­tual hu­man. can sim­ply be some fixed frac­tion of the time, for ex­am­ple—or more of­ten when is con­fi­dent and less of­ten when is un­cer­tain—and the re­sult­ing pro­ce­dure is effec­tively the same. See “A con­crete pro­posal for ad­ver­sar­ial IDA” for an ex­am­ple of how some­thing like that could work. ↩︎

  3. Imi­ta­tive am­plifi­ca­tion leads to HCH in the limit of train­ing a perfect pre­dic­tor over all pos­si­ble in­puts—that is, an in­finite amount of train­ing data, an in­finitely large model, and an in­finitely large amount of train­ing time. ↩︎

  4. En­forc­ing my­opia for am­plifi­ca­tion could also have the side benefit of im­prov­ing com­pet­i­tive­ness, as a non-my­opic am­plifi­ca­tion model might try to make its out­put sim­pler and thus less use­ful if it is try­ing to make a hu­man con­sult­ing it eas­ier to pre­dict. ↩︎

  5. Fur­ther ex­plo­ra­tion of my­opia ver­ifi­ca­tion will also be forth­com­ing in a fu­ture post. ↩︎

  6. See Paul Chris­ti­ano’s “Effi­cient feed­back” for some ex­am­ples of pos­si­ble sam­ple-effi­ciency-im­prov­ing tech­niques. ↩︎

  7. That is, it needs to be the case that in the limit there ex­ists a model that perfectly imi­tates HCH in such a way that an am­plified over­seer will be con­vinced that the imi­ta­tor is non-catas­trophic. ↩︎

  8. This sort of failure mode could be par­tic­u­larly per­ni­cious if the de­cep­tive model is ca­pa­ble of gra­di­ent hack­ing. ↩︎

  9. Ver­ify­ing that the given tree is, in fact, the limit of ap­proval-based am­plifi­ca­tion is a straight­for­ward ex­er­cise of ex­pand­ing the train­ing pro­ce­dure out over time. Note that the mod­els at the top of the tree are the most re­cent mod­els and the mod­els at the bot­tom of the tree are the old­est. ↩︎

  10. It is worth not­ing that there is also the pos­si­bil­ity of re­cov­er­ing HCH as the limit of ap­proval-based am­plifi­ca­tion if the ap­proval sig­nal is im­ple­mented as a hu­man-com­puted dis­tance met­ric rel­a­tive to what the hu­man would have done be­fore see­ing the model’s an­swer. ↩︎

  11. In par­tic­u­lar, my­opia ver­ifi­ca­tion for ap­proval-based am­plifi­ca­tion could be used to pro­duce ap­proval-di­rected agents. ↩︎

  12. The prob­lem of “for­ward­ing” safety guaran­tees re­gard­ing a model perform­ing meta-learn­ing to any other mod­els found by that meta-learn­ing pro­ce­dure is a gen­eral prob­lem that oc­curs in all in­ner al­ign­ment schemes, though it is par­tic­u­larly per­ni­cious in this situ­a­tion. For a more de­tailed dis­cus­sion of this prob­lem, see the “Meta-learn­ing” sec­tion in “Re­laxed ad­ver­sar­ial train­ing for in­ner al­ign­ment.” ↩︎

  13. Scal­able agent al­ign­ment via re­ward mod­el­ing: a re­search di­rec­tion” notes that, while they ini­tially as­sume that each agent is com­pletely sep­a­rate, “While this kind of se­quen­tial train­ing is con­cep­tu­ally clearer, in prac­tice it might make more sense to train all of these agents jointly to en­sure that they are be­ing trained on the right dis­tri­bu­tion. More­over, all of these agents may share model pa­ram­e­ters or even be copies of the same agent in­stan­ti­ated as differ­ent play­ers in an ad­ver­sar­ial game.” Thus, while the differ­ent agents are pre­sented here as differ­ent in­stances of the same model—which is a type of re­cur­sive re­ward mod­el­ing—it is worth not­ing that re­cur­sive re­ward mod­el­ing also in­cludes other pos­si­bil­ities such as us­ing com­pletely differ­ent mod­els for the differ­ent agents. ↩︎

  14. Scal­able agent al­ign­ment via re­ward mod­el­ing: a re­search di­rec­tion” men­tions the pos­si­bil­ity of such over­sight, though does not in­clude it as part of the base pro­posal as is done here, not­ing that “When us­ing re­cur­sive re­ward mod­el­ing users have the op­tion to provide feed­back on the cog­ni­tive pro­cess that pro­duced out­comes, but they are not re­quired to do so. More­over, this feed­back might be difficult to provide in prac­tice if the policy model is not very in­ter­pretable.” ↩︎