Iterated Distillation and Amplification

This is a guest post sum­ma­riz­ing Paul Chris­ti­ano’s pro­posed scheme for train­ing ma­chine learn­ing sys­tems that can be ro­bustly al­igned to com­plex and fuzzy val­ues, which I call Iter­ated Distil­la­tion and Am­plifi­ca­tion (IDA) here. IDA is no­tably similar to AlphaGoZero and ex­pert iter­a­tion.

The hope is that if we use IDA to train each learned com­po­nent of an AI then the over­all AI will re­main al­igned with the user’s in­ter­ests while achiev­ing state of the art perfor­mance at run­time — pro­vided that any non-learned com­po­nents such as search or logic are also built to pre­serve al­ign­ment and main­tain run­time perfor­mance. This doc­u­ment gives a high-level out­line of IDA.

Mo­ti­va­tion: The al­ign­ment/​ca­pa­bil­ities tradeoff

As­sume that we want to train a learner A to perform some com­plex fuzzy task, e.g. “Be a good per­sonal as­sis­tant.” As­sume that A is ca­pa­ble of learn­ing to perform the task at a su­per­hu­man level — that is, if we could perfectly spec­ify a “per­sonal as­sis­tant” ob­jec­tive func­tion and trained A to max­i­mize it, then Awould be­come a far bet­ter per­sonal as­sis­tant than any hu­man.

There is a spec­trum of pos­si­bil­ities for how we might train A to do this task. On one end, there are tech­niques which al­low the learner to dis­cover pow­er­ful, novel poli­cies that im­prove upon hu­man ca­pa­bil­ities:

  • Broad re­in­force­ment learn­ing: As A takes ac­tions in the world, we give it a rel­a­tively sparse re­ward sig­nal based on how satis­fied or dis­satis­fied we are with the even­tual con­se­quences. We then al­low A to op­ti­mize for the ex­pected sum of its fu­ture rewards

  • Broad in­verse re­in­force­ment learn­ing: A at­tempts to in­fer our deep long-term val­ues from our ac­tions, per­haps us­ing a so­phis­ti­cated model of hu­man psy­chol­ogy and ir­ra­tional­ity to se­lect which of many pos­si­ble ex­trap­o­la­tions is cor­rect.

How­ever, it is difficult to spec­ify a broad ob­jec­tive that cap­tures ev­ery­thing we care about, so in prac­tice A will be op­ti­miz­ing for some proxy that is not com­pletely al­igned with our in­ter­ests. Even if this proxy ob­jec­tive is “al­most” right, its op­ti­mum could be dis­as­trous ac­cord­ing to our true val­ues.

On the other end, there are tech­niques that try to nar­rowly em­u­late hu­man judg­ments:

  • Imi­ta­tion learn­ing: We could train A to ex­actly mimic how an ex­pert­would do the task, e.g. by train­ing it to fool a dis­crim­i­na­tive model try­ing to tell apart A’s ac­tions from the hu­man ex­pert’s ac­tions.

  • Nar­row in­verse re­in­force­ment learn­ing: We could train A to in­fer our near-term in­stru­men­tal val­ues from our ac­tions, with the pre­sump­tion that our ac­tions are roughly op­ti­mal ac­cord­ing to those val­ues.

  • Nar­row re­in­force­ment learn­ing: As A takes ac­tions in the world, we give it a dense re­ward sig­nal based on how rea­son­able we judge its choices are (per­haps we di­rectly re­ward state-ac­tion pairs them­selves rather than out­comes in the world, as in TAMER). A op­ti­mizes for the ex­pected sum of its fu­ture re­wards.

Us­ing these tech­niques, the risk of mis­al­ign­ment is re­duced sig­nifi­cantly (though not elimi­nated) by re­strict­ing agents to the range of known hu­man be­hav­ior — but this in­tro­duces se­vere limi­ta­tions on ca­pa­bil­ity. This trade­off be­tween al­low­ing for novel ca­pa­bil­ities and re­duc­ing mis­al­ign­ment risk ap­plies across differ­ent learn­ing schemes (with imi­ta­tion learn­ing gen­er­ally be­ing nar­row­est and low­est risk) as well as within a sin­gle scheme.

The mo­ti­vat­ing prob­lem that IDA at­tempts to solve: if we are only able to al­ign agents that nar­rowly repli­cate hu­man be­hav­ior, how can we build an AGI that is both al­igned and ul­ti­mately much more ca­pa­ble than the best hu­mans?

Core con­cept: Anal­ogy to AlphaGoZero

The core idea of Paul’s scheme is similar to AlphaGoZero (AGZ): We use a learned model many times as a sub­rou­tine in a more pow­er­ful de­ci­sion-mak­ing pro­cess, and then re-train the model to imi­tate those bet­ter de­ci­sions.

AGZ’s policy net­work p is the learned model. At each iter­a­tion, AGZ se­lects moves by an ex­pen­sive Monte Carlo Tree Search (MCTS) which uses policy pas its prior; p is then trained to di­rectly pre­dict the dis­tri­bu­tion of moves that MCTS ul­ti­mately set­tles on. In the next iter­a­tion, MCTS is run us­ing the new more ac­cu­rate p, and p is trained to pre­dict the even­tual out­come of that pro­cess, and so on. After enough iter­a­tions, a fixed point is reached — p is un­able to learn how run­ning MCTS will change its cur­rent prob­a­bil­ities.

MCTS is an am­plifi­ca­tion of p — it uses p as a sub­rou­tine in a larger pro­cess that ul­ti­mately makes bet­ter moves than p alone could. In turn, p is a dis­til­la­tion of MCTS: it learns to di­rectly guess the re­sults of run­ning MCTS, achiev­ing com­pa­rable perfor­mance while short-cut­ting the ex­pen­sive com­pu­ta­tion. The idea of IDA is to use the ba­sic iter­ated dis­til­la­tion and am­plifi­ca­tion pro­ce­dure in a much more gen­eral do­main.

The IDA Scheme

IDA in­volves re­peat­edly im­prov­ing a learned model through an am­plifi­ca­tion and dis­til­la­tion pro­cess over mul­ti­ple iter­a­tions.

Am­plifi­ca­tion is in­ter­ac­tive and hu­man-di­rected in IDA

In AGZ, the am­plifi­ca­tion pro­ce­dure is Monte Carlo Tree Search — it’s a sim­ple and well-un­der­stood al­gorithm, and there’s a clear mechanism for how it im­proves on the policy net­work’s origi­nal choices (it tra­verses the game tree more deeply). But in IDA, am­plifi­ca­tion is not nec­es­sar­ily a fixed al­gorithm that can be writ­ten down once and re­peat­edly ap­plied; it’s an in­ter­ac­tive pro­cess di­rected by hu­man de­ci­sions.

In most do­mains, hu­mans are ca­pa­ble of im­prov­ing their na­tive ca­pa­bil­ities by del­e­gat­ing to as­sis­tants (e.g. be­cause CEOs can del­e­gate tasks to a large team, they can pro­duce or­ders of mag­ni­tude more out­put per day than they could on their own). This means if our learn­ing pro­ce­dure can cre­ate an ad­e­quate helper for the hu­man, the hu­man can use the AI to am­plify their abil­ity — this hu­man/​AI sys­tem may be ca­pa­ble of do­ing things that the hu­man couldn’t man­age on their own.

Below I con­sider the ex­am­ple of us­ing IDA to build a su­per­hu­man per­sonal as­sis­tant. Let A[t] to re­fer to the state of the learned model af­ter the end of iter­a­tion t; the ini­tial agent A[0] is trained by a hu­man over­seer H.

Ex­am­ple: Build­ing a su­per­hu­man per­sonal assistant

H trains A[0] us­ing a tech­nique from the nar­row end of the spec­trum, such as imi­ta­tion learn­ing. Here we are imag­in­ing a much more pow­er­ful ver­sion of “imi­ta­tion learn­ing” than cur­rent sys­tems are ac­tu­ally ca­pa­ble of — we as­sume that A[0] can ac­quire nearly hu­man-level ca­pa­bil­ities through this pro­cess. That is, the trained A[0] model ex­e­cutes all the tasks of a per­sonal as­sis­tant as H would (in­clud­ing com­pre­hend­ing English in­struc­tions, writ­ing emails, putting to­gether a meet­ing sched­ule, etc).

Even though A[0] can­not dis­cover any novel ca­pa­bil­ities, it has two key ad­van­tages over H: it can run much faster, and many copies or ver­sions of it can be run at once. We hope to lev­er­age these ad­van­tages to con­struct a larger sys­tem — in­volv­ing H and many copies of A[0] — that will sub­stan­tially im­prove on H’s ca­pa­bil­ities while pre­serv­ing al­ign­ment with H’s val­ues.

H can use calls to A[0] (along with other tools such as ex­ter­nal mem­ory) to be­come a bet­ter per­sonal as­sis­tant. For ex­am­ple, H could as­sign one copy of A[0] to figur­ing out the best time to sched­ule the client’s re­cur­ring team meet­ings, an­other copy to figure out what to or­der the client for lunch, an­other copy to bal­ance the client’s per­sonal bud­get, etc. H now has the abil­ity to get very quick solu­tions to sub-prob­lems that are roughly as good as the ones H would have come up with on their own over a longer time pe­riod, and can com­bine these re­sults to make much bet­ter de­ci­sions than an un­aided hu­man.

Let Am­plify(H, A[0]) re­fer to the larger sys­tem of H + many copies of A[0] + aids. Com­pared to A[0] alone, the Am­plify(H, A[0]) sys­tem has much higher time and re­source costs but its even­tual de­ci­sions are much bet­ter. More­over, be­cause in each of its in­di­vi­d­ual de­ci­sions each copy of A[0] con­tinues to act just as a hu­man per­sonal as­sis­tant would act, we can hope that Am­plify(H, A[0]) pre­serves al­ign­ment.

In the next iter­a­tion of train­ing, the Am­plify(H, A[0]) sys­tem takes over the role of H as the over­seer. A[1] is trained with nar­row and safe tech­niques to quickly re­pro­duce the re­sults of Am­plify(H, A[0]). Be­cause we as­sumed Am­plify(H, A[0]) was al­igned, we can hope that A[1] is also al­igned if it is trained us­ing suffi­ciently nar­row tech­niques which in­tro­duce no new be­hav­iors. A[1] is then used in Am­plify(H, A[1]), which serves as an over­seer to train A[2], and so on.


def IDA(H):  
A ← ran­dom ini­tial­iza­tion
A ← Distill(Am­plify(H, A))

def Distill(over­seer):
Re­turns an AI trained us­ing nar­row, ro­bust tech­niques to
perform a task that the over­seer already un­der­stands how to


def Am­plify(hu­man, AI):
In­ter­ac­tive pro­cess in which hu­man uses many calls to AI to
im­prove on hu­man’s na­tive perfor­mance at rele­vant task(s).

What prop­er­ties must hold for IDA to work?

The IDA scheme is a tem­plate with “slots” for Am­plify and Distill pro­ce­dures that have not been fully speci­fied yet — in fact, they rely on ca­pa­bil­ities we don’t yet have. Be­cause IDA it­self is not fully speci­fied, it’s not clear what min­i­mal set of prop­er­ties are nec­es­sary for it to suc­ceed.

Achiev­ing al­ign­ment and high capability

That said, here are some gen­eral prop­er­ties which seem nec­es­sary — though likely not suffi­cient — for IDA agents to achieve ro­bust al­ign­ment and high ca­pa­bil­ity:

  1. The Distill pro­ce­dure ro­bustly pre­serves al­ign­ment: Given an al­igned agent Hwe can use nar­row safe learn­ing tech­niques to train a much faster agent Awhich be­haves as H would have be­haved, with­out in­tro­duc­ing any mis­al­igned op­ti­miza­tion or los­ing im­por­tant as­pects of what H val­ues.

  2. The Am­plify pro­ce­dure ro­bustly pre­serves al­ign­ment: Given an al­igned agent A, it is pos­si­ble to spec­ify an am­plifi­ca­tion scheme which calls A mul­ti­ple times as a sub­rou­tine in a way that re­li­ably avoids in­tro­duc­ing mis­al­igned op­ti­miza­tion.

  3. At least some hu­man ex­perts are able to iter­a­tively ap­ply am­plifi­ca­tion to achieve ar­bi­trar­ily high ca­pa­bil­ities at the rele­vant task: a) there is some thresh­old of gen­eral ca­pa­bil­ity such that if some­one is above this thresh­old, they can even­tu­ally solve any prob­lem that an ar­bi­trar­ily in­tel­li­gent sys­tem could solve, pro­vided they can del­e­gate tasks to similarly-in­tel­li­gent as­sis­tants and are given ar­bi­trary amounts of mem­ory and time; b) at least some hu­man ex­perts are above this thresh­old of gen­er­al­ity — given enough time and re­sources, they can figure out how to use AI as­sis­tants and tools to im­prove their ca­pa­bil­ities ar­bi­trar­ily far.

The non-profit Ought is work­ing on gath­er­ing more ev­i­dence about as­sump­tions 2 and 3.

Achiev­ing com­pet­i­tive perfor­mance and efficiency

Paul aims for IDA agents to be com­pet­i­tive with tra­di­tional RL agents in time and re­source costs at run­time — this is a rea­son­able ex­pec­ta­tion be­cause an IDA agent is ul­ti­mately just an­other learned model whose weights were tuned with an un­usual train­ing pro­ce­dure.

Re­source and time cost dur­ing train­ing is a more open ques­tion; I haven’t ex­plored the as­sump­tions that would have to hold for the IDA train­ing pro­cess to be prac­ti­cally fea­si­ble or re­source-com­pet­i­tive with other AI pro­jects.

This was origi­nally posted here.