Preface to the sequence on iterated amplification

This se­quence de­scribes iter­ated am­plifi­ca­tion, a pos­si­ble strat­egy for build­ing an AI that is ac­tu­ally try­ing to do what we want out of ML sys­tems trained by gra­di­ent de­scent.

Iter­ated am­plifi­ca­tion is not in­tended to be a silver bul­let that re­solves all of the pos­si­ble prob­lems with AI; it’s an ap­proach to the par­tic­u­lar al­ign­ment prob­lem posed by scaled-up ver­sions of mod­ern ML sys­tems.

Iter­ated am­plifi­ca­tion is based on a few key hopes

  • If you have an over­seer who is smarter than the agent you are try­ing to train, you can safely use that over­seer’s judg­ment as an ob­jec­tive.

  • We can train an RL sys­tem us­ing very sparse feed­back, so it’s OK if that over­seer is very com­pu­ta­tion­ally ex­pen­sive.

  • A team of al­igned agents may be smarter than any in­di­vi­d­ual agent, while re­main­ing al­igned.

If all of these hopes panned out, then at ev­ery point in train­ing “a team of the smartest agents we’ve been able to train so far” would be a suit­able over­seer for train­ing a slightly smarter al­igned suc­ces­sor. This could let us train very in­tel­li­gent agents while pre­serv­ing al­ign­ment (start­ing the in­duc­tion from an al­igned hu­man).

Iter­ated am­plifi­ca­tion is still in an pre­limi­nary state and is best un­der­stood as a re­search pro­gram rather than a worked out solu­tion. Nev­er­the­less, I think it is the most con­crete ex­ist­ing frame­work for al­ign­ing pow­er­ful ML with hu­man in­ter­ests.

Pur­pose and audience

The pur­pose of this se­quence is to com­mu­ni­cate the ba­sic in­tu­itions mo­ti­vat­ing iter­ated am­plifi­ca­tion, to define iter­ated am­plifi­ca­tion, and to pre­sent some of the im­por­tant open ques­tions.

I ex­pect this se­quence to be most use­ful for read­ers who would like to have a some­what de­tailed un­der­stand­ing of iter­ated am­plifi­ca­tion, and are look­ing for some­thing more struc­tured than ai-al­ign­ to help ori­ent them­selves.

The se­quence is in­tended to provide enough back­ground to fol­low most pub­lic dis­cus­sion about iter­ated am­plifi­ca­tion, and to be use­ful for build­ing in­tu­ition and in­form­ing re­search about AI al­ign­ment even if you never think about am­plifi­ca­tion again.

The se­quence will be eas­ier to un­der­stand if you have a work­ing un­der­stand­ing of ML, statis­tics, and on­line learn­ing, and if you are fa­mil­iar with other work on AI al­ign­ment. But it would be rea­son­able to just dive in and just skip over any de­tailed dis­cus­sion that seems to de­pend on miss­ing pre­req­ui­sites.

Out­line and read­ing recommendations

  • The first part of this se­quence clar­ifies the prob­lem that iter­ated am­plifi­ca­tion is try­ing to solve, which is both nar­rower and broader than you might ex­pect.

  • The sec­ond part of the se­quence out­lines the ba­sic in­tu­itions that mo­ti­vate iter­ated am­plifi­ca­tion. I think that these in­tu­itions may be more im­por­tant than the scheme it­self, but they are con­sid­er­ably more in­for­mal.

  • The core of the se­quence is the third sec­tion. Benign model-free RL de­scribes iter­ated am­plifi­ca­tion, as a gen­eral frame­work into which we can sub­sti­tute ar­bi­trary al­gorithms for re­ward learn­ing, am­plifi­ca­tion, and ro­bust­ness. The first four posts all de­scribe var­i­ants of this idea from differ­ent per­spec­tives, and if you find that one of those de­scrip­tions is clear­est for you then I recom­mend fo­cus­ing on that one and skim­ming the oth­ers.

  • The fourth part of the se­quence de­scribes some of the black boxes in iter­ated am­plifi­ca­tion and dis­cusses what we would need to do to fill in those boxes. I think these are some of the most im­por­tant open ques­tions in AI al­ign­ment.

  • The fifth sec­tion of the se­quence breaks down some of these prob­lems fur­ther and de­scribes some pos­si­ble ap­proaches.

  • The fi­nal sec­tion is an FAQ by Alex Zhu, in­cluded as ap­pendix.

The se­quence is not in­tended to be build­ing to­wards a big re­veal—af­ter the first sec­tion, each post should stand on its own as ad­dress­ing a ba­sic ques­tion raised by the pre­ced­ing posts. If the first sec­tion seems un­in­ter­est­ing you may want to skip it; if fu­ture sec­tions seem un­in­ter­est­ing then it’s prob­a­bly not go­ing to get any bet­ter.

Some read­ers might pre­fer start­ing with the third sec­tion, while be­ing pre­pared to jump back if it’s not clear what’s go­ing on or why. (It would still make sense to re­turn to the first two sec­tions af­ter read­ing the third.)

If you already un­der­stand iter­ated am­plifi­ca­tion you might be in­ter­ested in jump­ing around the fourth and fifth sec­tions to look at de­tails you haven’t con­sid­ered be­fore.

The posts in this se­quence link liber­ally to each other (not always in or­der) and to out­side posts. The se­quence is de­signed to make sense when read in or­der with­out read­ing other posts, fol­low­ing links only if you are in­ter­ested in more de­tails.

To­mor­row’s AI Align­ment Fo­rum se­quences post will be ‘Fu­ture di­rec­tions for am­bi­tious value learn­ing’ by Ro­hin Shah, in the se­quence ‘Value Learn­ing’.

The next post in this se­quence will come out on Tues­day 13th Novem­ber, and will be ‘The Steer­ing Prob­lem’ by Paul Chris­ti­ano.