Iterated Amplification

This is a se­quence cu­rated by Paul Chris­ti­ano on one cur­rent ap­proach to al­ign­ment: Iter­ated Am­plifi­ca­tion.

Pre­face to the se­quence on iter­ated amplification

Problem statement

The first part of this se­quence clar­ifies the prob­lem that iter­ated am­plifi­ca­tion is try­ing to solve, which is both nar­rower and broader than you might ex­pect.

The Steer­ing Problem

Clar­ify­ing “AI Align­ment”

An un­al­igned benchmark

Pro­saic AI alignment

Basic intuition

The sec­ond part of the se­quence out­lines the ba­sic in­tu­itions that mo­ti­vate iter­ated am­plifi­ca­tion. I think that these in­tu­itions may be more im­por­tant than the scheme it­self, but they are con­sid­er­ably more in­for­mal.

Ap­proval-di­rected agents

Ap­proval-di­rected bootstrapping

Hu­mans Con­sult­ing HCH

Corrigibility

The scheme

The core of the se­quence is the third sec­tion. Benign model-free RL de­scribes iter­ated am­plifi­ca­tion, as a gen­eral out­line into which we can sub­sti­tute ar­bi­trary al­gorithms for re­ward learn­ing, am­plifi­ca­tion, and ro­bust­ness. The first four posts all de­scribe var­i­ants of this idea from differ­ent per­spec­tives, and if you find that one of those de­scrip­tions is clear­est for you then I recom­mend fo­cus­ing on that one and skim­ming the oth­ers.

Iter­ated Distil­la­tion and Amplification

Benign model-free RL

Fac­tored Cognition

Su­per­vis­ing strong learn­ers by am­plify­ing weak experts

AlphaGo Zero and ca­pa­bil­ity amplification

What needs doing

The fourth part of the se­quence de­scribes some of the black boxes in iter­ated am­plifi­ca­tion and dis­cusses what we would need to do to fill in those boxes. I think these are some of the most im­por­tant open ques­tions in AI al­ign­ment.

Direc­tions and desider­ata for AI alignment

The re­ward en­g­ineer­ing prob­lem

Ca­pa­bil­ity amplification

Learn­ing with catastrophes

Possible approaches

The fifth sec­tion of the se­quence breaks down some of these prob­lems fur­ther and de­scribes some pos­si­ble ap­proaches.

Thoughts on re­ward en­g­ineer­ing

Tech­niques for op­ti­miz­ing worst-case performance

Reli­a­bil­ity am­plifi­ca­tion