Iterated Amplification

This is a sequence curated by Paul Christiano on one current approach to alignment: Iterated Amplification.

Pre­face to the se­quence on iter­ated amplification

0. Problem statement

The first part of this sequence clarifies the problem that iterated amplification is trying to solve, which is both narrower and broader than you might expect.

The Steer­ing Problem

Clar­ify­ing “AI Align­ment”

An un­al­igned benchmark

Pro­saic AI alignment

1. Basic intuition

The second part of the sequence outlines the basic intuitions that motivate iterated amplification. I think that these intuitions may be more important than the scheme itself, but they are considerably more informal.

Ap­proval-di­rected agents

Ap­proval-di­rected bootstrapping

Hu­mans Con­sult­ing HCH


2. The scheme

The core of the sequence is the third section. Benign model-free RL describes iterated amplification, as a general outline into which we can substitute arbitrary algorithms for reward learning, amplification, and robustness. The first four posts all describe variants of this idea from different perspectives, and if you find that one of those descriptions is clearest for you then I recommend focusing on that one and skimming the others.

Iter­ated Distil­la­tion and Amplification

Benign model-free RL

Fac­tored Cognition

Su­per­vis­ing strong learn­ers by am­plify­ing weak experts

AlphaGo Zero and ca­pa­bil­ity amplification

3. What needs doing

The fourth part of the sequence describes some of the black boxes in iterated amplification and discusses what we would need to do to fill in those boxes. I think these are some of the most important open questions in AI alignment.

Direc­tions and desider­ata for AI alignment

The re­ward en­g­ineer­ing prob­lem

Ca­pa­bil­ity amplification

Learn­ing with catastrophes

4. Possible approaches

The fifth section of the sequence breaks down some of these problems further and describes some possible approaches.

Thoughts on re­ward en­g­ineer­ing

Tech­niques for op­ti­miz­ing worst-case performance

Reli­a­bil­ity am­plifi­ca­tion

Se­cu­rity amplification