Shard Theory

Written by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:

Hu­mans provide an un­tapped wealth of ev­i­dence about alignment

Hu­man val­ues & bi­ases are in­ac­cessible to the genome

Gen­eral al­ign­ment properties

Evolu­tion is a bad anal­ogy for AGI: in­ner alignment

Re­ward is not the op­ti­miza­tion target

The shard the­ory of hu­man values

Un­der­stand­ing and avoid­ing value drift

A shot at the di­a­mond-al­ign­ment problem

Don’t de­sign agents which ex­ploit ad­ver­sar­ial inputs

Don’t al­ign agents to eval­u­a­tions of plans

Align­ment al­lows “non­ro­bust” de­ci­sion-in­fluences and doesn’t re­quire ro­bust grading

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems