Shard TheoryQuintin Pope14 Jul 2022 1:36 UTCWritten by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:Humans provide an untapped wealth of evidence about alignmentTurnTrout and Quintin Pope14 Jul 2022 2:31 UTC197 points94 comments9 min readLW link1 reviewHuman values & biases are inaccessible to the genomeTurnTrout7 Jul 2022 17:29 UTC93 points54 comments6 min readLW link1 reviewGeneral alignment propertiesTurnTrout8 Aug 2022 23:40 UTC50 points2 comments1 min readLW linkEvolution is a bad analogy for AGI: inner alignmentQuintin Pope13 Aug 2022 22:15 UTC78 points15 comments8 min readLW linkReward is not the optimization targetTurnTrout25 Jul 2022 0:03 UTC348 points123 comments10 min readLW link3 reviewsThe shard theory of human valuesQuintin Pope and TurnTrout4 Sep 2022 4:28 UTC235 points66 comments24 min readLW link2 reviewsUnderstanding and avoiding value driftTurnTrout9 Sep 2022 4:16 UTC43 points9 comments6 min readLW linkA shot at the diamond-alignment problemTurnTrout6 Oct 2022 18:29 UTC92 points58 comments15 min readLW linkDon’t design agents which exploit adversarial inputsTurnTrout and Garrett Baker18 Nov 2022 1:48 UTC69 points64 comments12 min readLW linkDon’t align agents to evaluations of plansTurnTrout26 Nov 2022 21:16 UTC42 points49 comments18 min readLW linkAlignment allows “nonrobust” decision-influences and doesn’t require robust gradingTurnTrout29 Nov 2022 6:23 UTC60 points42 comments15 min readLW linkInner and outer alignment decompose one hard problem into two extremely hard problemsTurnTrout2 Dec 2022 2:43 UTC139 points22 comments47 min readLW link3 reviews