Shard Theory

14 Jul 2022 1:36 UTC

Written by Quintin Pope, Alex Turner, Charles Foster, and Logan Smith. Card image generated by DALL-E 2:

Humans provide an untapped wealth of evidence about alignment

TurnTrout and Quintin Pope

14 Jul 2022 2:31 UTC

197 points

94 comments9 min readLW link 1 review

Human values & biases are inaccessible to the genome

TurnTrout7 Jul 2022 17:29 UTC

93 points

54 comments6 min readLW link 1 review

General alignment properties

TurnTrout8 Aug 2022 23:40 UTC

50 points

2 comments1 min readLW link

Evolution is a bad analogy for AGI: inner alignment

Quintin Pope13 Aug 2022 22:15 UTC

78 points

15 comments8 min readLW link

Reward is not the optimization target

TurnTrout25 Jul 2022 0:03 UTC

348 points

123 comments10 min readLW link 3 reviews

The shard theory of human values

Quintin Pope and TurnTrout

4 Sep 2022 4:28 UTC

235 points

66 comments24 min readLW link 2 reviews

Understanding and avoiding value drift

TurnTrout9 Sep 2022 4:16 UTC

43 points

9 comments6 min readLW link

A shot at the diamond-alignment problem

TurnTrout6 Oct 2022 18:29 UTC

92 points

58 comments15 min readLW link

Don’t design agents which exploit adversarial inputs

TurnTrout and Garrett Baker

18 Nov 2022 1:48 UTC

69 points

64 comments12 min readLW link

Don’t align agents to evaluations of plans

TurnTrout26 Nov 2022 21:16 UTC

42 points

49 comments18 min readLW link

Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTrout29 Nov 2022 6:23 UTC

60 points

42 comments15 min readLW link

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout2 Dec 2022 2:43 UTC

139 points

22 comments47 min readLW link 3 reviews