Shard Theory

TagLast edit: 14 Nov 2023 10:49 UTC by Sohaib Imran

Shard theory is an alignment research program, about the relationship between training variables and learned values in trained Reinforcement Learning (RL) agents. It is thus an approach to progressively fleshing out a mechanistic account of human values, learned values in RL agents, and (to a lesser extent) the learned algorithms in ML generally.

Shard theory’s basic ontology of RL holds that shards are contextually activated, behavior-steering computations in neural networks (biological and artificial). The circuits that implement a shard that garners reinforcement are reinforced, meaning that that shard will be more likely to trigger again in the future, when given similar cognitive inputs.

As an appreciable fraction of a neural network is composed of shards, large neural nets can possess quite intelligent constituent shards. These shards can be sophisticated enough to be well-modeled as playing negotiation games with each other, (potentially) explaining human psychological phenomena like akrasia and value changes from moral reflection. Shard theory also suggests an approach to explaining the shape of human values, and scheme for RL alignment.

The shard the­ory of hu­man values

4 Sep 2022 4:28 UTC
235 points
66 comments24 min readLW link2 reviews

Shard The­ory in Nine Th­e­ses: a Distil­la­tion and Crit­i­cal Appraisal

LawrenceC19 Dec 2022 22:52 UTC
138 points
30 comments18 min readLW link

Un­der­stand­ing and avoid­ing value drift

TurnTrout9 Sep 2022 4:16 UTC
43 points
9 comments6 min readLW link

The her­i­ta­bil­ity of hu­man val­ues: A be­hav­ior ge­netic cri­tique of Shard Theory

geoffreymiller20 Oct 2022 15:51 UTC
80 points
59 comments21 min readLW link

Con­tra shard the­ory, in the con­text of the di­a­mond max­i­mizer problem

So8res13 Oct 2022 23:51 UTC
101 points
19 comments2 min readLW link1 review

A shot at the di­a­mond-al­ign­ment problem

TurnTrout6 Oct 2022 18:29 UTC
92 points
58 comments15 min readLW link

Re­ward is not the op­ti­miza­tion target

TurnTrout25 Jul 2022 0:03 UTC
355 points
121 comments10 min readLW link3 reviews

Shard The­ory: An Overview

David Udell11 Aug 2022 5:44 UTC
160 points
34 comments10 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
312 points
22 comments23 min readLW link

Paper: Un­der­stand­ing and Con­trol­ling a Maze-Solv­ing Policy Network

13 Oct 2023 1:38 UTC
69 points
0 comments1 min readLW link

Gen­eral al­ign­ment properties

TurnTrout8 Aug 2022 23:40 UTC
50 points
2 comments1 min readLW link

The Shard The­ory Align­ment Scheme

David Udell25 Aug 2022 4:52 UTC
47 points
32 comments2 min readLW link

Team Shard Sta­tus Report

David Udell9 Aug 2022 5:33 UTC
38 points
8 comments3 min readLW link

Hu­man val­ues & bi­ases are in­ac­cessible to the genome

TurnTrout7 Jul 2022 17:29 UTC
93 points
54 comments6 min readLW link1 review

Con­tra “Strong Co­her­ence”

DragonGod4 Mar 2023 20:05 UTC
39 points
24 comments1 min readLW link

[April Fools’] Defini­tive con­fir­ma­tion of shard theory

TurnTrout1 Apr 2023 7:27 UTC
166 points
7 comments2 min readLW link

Be­havi­oural statis­tics for a maze-solv­ing agent

20 Apr 2023 22:26 UTC
44 points
11 comments10 min readLW link

Re­search agenda: Su­per­vis­ing AIs im­prov­ing AIs

29 Apr 2023 17:09 UTC
76 points
5 comments19 min readLW link

Some Thoughts on Virtue Ethics for AIs

peligrietzer2 May 2023 5:46 UTC
74 points
7 comments4 min readLW link

AXRP Epi­sode 22 - Shard The­ory with Quintin Pope

DanielFilan15 Jun 2023 19:00 UTC
52 points
11 comments93 min readLW link

A frame­work and open ques­tions for game the­o­retic shard modeling

Garrett Baker21 Oct 2022 21:40 UTC
11 points
4 comments4 min readLW link

Align­ment al­lows “non­ro­bust” de­ci­sion-in­fluences and doesn’t re­quire ro­bust grading

TurnTrout29 Nov 2022 6:23 UTC
60 points
42 comments15 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTrout2 Dec 2022 2:43 UTC
138 points
22 comments47 min readLW link3 reviews

Disen­tan­gling Shard The­ory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC
85 points
6 comments18 min readLW link

Pos­i­tive val­ues seem more ro­bust and last­ing than prohibitions

TurnTrout17 Dec 2022 21:43 UTC
51 points
13 comments2 min readLW link

An ML in­ter­pre­ta­tion of Shard Theory

beren3 Jan 2023 20:30 UTC
38 points
5 comments4 min readLW link

Shard the­ory al­ign­ment has im­por­tant, of­ten-over­looked free pa­ram­e­ters.

Charlie Steiner20 Jan 2023 9:30 UTC
33 points
10 comments3 min readLW link

Re­view of AI Align­ment Progress

PeterMcCluskey7 Feb 2023 18:57 UTC
72 points
32 comments7 min readLW link

Pre­dic­tions for shard the­ory mechanis­tic in­ter­pretabil­ity results

1 Mar 2023 5:16 UTC
105 points
9 comments5 min readLW link

[Question] Is “Strong Co­her­ence” Anti-Nat­u­ral?

DragonGod11 Apr 2023 6:22 UTC
23 points
25 comments2 min readLW link

Clippy, the friendly paperclipper

Seth Herd2 Mar 2023 0:02 UTC
1 point
11 comments2 min readLW link

Hu­mans provide an un­tapped wealth of ev­i­dence about alignment

14 Jul 2022 2:31 UTC
196 points
94 comments9 min readLW link1 review

In Defense of Wrap­per-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC
23 points
38 comments3 min readLW link

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

13 May 2023 18:42 UTC
414 points
97 comments50 min readLW link

Ex­per­i­ment Idea: RL Agents Evad­ing Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC
31 points
7 comments17 min readLW link

Failure modes in a shard the­ory al­ign­ment plan

Thomas Kwa27 Sep 2022 22:34 UTC
26 points
2 comments7 min readLW link

The al­ign­ment sta­bil­ity problem

Seth Herd26 Mar 2023 2:10 UTC
24 points
10 comments4 min readLW link

Broad Pic­ture of Hu­man Values

Thane Ruthenis20 Aug 2022 19:42 UTC
42 points
6 comments10 min readLW link

Pes­simistic Shard Theory

Garrett Baker25 Jan 2023 0:59 UTC
71 points
13 comments3 min readLW link

AGI will have learnt util­ity functions

beren25 Jan 2023 19:42 UTC
35 points
3 comments13 min readLW link

Un­pack­ing “Shard The­ory” as Hunch, Ques­tion, The­ory, and Insight

Jacy Reese Anthis16 Nov 2022 13:54 UTC
31 points
9 comments2 min readLW link

A Short Dialogue on the Mean­ing of Re­ward Functions

19 Nov 2022 21:04 UTC
45 points
0 comments3 min readLW link

Evolu­tion is a bad anal­ogy for AGI: in­ner alignment

Quintin Pope13 Aug 2022 22:15 UTC
75 points
15 comments8 min readLW link

If Went­worth is right about nat­u­ral ab­strac­tions, it would be bad for alignment

Wuschel Schulz8 Dec 2022 15:19 UTC
28 points
5 comments4 min readLW link

Adap­ta­tion-Ex­e­cuters, not Fit­ness-Maximizers

Eliezer Yudkowsky11 Nov 2007 6:39 UTC
137 points
33 comments3 min readLW link