RSS

Shard Theory

TagLast edit: 26 Oct 2024 0:18 UTC by Noosphere89

Shard theory is an alignment research program, about the relationship between training variables and learned values in trained Reinforcement Learning (RL) agents. It is thus an approach to progressively fleshing out a mechanistic account of human values, learned values in RL agents, and (to a lesser extent) the learned algorithms in ML generally.

Shard theory’s basic ontology of RL holds that shards are contextually activated, behavior-steering computations in neural networks (biological and artificial). The circuits that implement a shard that garners reinforcement are reinforced, meaning that that shard will be more likely to trigger again in the future, when given similar cognitive inputs.

As an appreciable fraction of a neural network is composed of shards, large neural nets can possess quite intelligent constituent shards. These shards can be sophisticated enough to be well-modeled as playing negotiation games with each other, (potentially) explaining human psychological phenomena like akrasia and value changes from moral reflection. Shard theory also suggests an approach to explaining the shape of human values, and a scheme for RL alignment.

The shard the­ory of hu­man values

4 Sep 2022 4:28 UTC
249 points
67 comments24 min readLW link2 reviews

Shard The­ory in Nine Th­e­ses: a Distil­la­tion and Crit­i­cal Appraisal

LawrenceC19 Dec 2022 22:52 UTC
143 points
30 comments18 min readLW link

The her­i­ta­bil­ity of hu­man val­ues: A be­hav­ior ge­netic cri­tique of Shard Theory

geoffreymiller20 Oct 2022 15:51 UTC
80 points
59 comments21 min readLW link

Un­der­stand­ing and avoid­ing value drift

TurnTrout9 Sep 2022 4:16 UTC
48 points
11 comments6 min readLW link

Con­tra shard the­ory, in the con­text of the di­a­mond max­i­mizer problem

So8res13 Oct 2022 23:51 UTC
102 points
19 comments2 min readLW link1 review

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
328 points
27 comments23 min readLW link

Re­ward is not the op­ti­miza­tion target

TurnTrout25 Jul 2022 0:03 UTC
376 points
123 comments10 min readLW link3 reviews

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTrout2 Dec 2022 2:43 UTC
147 points
22 comments47 min readLW link3 reviews

A shot at the di­a­mond-al­ign­ment problem

TurnTrout6 Oct 2022 18:29 UTC
95 points
62 comments15 min readLW link

Shard The­ory: An Overview

David Udell11 Aug 2022 5:44 UTC
166 points
34 comments10 min readLW link

Shard The­ory—is it true for hu­mans?

Rishika14 Jun 2024 19:21 UTC
68 points
7 comments15 min readLW link

An ML in­ter­pre­ta­tion of Shard Theory

beren3 Jan 2023 20:30 UTC
39 points
5 comments4 min readLW link

Paper: Un­der­stand­ing and Con­trol­ling a Maze-Solv­ing Policy Network

13 Oct 2023 1:38 UTC
70 points
0 comments1 min readLW link
(arxiv.org)

Shard the­ory al­ign­ment has im­por­tant, of­ten-over­looked free pa­ram­e­ters.

Charlie Steiner20 Jan 2023 9:30 UTC
36 points
10 comments3 min readLW link

Gen­eral al­ign­ment properties

TurnTrout8 Aug 2022 23:40 UTC
50 points
2 comments1 min readLW link

Re­view of AI Align­ment Progress

PeterMcCluskey7 Feb 2023 18:57 UTC
72 points
32 comments7 min readLW link
(bayesianinvestor.com)

Pre­dic­tions for shard the­ory mechanis­tic in­ter­pretabil­ity results

1 Mar 2023 5:16 UTC
105 points
10 comments5 min readLW link

Con­tra “Strong Co­her­ence”

DragonGod4 Mar 2023 20:05 UTC
39 points
24 comments1 min readLW link

[Question] Is “Strong Co­her­ence” Anti-Nat­u­ral?

DragonGod11 Apr 2023 6:22 UTC
23 points
25 comments2 min readLW link

Clippy, the friendly paperclipper

Seth Herd2 Mar 2023 0:02 UTC
3 points
11 comments2 min readLW link

Re­ward Bases: A sim­ple mechanism for adap­tive ac­qui­si­tion of mul­ti­ple re­ward type

Bogdan Ionut Cirstea23 Nov 2024 12:45 UTC
11 points
0 comments1 min readLW link

Why I’m bear­ish on mechanis­tic in­ter­pretabil­ity: the shards are not in the network

tailcalled13 Sep 2024 17:09 UTC
19 points
40 comments1 min readLW link

The Shard The­ory Align­ment Scheme

David Udell25 Aug 2022 4:52 UTC
47 points
32 comments2 min readLW link

Team Shard Sta­tus Report

David Udell9 Aug 2022 5:33 UTC
38 points
8 comments3 min readLW link

[April Fools’] Defini­tive con­fir­ma­tion of shard theory

TurnTrout1 Apr 2023 7:27 UTC
169 points
8 comments2 min readLW link

Hu­man val­ues & bi­ases are in­ac­cessible to the genome

TurnTrout7 Jul 2022 17:29 UTC
94 points
54 comments6 min readLW link1 review

Be­havi­oural statis­tics for a maze-solv­ing agent

20 Apr 2023 22:26 UTC
46 points
11 comments10 min readLW link

Re­search agenda: Su­per­vis­ing AIs im­prov­ing AIs

29 Apr 2023 17:09 UTC
76 points
5 comments19 min readLW link

Align­ment al­lows “non­ro­bust” de­ci­sion-in­fluences and doesn’t re­quire ro­bust grading

TurnTrout29 Nov 2022 6:23 UTC
62 points
42 comments15 min readLW link

A frame­work and open ques­tions for game the­o­retic shard modeling

Garrett Baker21 Oct 2022 21:40 UTC
11 points
4 comments4 min readLW link

Some Thoughts on Virtue Ethics for AIs

peligrietzer2 May 2023 5:46 UTC
76 points
8 comments4 min readLW link

In­trin­sic Power-Seek­ing: AI Might Seek Power for Power’s Sake

TurnTrout19 Nov 2024 18:36 UTC
40 points
5 comments1 min readLW link
(turntrout.com)

AXRP Epi­sode 22 - Shard The­ory with Quintin Pope

DanielFilan15 Jun 2023 19:00 UTC
52 points
11 comments93 min readLW link

Disen­tan­gling Shard The­ory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC
86 points
6 comments18 min readLW link

Pos­i­tive val­ues seem more ro­bust and last­ing than prohibitions

TurnTrout17 Dec 2022 21:43 UTC
52 points
13 comments2 min readLW link

Ex­plor­ing Shard-like Be­hav­ior: Em­piri­cal In­sights into Con­tex­tual De­ci­sion-Mak­ing in RL Agents

Alejandro Aristizabal29 Sep 2024 0:32 UTC
6 points
0 comments15 min readLW link

Pro­gram­ming Re­fusal with Con­di­tional Ac­ti­va­tion Steering

Bruce W. Lee11 Sep 2024 20:57 UTC
41 points
0 comments11 min readLW link
(arxiv.org)

The al­ign­ment sta­bil­ity problem

Seth Herd26 Mar 2023 2:10 UTC
35 points
15 comments4 min readLW link

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

13 May 2023 18:42 UTC
437 points
97 comments50 min readLW link

Evolu­tion is a bad anal­ogy for AGI: in­ner alignment

Quintin Pope13 Aug 2022 22:15 UTC
79 points
15 comments8 min readLW link

Hu­mans provide an un­tapped wealth of ev­i­dence about alignment

14 Jul 2022 2:31 UTC
211 points
94 comments9 min readLW link1 review

Broad Pic­ture of Hu­man Values

Thane Ruthenis20 Aug 2022 19:42 UTC
42 points
6 comments10 min readLW link

Adap­ta­tion-Ex­e­cuters, not Fit­ness-Maximizers

Eliezer Yudkowsky11 Nov 2007 6:39 UTC
156 points
33 comments3 min readLW link

Failure modes in a shard the­ory al­ign­ment plan

Thomas Kwa27 Sep 2022 22:34 UTC
26 points
2 comments7 min readLW link

Un­pack­ing “Shard The­ory” as Hunch, Ques­tion, The­ory, and Insight

Jacy Reese Anthis16 Nov 2022 13:54 UTC
31 points
9 comments2 min readLW link

A Short Dialogue on the Mean­ing of Re­ward Functions

19 Nov 2022 21:04 UTC
45 points
0 comments3 min readLW link

If Went­worth is right about nat­u­ral ab­strac­tions, it would be bad for alignment

Wuschel Schulz8 Dec 2022 15:19 UTC
29 points
5 comments4 min readLW link

In Defense of Wrap­per-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC
24 points
38 comments3 min readLW link

Ex­per­i­ment Idea: RL Agents Evad­ing Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC
31 points
7 comments17 min readLW link
(docs.google.com)

Pes­simistic Shard Theory

Garrett Baker25 Jan 2023 0:59 UTC
72 points
13 comments3 min readLW link

AGI will have learnt util­ity functions

beren25 Jan 2023 19:42 UTC
36 points
3 comments13 min readLW link