Shard Theory

TagLast edit: 30 Dec 2024 10:03 UTC by Dakara

Shard Theory is an alignment research program, about the relationship between training variables and learned values in trained Reinforcement Learning (RL) agents. It is thus an approach to progressively fleshing out a mechanistic account of human values, learned values in RL agents, and (to a lesser extent) the learned algorithms in ML generally.

Shard theory’s basic ontology of RL holds that shards are contextually activated, behavior-steering computations in neural networks (biological and artificial). The circuits that implement a shard that garners reinforcement are reinforced, meaning that that shard will be more likely to trigger again in the future, when given similar cognitive inputs.

As an appreciable fraction of a neural network is composed of shards, large neural nets can possess quite intelligent constituent shards. These shards can be sophisticated enough to be well-modeled as playing negotiation games with each other, (potentially) explaining human psychological phenomena like akrasia and value changes from moral reflection. Shard theory also suggests an approach to explaining the shape of human values, and a scheme for RL alignment.

The shard theory of human values

Quintin Pope and TurnTrout

4 Sep 2022 4:28 UTC

261 points

67 comments24 min readLW link 2 reviews

Shard Theory in Nine Theses: a Distillation and Critical Appraisal

LawrenceC19 Dec 2022 22:52 UTC

150 points

30 comments18 min readLW link

Contra shard theory, in the context of the diamond maximizer problem

So8res13 Oct 2022 23:51 UTC

105 points

19 comments2 min readLW link 1 review

The heritability of human values: A behavior genetic critique of Shard Theory

geoffreymiller20 Oct 2022 15:51 UTC

82 points

63 comments21 min readLW link

Understanding and avoiding value drift

TurnTrout9 Sep 2022 4:16 UTC

48 points

14 comments6 min readLW link

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout2 Dec 2022 2:43 UTC

136 points

23 comments47 min readLW link 3 reviews

Reward is not the optimization target

TurnTrout25 Jul 2022 0:03 UTC

379 points

127 comments10 min readLW link 3 reviews

Shard Theory—is it true for humans?

Rishika14 Jun 2024 19:21 UTC

71 points

7 comments15 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

11 Mar 2023 18:59 UTC

334 points

28 comments23 min readLW link

Shard Theory: An Overview

David Udell11 Aug 2022 5:44 UTC

167 points

34 comments10 min readLW link

A shot at the diamond-alignment problem

TurnTrout6 Oct 2022 18:29 UTC

95 points

67 comments15 min readLW link

Self-dialogue: Do behaviorist rewards make scheming AGIs?

Steven Byrnes13 Feb 2025 18:39 UTC

43 points

1 comment46 min readLW link

Behavioural statistics for a maze-solving agent

peligrietzer and TurnTrout

20 Apr 2023 22:26 UTC

46 points

11 comments10 min readLW link

Human values & biases are inaccessible to the genome

TurnTrout7 Jul 2022 17:29 UTC

95 points

54 comments6 min readLW link 1 review

Team Shard Status Report

David Udell9 Aug 2022 5:33 UTC

38 points

8 comments3 min readLW link

Clippy, the friendly paperclipper

Seth Herd2 Mar 2023 0:02 UTC

3 points

11 comments2 min readLW link

Contra “Strong Coherence”

DragonGod4 Mar 2023 20:05 UTC

39 points

24 comments1 min readLW link

AXRP Episode 22 - Shard Theory with Quintin Pope

DanielFilan15 Jun 2023 19:00 UTC

52 points

11 comments93 min readLW link

Positive values seem more robust and lasting than prohibitions

TurnTrout17 Dec 2022 21:43 UTC

52 points

13 comments2 min readLW link

Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTrout29 Nov 2022 6:23 UTC

62 points

41 comments15 min readLW link

Some Thoughts on Virtue Ethics for AIs

peligrietzer2 May 2023 5:46 UTC

83 points

8 comments4 min readLW link

Research agenda: Supervising AIs improving AIs

Quintin Pope, Owen D, Roman Engeler and jacquesthibs

29 Apr 2023 17:09 UTC

76 points

5 comments19 min readLW link

Predictions for shard theory mechanistic interpretability results

TurnTrout, Ulisse Mini and peligrietzer

1 Mar 2023 5:16 UTC

105 points

10 comments5 min readLW link

[Question] Is “Strong Coherence” Anti-Natural?

DragonGod11 Apr 2023 6:22 UTC

23 points

25 comments2 min readLW link

A framework and open questions for game theoretic shard modeling

Garrett Baker21 Oct 2022 21:40 UTC

11 points

4 comments4 min readLW link

General alignment properties

TurnTrout8 Aug 2022 23:40 UTC

51 points

2 comments1 min readLW link

Why I’m bearish on mechanistic interpretability: the shards are not in the network

tailcalled13 Sep 2024 17:09 UTC

21 points

40 comments1 min readLW link

An ML interpretation of Shard Theory

beren3 Jan 2023 20:30 UTC

39 points

5 comments4 min readLW link

[April Fools’] Definitive confirmation of shard theory

TurnTrout1 Apr 2023 7:27 UTC

170 points

8 comments2 min readLW link

Disentangling Shard Theory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC

86 points

6 comments18 min readLW link

Review of AI Alignment Progress

PeterMcCluskey7 Feb 2023 18:57 UTC

72 points

32 comments7 min readLW link

(bayesianinvestor.com)

Reward Bases: A simple mechanism for adaptive acquisition of multiple reward type

Bogdan Ionut Cirstea23 Nov 2024 12:45 UTC

11 points

0 comments1 min readLW link

The Shard Theory Alignment Scheme

David Udell25 Aug 2022 4:52 UTC

47 points

32 comments2 min readLW link

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud, Jacob G-W, Evzen, Joseph Miller and TurnTrout

6 Dec 2024 22:19 UTC

169 points

14 comments11 min readLW link

(arxiv.org)

Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake

TurnTrout19 Nov 2024 18:36 UTC

40 points

5 comments1 min readLW link

(turntrout.com)

Paper: Understanding and Controlling a Maze-Solving Policy Network

TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M and lisathiergart

13 Oct 2023 1:38 UTC

70 points

0 comments1 min readLW link

(arxiv.org)

Shard theory alignment has important, often-overlooked free parameters.

Charlie Steiner20 Jan 2023 9:30 UTC

36 points

10 comments3 min readLW link

Adaptation-Executers, not Fitness-Maximizers

Eliezer Yudkowsky11 Nov 2007 6:39 UTC

177 points

33 comments3 min readLW link

If Wentworth is right about natural abstractions, it would be bad for alignment

Wuschel Schulz8 Dec 2022 15:19 UTC

29 points

5 comments4 min readLW link

Evolution is a bad analogy for AGI: inner alignment

Quintin Pope13 Aug 2022 22:15 UTC

86 points

17 comments8 min readLW link

Exploring Shard-like Behavior: Empirical Insights into Contextual Decision-Making in RL Agents

Alejandro Aristizabal29 Sep 2024 0:32 UTC

6 points

0 comments15 min readLW link

Humans provide an untapped wealth of evidence about alignment

TurnTrout and Quintin Pope

14 Jul 2022 2:31 UTC

213 points

94 comments9 min readLW link 1 review

Experiment Idea: RL Agents Evading Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC

31 points

7 comments17 min readLW link

(docs.google.com)

In Defense of Wrapper-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC

24 points

38 comments3 min readLW link

Steering GPT-2-XL by adding an activation vector

TurnTrout, Monte M, David Udell, lisathiergart and Ulisse Mini

13 May 2023 18:42 UTC

439 points

98 comments50 min readLW link 1 review

AGI will have learnt utility functions

beren25 Jan 2023 19:42 UTC

40 points

4 comments13 min readLW link

Failure modes in a shard theory alignment plan

Thomas Kwa27 Sep 2022 22:34 UTC

26 points

2 comments7 min readLW link

The alignment stability problem

Seth Herd26 Mar 2023 2:10 UTC

35 points

15 comments4 min readLW link

Pessimistic Shard Theory

Garrett Baker25 Jan 2023 0:59 UTC

72 points

13 comments3 min readLW link

Broad Picture of Human Values

Thane Ruthenis20 Aug 2022 19:42 UTC

42 points

6 comments10 min readLW link

Unpacking “Shard Theory” as Hunch, Question, Theory, and Insight

Jacy Reese Anthis16 Nov 2022 13:54 UTC

31 points

9 comments2 min readLW link

A Short Dialogue on the Meaning of Reward Functions

Leon Lang, Quintin Pope and peligrietzer

19 Nov 2022 21:04 UTC

45 points

0 comments3 min readLW link

raccoon 15 Feb 2023 3:19 UTC
3 points
0
Changed first instance of “RL” to “Reinforcement Learning (RL)” because if I didn’t immediately realize what it meant, someone who is learning this for the first time won’t think of it either.
- Raemon 15 Feb 2023 5:58 UTC
  2 points
  0
  Parent
  Yeah I think this is good practice.