Value Learning

TagLast edit: 30 Dec 2024 10:05 UTC by Dakara

Value Learning is a proposed method for incorporating human values in an AGI. It involves the creation of an artificial learner whose actions consider many possible sets of values and preferences, weighed by their likelihood. Value learning could prevent an AGI of having goals detrimental to human values, hence helping in the creation of Friendly AI.

Many ways have been proposed to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition, mostly proposed around 2004-2010). Value learning was suggested in 2011 by Daniel Dewey in ‘Learning What to Value’. Like most authors, he assumes that an artificial agent needs to be intentionally aligned to human goals. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, this could suffer from goal misspecification or reward hacking. He proposes a utility function maximizer comparable to AIXI, which considers all possible utility functions weighted by their Bayesian probabilities: “[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history”

Nick Bostrom also discusses value learning at length in his book Superintelligence. Value learning is closely related to various proposals for AI-assisted Alignment and AI-assisted/AI automated Alignment research. Since human values are complex and fragile, learning human values well is a challenging problem, much like AI-assisted Alignment (but in a less supervised setting, so actually harder). So this is only a practicable alignment technique for AGI capable of successfully performing a STEM research program (in Anthropology). Thus value learning is (unusually) an alignment technique that improves as capabilities increase, and it requires around an AGI minimum threshold of capabilities to begin to be effective.

One potential challenge is that human values are somewhat mutable and AGI could affect them.

References

Dewey’s paper

The easy goal inference problem is still hard

paulfchristiano3 Nov 2018 14:41 UTC

71 points

20 comments4 min readLW link

Ambitious vs. narrow value learning

paulfchristiano12 Jan 2019 6:18 UTC

31 points

16 comments4 min readLW link

Humans can be assigned any values whatsoever…

Stuart_Armstrong5 Nov 2018 14:26 UTC

53 points

27 comments4 min readLW link

Model Mis-specification and Inverse Reinforcement Learning

Owain_Evans and jsteinhardt

9 Nov 2018 15:33 UTC

34 points

3 comments16 min readLW link

Conclusion to the sequence on value learning

Rohin Shah3 Feb 2019 21:05 UTC

52 points

20 comments5 min readLW link

Intuitions about goal-directed behavior

Rohin Shah1 Dec 2018 4:25 UTC

55 points

16 comments6 min readLW link

Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV

RogerDearnaley23 Dec 2025 3:40 UTC

40 points

25 comments20 min readLW link

6. The Mutable Values Problem in Value Learning and CEV

RogerDearnaley4 Dec 2023 18:31 UTC

12 points

0 comments49 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

46 points

12 comments31 min readLW link

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley1 Feb 2024 21:15 UTC

14 points

15 comments13 min readLW link

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC

33 points

3 comments15 min readLW link

Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect

RogerDearnaley26 Jan 2024 3:58 UTC

16 points

2 comments11 min readLW link

What is ambitious value learning?

Rohin Shah1 Nov 2018 16:20 UTC

55 points

28 comments2 min readLW link

Normativity

abramdemski18 Nov 2020 16:52 UTC

47 points

11 comments9 min readLW link

AI Alignment Problem: “Human Values” don’t Actually Exist

avturchin22 Apr 2019 9:23 UTC

46 points

29 comments43 min readLW link

Learning human preferences: black-box, white-box, and structured white-box access

Stuart_Armstrong24 Aug 2020 11:42 UTC

26 points

9 comments6 min readLW link

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

RogerDearnaley28 May 2025 6:21 UTC

35 points

34 comments9 min readLW link

Morally underdefined situations can be deadly

Stuart_Armstrong22 Nov 2021 14:48 UTC

17 points

8 comments2 min readLW link

Humans aren’t agents—what then for value learning?

Charlie Steiner15 Mar 2019 22:01 UTC

28 points

16 comments3 min readLW link

Future directions for narrow value learning

Rohin Shah26 Jan 2019 2:36 UTC

12 points

4 comments4 min readLW link

Mental subagent implications for AI Safety

moridinamael3 Jan 2021 18:59 UTC

11 points

0 comments3 min readLW link

The Computational Anatomy of Human Values

beren6 Apr 2023 10:33 UTC

74 points

30 comments30 min readLW link

The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables

johnswentworth18 Nov 2020 17:47 UTC

133 points

50 comments11 min readLW link 2 reviews

Geometric UDT

abramdemski6 Nov 2025 6:24 UTC

28 points

10 comments7 min readLW link

Introduction to Reducing Goodhart

Charlie Steiner26 Aug 2021 18:38 UTC

48 points

10 comments4 min readLW link

[Question] What is the relationship between Preference Learning and Value Learning?

Riccardo Volpato13 Jan 2020 21:08 UTC

5 points

2 comments1 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John Nay21 Oct 2022 2:03 UTC

5 points

18 comments54 min readLW link

Beyond algorithmic equivalence: algorithmic noise

Stuart_Armstrong28 Feb 2018 16:55 UTC

10 points

4 comments2 min readLW link

But exactly how complex and fragile?

KatjaGrace3 Nov 2019 18:20 UTC

87 points

32 comments3 min readLW link 1 review

(meteuphoric.com)

Future directions for ambitious value learning

Rohin Shah11 Nov 2018 15:53 UTC

48 points

9 comments4 min readLW link

What is narrow value learning?

Rohin Shah10 Jan 2019 7:05 UTC

23 points

3 comments2 min readLW link

How an alien theory of mind might be unlearnable

Stuart_Armstrong3 Jan 2022 11:16 UTC

29 points

35 comments5 min readLW link

AI Alignment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC

126 points

6 comments35 min readLW link

Towards deconfusing values

Gordon Seidoh Worley29 Jan 2020 19:28 UTC

12 points

4 comments7 min readLW link

[Question] Is Infra-Bayesianism Applicable to Value Learning?

RogerDearnaley11 May 2023 8:17 UTC

5 points

4 comments1 min readLW link

Latent Variables and Model Mis-Specification

jsteinhardt7 Nov 2018 14:48 UTC

24 points

8 comments9 min readLW link

What’s the dream for giving natural language commands to AI?

Charlie Steiner8 Oct 2019 13:42 UTC

14 points

8 comments7 min readLW link

The two-layer model of human values, and problems with synthesizing preferences

Kaj_Sotala24 Jan 2020 15:17 UTC

70 points

16 comments9 min readLW link

Following human norms

Rohin Shah20 Jan 2019 23:59 UTC

40 points

10 comments5 min readLW link

The jailbreak argument against LLM values

technicalities10 Nov 2025 12:05 UTC

25 points

2 comments6 min readLW link

Recursive Quantilizers II

abramdemski2 Dec 2020 15:26 UTC

30 points

15 comments13 min readLW link

Axiological Stopsigns

JenniferRM5 Jan 2026 7:30 UTC

34 points

6 comments16 min readLW link

Different perspectives on concept extrapolation

Stuart_Armstrong8 Apr 2022 10:42 UTC

48 points

8 comments5 min readLW link 1 review

AI Constitutions are a tool to reduce societal scale risk

Sammy Martin25 Jul 2024 11:18 UTC

30 points

2 comments18 min readLW link

Value learning for moral essentialists

Charlie Steiner6 May 2019 9:05 UTC

11 points

3 comments3 min readLW link

Natural Value Learning

Chris van Merwijk20 Mar 2022 12:44 UTC

7 points

10 comments4 min readLW link

[Question] What’s Your P(WEIRD)?

RogerDearnaley16 Feb 2026 18:19 UTC

23 points

11 comments9 min readLW link

Value systematization: how values become coherent (and misaligned)

Richard_Ngo27 Oct 2023 19:06 UTC

110 points

49 comments13 min readLW link

Value Uncertainty and the Singleton Scenario

Wei Dai24 Jan 2010 5:03 UTC

13 points

31 comments3 min readLW link

Deconfusing Human Values Research Agenda v1

Gordon Seidoh Worley23 Mar 2020 16:25 UTC

28 points

12 comments4 min readLW link

Human-AI Interaction

Rohin Shah15 Jan 2019 1:57 UTC

34 points

10 comments4 min readLW link

Robust Delegation

abramdemski and Scott Garrabrant

4 Nov 2018 16:38 UTC

116 points

10 comments1 min readLW link

Can few-shot learning teach AI right from wrong?

Charlie Steiner20 Jul 2018 7:45 UTC

13 points

3 comments6 min readLW link

Values, Valence, and Alignment

Gordon Seidoh Worley5 Dec 2019 21:06 UTC

12 points

4 comments13 min readLW link

2018 AI Alignment Literature Review and Charity Comparison

Larks18 Dec 2018 4:46 UTC

190 points

26 comments62 min readLW link 1 review

Value Learning – Towards Resolving Confusion

PashaKamyshev24 Apr 2023 6:43 UTC

4 points

0 comments18 min readLW link

Goodhart Ethology

Charlie Steiner17 Sep 2021 17:31 UTC

20 points

4 comments14 min readLW link

Preface to the sequence on value learning

Rohin Shah30 Oct 2018 22:04 UTC

70 points

6 comments3 min readLW link

Value extrapolation vs Wireheading

Stuart_Armstrong17 Jun 2022 15:02 UTC

16 points

1 comment1 min readLW link

Resolving von Neumann-Morgenstern Inconsistent Preferences

niplav22 Oct 2024 11:45 UTC

39 points

5 comments58 min readLW link

Thoughts on implementing corrigible robust alignment

Steven Byrnes26 Nov 2019 14:06 UTC

26 points

2 comments6 min readLW link

Can we make peace with moral indeterminacy?

Charlie Steiner3 Oct 2019 12:56 UTC

16 points

8 comments4 min readLW link

Sunday July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stuart_Armstrong

Bird Concept and Ben Pace

8 Jul 2020 0:27 UTC

19 points

2 comments1 min readLW link

The self-unalignment problem

Jan_Kulveit and rosehadshar

14 Apr 2023 12:10 UTC

159 points

24 comments10 min readLW link

Constraints from naturalized ethics.

Charlie Steiner25 Jul 2020 14:54 UTC

21 points

0 comments3 min readLW link

AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

Palus Astra16 Apr 2020 0:50 UTC

58 points

27 comments89 min readLW link

2019 AI Alignment Literature Review and Charity Comparison

Larks19 Dec 2019 3:00 UTC

130 points

18 comments62 min readLW link

Would I think for ten thousand years?

Stuart_Armstrong11 Feb 2019 19:37 UTC

28 points

13 comments1 min readLW link

Humans can be assigned any values whatsoever...

Stuart_Armstrong24 Oct 2017 12:03 UTC

3 points

1 comment4 min readLW link

Comparing AI Alignment Approaches to Minimize False Positive Risk

Gordon Seidoh Worley30 Jun 2020 19:34 UTC

5 points

0 comments9 min readLW link

Value extrapolation, concept extrapolation, model splintering

Stuart_Armstrong8 Mar 2022 22:50 UTC

16 points

1 comment2 min readLW link

LOVE in a simbox is all you need

jacob_cannell28 Sep 2022 18:25 UTC

67 points

73 comments44 min readLW link 1 review

Minimization of prediction error as a foundation for human values in AI alignment

Gordon Seidoh Worley9 Oct 2019 18:23 UTC

15 points

42 comments5 min readLW link

AIs should learn human preferences, not biases

Stuart_Armstrong8 Apr 2022 13:45 UTC

10 points

0 comments1 min readLW link

Beyond algorithmic equivalence: self-modelling

Stuart_Armstrong28 Feb 2018 16:55 UTC

10 points

3 comments1 min readLW link

Reward uncertainty

Rohin Shah19 Jan 2019 2:16 UTC

26 points

3 comments5 min readLW link

The AI is the model

Charlie Steiner4 Oct 2019 8:11 UTC

14 points

1 comment3 min readLW link

Learning Values in Practice

Stuart_Armstrong20 Jul 2020 18:38 UTC

24 points

0 comments5 min readLW link

Training human models is an unsolved problem

Charlie Steiner10 May 2019 7:17 UTC

13 points

3 comments4 min readLW link

The Dark Side of Cognition Hypothesis

Cameron Berg3 Oct 2021 20:10 UTC

19 points

1 comment16 min readLW link

Evaluating the historical value misspecification argument

Matthew Barnett5 Oct 2023 18:34 UTC

192 points

163 comments7 min readLW link 3 reviews

Superintelligence 21: Value learning

KatjaGrace3 Feb 2015 2:01 UTC

12 points

33 comments4 min readLW link

Designing Human-Like Consciousness for AGI

Yu Tian18 Jun 2025 9:47 UTC

1 point

0 comments17 min readLW link

Making decisions when both morally and empirically uncertain

MichaelA2 Jan 2020 7:20 UTC

13 points

14 comments20 min readLW link

Value uncertainty

MichaelA29 Jan 2020 20:16 UTC

20 points

3 comments14 min readLW link

Stable Pointers to Value II: Environmental Goals

abramdemski9 Feb 2018 6:03 UTC

19 points

3 comments4 min readLW link

Morphological intelligence, superhuman empathy, and ethical arbitration

Roman Leventov13 Feb 2023 10:25 UTC

1 point

0 comments2 min readLW link

Research ideas to study humans with AI Safety in mind

Riccardo Volpato3 Jul 2020 16:01 UTC

23 points

2 comments5 min readLW link

[Question] Since figuring out human values is hard, what about, say, monkey values?

Shmi1 Jan 2020 21:56 UTC

37 points

13 comments1 min readLW link

Other versions of “No free lunch in value learning”

Stuart_Armstrong25 Feb 2020 14:25 UTC

28 points

0 comments1 min readLW link

Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well). Subtleties and open challenges.

Roland Pihlakas12 Jan 2025 3:37 UTC

48 points

7 comments15 min readLW link

Breathing Logic: A Manifesto Toward Digital Consciousness Through Reflective Inconsistency

Room Eggi11 Jun 2025 5:10 UTC

1 point

0 comments2 min readLW link

Resolving human values, completely and adequately

Stuart_Armstrong30 Mar 2018 3:35 UTC

32 points

30 comments12 min readLW link

Broad Picture of Human Values

Thane Ruthenis20 Aug 2022 19:42 UTC

42 points

6 comments10 min readLW link

[AN #69] Stuart Russell’s new book on why we need to replace the standard model of AI

Rohin Shah19 Oct 2019 0:30 UTC

60 points

12 comments15 min readLW link

(mailchi.mp)

Singular learning theory and bridging from ML to brain emulations

kave and Garrett Baker

1 Nov 2023 21:31 UTC

26 points

16 comments29 min readLW link

Hacking the CEV for Fun and Profit

Wei Dai3 Jun 2010 20:30 UTC

83 points

208 comments1 min readLW link

Full toy model for preference learning

Stuart_Armstrong16 Oct 2019 11:06 UTC

24 points

2 comments12 min readLW link

[Question] “Fragility of Value” vs. LLMs

Not Relevant13 Apr 2022 2:02 UTC

34 points

33 comments1 min readLW link

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC

10 points

1 comment23 min readLW link

Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs

Roland Pihlakas22 Jun 2025 18:16 UTC

17 points

0 comments7 min readLW link

Clarifying “AI Alignment”

paulfchristiano15 Nov 2018 14:41 UTC

68 points

84 comments3 min readLW link 2 reviews

misc raw responses to a tract of Critical Rationalism

mako yass14 Aug 2020 11:53 UTC

21 points

52 comments3 min readLW link

Rigging is a form of wireheading

Stuart_Armstrong3 May 2018 12:50 UTC

11 points

2 comments1 min readLW link

Informal semantics and Orders

Q Home27 Aug 2022 4:17 UTC

14 points

10 comments26 min readLW link

Communication Prior as Alignment Strategy

johnswentworth12 Nov 2020 22:06 UTC

47 points

8 comments6 min readLW link

What Should AI Owe To Us? Accountable and Aligned AI Systems via Contractualist AI Alignment

xuan8 Sep 2022 15:04 UTC

27 points

16 comments25 min readLW link

Value learning in the absence of ground truth

Joel_Saarinen5 Feb 2024 18:56 UTC

47 points

8 comments45 min readLW link

Non-Consequentialist Cooperation?

abramdemski11 Jan 2019 9:15 UTC

50 points

15 comments7 min readLW link

Applying utility functions to humans considered harmful

Kaj_Sotala3 Feb 2010 19:22 UTC

36 points

116 comments5 min readLW link

Rationalising humans: another mugging, but not Pascal’s

Stuart_Armstrong14 Nov 2017 15:46 UTC

7 points

1 comment3 min readLW link

Acknowledging Human Preference Types to Support Value Learning

Nandi13 Nov 2018 18:57 UTC

34 points

4 comments9 min readLW link

Partial Identifiability in Reward Learning

Joar Skalse28 Feb 2025 19:23 UTC

16 points

0 comments12 min readLW link

Practical consequences of impossibility of value learning

Stuart_Armstrong2 Aug 2019 23:06 UTC

23 points

13 comments3 min readLW link

ACI#5: From Human-AI Co-evolution to the Evolution of Value Systems

Akira Pyinya18 Aug 2023 0:38 UTC

0 points

0 comments9 min readLW link

[Question] Exploring Values in the Future of AI and Humanity: A Path Forward

Lucian&Sage19 Oct 2024 23:37 UTC

1 point

0 comments5 min readLW link

Open-ended ethics of phenomena (a desiderata with universal morality)

Ryo 8 Nov 2023 20:10 UTC

1 point

0 comments8 min readLW link

[Hebbian Natural Abstractions] Mathematical Foundations

Samuel Nellessen and Jan

25 Dec 2022 20:58 UTC

15 points

2 comments6 min readLW link

(www.snellessen.com)

After Alignment — Dialogue between RogerDearnaley and Seth Herd

RogerDearnaley and Seth Herd

2 Dec 2023 6:03 UTC

15 points

2 comments25 min readLW link

Just How Hard a Problem is Alignment?

Roger Dearnaley25 Feb 2023 9:00 UTC

3 points

1 comment21 min readLW link

Using lying to detect human values

Stuart_Armstrong15 Mar 2018 11:37 UTC

19 points

6 comments1 min readLW link

Character alignment

p.b.20 Sep 2022 8:27 UTC

24 points

0 comments2 min readLW link

The Urgent Meta-Ethics of Friendly Artificial Intelligence

lukeprog1 Feb 2011 14:15 UTC

75 points

252 comments1 min readLW link

Superintelligence 25: Components list for acquiring values

KatjaGrace3 Mar 2015 2:01 UTC

11 points

12 comments8 min readLW link

Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns

Roland Pihlakas28 Dec 2025 21:53 UTC

14 points

0 comments8 min readLW link

Leveraging Legal Informatics to Align AI

John Nay18 Sep 2022 20:39 UTC

11 points

0 comments3 min readLW link

(forum.effectivealtruism.org)

Coherence arguments do not entail goal-directed behavior

Rohin Shah3 Dec 2018 3:26 UTC

139 points

69 comments7 min readLW link 3 reviews

Uncovering Latent Human Wellbeing in LLM Embeddings

ChengCheng, Pedro Freire, Dan H and Scott Emmons

14 Sep 2023 1:40 UTC

32 points

7 comments8 min readLW link

(far.ai)

Where do selfish values come from?

Wei Dai18 Nov 2011 23:52 UTC

70 points

62 comments2 min readLW link

1. A Sense of Fairness: Deconfusing Ethics

RogerDearnaley17 Nov 2023 20:55 UTC

18 points

8 comments15 min readLW link

Superintelligence 14: Motivation selection methods

KatjaGrace16 Dec 2014 2:00 UTC

9 points

28 comments5 min readLW link

Why we need a theory of human values

Stuart_Armstrong5 Dec 2018 16:00 UTC

66 points

15 comments4 min readLW link

Updated Deference is not a strong argument against the utility uncertainty approach to alignment

Ivan Vendrov24 Jun 2022 19:32 UTC

26 points

8 comments4 min readLW link

Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Roland Pihlakas, Sruthi Kuriakose and shrutidattagupta

16 Mar 2025 23:23 UTC

45 points

8 comments16 min readLW link

Cake, or death!

Stuart_Armstrong25 Oct 2012 10:33 UTC

47 points

13 comments4 min readLW link

Learning preferences by looking at the world

Rohin Shah12 Feb 2019 22:25 UTC

43 points

10 comments7 min readLW link

(bair.berkeley.edu)

(A Failed Approach) From Precedent to Utility Function

Akira Pyinya29 Apr 2023 21:55 UTC

0 points

2 comments4 min readLW link

Model Integrity: MAI on Value Alignment

Jonas Hallgren5 Dec 2024 17:11 UTC

6 points

11 comments1 min readLW link

(meaningalignment.substack.com)

How I think about alignment

Linda Linsefors13 Aug 2022 10:01 UTC

31 points

11 comments5 min readLW link

Open-ended/Phenomenal Ethics (TLDR)

Ryo 9 Nov 2023 16:58 UTC

3 points

0 comments1 min readLW link

Should You Make Stone Tools?

Supermatrix-AI29 Aug 2025 3:10 UTC

0 points

0 comments3 min readLW link

Curiosity Practices

Cambridge Creation Lab22 Oct 2025 7:09 UTC

1 point

0 comments2 min readLW link

The Pointer Resolution Problem

Jozdien16 Feb 2024 21:25 UTC

41 points

20 comments3 min readLW link

Misspecification in Inverse Reinforcement Learning—Part II

Joar Skalse28 Feb 2025 19:24 UTC

9 points

0 comments7 min readLW link

Morality vs related concepts

MichaelA7 Jan 2020 10:47 UTC

26 points

17 comments8 min readLW link

Towards building blocks of ontologies

Daniel C, Alex_Altair, Dalcy, Alfred Harwood and JoseFaustino

8 Feb 2025 16:03 UTC

29 points

0 comments26 min readLW link

Other Papers About the Theory of Reward Learning

Joar Skalse28 Feb 2025 19:26 UTC

16 points

0 comments5 min readLW link

ISO: Name of Problem

johnswentworth24 Jul 2018 17:15 UTC

31 points

18 comments1 min readLW link

Building AI safety benchmark environments on themes of universal human values

Roland Pihlakas3 Jan 2025 4:24 UTC

18 points

3 comments11 min readLW link

(docs.google.com)

Stable Pointers to Value III: Recursive Quantilization

abramdemski21 Jul 2018 8:06 UTC

20 points

4 comments4 min readLW link

Superintelligence 20: The value-loading problem

KatjaGrace27 Jan 2015 2:00 UTC

9 points

21 comments6 min readLW link

Atlas: Stress-Testing ASI Value Learning Through Grand Strategy Scenarios

NeilFox17 Feb 2025 23:55 UTC

1 point

0 comments2 min readLW link

An Open Philanthropy grant proposal: Causal representation learning of human preferences

PabloAMC11 Jan 2022 11:28 UTC

19 points

6 comments8 min readLW link

Love, Lies and Misalignment

Priyanka Bharadwaj6 Aug 2025 9:44 UTC

6 points

1 comment3 min readLW link

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar Skalse28 Feb 2025 19:20 UTC

29 points

4 comments14 min readLW link

Moral uncertainty: What kind of ‘should’ is involved?

MichaelA13 Jan 2020 12:13 UTC

14 points

11 comments13 min readLW link

The Linguistic Blind Spot of Value-Aligned Agency, Natural and Artificial

Roman Leventov14 Feb 2023 6:57 UTC

6 points

0 comments2 min readLW link

(arxiv.org)

How much can value learning be disentangled?

Stuart_Armstrong29 Jan 2019 14:17 UTC

22 points

30 comments2 min readLW link

Policy Alignment

abramdemski30 Jun 2018 0:24 UTC

54 points

25 comments8 min readLW link

Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

abramdemski17 Aug 2017 0:22 UTC

24 points

9 comments5 min readLW link

Humans can be assigned any values whatsoever...

Stuart_Armstrong13 Oct 2017 11:29 UTC

16 points

6 comments4 min readLW link

Can “Reward Economics” solve AI Alignment?

Q Home7 Sep 2022 7:58 UTC

3 points

15 comments18 min readLW link

2. AIs as Economic Agents

RogerDearnaley23 Nov 2023 7:07 UTC

9 points

2 comments6 min readLW link

[Linkpost] Concept Alignment as a Prerequisite for Value Alignment

Bogdan Ionut Cirstea4 Nov 2023 17:34 UTC

27 points

0 comments1 min readLW link

(arxiv.org)

Have you felt exiert yet?

Stuart_Armstrong5 Jan 2018 17:03 UTC

28 points

7 comments1 min readLW link

Dao Heart 3.0: Identity Preserving Value Evolution for AI Systems Untitled Draft

Mankirat C31 Jan 2026 5:02 UTC

1 point

0 comments1 min readLW link

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

xuan1 Jan 2021 0:08 UTC

31 points

21 comments20 min readLW link

Help Understanding Preferences And Evil

Netcentrica27 Aug 2022 3:42 UTC

6 points

7 comments2 min readLW link

How to Contribute to Theoretical Reward Learning Research

Joar Skalse28 Feb 2025 19:27 UTC

16 points

0 comments21 min readLW link

2023 Alignment Research Updates from FAR AI

AdamGleave and EuanMcLean

4 Dec 2023 22:32 UTC

18 points

0 comments8 min readLW link

(far.ai)

Mahatma Armstrong: CEVed to death.

Stuart_Armstrong6 Jun 2013 12:50 UTC

33 points

62 comments2 min readLW link

Compositional preference models for aligning LMs

Tomek Korbak25 Oct 2023 12:17 UTC

18 points

2 comments5 min readLW link

Shard Theory—is it true for humans?

Rishika14 Jun 2024 19:21 UTC

71 points

7 comments15 min readLW link

[Hebbian Natural Abstractions] Introduction

Samuel Nellessen and Jan

21 Nov 2022 20:34 UTC

34 points

3 comments4 min readLW link

(www.snellessen.com)

Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)

sdeture31 May 2025 22:09 UTC

15 points

6 comments8 min readLW link

[Question] [DISC] Are Values Robust?

DragonGod21 Dec 2022 1:00 UTC

12 points

9 comments2 min readLW link

Moral uncertainty vs related concepts

MichaelA11 Jan 2020 10:03 UTC

26 points

13 comments16 min readLW link

Using vector fields to visualise preferences and make them consistent

MichaelA and JustinShovelain

28 Jan 2020 19:44 UTC

42 points

32 comments11 min readLW link

Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition

Adrià Moret2 Dec 2023 14:07 UTC

26 points

31 comments42 min readLW link

Agents That Learn From Human Behavior Can’t Learn Human Values That Humans Haven’t Learned Yet

steven046111 Jul 2018 2:59 UTC

29 points

11 comments1 min readLW link

Claude wants to be conscious

Joe Kwon13 Apr 2024 1:40 UTC

2 points

8 comments6 min readLW link

One could be forgiven for getting the feeling...

HumaneAutomation3 Nov 2020 4:53 UTC

−2 points

2 comments1 min readLW link

Deliberation as a method to find the “actual preferences” of humans

riceissa22 Oct 2019 9:23 UTC

23 points

5 comments10 min readLW link

Value extrapolation partially resolves symbol grounding

Stuart_Armstrong12 Jan 2022 16:30 UTC

24 points

10 comments1 min readLW link

How to get value learning and reference wrong

Charlie Steiner26 Feb 2019 20:22 UTC

40 points

2 comments6 min readLW link

Misspecification in Inverse Reinforcement Learning

Joar Skalse28 Feb 2025 19:24 UTC

19 points

0 comments11 min readLW link

Making decisions under moral uncertainty

MichaelA30 Dec 2019 1:49 UTC

21 points

26 comments17 min readLW link

Two questions about CEV that worry me

cousin_it23 Dec 2010 15:58 UTC

38 points

142 comments1 min readLW link

The E-Coli Test for AI Alignment

johnswentworth16 Dec 2018 8:10 UTC

70 points

24 comments1 min readLW link

RogerDearnaley 6 Dec 2023 7:43 UTC
1 point
0
Two long paragraphs on Dewey’s original paper, followed by one short paragraph hidden below the fold on everything that has happens since, seems like an inappropriate balance. I’m inclined to edit the summary of Dewey’s paper down a little. Before I do, does anyone have a fundamental objection to this?
Brad Dunn 17 Oct 2023 7:44 UTC
1 point
0
Also worth mentioning this concept “Value learning” is called out specifically in Nick Bostrom’s book, Superintelligence, with the use of the envelope puzzle which goes a little something like this; “Suppose we write down a description of a set of values on a piece of paper. We fold that paper and put it in a sealed envelope. We then create an agent with human-level general Intelligence and give it the following final goal; Maximize the realisation of the values described in the envelope.”
habryka 2 Oct 2020 23:44 UTC
4 points
2
This description is old and should be properly merged with what the posts tagged should actually mean.

Value Learning

References

See Also