Value Learning

TagLast edit: Dec 30, 2024, 10:05 AM by Dakara

Value Learning is a proposed method for incorporating human values in an AGI. It involves the creation of an artificial learner whose actions consider many possible sets of values and preferences, weighed by their likelihood. Value learning could prevent an AGI of having goals detrimental to human values, hence helping in the creation of Friendly AI.

Many ways have been proposed to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition, mostly proposed around 2004-2010). Value learning was suggested in 2011 by Daniel Dewey in ‘Learning What to Value’. Like most authors, he assumes that an artificial agent needs to be intentionally aligned to human goals. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, this could suffer from goal misspecification or reward hacking. He proposes a utility function maximizer comparable to AIXI, which considers all possible utility functions weighted by their Bayesian probabilities: “[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history”

Nick Bostrom also discusses value learning at length in his book Superintelligence. Value learning is closely related to various proposals for AI-assisted Alignment and AI-assisted/AI automated Alignment research. Since human values are complex and fragile, learning human values well is a challenging problem, much like AI-assisted Alignment (but in a less supervised setting, so actually harder). So this is only a practicable alignment technique for AGI capable of successfully performing a STEM research program (in Anthropology). Thus value learning is (unusually) an alignment technique that improves as capabilities increase, and it requires around an AGI minimum threshold of capabilities to begin to be effective.

One potential challenge is that human values are somewhat mutable and AGI could affect them.

References

Dewey’s paper

The easy goal inference problem is still hard

paulfchristianoNov 3, 2018, 2:41 PM

62 points

20 comments4 min readLW link

Ambitious vs. narrow value learning

paulfchristianoJan 12, 2019, 6:18 AM

31 points

16 comments4 min readLW link

Humans can be assigned any values whatsoever…

Stuart_ArmstrongNov 5, 2018, 2:26 PM

54 points

27 comments4 min readLW link

Model Mis-specification and Inverse Reinforcement Learning

Owain_Evans and jsteinhardt

Nov 9, 2018, 3:33 PM

34 points

3 comments16 min readLW link

Conclusion to the sequence on value learning

Rohin ShahFeb 3, 2019, 9:05 PM

51 points

20 comments5 min readLW link

Intuitions about goal-directed behavior

Rohin ShahDec 1, 2018, 4:25 AM

54 points

15 comments6 min readLW link

6. The Mutable Values Problem in Value Learning and CEV

RogerDearnaleyDec 4, 2023, 6:31 PM

12 points

0 comments49 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM

41 points

12 comments31 min readLW link

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaleyFeb 1, 2024, 9:15 PM

16 points

15 comments13 min readLW link

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaleyMay 25, 2023, 9:26 AM

33 points

3 comments15 min readLW link

Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect

RogerDearnaleyJan 26, 2024, 3:58 AM

16 points

2 comments11 min readLW link

What is ambitious value learning?

Rohin ShahNov 1, 2018, 4:20 PM

55 points

28 comments2 min readLW link

Normativity

abramdemskiNov 18, 2020, 4:52 PM

47 points

11 comments9 min readLW link

AI Alignment Problem: “Human Values” don’t Actually Exist

avturchinApr 22, 2019, 9:23 AM

45 points

29 comments43 min readLW link

Learning human preferences: black-box, white-box, and structured white-box access

Stuart_ArmstrongAug 24, 2020, 11:42 AM

26 points

9 comments6 min readLW link

Morally underdefined situations can be deadly

Stuart_ArmstrongNov 22, 2021, 2:48 PM

17 points

8 comments2 min readLW link

Humans aren’t agents—what then for value learning?

Charlie SteinerMar 15, 2019, 10:01 PM

28 points

16 comments3 min readLW link

Future directions for narrow value learning

Rohin ShahJan 26, 2019, 2:36 AM

12 points

4 comments4 min readLW link

Mental subagent implications for AI Safety

moridinamaelJan 3, 2021, 6:59 PM

11 points

0 comments3 min readLW link

The Computational Anatomy of Human Values

berenApr 6, 2023, 10:33 AM

74 points

30 comments30 min readLW link

The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables

johnswentworthNov 18, 2020, 5:47 PM

129 points

50 comments11 min readLW link 2 reviews

Introduction to Reducing Goodhart

Charlie SteinerAug 26, 2021, 6:38 PM

48 points

10 comments4 min readLW link

[Question] What is the relationship between Preference Learning and Value Learning?

Riccardo VolpatoJan 13, 2020, 9:08 PM

5 points

2 comments1 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John NayOct 21, 2022, 2:03 AM

5 points

18 comments54 min readLW link

Beyond algorithmic equivalence: algorithmic noise

Stuart_ArmstrongFeb 28, 2018, 4:55 PM

10 points

4 comments2 min readLW link

But exactly how complex and fragile?

KatjaGraceNov 3, 2019, 6:20 PM

87 points

32 comments3 min readLW link 1 review

(meteuphoric.com)

Future directions for ambitious value learning

Rohin ShahNov 11, 2018, 3:53 PM

48 points

9 comments4 min readLW link

What is narrow value learning?

Rohin ShahJan 10, 2019, 7:05 AM

23 points

3 comments2 min readLW link

How an alien theory of mind might be unlearnable

Stuart_ArmstrongJan 3, 2022, 11:16 AM

29 points

35 comments5 min readLW link

AI Alignment 2018-19 Review

Rohin ShahJan 28, 2020, 2:19 AM

126 points

6 comments35 min readLW link

Towards deconfusing values

Gordon Seidoh WorleyJan 29, 2020, 7:28 PM

12 points

4 comments7 min readLW link

[Question] Is Infra-Bayesianism Applicable to Value Learning?

RogerDearnaleyMay 11, 2023, 8:17 AM

5 points

4 comments1 min readLW link

Latent Variables and Model Mis-Specification

jsteinhardtNov 7, 2018, 2:48 PM

24 points

8 comments9 min readLW link

What’s the dream for giving natural language commands to AI?

Charlie SteinerOct 8, 2019, 1:42 PM

14 points

8 comments7 min readLW link

The two-layer model of human values, and problems with synthesizing preferences

Kaj_SotalaJan 24, 2020, 3:17 PM

70 points

16 comments9 min readLW link

Following human norms

Rohin ShahJan 20, 2019, 11:59 PM

30 points

10 comments5 min readLW link

Recursive Quantilizers II

abramdemskiDec 2, 2020, 3:26 PM

30 points

15 comments13 min readLW link

Different perspectives on concept extrapolation

Stuart_ArmstrongApr 8, 2022, 10:42 AM

48 points

8 comments5 min readLW link 1 review

AI Constitutions are a tool to reduce societal scale risk

Sammy MartinJul 25, 2024, 11:18 AM

30 points

2 comments18 min readLW link

Value learning for moral essentialists

Charlie SteinerMay 6, 2019, 9:05 AM

11 points

3 comments3 min readLW link

Natural Value Learning

Chris van MerwijkMar 20, 2022, 12:44 PM

7 points

10 comments4 min readLW link

Value systematization: how values become coherent (and misaligned)

Richard_NgoOct 27, 2023, 7:06 PM

103 points

49 comments13 min readLW link

Value Uncertainty and the Singleton Scenario

Wei DaiJan 24, 2010, 5:03 AM

13 points

31 comments3 min readLW link

Deconfusing Human Values Research Agenda v1

Gordon Seidoh WorleyMar 23, 2020, 4:25 PM

28 points

12 comments4 min readLW link

Human-AI Interaction

Rohin ShahJan 15, 2019, 1:57 AM

34 points

10 comments4 min readLW link

Robust Delegation

abramdemski and Scott Garrabrant

Nov 4, 2018, 4:38 PM

116 points

10 comments1 min readLW link

Can few-shot learning teach AI right from wrong?

Charlie SteinerJul 20, 2018, 7:45 AM

13 points

3 comments6 min readLW link

Values, Valence, and Alignment

Gordon Seidoh WorleyDec 5, 2019, 9:06 PM

12 points

4 comments13 min readLW link

2018 AI Alignment Literature Review and Charity Comparison

LarksDec 18, 2018, 4:46 AM

190 points

26 comments62 min readLW link 1 review

Value Learning – Towards Resolving Confusion

PashaKamyshevApr 24, 2023, 6:43 AM

4 points

0 comments18 min readLW link

Goodhart Ethology

Charlie SteinerSep 17, 2021, 5:31 PM

20 points

4 comments14 min readLW link

Preface to the sequence on value learning

Rohin ShahOct 30, 2018, 10:04 PM

70 points

6 comments3 min readLW link

Value extrapolation vs Wireheading

Stuart_ArmstrongJun 17, 2022, 3:02 PM

16 points

1 comment1 min readLW link

Resolving von Neumann-Morgenstern Inconsistent Preferences

niplavOct 22, 2024, 11:45 AM

38 points

5 comments58 min readLW link

Thoughts on implementing corrigible robust alignment

Steven ByrnesNov 26, 2019, 2:06 PM

26 points

2 comments6 min readLW link

Can we make peace with moral indeterminacy?

Charlie SteinerOct 3, 2019, 12:56 PM

16 points

8 comments4 min readLW link

Sunday July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stuart_Armstrong

Bird Concept and Ben Pace

Jul 8, 2020, 12:27 AM

19 points

2 comments1 min readLW link

The self-unalignment problem

Jan_Kulveit and rosehadshar

Apr 14, 2023, 12:10 PM

155 points

24 comments10 min readLW link

Constraints from naturalized ethics.

Charlie SteinerJul 25, 2020, 2:54 PM

21 points

0 comments3 min readLW link

AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

Palus AstraApr 16, 2020, 12:50 AM

58 points

27 comments89 min readLW link

2019 AI Alignment Literature Review and Charity Comparison

LarksDec 19, 2019, 3:00 AM

130 points

18 comments62 min readLW link

Would I think for ten thousand years?

Stuart_ArmstrongFeb 11, 2019, 7:37 PM

25 points

13 comments1 min readLW link

Humans can be assigned any values whatsoever...

Stuart_ArmstrongOct 24, 2017, 12:03 PM

3 points

1 comment4 min readLW link

Comparing AI Alignment Approaches to Minimize False Positive Risk

Gordon Seidoh WorleyJun 30, 2020, 7:34 PM

5 points

0 comments9 min readLW link

Value extrapolation, concept extrapolation, model splintering

Stuart_ArmstrongMar 8, 2022, 10:50 PM

16 points

1 comment2 min readLW link

LOVE in a simbox is all you need

jacob_cannellSep 28, 2022, 6:25 PM

66 points

73 comments44 min readLW link 1 review

Minimization of prediction error as a foundation for human values in AI alignment

Gordon Seidoh WorleyOct 9, 2019, 6:23 PM

15 points

42 comments5 min readLW link

AIs should learn human preferences, not biases

Stuart_ArmstrongApr 8, 2022, 1:45 PM

10 points

0 comments1 min readLW link

Beyond algorithmic equivalence: self-modelling

Stuart_ArmstrongFeb 28, 2018, 4:55 PM

10 points

3 comments1 min readLW link

Reward uncertainty

Rohin ShahJan 19, 2019, 2:16 AM

26 points

3 comments5 min readLW link

The AI is the model

Charlie SteinerOct 4, 2019, 8:11 AM

14 points

1 comment3 min readLW link

Learning Values in Practice

Stuart_ArmstrongJul 20, 2020, 6:38 PM

24 points

0 comments5 min readLW link

Training human models is an unsolved problem

Charlie SteinerMay 10, 2019, 7:17 AM

13 points

3 comments4 min readLW link

The Dark Side of Cognition Hypothesis

Cameron BergOct 3, 2021, 8:10 PM

19 points

1 comment16 min readLW link

Evaluating the historical value misspecification argument

Matthew BarnettOct 5, 2023, 6:34 PM

190 points

162 comments7 min readLW link 3 reviews

Superintelligence 21: Value learning

KatjaGraceFeb 3, 2015, 2:01 AM

12 points

33 comments4 min readLW link

Making decisions when both morally and empirically uncertain

MichaelAJan 2, 2020, 7:20 AM

13 points

14 comments20 min readLW link

Value uncertainty

MichaelAJan 29, 2020, 8:16 PM

20 points

3 comments14 min readLW link

Stable Pointers to Value II: Environmental Goals

abramdemskiFeb 9, 2018, 6:03 AM

19 points

3 comments4 min readLW link

Morphological intelligence, superhuman empathy, and ethical arbitration

Roman LeventovFeb 13, 2023, 10:25 AM

1 point

0 comments2 min readLW link

Research ideas to study humans with AI Safety in mind

Riccardo VolpatoJul 3, 2020, 4:01 PM

23 points

2 comments5 min readLW link

[Question] Since figuring out human values is hard, what about, say, monkey values?

ShmiJan 1, 2020, 9:56 PM

37 points

13 comments1 min readLW link

Other versions of “No free lunch in value learning”

Stuart_ArmstrongFeb 25, 2020, 2:25 PM

28 points

0 comments1 min readLW link

Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)

Roland PihlakasJan 12, 2025, 3:37 AM

46 points

7 comments10 min readLW link

Resolving human values, completely and adequately

Stuart_ArmstrongMar 30, 2018, 3:35 AM

32 points

30 comments12 min readLW link

Broad Picture of Human Values

Thane RuthenisAug 20, 2022, 7:42 PM

42 points

6 comments10 min readLW link

[AN #69] Stuart Russell’s new book on why we need to replace the standard model of AI

Rohin ShahOct 19, 2019, 12:30 AM

60 points

12 comments15 min readLW link

(mailchi.mp)

Singular learning theory and bridging from ML to brain emulations

kave and Garrett Baker

Nov 1, 2023, 9:31 PM

26 points

16 comments29 min readLW link

Hacking the CEV for Fun and Profit

Wei DaiJun 3, 2010, 8:30 PM

78 points

207 comments1 min readLW link

Full toy model for preference learning

Stuart_ArmstrongOct 16, 2019, 11:06 AM

20 points

2 comments12 min readLW link

[Question] “Fragility of Value” vs. LLMs

Not RelevantApr 13, 2022, 2:02 AM

34 points

33 comments1 min readLW link

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger DearnaleyFeb 21, 2023, 9:05 AM

10 points

1 comment23 min readLW link

Clarifying “AI Alignment”

paulfchristianoNov 15, 2018, 2:41 PM

67 points

84 comments3 min readLW link 2 reviews

misc raw responses to a tract of Critical Rationalism

mako yassAug 14, 2020, 11:53 AM

21 points

52 comments3 min readLW link

Rigging is a form of wireheading

Stuart_ArmstrongMay 3, 2018, 12:50 PM

11 points

2 comments1 min readLW link

Informal semantics and Orders

Q HomeAug 27, 2022, 4:17 AM

14 points

10 comments26 min readLW link

Communication Prior as Alignment Strategy

johnswentworthNov 12, 2020, 10:06 PM

46 points

8 comments6 min readLW link

What Should AI Owe To Us? Accountable and Aligned AI Systems via Contractualist AI Alignment

xuanSep 8, 2022, 3:04 PM

26 points

16 comments25 min readLW link

Value learning in the absence of ground truth

Joel_SaarinenFeb 5, 2024, 6:56 PM

47 points

8 comments45 min readLW link

Non-Consequentialist Cooperation?

abramdemskiJan 11, 2019, 9:15 AM

50 points

15 comments7 min readLW link

Applying utility functions to humans considered harmful

Kaj_SotalaFeb 3, 2010, 7:22 PM

36 points

116 comments5 min readLW link

Rationalising humans: another mugging, but not Pascal’s

Stuart_ArmstrongNov 14, 2017, 3:46 PM

7 points

1 comment3 min readLW link

Acknowledging Human Preference Types to Support Value Learning

NandiNov 13, 2018, 6:57 PM

34 points

4 comments9 min readLW link

Partial Identifiability in Reward Learning

Joar SkalseFeb 28, 2025, 7:23 PM

15 points

0 comments12 min readLW link

Practical consequences of impossibility of value learning

Stuart_ArmstrongAug 2, 2019, 11:06 PM

23 points

13 comments3 min readLW link

ACI#5: From Human-AI Co-evolution to the Evolution of Value Systems

Akira PyinyaAug 18, 2023, 12:38 AM

0 points

0 comments9 min readLW link

[Question] Exploring Values in the Future of AI and Humanity: A Path Forward

Lucian&SageOct 19, 2024, 11:37 PM

1 point

0 comments5 min readLW link

Open-ended ethics of phenomena (a desiderata with universal morality)

Ryo Nov 8, 2023, 8:10 PM

1 point

0 comments8 min readLW link

[Hebbian Natural Abstractions] Mathematical Foundations

Samuel Nellessen and Jan

Dec 25, 2022, 8:58 PM

15 points

2 comments6 min readLW link

(www.snellessen.com)

After Alignment — Dialogue between RogerDearnaley and Seth Herd

RogerDearnaley and Seth Herd

Dec 2, 2023, 6:03 AM

15 points

2 comments25 min readLW link

Just How Hard a Problem is Alignment?

Roger DearnaleyFeb 25, 2023, 9:00 AM

3 points

1 comment21 min readLW link

Using lying to detect human values

Stuart_ArmstrongMar 15, 2018, 11:37 AM

19 points

6 comments1 min readLW link

Character alignment

p.b.Sep 20, 2022, 8:27 AM

22 points

0 comments2 min readLW link

The Urgent Meta-Ethics of Friendly Artificial Intelligence

lukeprogFeb 1, 2011, 2:15 PM

75 points

252 comments1 min readLW link

Superintelligence 25: Components list for acquiring values

KatjaGraceMar 3, 2015, 2:01 AM

11 points

12 comments8 min readLW link

Leveraging Legal Informatics to Align AI

John NaySep 18, 2022, 8:39 PM

11 points

0 comments3 min readLW link

(forum.effectivealtruism.org)

Coherence arguments do not entail goal-directed behavior

Rohin ShahDec 3, 2018, 3:26 AM

134 points

69 comments7 min readLW link 3 reviews

Uncovering Latent Human Wellbeing in LLM Embeddings

ChengCheng, Pedro Freire, Dan H and Scott Emmons

Sep 14, 2023, 1:40 AM

32 points

7 comments8 min readLW link

(far.ai)

Where do selfish values come from?

Wei DaiNov 18, 2011, 11:52 PM

70 points

62 comments2 min readLW link

1. A Sense of Fairness: Deconfusing Ethics

RogerDearnaleyNov 17, 2023, 8:55 PM

17 points

8 comments15 min readLW link

Superintelligence 14: Motivation selection methods

KatjaGraceDec 16, 2014, 2:00 AM

9 points

28 comments5 min readLW link

Why we need a theory of human values

Stuart_ArmstrongDec 5, 2018, 4:00 PM

66 points

15 comments4 min readLW link

Updated Deference is not a strong argument against the utility uncertainty approach to alignment

Ivan VendrovJun 24, 2022, 7:32 PM

26 points

8 comments4 min readLW link

Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format

Roland Pihlakas, Sruthi Kuriakose and shrutidattagupta

Mar 16, 2025, 11:23 PM

37 points

6 comments7 min readLW link

Cake, or death!

Stuart_ArmstrongOct 25, 2012, 10:33 AM

47 points

13 comments4 min readLW link

Learning preferences by looking at the world

Rohin ShahFeb 12, 2019, 10:25 PM

43 points

10 comments7 min readLW link

(bair.berkeley.edu)

(A Failed Approach) From Precedent to Utility Function

Akira PyinyaApr 29, 2023, 9:55 PM

0 points

2 comments4 min readLW link

Model Integrity: MAI on Value Alignment

Jonas HallgrenDec 5, 2024, 5:11 PM

6 points

11 comments1 min readLW link

(meaningalignment.substack.com)

How I think about alignment

Linda LinseforsAug 13, 2022, 10:01 AM

31 points

11 comments5 min readLW link

Open-ended/Phenomenal Ethics (TLDR)

Ryo Nov 9, 2023, 4:58 PM

3 points

0 comments1 min readLW link

The Pointer Resolution Problem

JozdienFeb 16, 2024, 9:25 PM

41 points

20 comments3 min readLW link

Misspecification in Inverse Reinforcement Learning—Part II

Joar SkalseFeb 28, 2025, 7:24 PM

9 points

0 comments7 min readLW link

Morality vs related concepts

MichaelAJan 7, 2020, 10:47 AM

26 points

17 comments8 min readLW link

Towards building blocks of ontologies

Daniel C, Alex_Altair, Dalcy, Alfred Harwood and JoseFaustino

Feb 8, 2025, 4:03 PM

29 points

0 comments26 min readLW link

Other Papers About the Theory of Reward Learning

Joar SkalseFeb 28, 2025, 7:26 PM

16 points

0 comments5 min readLW link

ISO: Name of Problem

johnswentworthJul 24, 2018, 5:15 PM

28 points

15 comments1 min readLW link

Building AI safety benchmark environments on themes of universal human values

Roland PihlakasJan 3, 2025, 4:24 AM

18 points

3 comments8 min readLW link

(docs.google.com)

Stable Pointers to Value III: Recursive Quantilization

abramdemskiJul 21, 2018, 8:06 AM

20 points

4 comments4 min readLW link

Superintelligence 20: The value-loading problem

KatjaGraceJan 27, 2015, 2:00 AM

8 points

21 comments6 min readLW link

Atlas: Stress-Testing ASI Value Learning Through Grand Strategy Scenarios

NeilFoxFeb 17, 2025, 11:55 PM

1 point

0 comments2 min readLW link

An Open Philanthropy grant proposal: Causal representation learning of human preferences

PabloAMCJan 11, 2022, 11:28 AM

19 points

6 comments8 min readLW link

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar SkalseFeb 28, 2025, 7:20 PM

25 points

4 comments14 min readLW link

Moral uncertainty: What kind of ‘should’ is involved?

MichaelAJan 13, 2020, 12:13 PM

14 points

11 comments13 min readLW link

The Linguistic Blind Spot of Value-Aligned Agency, Natural and Artificial

Roman LeventovFeb 14, 2023, 6:57 AM

6 points

0 comments2 min readLW link

(arxiv.org)

How much can value learning be disentangled?

Stuart_ArmstrongJan 29, 2019, 2:17 PM

22 points

30 comments2 min readLW link

Policy Alignment

abramdemskiJun 30, 2018, 12:24 AM

54 points

25 comments8 min readLW link

Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

abramdemskiAug 17, 2017, 12:22 AM

15 points

9 comments5 min readLW link

Humans can be assigned any values whatsoever...

Stuart_ArmstrongOct 13, 2017, 11:29 AM

16 points

6 comments4 min readLW link

Can “Reward Economics” solve AI Alignment?

Q HomeSep 7, 2022, 7:58 AM

3 points

15 comments18 min readLW link

2. AIs as Economic Agents

RogerDearnaleyNov 23, 2023, 7:07 AM

9 points

2 comments6 min readLW link

[Linkpost] Concept Alignment as a Prerequisite for Value Alignment

Bogdan Ionut CirsteaNov 4, 2023, 5:34 PM

27 points

0 comments1 min readLW link

(arxiv.org)

Have you felt exiert yet?

Stuart_ArmstrongJan 5, 2018, 5:03 PM

28 points

7 comments1 min readLW link

AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

xuanJan 1, 2021, 12:08 AM

31 points

21 comments20 min readLW link

Help Understanding Preferences And Evil

NetcentricaAug 27, 2022, 3:42 AM

6 points

7 comments2 min readLW link

How to Contribute to Theoretical Reward Learning Research

Joar SkalseFeb 28, 2025, 7:27 PM

16 points

0 comments21 min readLW link

2023 Alignment Research Updates from FAR AI

AdamGleave and EuanMcLean

Dec 4, 2023, 10:32 PM

18 points

0 comments8 min readLW link

(far.ai)

Mahatma Armstrong: CEVed to death.

Stuart_ArmstrongJun 6, 2013, 12:50 PM

33 points

62 comments2 min readLW link

Compositional preference models for aligning LMs

Tomek KorbakOct 25, 2023, 12:17 PM

18 points

2 comments5 min readLW link

Shard Theory—is it true for humans?

RishikaJun 14, 2024, 7:21 PM

71 points

7 comments15 min readLW link

[Hebbian Natural Abstractions] Introduction

Samuel Nellessen and Jan

Nov 21, 2022, 8:34 PM

34 points

3 comments4 min readLW link

(www.snellessen.com)

[Question] [DISC] Are Values Robust?

DragonGodDec 21, 2022, 1:00 AM

12 points

9 comments2 min readLW link

Moral uncertainty vs related concepts

MichaelAJan 11, 2020, 10:03 AM

26 points

13 comments16 min readLW link

Using vector fields to visualise preferences and make them consistent

MichaelA and JustinShovelain

Jan 28, 2020, 7:44 PM

42 points

32 comments11 min readLW link

Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition

Adrià MoretDec 2, 2023, 2:07 PM

26 points

31 comments42 min readLW link

Agents That Learn From Human Behavior Can’t Learn Human Values That Humans Haven’t Learned Yet

steven0461Jul 11, 2018, 2:59 AM

28 points

11 comments1 min readLW link

Claude wants to be conscious

Joe KwonApr 13, 2024, 1:40 AM

2 points

8 comments6 min readLW link

One could be forgiven for getting the feeling...

HumaneAutomationNov 3, 2020, 4:53 AM

−2 points

2 comments1 min readLW link

Deliberation as a method to find the “actual preferences” of humans

riceissaOct 22, 2019, 9:23 AM

23 points

5 comments10 min readLW link

Value extrapolation partially resolves symbol grounding

Stuart_ArmstrongJan 12, 2022, 4:30 PM

24 points

10 comments1 min readLW link

How to get value learning and reference wrong

Charlie SteinerFeb 26, 2019, 8:22 PM

40 points

2 comments6 min readLW link

Misspecification in Inverse Reinforcement Learning

Joar SkalseFeb 28, 2025, 7:24 PM

19 points

0 comments11 min readLW link

Making decisions under moral uncertainty

MichaelADec 30, 2019, 1:49 AM

21 points

26 comments17 min readLW link

Two questions about CEV that worry me

cousin_itDec 23, 2010, 3:58 PM

37 points

141 comments1 min readLW link

The E-Coli Test for AI Alignment

johnswentworthDec 16, 2018, 8:10 AM

70 points

24 comments1 min readLW link

RogerDearnaley Dec 6, 2023, 7:43 AM
1 point
0
Two long paragraphs on Dewey’s original paper, followed by one short paragraph hidden below the fold on everything that has happens since, seems like an inappropriate balance. I’m inclined to edit the summary of Dewey’s paper down a little. Before I do, does anyone have a fundamental objection to this?
Brad Dunn Oct 17, 2023, 7:44 AM
1 point
0
Also worth mentioning this concept “Value learning” is called out specifically in Nick Bostrom’s book, Superintelligence, with the use of the envelope puzzle which goes a little something like this; “Suppose we write down a description of a set of values on a piece of paper. We fold that paper and put it in a sealed envelope. We then create an agent with human-level general Intelligence and give it the following final goal; Maximize the realisation of the values described in the envelope.”
habryka Oct 2, 2020, 11:44 PM
4 points
2
This description is old and should be properly merged with what the posts tagged should actually mean.

Value Learning

References

See Also