Value Learning

TagLast edit: 19 Mar 2023 21:29 UTC by Diabloto96

Value learning is a proposed method for incorporating human values in an AGI. It involves the creation of an artificial learner whose actions consider many possible set of values and preferences, weighed by their likelihood. Value learning could prevent an AGI of having goals detrimental to human values, hence helping in the creation of Friendly AI.

Although there are many ways to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition), this method is directly mentioned and developed in Daniel Dewey’s paper ‘Learning What to Value’. Like most authors, he assumes that human’s goals would not naturally occur in an artificial agent and should be enforced in it. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, even if we forcefully engineer the agent to maximize those rewards that also maximize human values, the agent could alter its environment to more easily produce those same rewards without the trouble of also maximizing human values (i.e.: if the reward was human happiness it could alter the human mind so it became happy with anything).

To solve all these problems, Dewey proposes a utility function maximizer, who considers all possible utility functions weighted by their probabilities: “[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history” He concludes saying that although it solves many of the mentioned problems, this method still leaves many open questions. However it should provide a direction for future work.


See Also

The easy goal in­fer­ence prob­lem is still hard

paulfchristiano3 Nov 2018 14:41 UTC
52 points
19 comments4 min readLW link

Hu­mans can be as­signed any val­ues what­so­ever…

Stuart_Armstrong5 Nov 2018 14:26 UTC
56 points
26 comments4 min readLW link

Am­bi­tious vs. nar­row value learning

paulfchristiano12 Jan 2019 6:18 UTC
24 points
16 comments4 min readLW link

Model Mis-speci­fi­ca­tion and In­verse Re­in­force­ment Learning

9 Nov 2018 15:33 UTC
33 points
3 comments16 min readLW link

Con­clu­sion to the se­quence on value learning

Rohin Shah3 Feb 2019 21:05 UTC
49 points
20 comments5 min readLW link

In­tu­itions about goal-di­rected behavior

Rohin Shah1 Dec 2018 4:25 UTC
54 points
15 comments6 min readLW link


abramdemski18 Nov 2020 16:52 UTC
46 points
11 comments9 min readLW link

What is am­bi­tious value learn­ing?

Rohin Shah1 Nov 2018 16:20 UTC
50 points
28 comments2 min readLW link

Learn­ing hu­man prefer­ences: black-box, white-box, and struc­tured white-box access

Stuart_Armstrong24 Aug 2020 11:42 UTC
26 points
9 comments6 min readLW link

What’s the dream for giv­ing nat­u­ral lan­guage com­mands to AI?

Charlie Steiner8 Oct 2019 13:42 UTC
8 points
8 comments7 min readLW link

Con­straints from nat­u­ral­ized ethics.

Charlie Steiner25 Jul 2020 14:54 UTC
21 points
0 comments3 min readLW link

Values, Valence, and Alignment

Gordon Seidoh Worley5 Dec 2019 21:06 UTC
12 points
4 comments13 min readLW link

The two-layer model of hu­man val­ues, and prob­lems with syn­the­siz­ing preferences

Kaj_Sotala24 Jan 2020 15:17 UTC
69 points
16 comments9 min readLW link

Towards de­con­fus­ing values

Gordon Seidoh Worley29 Jan 2020 19:28 UTC
12 points
4 comments7 min readLW link

Sun­day July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stu­art_Armstrong

8 Jul 2020 0:27 UTC
19 points
2 comments1 min readLW link

Value Uncer­tainty and the Sin­gle­ton Scenario

Wei_Dai24 Jan 2010 5:03 UTC
10 points
31 comments3 min readLW link

2018 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Larks18 Dec 2018 4:46 UTC
190 points
26 comments62 min readLW link1 review

Re­cur­sive Quan­tiliz­ers II

abramdemski2 Dec 2020 15:26 UTC
30 points
15 comments13 min readLW link

The Poin­t­ers Prob­lem: Hu­man Values Are A Func­tion Of Hu­mans’ La­tent Variables

johnswentworth18 Nov 2020 17:47 UTC
116 points
45 comments11 min readLW link2 reviews

2019 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Larks19 Dec 2019 3:00 UTC
130 points
18 comments62 min readLW link

Men­tal sub­agent im­pli­ca­tions for AI Safety

moridinamael3 Jan 2021 18:59 UTC
11 points
0 comments3 min readLW link

AI Align­ment Pod­cast: An Overview of Tech­ni­cal AI Align­ment in 2018 and 2019 with Buck Sh­legeris and Ro­hin Shah

Palus Astra16 Apr 2020 0:50 UTC
58 points
27 comments89 min readLW link

In­tro­duc­tion to Re­duc­ing Goodhart

Charlie Steiner26 Aug 2021 18:38 UTC
40 points
10 comments4 min readLW link

Good­hart Ethology

Charlie Steiner17 Sep 2021 17:31 UTC
19 points
4 comments14 min readLW link

The Dark Side of Cog­ni­tion Hypothesis

Cameron Berg3 Oct 2021 20:10 UTC
19 points
1 comment16 min readLW link

Mo­rally un­der­defined situ­a­tions can be deadly

Stuart_Armstrong22 Nov 2021 14:48 UTC
17 points
8 comments2 min readLW link

How an alien the­ory of mind might be unlearnable

Stuart_Armstrong3 Jan 2022 11:16 UTC
26 points
35 comments5 min readLW link

Learn­ing Values in Practice

Stuart_Armstrong20 Jul 2020 18:38 UTC
24 points
0 comments5 min readLW link

Value ex­trap­o­la­tion, con­cept ex­trap­o­la­tion, model splintering

Stuart_Armstrong8 Mar 2022 22:50 UTC
14 points
1 comment2 min readLW link

Nat­u­ral Value Learning

Chris van Merwijk20 Mar 2022 12:44 UTC
7 points
10 comments4 min readLW link

AIs should learn hu­man prefer­ences, not biases

Stuart_Armstrong8 Apr 2022 13:45 UTC
10 points
0 comments1 min readLW link

Differ­ent per­spec­tives on con­cept extrapolation

Stuart_Armstrong8 Apr 2022 10:42 UTC
43 points
7 comments5 min readLW link

The Poin­t­ers Prob­lem—Distilled

NinaR26 May 2022 22:44 UTC
11 points
0 comments2 min readLW link

Value ex­trap­o­la­tion vs Wireheading

Stuart_Armstrong17 Jun 2022 15:02 UTC
16 points
1 comment1 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
126 points
6 comments35 min readLW link

La­tent Vari­ables and Model Mis-Specification

jsteinhardt7 Nov 2018 14:48 UTC
23 points
7 comments9 min readLW link

Fu­ture di­rec­tions for am­bi­tious value learning

Rohin Shah11 Nov 2018 15:53 UTC
46 points
9 comments4 min readLW link

Pre­face to the se­quence on value learning

Rohin Shah30 Oct 2018 22:04 UTC
69 points
6 comments3 min readLW link

What is nar­row value learn­ing?

Rohin Shah10 Jan 2019 7:05 UTC
23 points
3 comments2 min readLW link

Hu­man-AI Interaction

Rohin Shah15 Jan 2019 1:57 UTC
34 points
10 comments4 min readLW link

LOVE in a sim­box is all you need

jacob_cannell28 Sep 2022 18:25 UTC
68 points
69 comments44 min readLW link

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John Nay21 Oct 2022 2:03 UTC
4 points
18 comments54 min readLW link

Re­ward uncertainty

Rohin Shah19 Jan 2019 2:16 UTC
26 points
3 comments5 min readLW link

Fu­ture di­rec­tions for nar­row value learning

Rohin Shah26 Jan 2019 2:36 UTC
12 points
4 comments4 min readLW link

Would I think for ten thou­sand years?

Stuart_Armstrong11 Feb 2019 19:37 UTC
25 points
13 comments1 min readLW link

Beyond al­gorith­mic equiv­alence: self-modelling

Stuart_Armstrong28 Feb 2018 16:55 UTC
10 points
3 comments1 min readLW link

Beyond al­gorith­mic equiv­alence: al­gorith­mic noise

Stuart_Armstrong28 Feb 2018 16:55 UTC
10 points
4 comments2 min readLW link

Fol­low­ing hu­man norms

Rohin Shah20 Jan 2019 23:59 UTC
30 points
10 comments5 min readLW link

Thoughts on im­ple­ment­ing cor­rigible ro­bust alignment

Steven Byrnes26 Nov 2019 14:06 UTC
26 points
2 comments6 min readLW link

Ro­bust Delegation

4 Nov 2018 16:38 UTC
115 points
10 comments1 min readLW link

Com­par­ing AI Align­ment Ap­proaches to Min­i­mize False Pos­i­tive Risk

Gordon Seidoh Worley30 Jun 2020 19:34 UTC
5 points
0 comments9 min readLW link

The Com­pu­ta­tional Anatomy of Hu­man Values

beren6 Apr 2023 10:33 UTC
62 points
30 comments30 min readLW link

The self-un­al­ign­ment problem

14 Apr 2023 12:10 UTC
123 points
21 comments10 min readLW link

Value Learn­ing – Towards Re­solv­ing Con­fu­sion

PashaKamyshev24 Apr 2023 6:43 UTC
4 points
0 comments18 min readLW link

De­con­fus­ing Hu­man Values Re­search Agenda v1

Gordon Seidoh Worley23 Mar 2020 16:25 UTC
27 points
12 comments4 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC
3 points
0 comments15 min readLW link

[Question] Is In­fra-Bayesi­anism Ap­pli­ca­ble to Value Learn­ing?

RogerDearnaley11 May 2023 8:17 UTC
2 points
4 comments1 min readLW link

[Question] What is the re­la­tion­ship be­tween Prefer­ence Learn­ing and Value Learn­ing?

Riccardo Volpato13 Jan 2020 21:08 UTC
5 points
2 comments1 min readLW link

AI Align­ment Prob­lem: “Hu­man Values” don’t Ac­tu­ally Exist

avturchin22 Apr 2019 9:23 UTC
42 points
29 comments43 min readLW link

Min­i­miza­tion of pre­dic­tion er­ror as a foun­da­tion for hu­man val­ues in AI alignment

Gordon Seidoh Worley9 Oct 2019 18:23 UTC
15 points
42 comments5 min readLW link

Can few-shot learn­ing teach AI right from wrong?

Charlie Steiner20 Jul 2018 7:45 UTC
13 points
3 comments6 min readLW link

Hu­mans aren’t agents—what then for value learn­ing?

Charlie Steiner15 Mar 2019 22:01 UTC
21 points
14 comments3 min readLW link

Value learn­ing for moral essentialists

Charlie Steiner6 May 2019 9:05 UTC
11 points
3 comments3 min readLW link

Train­ing hu­man mod­els is an un­solved problem

Charlie Steiner10 May 2019 7:17 UTC
13 points
3 comments4 min readLW link

Can we make peace with moral in­de­ter­mi­nacy?

Charlie Steiner3 Oct 2019 12:56 UTC
16 points
8 comments3 min readLW link

The AI is the model

Charlie Steiner4 Oct 2019 8:11 UTC
14 points
1 comment3 min readLW link

[Heb­bian Nat­u­ral Ab­strac­tions] Math­e­mat­i­cal Foundations

25 Dec 2022 20:58 UTC
15 points
2 comments6 min readLW link

Mor­pholog­i­cal in­tel­li­gence, su­per­hu­man em­pa­thy, and eth­i­cal arbitration

Roman Leventov13 Feb 2023 10:25 UTC
1 point
0 comments2 min readLW link

The Lin­guis­tic Blind Spot of Value-Aligned Agency, Nat­u­ral and Ar­tifi­cial

Roman Leventov14 Feb 2023 6:57 UTC
6 points
0 comments2 min readLW link

Break­ing the Op­ti­mizer’s Curse, and Con­se­quences for Ex­is­ten­tial Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC
10 points
1 comment23 min readLW link

Just How Hard a Prob­lem is Align­ment?

Roger Dearnaley25 Feb 2023 9:00 UTC
−1 points
1 comment21 min readLW link

[AN #69] Stu­art Rus­sell’s new book on why we need to re­place the stan­dard model of AI

Rohin Shah19 Oct 2019 0:30 UTC
60 points
12 comments15 min readLW link

(A Failed Ap­proach) From Prece­dent to Utility Function

Akira Pyinya29 Apr 2023 21:55 UTC
0 points
2 comments4 min readLW link

[Question] “Frag­ility of Value” vs. LLMs

Not Relevant13 Apr 2022 2:02 UTC
32 points
33 comments1 min readLW link

Us­ing vec­tor fields to vi­su­al­ise prefer­ences and make them consistent

28 Jan 2020 19:44 UTC
41 points
32 comments11 min readLW link

Value uncertainty

MichaelA29 Jan 2020 20:16 UTC
19 points
3 comments14 min readLW link

Mo­ral un­cer­tainty: What kind of ‘should’ is in­volved?

MichaelA13 Jan 2020 12:13 UTC
14 points
11 comments13 min readLW link

Mo­ral un­cer­tainty vs re­lated concepts

MichaelA11 Jan 2020 10:03 UTC
26 points
13 comments16 min readLW link

Mo­ral­ity vs re­lated concepts

MichaelA7 Jan 2020 10:47 UTC
26 points
17 comments8 min readLW link

Mak­ing de­ci­sions when both morally and em­piri­cally uncertain

MichaelA2 Jan 2020 7:20 UTC
13 points
14 comments20 min readLW link

Mak­ing de­ci­sions un­der moral uncertainty

MichaelA30 Dec 2019 1:49 UTC
15 points
26 comments17 min readLW link

Re­search ideas to study hu­mans with AI Safety in mind

Riccardo Volpato3 Jul 2020 16:01 UTC
23 points
2 comments5 min readLW link

The E-Coli Test for AI Alignment

johnswentworth16 Dec 2018 8:10 UTC
70 points
24 comments1 min readLW link

Have you felt ex­iert yet?

Stuart_Armstrong5 Jan 2018 17:03 UTC
28 points
7 comments1 min readLW link

Why we need a *the­ory* of hu­man values

Stuart_Armstrong5 Dec 2018 16:00 UTC
66 points
15 comments4 min readLW link

But ex­actly how com­plex and frag­ile?

KatjaGrace3 Nov 2019 18:20 UTC
73 points
32 comments3 min readLW link1 review

Clar­ify­ing “AI Align­ment”

paulfchristiano15 Nov 2018 14:41 UTC
65 points
82 comments3 min readLW link2 reviews

Hack­ing the CEV for Fun and Profit

Wei_Dai3 Jun 2010 20:30 UTC
77 points
207 comments1 min readLW link

Us­ing ly­ing to de­tect hu­man values

Stuart_Armstrong15 Mar 2018 11:37 UTC
19 points
6 comments1 min readLW link

The Ur­gent Meta-Ethics of Friendly Ar­tifi­cial Intelligence

lukeprog1 Feb 2011 14:15 UTC
76 points
252 comments1 min readLW link

Re­solv­ing hu­man val­ues, com­pletely and adequately

Stuart_Armstrong30 Mar 2018 3:35 UTC
32 points
30 comments12 min readLW link

Learn­ing prefer­ences by look­ing at the world

Rohin Shah12 Feb 2019 22:25 UTC
43 points
10 comments7 min readLW link

Non-Con­se­quen­tial­ist Co­op­er­a­tion?

abramdemski11 Jan 2019 9:15 UTC
49 points
15 comments7 min readLW link

Stable Poin­t­ers to Value: An Agent Embed­ded in Its Own Utility Function

abramdemski17 Aug 2017 0:22 UTC
15 points
9 comments5 min readLW link

Stable Poin­t­ers to Value II: En­vi­ron­men­tal Goals

abramdemski9 Feb 2018 6:03 UTC
18 points
2 comments4 min readLW link

Stable Poin­t­ers to Value III: Re­cur­sive Quantilization

abramdemski21 Jul 2018 8:06 UTC
19 points
4 comments4 min readLW link

Policy Alignment

abramdemski30 Jun 2018 0:24 UTC
50 points
25 comments8 min readLW link

Where do self­ish val­ues come from?

Wei_Dai18 Nov 2011 23:52 UTC
58 points
62 comments2 min readLW link

Ac­knowl­edg­ing Hu­man Prefer­ence Types to Sup­port Value Learning

Nandi Sabrina Erin13 Nov 2018 18:57 UTC
34 points
4 comments9 min readLW link

Co­her­ence ar­gu­ments do not en­tail goal-di­rected behavior

Rohin Shah3 Dec 2018 3:26 UTC
110 points
69 comments7 min readLW link3 reviews

Ma­hatma Arm­strong: CEVed to death.

Stuart_Armstrong6 Jun 2013 12:50 UTC
33 points
62 comments2 min readLW link

misc raw re­sponses to a tract of Crit­i­cal Rationalism

mako yass14 Aug 2020 11:53 UTC
21 points
52 comments3 min readLW link

How to get value learn­ing and refer­ence wrong

Charlie Steiner26 Feb 2019 20:22 UTC
37 points
2 comments6 min readLW link

[Question] Since figur­ing out hu­man val­ues is hard, what about, say, mon­key val­ues?

shminux1 Jan 2020 21:56 UTC
37 points
13 comments1 min readLW link

Two ques­tions about CEV that worry me

cousin_it23 Dec 2010 15:58 UTC
37 points
141 comments1 min readLW link

Cake, or death!

Stuart_Armstrong25 Oct 2012 10:33 UTC
47 points
13 comments4 min readLW link

Ap­ply­ing util­ity func­tions to hu­mans con­sid­ered harmful

Kaj_Sotala3 Feb 2010 19:22 UTC
36 points
116 comments5 min readLW link

Agents That Learn From Hu­man Be­hav­ior Can’t Learn Hu­man Values That Hu­mans Haven’t Learned Yet

steven046111 Jul 2018 2:59 UTC
27 points
11 comments1 min readLW link

Full toy model for prefer­ence learning

Stuart_Armstrong16 Oct 2019 11:06 UTC
20 points
2 comments12 min readLW link

Rig­ging is a form of wireheading

Stuart_Armstrong3 May 2018 12:50 UTC
11 points
2 comments1 min readLW link

ISO: Name of Problem

johnswentworth24 Jul 2018 17:15 UTC
28 points
15 comments1 min readLW link

Su­per­in­tel­li­gence 14: Mo­ti­va­tion se­lec­tion methods

KatjaGrace16 Dec 2014 2:00 UTC
9 points
28 comments5 min readLW link

Su­per­in­tel­li­gence 20: The value-load­ing problem

KatjaGrace27 Jan 2015 2:00 UTC
8 points
21 comments6 min readLW link

Su­per­in­tel­li­gence 21: Value learning

KatjaGrace3 Feb 2015 2:01 UTC
12 points
33 comments4 min readLW link

Su­per­in­tel­li­gence 25: Com­po­nents list for ac­quiring values

KatjaGrace3 Mar 2015 2:01 UTC
11 points
12 comments8 min readLW link

Hu­mans can be as­signed any val­ues what­so­ever...

Stuart_Armstrong13 Oct 2017 11:29 UTC
15 points
6 comments4 min readLW link

How much can value learn­ing be dis­en­tan­gled?

Stuart_Armstrong29 Jan 2019 14:17 UTC
22 points
30 comments2 min readLW link

Other ver­sions of “No free lunch in value learn­ing”

Stuart_Armstrong25 Feb 2020 14:25 UTC
28 points
0 comments1 min readLW link

De­liber­a­tion as a method to find the “ac­tual prefer­ences” of humans

riceissa22 Oct 2019 9:23 UTC
23 points
5 comments9 min readLW link

Prac­ti­cal con­se­quences of im­pos­si­bil­ity of value learning

Stuart_Armstrong2 Aug 2019 23:06 UTC
22 points
13 comments3 min readLW link

Com­mu­ni­ca­tion Prior as Align­ment Strategy

johnswentworth12 Nov 2020 22:06 UTC
40 points
8 comments6 min readLW link

One could be for­given for get­ting the feel­ing...

HumaneAutomation3 Nov 2020 4:53 UTC
−2 points
2 comments1 min readLW link

Ra­tion­al­is­ing hu­mans: an­other mug­ging, but not Pas­cal’s

Stuart_Armstrong14 Nov 2017 15:46 UTC
7 points
1 comment3 min readLW link

AI Align­ment, Philo­soph­i­cal Plu­ral­ism, and the Rele­vance of Non-Western Philosophy

xuan1 Jan 2021 0:08 UTC
30 points
21 comments20 min readLW link

An Open Philan­thropy grant pro­posal: Causal rep­re­sen­ta­tion learn­ing of hu­man preferences

PabloAMC11 Jan 2022 11:28 UTC
19 points
6 comments8 min readLW link

Value ex­trap­o­la­tion par­tially re­solves sym­bol grounding

Stuart_Armstrong12 Jan 2022 16:30 UTC
24 points
10 comments1 min readLW link

Up­dated Defer­ence is not a strong ar­gu­ment against the util­ity un­cer­tainty ap­proach to alignment

Ivan Vendrov24 Jun 2022 19:32 UTC
26 points
8 comments4 min readLW link

How I think about alignment

Linda Linsefors13 Aug 2022 10:01 UTC
30 points
11 comments5 min readLW link

Broad Pic­ture of Hu­man Values

Thane Ruthenis20 Aug 2022 19:42 UTC
36 points
5 comments10 min readLW link

Help Un­der­stand­ing Prefer­ences And Evil

Netcentrica27 Aug 2022 3:42 UTC
6 points
7 comments2 min readLW link

Solv­ing Align­ment by “solv­ing” semantics

Q Home27 Aug 2022 4:17 UTC
14 points
10 comments26 min readLW link

Can “Re­ward Eco­nomics” solve AI Align­ment?

Q Home7 Sep 2022 7:58 UTC
3 points
15 comments18 min readLW link

What Should AI Owe To Us? Ac­countable and Aligned AI Sys­tems via Con­trac­tu­al­ist AI Alignment

xuan8 Sep 2022 15:04 UTC
31 points
15 comments25 min readLW link

Lev­er­ag­ing Le­gal In­for­mat­ics to Align AI

John Nay18 Sep 2022 20:39 UTC
11 points
0 comments3 min readLW link

Char­ac­ter alignment

p.b.20 Sep 2022 8:27 UTC
22 points
0 comments2 min readLW link

[Heb­bian Nat­u­ral Ab­strac­tions] Introduction

21 Nov 2022 20:34 UTC
34 points
3 comments4 min readLW link

The Op­por­tu­nity and Risks of Learn­ing Hu­man Values In-Context

Zachary Robertson10 Dec 2022 21:40 UTC
2 points
4 comments5 min readLW link

[Question] [DISC] Are Values Ro­bust?

DragonGod21 Dec 2022 1:00 UTC
12 points
8 comments2 min readLW link