Value Learning

TagLast edit: 3 Oct 2020 2:34 UTC by Ruby

Value learning is a proposed method for incorporating human values in an AGI. It involves the creation of an artificial learner whose actions consider many possible set of values and preferences, weighed by their likelihood. Value learning could prevent an AGI of having goals detrimental to human values, hence helping in the creation of Friendly AI.

Although there are many ways to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition), this method is directly mentioned and developed in Daniel Dewey’s paper ‘Learning What to Value’. Like most authors, he assumes that human’s goals would not naturally occur in an artificial agent and should be enforced in it. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, even if we forcefully engineer the agent to maximize those rewards that also maximize human values, the agent could alter its environment to more easily produce those same rewards without the trouble of also maximizing human values (i.e.: if the reward was human happiness it could alter the human mind so it became happy with anything).

To solve all these problems, Dewey proposes a utility function maximizer, who considers all possible utility functions weighted by their probabilities: “[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history” He concludes saying that although it solves many of the mentioned problems, this method still leaves many open questions. However it should provide a direction for future work.


See Also

The easy goal in­fer­ence prob­lem is still hard

paulfchristiano3 Nov 2018 14:41 UTC
41 points
17 comments4 min readLW link

Hu­mans can be as­signed any val­ues what­so­ever…

Stuart_Armstrong5 Nov 2018 14:26 UTC
53 points
25 comments4 min readLW link

Model Mis-speci­fi­ca­tion and In­verse Re­in­force­ment Learning

9 Nov 2018 15:33 UTC
30 points
3 comments16 min readLW link

Am­bi­tious vs. nar­row value learning

paulfchristiano12 Jan 2019 6:18 UTC
21 points
16 comments4 min readLW link

Con­clu­sion to the se­quence on value learning

rohinmshah3 Feb 2019 21:05 UTC
45 points
20 comments5 min readLW link

In­tu­itions about goal-di­rected behavior

rohinmshah1 Dec 2018 4:25 UTC
46 points
15 comments6 min readLW link

What is am­bi­tious value learn­ing?

rohinmshah1 Nov 2018 16:20 UTC
42 points
28 comments2 min readLW link


abramdemski18 Nov 2020 16:52 UTC
45 points
11 comments9 min readLW link

Learn­ing hu­man prefer­ences: black-box, white-box, and struc­tured white-box access

Stuart_Armstrong24 Aug 2020 11:42 UTC
23 points
9 comments6 min readLW link

[Question] What is the re­la­tion­ship be­tween Prefer­ence Learn­ing and Value Learn­ing?

Riccardo Volpato13 Jan 2020 21:08 UTC
5 points
2 comments1 min readLW link

Ro­bust Delegation

4 Nov 2018 16:38 UTC
108 points
10 comments1 min readLW link

Thoughts on im­ple­ment­ing cor­rigible ro­bust alignment

steve215226 Nov 2019 14:06 UTC
26 points
2 comments6 min readLW link

Com­par­ing AI Align­ment Ap­proaches to Min­i­mize False Pos­i­tive Risk

G Gordon Worley III30 Jun 2020 19:34 UTC
5 points
0 comments9 min readLW link

De­con­fus­ing Hu­man Values Re­search Agenda v1

G Gordon Worley III23 Mar 2020 16:25 UTC
18 points
12 comments4 min readLW link

AI Align­ment Prob­lem: “Hu­man Values” don’t Ac­tu­ally Exist

avturchin22 Apr 2019 9:23 UTC
26 points
26 comments43 min readLW link

Min­i­miza­tion of pre­dic­tion er­ror as a foun­da­tion for hu­man val­ues in AI alignment

G Gordon Worley III9 Oct 2019 18:23 UTC
12 points
42 comments5 min readLW link

Values, Valence, and Alignment

G Gordon Worley III5 Dec 2019 21:06 UTC
12 points
4 comments13 min readLW link

The two-layer model of hu­man val­ues, and prob­lems with syn­the­siz­ing preferences

Kaj_Sotala24 Jan 2020 15:17 UTC
68 points
15 comments9 min readLW link

Towards de­con­fus­ing values

G Gordon Worley III29 Jan 2020 19:28 UTC
12 points
4 comments7 min readLW link

Sun­day July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stu­art_Armstrong

8 Jul 2020 0:27 UTC
19 points
2 comments1 min readLW link

Value Uncer­tainty and the Sin­gle­ton Scenario

Wei_Dai24 Jan 2010 5:03 UTC
10 points
31 comments3 min readLW link

2018 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Larks18 Dec 2018 4:46 UTC
189 points
26 comments62 min readLW link

2019 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Larks19 Dec 2019 3:00 UTC
129 points
18 comments62 min readLW link

AI Align­ment Pod­cast: An Overview of Tech­ni­cal AI Align­ment in 2018 and 2019 with Buck Sh­legeris and Ro­hin Shah

Palus Astra16 Apr 2020 0:50 UTC
46 points
27 comments89 min readLW link

Learn­ing Values in Practice

Stuart_Armstrong20 Jul 2020 18:38 UTC
23 points
0 comments5 min readLW link

La­tent Vari­ables and Model Mis-Specification

jsteinhardt7 Nov 2018 14:48 UTC
19 points
3 comments9 min readLW link

Fu­ture di­rec­tions for am­bi­tious value learning

rohinmshah11 Nov 2018 15:53 UTC
43 points
9 comments4 min readLW link

Pre­face to the se­quence on value learning

rohinmshah30 Oct 2018 22:04 UTC
65 points
6 comments3 min readLW link

What is nar­row value learn­ing?

rohinmshah10 Jan 2019 7:05 UTC
23 points
3 comments2 min readLW link

Hu­man-AI Interaction

rohinmshah15 Jan 2019 1:57 UTC
30 points
10 comments4 min readLW link

Re­ward uncertainty

rohinmshah19 Jan 2019 2:16 UTC
24 points
0 comments5 min readLW link

Fu­ture di­rec­tions for nar­row value learning

rohinmshah26 Jan 2019 2:36 UTC
12 points
4 comments4 min readLW link

AI Align­ment 2018-19 Review

rohinmshah28 Jan 2020 2:19 UTC
113 points
6 comments35 min readLW link

Would I think for ten thou­sand years?

Stuart_Armstrong11 Feb 2019 19:37 UTC
25 points
13 comments1 min readLW link

Beyond al­gorith­mic equiv­alence: self-modelling

Stuart_Armstrong28 Feb 2018 16:55 UTC
10 points
3 comments1 min readLW link

Beyond al­gorith­mic equiv­alence: al­gorith­mic noise

Stuart_Armstrong28 Feb 2018 16:55 UTC
10 points
4 comments2 min readLW link

Fol­low­ing hu­man norms

rohinmshah20 Jan 2019 23:59 UTC
27 points
10 comments5 min readLW link

Can few-shot learn­ing teach AI right from wrong?

Charlie Steiner20 Jul 2018 7:45 UTC
13 points
3 comments6 min readLW link

Hu­mans aren’t agents—what then for value learn­ing?

Charlie Steiner15 Mar 2019 22:01 UTC
18 points
14 comments3 min readLW link

Value learn­ing for moral essentialists

Charlie Steiner6 May 2019 9:05 UTC
11 points
3 comments3 min readLW link

Train­ing hu­man mod­els is an un­solved problem

Charlie Steiner10 May 2019 7:17 UTC
13 points
3 comments4 min readLW link

Can we make peace with moral in­de­ter­mi­nacy?

Charlie Steiner3 Oct 2019 12:56 UTC
16 points
8 comments3 min readLW link

The AI is the model

Charlie Steiner4 Oct 2019 8:11 UTC
13 points
1 comment3 min readLW link

What’s the dream for giv­ing nat­u­ral lan­guage com­mands to AI?

Charlie Steiner8 Oct 2019 13:42 UTC
8 points
8 comments7 min readLW link

Con­straints from nat­u­ral­ized ethics.

Charlie Steiner25 Jul 2020 14:54 UTC
21 points
0 comments3 min readLW link

Re­cur­sive Quan­tiliz­ers II

abramdemski2 Dec 2020 15:26 UTC
25 points
8 comments13 min readLW link

The Poin­t­ers Prob­lem: Hu­man Values Are A Func­tion Of Hu­mans’ La­tent Variables

johnswentworth18 Nov 2020 17:47 UTC
37 points
33 comments11 min readLW link

Men­tal sub­agent im­pli­ca­tions for AI Safety

moridinamael3 Jan 2021 18:59 UTC
9 points
0 comments3 min readLW link

Us­ing vec­tor fields to vi­su­al­ise prefer­ences and make them consistent

28 Jan 2020 19:44 UTC
38 points
32 comments11 min readLW link

Value uncertainty

MichaelA29 Jan 2020 20:16 UTC
16 points
3 comments14 min readLW link

Mo­ral un­cer­tainty: What kind of ‘should’ is in­volved?

MichaelA13 Jan 2020 12:13 UTC
14 points
11 comments13 min readLW link

Mo­ral un­cer­tainty vs re­lated concepts

MichaelA11 Jan 2020 10:03 UTC
26 points
13 comments16 min readLW link

Mo­ral­ity vs re­lated concepts

MichaelA7 Jan 2020 10:47 UTC
26 points
17 comments8 min readLW link

Mak­ing de­ci­sions when both morally and em­piri­cally uncertain

MichaelA2 Jan 2020 7:20 UTC
13 points
14 comments20 min readLW link

Mak­ing de­ci­sions un­der moral uncertainty

MichaelA30 Dec 2019 1:49 UTC
15 points
26 comments17 min readLW link

Re­search ideas to study hu­mans with AI Safety in mind

Riccardo Volpato3 Jul 2020 16:01 UTC
21 points
2 comments5 min readLW link

The E-Coli Test for AI Alignment

johnswentworth16 Dec 2018 8:10 UTC
65 points
24 comments1 min readLW link

Have you felt ex­iert yet?

Stuart_Armstrong5 Jan 2018 17:03 UTC
28 points
7 comments1 min readLW link

Why we need a *the­ory* of hu­man values

Stuart_Armstrong5 Dec 2018 16:00 UTC
64 points
15 comments4 min readLW link

But ex­actly how com­plex and frag­ile?

KatjaGrace3 Nov 2019 18:20 UTC
65 points
32 comments3 min readLW link2 nominations1 review

Clar­ify­ing “AI Align­ment”

paulfchristiano15 Nov 2018 14:41 UTC
54 points
82 comments3 min readLW link

Hack­ing the CEV for Fun and Profit

Wei_Dai3 Jun 2010 20:30 UTC
70 points
207 comments1 min readLW link

Us­ing ly­ing to de­tect hu­man values

Stuart_Armstrong15 Mar 2018 11:37 UTC
18 points
6 comments1 min readLW link

The Ur­gent Meta-Ethics of Friendly Ar­tifi­cial Intelligence

lukeprog1 Feb 2011 14:15 UTC
63 points
252 comments1 min readLW link

Re­solv­ing hu­man val­ues, com­pletely and adequately

Stuart_Armstrong30 Mar 2018 3:35 UTC
30 points
28 comments12 min readLW link

Learn­ing prefer­ences by look­ing at the world

rohinmshah12 Feb 2019 22:25 UTC
43 points
10 comments7 min readLW link

Non-Con­se­quen­tial­ist Co­op­er­a­tion?

abramdemski11 Jan 2019 9:15 UTC
47 points
15 comments7 min readLW link

Stable Poin­t­ers to Value: An Agent Embed­ded in Its Own Utility Function

abramdemski17 Aug 2017 0:22 UTC
13 points
3 comments5 min readLW link

Stable Poin­t­ers to Value II: En­vi­ron­men­tal Goals

abramdemski9 Feb 2018 6:03 UTC
17 points
2 comments4 min readLW link

Stable Poin­t­ers to Value III: Re­cur­sive Quantilization

abramdemski21 Jul 2018 8:06 UTC
18 points
4 comments4 min readLW link

Policy Alignment

abramdemski30 Jun 2018 0:24 UTC
47 points
25 comments8 min readLW link

Where do self­ish val­ues come from?

Wei_Dai18 Nov 2011 23:52 UTC
55 points
58 comments2 min readLW link

Ac­knowl­edg­ing Hu­man Prefer­ence Types to Sup­port Value Learning

Nandi Sabrina Erin13 Nov 2018 18:57 UTC
34 points
4 comments9 min readLW link

Co­her­ence ar­gu­ments do not im­ply goal-di­rected behavior

rohinmshah3 Dec 2018 3:26 UTC
81 points
65 comments7 min readLW link

Ma­hatma Arm­strong: CEVed to death.

Stuart_Armstrong6 Jun 2013 12:50 UTC
31 points
62 comments2 min readLW link

misc raw re­sponses to a tract of Crit­i­cal Rationalism

MakoYass14 Aug 2020 11:53 UTC
21 points
52 comments3 min readLW link

How to get value learn­ing and refer­ence wrong

Charlie Steiner26 Feb 2019 20:22 UTC
37 points
2 comments6 min readLW link

[Question] Since figur­ing out hu­man val­ues is hard, what about, say, mon­key val­ues?

shminux1 Jan 2020 21:56 UTC
34 points
13 comments1 min readLW link

Two ques­tions about CEV that worry me

cousin_it23 Dec 2010 15:58 UTC
37 points
142 comments1 min readLW link

Cake, or death!

Stuart_Armstrong25 Oct 2012 10:33 UTC
45 points
13 comments4 min readLW link

Ap­ply­ing util­ity func­tions to hu­mans con­sid­ered harmful

Kaj_Sotala3 Feb 2010 19:22 UTC
35 points
116 comments5 min readLW link

Agents That Learn From Hu­man Be­hav­ior Can’t Learn Hu­man Values That Hu­mans Haven’t Learned Yet

steven046111 Jul 2018 2:59 UTC
27 points
11 comments1 min readLW link

Full toy model for prefer­ence learning

Stuart_Armstrong16 Oct 2019 11:06 UTC
20 points
2 comments12 min readLW link

Rig­ging is a form of wireheading

Stuart_Armstrong3 May 2018 12:50 UTC
11 points
2 comments1 min readLW link

ISO: Name of Problem

johnswentworth24 Jul 2018 17:15 UTC
28 points
15 comments1 min readLW link

Su­per­in­tel­li­gence 14: Mo­ti­va­tion se­lec­tion methods

KatjaGrace16 Dec 2014 2:00 UTC
9 points
28 comments5 min readLW link

Su­per­in­tel­li­gence 20: The value-load­ing problem

KatjaGrace27 Jan 2015 2:00 UTC
8 points
21 comments6 min readLW link

Su­per­in­tel­li­gence 21: Value learning

KatjaGrace3 Feb 2015 2:01 UTC
12 points
33 comments4 min readLW link

Su­per­in­tel­li­gence 25: Com­po­nents list for ac­quiring values

KatjaGrace3 Mar 2015 2:01 UTC
11 points
12 comments8 min readLW link

Hu­mans can be as­signed any val­ues what­so­ever...

Stuart_Armstrong13 Oct 2017 11:29 UTC
13 points
6 comments4 min readLW link

How much can value learn­ing be dis­en­tan­gled?

Stuart_Armstrong29 Jan 2019 14:17 UTC
22 points
30 comments2 min readLW link

Other ver­sions of “No free lunch in value learn­ing”

Stuart_Armstrong25 Feb 2020 14:25 UTC
26 points
0 comments1 min readLW link

De­liber­a­tion as a method to find the “ac­tual prefer­ences” of humans

riceissa22 Oct 2019 9:23 UTC
22 points
5 comments9 min readLW link

Prac­ti­cal con­se­quences of im­pos­si­bil­ity of value learning

Stuart_Armstrong2 Aug 2019 23:06 UTC
22 points
13 comments3 min readLW link

Com­mu­ni­ca­tion Prior as Align­ment Strategy

johnswentworth12 Nov 2020 22:06 UTC
36 points
7 comments6 min readLW link

One could be for­given for get­ting the feel­ing...

HumaneAutomation3 Nov 2020 4:53 UTC
−2 points
2 comments1 min readLW link

Ra­tion­al­is­ing hu­mans: an­other mug­ging, but not Pas­cal’s

Stuart_Armstrong14 Nov 2017 15:46 UTC
7 points
1 comment3 min readLW link

AI Align­ment, Philo­soph­i­cal Plu­ral­ism, and the Rele­vance of Non-Western Philosophy

xuan1 Jan 2021 0:08 UTC
27 points
19 comments20 min readLW link