Value Learning

TagLast edit: 6 Dec 2023 8:12 UTC by RogerDearnaley

Value learning is a proposed method for incorporating human values in an AGI. It involves the creation of an artificial learner whose actions consider many possible sets of values and preferences, weighed by their likelihood. Value learning could prevent an AGI of having goals detrimental to human values, hence helping in the creation of Friendly AI.

Many ways have been proposed to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition, mostly proposed around 2004-2010). Value learning was suggested in 2011 by Daniel Dewey in ‘Learning What to Value’. Like most authors, he assumes that an artificial agent needs to be intentionally aligned to human goals. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, this could suffer from goal misspecification or reward hacking. He proposes a utility function maximizer comparable to AIXI, which considers all possible utility functions weighted by their Bayesian probabilities: “[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history”

Nick Bostrom also discusses value learning at length in his book Superintelligence. Value learning is closely related to various proposals for AI-assisted Alignment and AI-assisted/​AI automated Alignment research. Since human values are complex and fragile, learning human values well is a challenging problem, much like AI-assisted Alignment (but in a less supervised setting, so actually harder). So this is only a practicable alignment technique for AGI capable of successfully performing a STEM research program (in Anthropology). Thus value learning is (unusually) an alignment technique that improves as capabilities increase, and it requires around an AGI minimum threshold of capabilities to begin to be effective.

One potential challenge is that human values are somewhat mutable and AGI could affect them.


See Also

The easy goal in­fer­ence prob­lem is still hard

paulfchristiano3 Nov 2018 14:41 UTC
59 points
20 comments4 min readLW link

Model Mis-speci­fi­ca­tion and In­verse Re­in­force­ment Learning

9 Nov 2018 15:33 UTC
33 points
3 comments16 min readLW link

Am­bi­tious vs. nar­row value learning

paulfchristiano12 Jan 2019 6:18 UTC
24 points
16 comments4 min readLW link

Hu­mans can be as­signed any val­ues what­so­ever…

Stuart_Armstrong5 Nov 2018 14:26 UTC
54 points
27 comments4 min readLW link

Con­clu­sion to the se­quence on value learning

Rohin Shah3 Feb 2019 21:05 UTC
51 points
20 comments5 min readLW link

In­tu­itions about goal-di­rected behavior

Rohin Shah1 Dec 2018 4:25 UTC
54 points
15 comments6 min readLW link

Ap­prox­i­mately Bayesian Rea­son­ing: Knigh­tian Uncer­tainty, Good­hart, and the Look-Else­where Effect

RogerDearnaley26 Jan 2024 3:58 UTC
13 points
0 comments11 min readLW link

Align­ment has a Basin of At­trac­tion: Beyond the Orthog­o­nal­ity Thesis

RogerDearnaley1 Feb 2024 21:15 UTC
4 points
15 comments13 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC
20 points
6 comments31 min readLW link

6. The Mutable Values Prob­lem in Value Learn­ing and CEV

RogerDearnaley4 Dec 2023 18:31 UTC
12 points
0 comments49 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC
32 points
3 comments15 min readLW link

What is am­bi­tious value learn­ing?

Rohin Shah1 Nov 2018 16:20 UTC
55 points
28 comments2 min readLW link


abramdemski18 Nov 2020 16:52 UTC
47 points
11 comments9 min readLW link

AI Align­ment Prob­lem: “Hu­man Values” don’t Ac­tu­ally Exist

avturchin22 Apr 2019 9:23 UTC
44 points
29 comments43 min readLW link

Learn­ing hu­man prefer­ences: black-box, white-box, and struc­tured white-box access

Stuart_Armstrong24 Aug 2020 11:42 UTC
26 points
9 comments6 min readLW link

Value sys­tem­ati­za­tion: how val­ues be­come co­her­ent (and mis­al­igned)

Richard_Ngo27 Oct 2023 19:06 UTC
95 points
47 comments13 min readLW link

[Question] What is the re­la­tion­ship be­tween Prefer­ence Learn­ing and Value Learn­ing?

Riccardo Volpato13 Jan 2020 21:08 UTC
5 points
2 comments1 min readLW link

Men­tal sub­agent im­pli­ca­tions for AI Safety

moridinamael3 Jan 2021 18:59 UTC
11 points
0 comments3 min readLW link

The Com­pu­ta­tional Anatomy of Hu­man Values

beren6 Apr 2023 10:33 UTC
70 points
30 comments30 min readLW link

The self-un­al­ign­ment problem

14 Apr 2023 12:10 UTC
144 points
22 comments10 min readLW link

Value Learn­ing – Towards Re­solv­ing Con­fu­sion

PashaKamyshev24 Apr 2023 6:43 UTC
4 points
0 comments18 min readLW link

Ro­bust Delegation

4 Nov 2018 16:38 UTC
116 points
10 comments1 min readLW link

Thoughts on im­ple­ment­ing cor­rigible ro­bust alignment

Steven Byrnes26 Nov 2019 14:06 UTC
26 points
2 comments6 min readLW link

Com­par­ing AI Align­ment Ap­proaches to Min­i­mize False Pos­i­tive Risk

Gordon Seidoh Worley30 Jun 2020 19:34 UTC
5 points
0 comments9 min readLW link

De­con­fus­ing Hu­man Values Re­search Agenda v1

Gordon Seidoh Worley23 Mar 2020 16:25 UTC
28 points
12 comments4 min readLW link

Min­i­miza­tion of pre­dic­tion er­ror as a foun­da­tion for hu­man val­ues in AI alignment

Gordon Seidoh Worley9 Oct 2019 18:23 UTC
15 points
42 comments5 min readLW link

Values, Valence, and Alignment

Gordon Seidoh Worley5 Dec 2019 21:06 UTC
12 points
4 comments13 min readLW link

The two-layer model of hu­man val­ues, and prob­lems with syn­the­siz­ing preferences

Kaj_Sotala24 Jan 2020 15:17 UTC
69 points
16 comments9 min readLW link

Towards de­con­fus­ing values

Gordon Seidoh Worley29 Jan 2020 19:28 UTC
12 points
4 comments7 min readLW link

Sun­day July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stu­art_Armstrong

8 Jul 2020 0:27 UTC
19 points
2 comments1 min readLW link

Value Uncer­tainty and the Sin­gle­ton Scenario

Wei Dai24 Jan 2010 5:03 UTC
10 points
31 comments3 min readLW link

2018 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Larks18 Dec 2018 4:46 UTC
190 points
26 comments62 min readLW link1 review

2019 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

Larks19 Dec 2019 3:00 UTC
130 points
18 comments62 min readLW link

AI Align­ment Pod­cast: An Overview of Tech­ni­cal AI Align­ment in 2018 and 2019 with Buck Sh­legeris and Ro­hin Shah

Palus Astra16 Apr 2020 0:50 UTC
58 points
27 comments89 min readLW link

Learn­ing Values in Practice

Stuart_Armstrong20 Jul 2020 18:38 UTC
24 points
0 comments5 min readLW link

La­tent Vari­ables and Model Mis-Specification

jsteinhardt7 Nov 2018 14:48 UTC
24 points
8 comments9 min readLW link

Fu­ture di­rec­tions for am­bi­tious value learning

Rohin Shah11 Nov 2018 15:53 UTC
48 points
9 comments4 min readLW link

Pre­face to the se­quence on value learning

Rohin Shah30 Oct 2018 22:04 UTC
70 points
6 comments3 min readLW link

What is nar­row value learn­ing?

Rohin Shah10 Jan 2019 7:05 UTC
23 points
3 comments2 min readLW link

Hu­man-AI Interaction

Rohin Shah15 Jan 2019 1:57 UTC
34 points
10 comments4 min readLW link

Re­ward uncertainty

Rohin Shah19 Jan 2019 2:16 UTC
26 points
3 comments5 min readLW link

Fu­ture di­rec­tions for nar­row value learning

Rohin Shah26 Jan 2019 2:36 UTC
12 points
4 comments4 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
126 points
6 comments35 min readLW link

[Question] Is In­fra-Bayesi­anism Ap­pli­ca­ble to Value Learn­ing?

RogerDearnaley11 May 2023 8:17 UTC
5 points
4 comments1 min readLW link

Would I think for ten thou­sand years?

Stuart_Armstrong11 Feb 2019 19:37 UTC
25 points
13 comments1 min readLW link

Beyond al­gorith­mic equiv­alence: self-modelling

Stuart_Armstrong28 Feb 2018 16:55 UTC
10 points
3 comments1 min readLW link

Beyond al­gorith­mic equiv­alence: al­gorith­mic noise

Stuart_Armstrong28 Feb 2018 16:55 UTC
10 points
4 comments2 min readLW link

Fol­low­ing hu­man norms

Rohin Shah20 Jan 2019 23:59 UTC
30 points
10 comments5 min readLW link

Can few-shot learn­ing teach AI right from wrong?

Charlie Steiner20 Jul 2018 7:45 UTC
13 points
3 comments6 min readLW link

Hu­mans aren’t agents—what then for value learn­ing?

Charlie Steiner15 Mar 2019 22:01 UTC
21 points
14 comments3 min readLW link

Value learn­ing for moral essentialists

Charlie Steiner6 May 2019 9:05 UTC
11 points
3 comments3 min readLW link

Train­ing hu­man mod­els is an un­solved problem

Charlie Steiner10 May 2019 7:17 UTC
13 points
3 comments4 min readLW link

Can we make peace with moral in­de­ter­mi­nacy?

Charlie Steiner3 Oct 2019 12:56 UTC
16 points
8 comments3 min readLW link

The AI is the model

Charlie Steiner4 Oct 2019 8:11 UTC
14 points
1 comment3 min readLW link

What’s the dream for giv­ing nat­u­ral lan­guage com­mands to AI?

Charlie Steiner8 Oct 2019 13:42 UTC
14 points
8 comments7 min readLW link

Con­straints from nat­u­ral­ized ethics.

Charlie Steiner25 Jul 2020 14:54 UTC
21 points
0 comments3 min readLW link

Eval­u­at­ing the his­tor­i­cal value mis­speci­fi­ca­tion argument

Matthew Barnett5 Oct 2023 18:34 UTC
162 points
140 comments7 min readLW link

Re­cur­sive Quan­tiliz­ers II

abramdemski2 Dec 2020 15:26 UTC
30 points
15 comments13 min readLW link

The Poin­t­ers Prob­lem: Hu­man Values Are A Func­tion Of Hu­mans’ La­tent Variables

johnswentworth18 Nov 2020 17:47 UTC
123 points
49 comments11 min readLW link2 reviews

In­tro­duc­tion to Re­duc­ing Goodhart

Charlie Steiner26 Aug 2021 18:38 UTC
47 points
10 comments4 min readLW link

Good­hart Ethology

Charlie Steiner17 Sep 2021 17:31 UTC
19 points
4 comments14 min readLW link

The Dark Side of Cog­ni­tion Hypothesis

Cameron Berg3 Oct 2021 20:10 UTC
19 points
1 comment16 min readLW link

Mo­rally un­der­defined situ­a­tions can be deadly

Stuart_Armstrong22 Nov 2021 14:48 UTC
17 points
8 comments2 min readLW link

How an alien the­ory of mind might be unlearnable

Stuart_Armstrong3 Jan 2022 11:16 UTC
29 points
35 comments5 min readLW link

Value ex­trap­o­la­tion, con­cept ex­trap­o­la­tion, model splintering

Stuart_Armstrong8 Mar 2022 22:50 UTC
16 points
1 comment2 min readLW link

Nat­u­ral Value Learning

Chris van Merwijk20 Mar 2022 12:44 UTC
7 points
10 comments4 min readLW link

AIs should learn hu­man prefer­ences, not biases

Stuart_Armstrong8 Apr 2022 13:45 UTC
10 points
0 comments1 min readLW link

Differ­ent per­spec­tives on con­cept extrapolation

Stuart_Armstrong8 Apr 2022 10:42 UTC
48 points
8 comments5 min readLW link1 review

The poin­t­ers prob­lem, distilled

Nina Rimsky26 May 2022 22:44 UTC
16 points
0 comments2 min readLW link

Value ex­trap­o­la­tion vs Wireheading

Stuart_Armstrong17 Jun 2022 15:02 UTC
16 points
1 comment1 min readLW link

LOVE in a sim­box is all you need

jacob_cannell28 Sep 2022 18:25 UTC
63 points
72 comments44 min readLW link1 review

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John Nay21 Oct 2022 2:03 UTC
5 points
18 comments54 min readLW link

But ex­actly how com­plex and frag­ile?

KatjaGrace3 Nov 2019 18:20 UTC
82 points
32 comments3 min readLW link1 review

Stable Poin­t­ers to Value III: Re­cur­sive Quantilization

abramdemski21 Jul 2018 8:06 UTC
20 points
4 comments4 min readLW link

Policy Alignment

abramdemski30 Jun 2018 0:24 UTC
50 points
25 comments8 min readLW link

Where do self­ish val­ues come from?

Wei Dai18 Nov 2011 23:52 UTC
67 points
62 comments2 min readLW link

After Align­ment — Dialogue be­tween RogerDear­naley and Seth Herd

2 Dec 2023 6:03 UTC
15 points
2 comments25 min readLW link

Ac­knowl­edg­ing Hu­man Prefer­ence Types to Sup­port Value Learning

Nandi13 Nov 2018 18:57 UTC
34 points
4 comments9 min readLW link

Co­her­ence ar­gu­ments do not en­tail goal-di­rected behavior

Rohin Shah3 Dec 2018 3:26 UTC
123 points
69 comments7 min readLW link3 reviews

Mor­pholog­i­cal in­tel­li­gence, su­per­hu­man em­pa­thy, and eth­i­cal arbitration

Roman Leventov13 Feb 2023 10:25 UTC
1 point
0 comments2 min readLW link

Ma­hatma Arm­strong: CEVed to death.

Stuart_Armstrong6 Jun 2013 12:50 UTC
33 points
62 comments2 min readLW link

misc raw re­sponses to a tract of Crit­i­cal Rationalism

mako yass14 Aug 2020 11:53 UTC
21 points
52 comments3 min readLW link

How to get value learn­ing and refer­ence wrong

Charlie Steiner26 Feb 2019 20:22 UTC
37 points
2 comments6 min readLW link

[Question] Since figur­ing out hu­man val­ues is hard, what about, say, mon­key val­ues?

shminux1 Jan 2020 21:56 UTC
37 points
13 comments1 min readLW link

The Lin­guis­tic Blind Spot of Value-Aligned Agency, Nat­u­ral and Ar­tifi­cial

Roman Leventov14 Feb 2023 6:57 UTC
6 points
0 comments2 min readLW link

[Question] “Frag­ility of Value” vs. LLMs

Not Relevant13 Apr 2022 2:02 UTC
34 points
33 comments1 min readLW link

Two ques­tions about CEV that worry me

cousin_it23 Dec 2010 15:58 UTC
37 points
141 comments1 min readLW link

Cake, or death!

Stuart_Armstrong25 Oct 2012 10:33 UTC
47 points
13 comments4 min readLW link

Ap­ply­ing util­ity func­tions to hu­mans con­sid­ered harmful

Kaj_Sotala3 Feb 2010 19:22 UTC
36 points
116 comments5 min readLW link

Agents That Learn From Hu­man Be­hav­ior Can’t Learn Hu­man Values That Hu­mans Haven’t Learned Yet

steven046111 Jul 2018 2:59 UTC
27 points
11 comments1 min readLW link

Full toy model for prefer­ence learning

Stuart_Armstrong16 Oct 2019 11:06 UTC
20 points
2 comments12 min readLW link

Rig­ging is a form of wireheading

Stuart_Armstrong3 May 2018 12:50 UTC
11 points
2 comments1 min readLW link

ISO: Name of Problem

johnswentworth24 Jul 2018 17:15 UTC
28 points
15 comments1 min readLW link

Break­ing the Op­ti­mizer’s Curse, and Con­se­quences for Ex­is­ten­tial Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC
10 points
1 comment23 min readLW link

Just How Hard a Prob­lem is Align­ment?

Roger Dearnaley25 Feb 2023 9:00 UTC
−1 points
1 comment21 min readLW link

Up­dated Defer­ence is not a strong ar­gu­ment against the util­ity un­cer­tainty ap­proach to alignment

Ivan Vendrov24 Jun 2022 19:32 UTC
26 points
8 comments4 min readLW link

Su­per­in­tel­li­gence 14: Mo­ti­va­tion se­lec­tion methods

KatjaGrace16 Dec 2014 2:00 UTC
9 points
28 comments5 min readLW link

Su­per­in­tel­li­gence 20: The value-load­ing problem

KatjaGrace27 Jan 2015 2:00 UTC
8 points
21 comments6 min readLW link

Su­per­in­tel­li­gence 21: Value learning

KatjaGrace3 Feb 2015 2:01 UTC
12 points
33 comments4 min readLW link

Su­per­in­tel­li­gence 25: Com­po­nents list for ac­quiring values

KatjaGrace3 Mar 2015 2:01 UTC
11 points
12 comments8 min readLW link

Open-ended ethics of phe­nom­ena (a desider­ata with uni­ver­sal moral­ity)

Ryo 8 Nov 2023 20:10 UTC
1 point
0 comments8 min readLW link

Hu­mans can be as­signed any val­ues what­so­ever...

Stuart_Armstrong13 Oct 2017 11:29 UTC
15 points
6 comments4 min readLW link

How much can value learn­ing be dis­en­tan­gled?

Stuart_Armstrong29 Jan 2019 14:17 UTC
22 points
30 comments2 min readLW link

How I think about alignment

Linda Linsefors13 Aug 2022 10:01 UTC
31 points
11 comments5 min readLW link

Broad Pic­ture of Hu­man Values

Thane Ruthenis20 Aug 2022 19:42 UTC
42 points
6 comments10 min readLW link

Help Un­der­stand­ing Prefer­ences And Evil

Netcentrica27 Aug 2022 3:42 UTC
6 points
7 comments2 min readLW link

In­for­mal se­man­tics and Orders

Q Home27 Aug 2022 4:17 UTC
14 points
10 comments26 min readLW link

Can “Re­ward Eco­nomics” solve AI Align­ment?

Q Home7 Sep 2022 7:58 UTC
3 points
15 comments18 min readLW link

What Should AI Owe To Us? Ac­countable and Aligned AI Sys­tems via Con­trac­tu­al­ist AI Alignment

xuan8 Sep 2022 15:04 UTC
32 points
15 comments25 min readLW link

Mo­ral un­cer­tainty: What kind of ‘should’ is in­volved?

MichaelA13 Jan 2020 12:13 UTC
14 points
11 comments13 min readLW link

Mo­ral un­cer­tainty vs re­lated concepts

MichaelA11 Jan 2020 10:03 UTC
26 points
13 comments16 min readLW link

Mo­ral­ity vs re­lated concepts

MichaelA7 Jan 2020 10:47 UTC
26 points
17 comments8 min readLW link

Mak­ing de­ci­sions when both morally and em­piri­cally uncertain

MichaelA2 Jan 2020 7:20 UTC
13 points
14 comments20 min readLW link

Mak­ing de­ci­sions un­der moral uncertainty

MichaelA30 Dec 2019 1:49 UTC
20 points
26 comments17 min readLW link

Sin­gu­lar learn­ing the­ory and bridg­ing from ML to brain emulations

1 Nov 2023 21:31 UTC
26 points
16 comments29 min readLW link

Other ver­sions of “No free lunch in value learn­ing”

Stuart_Armstrong25 Feb 2020 14:25 UTC
28 points
0 comments1 min readLW link

[Linkpost] Con­cept Align­ment as a Pr­ereq­ui­site for Value Alignment

Bogdan Ionut Cirstea4 Nov 2023 17:34 UTC
27 points
0 comments1 min readLW link

​​ Open-ended/​Phenom­e­nal ​Ethics ​(TLDR)

Ryo 9 Nov 2023 16:58 UTC
3 points
0 comments1 min readLW link

1. A Sense of Fair­ness: De­con­fus­ing Ethics

RogerDearnaley17 Nov 2023 20:55 UTC
15 points
8 comments15 min readLW link

Re­search ideas to study hu­mans with AI Safety in mind

Riccardo Volpato3 Jul 2020 16:01 UTC
23 points
2 comments5 min readLW link

ACI#5: From Hu­man-AI Co-evolu­tion to the Evolu­tion of Value Systems

Akira Pyinya18 Aug 2023 0:38 UTC
0 points
0 comments9 min readLW link

De­liber­a­tion as a method to find the “ac­tual prefer­ences” of humans

riceissa22 Oct 2019 9:23 UTC
23 points
5 comments9 min readLW link

2. AIs as Eco­nomic Agents

RogerDearnaley23 Nov 2023 7:07 UTC
9 points
2 comments6 min readLW link

Prac­ti­cal con­se­quences of im­pos­si­bil­ity of value learning

Stuart_Armstrong2 Aug 2019 23:06 UTC
23 points
13 comments3 min readLW link

Com­mu­ni­ca­tion Prior as Align­ment Strategy

johnswentworth12 Nov 2020 22:06 UTC
46 points
8 comments6 min readLW link

Un­cov­er­ing La­tent Hu­man Wel­lbe­ing in LLM Embeddings

14 Sep 2023 1:40 UTC
32 points
7 comments8 min readLW link

Lev­er­ag­ing Le­gal In­for­mat­ics to Align AI

John Nay18 Sep 2022 20:39 UTC
11 points
0 comments3 min readLW link

Tak­ing Into Ac­count Sen­tient Non-Hu­mans in AI Am­bi­tious Value Learn­ing: Sen­tien­tist Co­her­ent Ex­trap­o­lated Volition

Adrià Moret2 Dec 2023 14:07 UTC
26 points
31 comments42 min readLW link

(A Failed Ap­proach) From Prece­dent to Utility Function

Akira Pyinya29 Apr 2023 21:55 UTC
0 points
2 comments4 min readLW link

The Poin­ter Re­s­olu­tion Problem

Jozdien16 Feb 2024 21:25 UTC
39 points
20 comments3 min readLW link

One could be for­given for get­ting the feel­ing...

HumaneAutomation3 Nov 2020 4:53 UTC
−2 points
2 comments1 min readLW link

Ra­tion­al­is­ing hu­mans: an­other mug­ging, but not Pas­cal’s

Stuart_Armstrong14 Nov 2017 15:46 UTC
7 points
1 comment3 min readLW link

Char­ac­ter alignment

p.b.20 Sep 2022 8:27 UTC
22 points
0 comments2 min readLW link

[AN #69] Stu­art Rus­sell’s new book on why we need to re­place the stan­dard model of AI

Rohin Shah19 Oct 2019 0:30 UTC
60 points
12 comments15 min readLW link

2023 Align­ment Re­search Up­dates from FAR AI

4 Dec 2023 22:32 UTC
18 points
0 comments8 min readLW link

Value uncertainty

MichaelA29 Jan 2020 20:16 UTC
20 points
3 comments14 min readLW link

Com­po­si­tional prefer­ence mod­els for al­ign­ing LMs

Tomek Korbak25 Oct 2023 12:17 UTC
18 points
2 comments5 min readLW link

Us­ing vec­tor fields to vi­su­al­ise prefer­ences and make them consistent

28 Jan 2020 19:44 UTC
41 points
32 comments11 min readLW link

Value learn­ing in the ab­sence of ground truth

Joel_Saarinen5 Feb 2024 18:56 UTC
47 points
8 comments45 min readLW link

AI Align­ment, Philo­soph­i­cal Plu­ral­ism, and the Rele­vance of Non-Western Philosophy

xuan1 Jan 2021 0:08 UTC
30 points
21 comments20 min readLW link

[Heb­bian Nat­u­ral Ab­strac­tions] Introduction

21 Nov 2022 20:34 UTC
34 points
3 comments4 min readLW link

[Question] [DISC] Are Values Ro­bust?

DragonGod21 Dec 2022 1:00 UTC
12 points
9 comments2 min readLW link

Claude wants to be conscious

Joe Kwon13 Apr 2024 1:40 UTC
4 points
7 comments6 min readLW link

[Heb­bian Nat­u­ral Ab­strac­tions] Math­e­mat­i­cal Foundations

25 Dec 2022 20:58 UTC
15 points
2 comments6 min readLW link

An Open Philan­thropy grant pro­posal: Causal rep­re­sen­ta­tion learn­ing of hu­man preferences

PabloAMC11 Jan 2022 11:28 UTC
19 points
6 comments8 min readLW link

Value ex­trap­o­la­tion par­tially re­solves sym­bol grounding

Stuart_Armstrong12 Jan 2022 16:30 UTC
24 points
10 comments1 min readLW link

The E-Coli Test for AI Alignment

johnswentworth16 Dec 2018 8:10 UTC
70 points
24 comments1 min readLW link

Have you felt ex­iert yet?

Stuart_Armstrong5 Jan 2018 17:03 UTC
28 points
7 comments1 min readLW link

Why we need a *the­ory* of hu­man values

Stuart_Armstrong5 Dec 2018 16:00 UTC
66 points
15 comments4 min readLW link

Clar­ify­ing “AI Align­ment”

paulfchristiano15 Nov 2018 14:41 UTC
65 points
83 comments3 min readLW link2 reviews

Hack­ing the CEV for Fun and Profit

Wei Dai3 Jun 2010 20:30 UTC
78 points
207 comments1 min readLW link

Us­ing ly­ing to de­tect hu­man values

Stuart_Armstrong15 Mar 2018 11:37 UTC
19 points
6 comments1 min readLW link

The Ur­gent Meta-Ethics of Friendly Ar­tifi­cial Intelligence

lukeprog1 Feb 2011 14:15 UTC
76 points
252 comments1 min readLW link

Re­solv­ing hu­man val­ues, com­pletely and adequately

Stuart_Armstrong30 Mar 2018 3:35 UTC
32 points
30 comments12 min readLW link

Learn­ing prefer­ences by look­ing at the world

Rohin Shah12 Feb 2019 22:25 UTC
43 points
10 comments7 min readLW link

Non-Con­se­quen­tial­ist Co­op­er­a­tion?

abramdemski11 Jan 2019 9:15 UTC
49 points
15 comments7 min readLW link

Stable Poin­t­ers to Value: An Agent Embed­ded in Its Own Utility Function

abramdemski17 Aug 2017 0:22 UTC
15 points
9 comments5 min readLW link

Stable Poin­t­ers to Value II: En­vi­ron­men­tal Goals

abramdemski9 Feb 2018 6:03 UTC
19 points
2 comments4 min readLW link