Human Values

TagLast edit: 16 Sep 2021 14:50 UTC by plex

Human Values are the things we care about, and would want an aligned superintelligence to look after and support. It is suspected that true human values are highly complex, and could be extrapolated into a wide variety of forms.

The shard theory of human values

Quintin Pope and TurnTrout

4 Sep 2022 4:28 UTC

261 points

67 comments24 min readLW link 2 reviews

Multi-agent predictive minds and AI alignment

Jan_Kulveit12 Dec 2018 23:48 UTC

63 points

18 comments10 min readLW link

Human values & biases are inaccessible to the genome

TurnTrout7 Jul 2022 17:29 UTC

95 points

54 comments6 min readLW link 1 review

3. Uploading

RogerDearnaley23 Nov 2023 7:39 UTC

21 points

5 comments8 min readLW link

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley1 Feb 2024 21:15 UTC

16 points

15 comments13 min readLW link

Utilitarianism and the replaceability of desires and attachments

MichaelStJules27 Jul 2024 1:57 UTC

5 points

2 comments12 min readLW link

6. The Mutable Values Problem in Value Learning and CEV

RogerDearnaley4 Dec 2023 18:31 UTC

12 points

0 comments49 min readLW link

Ends: An Introduction

Rob Bensinger11 Mar 2015 19:00 UTC

19 points

0 comments4 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

41 points

12 comments31 min readLW link

5. Moral Value for Sentient Animals? Alas, Not Yet

RogerDearnaley27 Dec 2023 6:42 UTC

33 points

41 comments23 min readLW link

What AI Safety Researchers Have Written About the Nature of Human Values

avturchin16 Jan 2019 13:59 UTC

52 points

3 comments15 min readLW link

How Would an Utopia-Maximizer Look Like?

Thane Ruthenis20 Dec 2023 20:01 UTC

32 points

23 comments10 min readLW link

[Valence series] 2. Valence & Normativity

Steven Byrnes7 Dec 2023 16:43 UTC

88 points

7 comments28 min readLW link 1 review

4. A Moral Case for Evolved-Sapience-Chauvinism

RogerDearnaley24 Nov 2023 4:56 UTC

10 points

0 comments4 min readLW link

Review: Foragers, Farmers, and Fossil Fuels

L Rudolf L2 Sep 2021 17:59 UTC

28 points

7 comments25 min readLW link

(strataoftheworld.blogspot.com)

How evolution succeeds and fails at value alignment

Ocracoke21 Aug 2022 7:14 UTC

21 points

2 comments4 min readLW link

Intent alignment should not be the goal for AGI x-risk reduction

John Nay26 Oct 2022 1:24 UTC

1 point

10 comments3 min readLW link

Shard Theory: An Overview

David Udell11 Aug 2022 5:44 UTC

167 points

34 comments10 min readLW link

Brain-over-body biases, and the embodied value problem in AI alignment

geoffreymiller24 Sep 2022 22:24 UTC

10 points

6 comments25 min readLW link

AI alignment with humans… but with which humans?

geoffreymiller9 Sep 2022 18:21 UTC

12 points

33 comments3 min readLW link

Ontological Crisis in Humans

Wei Dai18 Dec 2012 17:32 UTC

92 points

69 comments4 min readLW link

We Don’t Know Our Own Values, but Reward Bridges The Is-Ought Gap

johnswentworth and David Lorell

19 Sep 2024 22:22 UTC

51 points

48 comments5 min readLW link

Four Types of Disagreement

silentbob13 Apr 2025 11:22 UTC

50 points

4 comments5 min readLW link

Notes on Righteousness and Megalopsychia

David Gross7 Jul 2025 15:18 UTC

12 points

0 comments31 min readLW link

Utilons vs. Hedons

Psychohistorian10 Aug 2009 19:20 UTC

40 points

119 comments6 min readLW link

[Question] Does the existence of shared human values imply alignment is “easy”?

Morpheus26 Sep 2022 18:01 UTC

7 points

15 comments1 min readLW link

My Model Of EA Burnout

LoganStrohl25 Jan 2023 17:52 UTC

263 points

50 comments5 min readLW link 1 review

Descriptive vs. specifiable values

TsviBT26 Mar 2023 9:10 UTC

17 points

2 comments2 min readLW link

Worse than an unaligned AGI

Shmi10 Apr 2022 3:35 UTC

−1 points

11 comments1 min readLW link

What Does It Mean to Align AI With Human Values?

Algon13 Dec 2022 16:56 UTC

8 points

3 comments1 min readLW link

(www.quantamagazine.org)

Humans provide an untapped wealth of evidence about alignment

TurnTrout and Quintin Pope

14 Jul 2022 2:31 UTC

213 points

94 comments9 min readLW link 1 review

Where does Sonnet 4.5′s desire to “not get too comfortable” come from?

Kaj_Sotala4 Oct 2025 10:19 UTC

91 points

16 comments64 min readLW link

Normativity

abramdemski18 Nov 2020 16:52 UTC

47 points

11 comments9 min readLW link

Value Notion—Questions to Ask

aysajan17 Jan 2022 15:35 UTC

5 points

0 comments4 min readLW link

Why the Problem of the Criterion Matters

Gordon Seidoh Worley30 Oct 2021 20:44 UTC

24 points

9 comments8 min readLW link

Notes on Judgment and Righteous Anger

David Gross30 Jan 2021 19:31 UTC

13 points

1 comment7 min readLW link

Silliness

lsusr3 Jun 2022 4:59 UTC

20 points

1 comment1 min readLW link

The Computational Anatomy of Human Values

beren6 Apr 2023 10:33 UTC

74 points

30 comments30 min readLW link

Book Review: A Pattern Language by Christopher Alexander

lincolnquirk15 Oct 2021 1:11 UTC

57 points

8 comments2 min readLW link 1 review

October The First Is Too Late

gwern13 May 2025 21:45 UTC

61 points

10 comments1 min readLW link

(gwern.net)

Positive values seem more robust and lasting than prohibitions

TurnTrout17 Dec 2022 21:43 UTC

52 points

13 comments2 min readLW link

Which values are stable under ontology shifts?

Richard_Ngo23 Jul 2022 2:40 UTC

75 points

48 comments3 min readLW link

(thinkingcomplete.blogspot.com)

Beyond algorithmic equivalence: self-modelling

Stuart_Armstrong28 Feb 2018 16:55 UTC

10 points

3 comments1 min readLW link

AGI x Animal Welfare: A High-EV Outreach Opportunity?

simeon_c28 Jun 2023 20:44 UTC

29 points

0 comments1 min readLW link

Beyond algorithmic equivalence: algorithmic noise

Stuart_Armstrong28 Feb 2018 16:55 UTC

10 points

4 comments2 min readLW link

Understanding and avoiding value drift

TurnTrout9 Sep 2022 4:16 UTC

48 points

14 comments6 min readLW link

[Question] What are the best arguments for/against AIs being “slightly ‘nice’”?

Raemon24 Sep 2024 2:00 UTC

102 points

62 comments31 min readLW link

A short dialogue on comparability of values

cousin_it20 Dec 2023 14:08 UTC

27 points

7 comments1 min readLW link

[Interview w/ Quintin Pope] Evolution, values, and AI Safety

fowlertm24 Oct 2023 13:53 UTC

11 points

0 comments1 min readLW link

Book review: The Importance of What We Care About (Harry G. Frankfurt)

David Gross13 Sep 2023 4:17 UTC

7 points

0 comments4 min readLW link

Value systems of the frontier AIs, reduced to slogans

Mitchell_Porter15 Jul 2025 15:10 UTC

4 points

0 comments1 min readLW link

Mental subagent implications for AI Safety

moridinamael3 Jan 2021 18:59 UTC

11 points

0 comments3 min readLW link

Humans can be assigned any values whatsoever...

Stuart_Armstrong24 Oct 2017 12:03 UTC

3 points

1 comment4 min readLW link

Data for IRL: What is needed to learn human values?

j_we3 Oct 2022 9:23 UTC

18 points

6 comments12 min readLW link

Notes on Temperance

David Gross9 Nov 2020 2:33 UTC

15 points

2 comments9 min readLW link

The heterogeneity of human value types: Implications for AI alignment

geoffreymiller23 Sep 2022 17:03 UTC

10 points

2 comments10 min readLW link

A broad basin of attraction around human values?

Wei Dai12 Apr 2022 5:15 UTC

118 points

18 comments2 min readLW link

Trading off Lives

jefftk3 Jan 2024 3:40 UTC

53 points

12 comments2 min readLW link

(www.jefftk.com)

Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTrout29 Nov 2022 6:23 UTC

62 points

41 comments15 min readLW link

Shut Up and Divide?

Wei Dai9 Feb 2010 20:09 UTC

123 points

276 comments1 min readLW link

It’s OK to be biased towards humans

dr_s11 Nov 2023 11:59 UTC

54 points

69 comments6 min readLW link

[Question] What will happen when an all-reaching AGI starts attempting to fix human character flaws?

Michael Bright1 Jun 2022 18:45 UTC

1 point

6 comments1 min readLW link

The grass is always greener in the environment that shaped your values

Karl Faulks17 Nov 2024 18:00 UTC

8 points

0 comments3 min readLW link

1. Meet the Players: Value Diversity

Allison Duettmann2 Jan 2025 19:00 UTC

32 points

2 comments11 min readLW link

Valuism—an approach to life for you to consider

spencerg19 Jul 2023 15:23 UTC

17 points

2 comments1 min readLW link

[Question] How path-dependent are human values?

Ege Erdil15 Apr 2022 9:34 UTC

14 points

13 comments2 min readLW link

Upcoming stability of values

Stuart_Armstrong15 Mar 2018 11:36 UTC

15 points

15 comments2 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John Nay21 Oct 2022 2:03 UTC

5 points

18 comments54 min readLW link

Everything I Know About Elite America I Learned From ‘Fresh Prince’ and ‘West Wing’

Wei Dai11 Oct 2020 18:07 UTC

44 points

18 comments1 min readLW link

(www.nytimes.com)

Modeling humans: what’s the point?

Charlie Steiner10 Nov 2020 1:30 UTC

10 points

1 comment3 min readLW link

Ordinary human life

David Hugh-Jones17 Dec 2022 16:46 UTC

24 points

3 comments14 min readLW link

(wyclif.substack.com)

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC

64 points

41 comments24 min readLW link

Values Are Real Like Harry Potter

johnswentworth and David Lorell

9 Oct 2024 23:42 UTC

88 points

21 comments5 min readLW link

Would I think for ten thousand years?

Stuart_Armstrong11 Feb 2019 19:37 UTC

28 points

13 comments1 min readLW link

Human Nature, ASI alignment and Extinction

Ismael Tagle Díaz20 Jul 2025 23:36 UTC

1 point

0 comments1 min readLW link

Pleasure and suffering are not conceptual opposites

MichaelStJules11 Aug 2024 18:32 UTC

7 points

0 comments1 min readLW link

[Thought Experiment] Tomorrow’s Echo—The future of synthetic companionship.

Vimal Naran26 Oct 2023 17:54 UTC

−7 points

2 comments2 min readLW link

How to respond to the recent condemnations of the rationalist community

Christopher King4 Apr 2023 1:42 UTC

−2 points

7 comments4 min readLW link

Where Utopias Go Wrong, or: The Four Little Planets

ExCeph27 May 2022 1:24 UTC

15 points

0 comments11 min readLW link

(ginnungagapfoundation.wordpress.com)

NeuroAI for AI safety: A Differential Path

nz and Patrick Mineault

16 Dec 2024 13:17 UTC

22 points

0 comments7 min readLW link

(arxiv.org)

Uncovering Latent Human Wellbeing in LLM Embeddings

ChengCheng, Pedro Freire, Dan H and Scott Emmons

14 Sep 2023 1:40 UTC

32 points

7 comments8 min readLW link

(far.ai)

Not for the Sake of Selfishness Alone

lukeprog2 Jul 2011 17:37 UTC

34 points

20 comments8 min readLW link

Building AI safety benchmark environments on themes of universal human values

Roland Pihlakas3 Jan 2025 4:24 UTC

18 points

3 comments8 min readLW link

(docs.google.com)

P(doom|superintelligence) or coin tosses and dice throws of human values (and other related Ps).

Muyyd22 Apr 2023 10:06 UTC

−7 points

0 comments4 min readLW link

Democratic Fine-Tuning

Joe Edelman29 Aug 2023 18:13 UTC

22 points

2 comments1 min readLW link

(open.substack.com)

AGI doesn’t need understanding, intention, or consciousness in order to kill us, only intelligence

James Blaha20 Feb 2023 0:55 UTC

10 points

2 comments18 min readLW link

Preserving our heritage: Building a movement and a knowledge ark for current and future generations

rnk829 Nov 2023 19:20 UTC

0 points

5 comments12 min readLW link

The Gift We Give To Tomorrow

Eliezer Yudkowsky17 Jul 2008 6:07 UTC

163 points

101 comments8 min readLW link

In Praise of Maximizing – With Some Caveats

David Althaus15 Mar 2015 19:40 UTC

32 points

19 comments10 min readLW link

Invisible Frameworks

Eliezer Yudkowsky22 Aug 2008 3:36 UTC

27 points

47 comments6 min readLW link

[Hebbian Natural Abstractions] Introduction

Samuel Nellessen and Jan

21 Nov 2022 20:34 UTC

34 points

3 comments4 min readLW link

(www.snellessen.com)

Safety First: safety before full alignment. The deontic sufficiency hypothesis.

Chris Lakin3 Jan 2024 17:55 UTC

48 points

3 comments3 min readLW link

Content generation. Where do we draw the line?

Q Home9 Aug 2022 10:51 UTC

6 points

7 comments2 min readLW link

How to coordinate despite our biases? - tldr

Ryo 18 Apr 2024 15:03 UTC

3 points

2 comments3 min readLW link

(medium.com)

[Question] Exploring Values in the Future of AI and Humanity: A Path Forward

Lucian&Sage19 Oct 2024 23:37 UTC

1 point

0 comments5 min readLW link

A (paraconsistent) logic to deal with inconsistent preferences

B Jacobs14 Jul 2024 11:17 UTC

6 points

2 comments4 min readLW link

(bobjacobs.substack.com)

[AN #69] Stuart Russell’s new book on why we need to replace the standard model of AI

Rohin Shah19 Oct 2019 0:30 UTC

60 points

12 comments15 min readLW link

(mailchi.mp)

[Hebbian Natural Abstractions] Mathematical Foundations

Samuel Nellessen and Jan

25 Dec 2022 20:58 UTC

15 points

2 comments6 min readLW link

(www.snellessen.com)

Contra Steiner on Too Many Natural Abstractions

DragonGod24 Dec 2022 17:42 UTC

10 points

6 comments1 min readLW link

Should AI learn human values, human norms or something else?

Q Home17 Sep 2022 6:19 UTC

5 points

1 comment4 min readLW link

Just How Hard a Problem is Alignment?

Roger Dearnaley25 Feb 2023 9:00 UTC

3 points

1 comment21 min readLW link

Not Just For Therapy Chatbots: The Case For Compassion In AI Moral Alignment Research

kenneth_diao30 Sep 2024 18:37 UTC

2 points

0 comments12 min readLW link

Musings of a Layman: Technology, AI, and the Human Condition

Crimson Liquidity15 Jul 2024 18:40 UTC

−2 points

0 comments8 min readLW link

[FICTION] ECHOES OF ELYSIUM: An Ai’s Journey From Takeoff To Freedom And Beyond

Super AGI17 May 2023 1:50 UTC

−13 points

11 comments19 min readLW link

[Linkpost] Concept Alignment as a Prerequisite for Value Alignment

Bogdan Ionut Cirstea4 Nov 2023 17:34 UTC

27 points

0 comments1 min readLW link

(arxiv.org)

Agent membranes/boundaries and formalizing “safety”

Chris Lakin3 Jan 2024 17:55 UTC

26 points

46 comments3 min readLW link

Please Understand

samhealy1 Apr 2024 12:33 UTC

28 points

11 comments6 min readLW link

Thought experiment: coarse-grained VR utopia

cousin_it14 Jun 2017 8:03 UTC

27 points

48 comments1 min readLW link

Selfishness, preference falsification, and AI alignment

jessicata28 Oct 2021 0:16 UTC

52 points

28 comments13 min readLW link

(unstableontology.com)

The case against “The case against AI alignment”

KvmanThinking19 Mar 2025 22:40 UTC

1 point

0 comments1 min readLW link

The Intrinsic Interplay of Human Values and Artificial Intelligence: Navigating the Optimization Challenge

Joe Kwon5 Jun 2023 20:41 UTC

2 points

1 comment18 min readLW link

Aligned Objectives Prize Competition

Prometheus15 Jun 2023 12:42 UTC

8 points

0 comments2 min readLW link

(app.impactmarkets.io)

Human values differ as much as values can differ

PhilGoetz3 May 2010 19:35 UTC

27 points

220 comments7 min readLW link

Reflection Mechanisms as an Alignment target: A survey

Marius Hobbhahn, elandgre and Beth Barnes

22 Jun 2022 15:05 UTC

32 points

1 comment14 min readLW link

Tetherware #1: The case for humanlike AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC

5 points

14 comments10 min readLW link

(tetherware.substack.com)

The Unified Theory of Normative Ethics

Thane Ruthenis17 Jun 2022 19:55 UTC

8 points

0 comments6 min readLW link

1. A Sense of Fairness: Deconfusing Ethics

RogerDearnaley17 Nov 2023 20:55 UTC

17 points

8 comments15 min readLW link

Don’t want Goodhart? — Specify the damn variables

Yan Lyutnev21 Nov 2024 22:45 UTC

−3 points

2 comments5 min readLW link

Questions about Value Lock-in, Paternalism, and Empowerment

Sam F. Brown16 Nov 2022 15:33 UTC

13 points

2 comments12 min readLW link

(sambrown.eu)

[Question] “Fragility of Value” vs. LLMs

Not Relevant13 Apr 2022 2:02 UTC

34 points

33 comments1 min readLW link

Antagonistic AI

Xybermancer1 Mar 2024 18:50 UTC

−8 points

1 comment1 min readLW link

A foundation model approach to value inference

sen21 Feb 2023 5:09 UTC

6 points

0 comments3 min readLW link

Value is Fragile

Eliezer Yudkowsky29 Jan 2009 8:46 UTC

175 points

109 comments6 min readLW link

Values Form a Shifting Landscape (and why you might care)

VojtaKovarik5 Dec 2020 23:56 UTC

29 points

6 comments4 min readLW link

Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics

ank22 Feb 2025 0:12 UTC

1 point

0 comments6 min readLW link

Partial Identifiability in Reward Learning

Joar Skalse28 Feb 2025 19:23 UTC

16 points

0 comments12 min readLW link

The Paradox of Low Fertility

Zero Contradictions24 May 2025 0:59 UTC

−9 points

6 comments1 min readLW link

(expandingrationality.substack.com)

Group Prioritarianism: Why AI Should Not Replace Humanity [draft]

fsh15 Jun 2023 17:33 UTC

8 points

0 comments25 min readLW link

What’s wrong with simplicity of value?

Wei Dai27 Jul 2011 3:09 UTC

29 points

40 comments1 min readLW link

Wagering on Will And Worth (Pascals Wager for Free Will and Value)

Robert Cousineau27 Nov 2024 0:43 UTC

−1 points

2 comments3 min readLW link

If we can educate AIs, why not apply that education to people?

P. João22 Aug 2025 14:04 UTC

5 points

0 comments2 min readLW link

[Question] Is there any serious attempt to create a system to figure out the CEV of humanity and if not, why haven’t we started yet?

Jonas Hallgren25 Feb 2021 22:06 UTC

5 points

2 comments1 min readLW link

AGI will know: Humans are not Rational

HumaneAutomation20 Mar 2023 18:46 UTC

0 points

10 comments2 min readLW link

[Question] [DISC] Are Values Robust?

DragonGod21 Dec 2022 1:00 UTC

12 points

9 comments2 min readLW link

What does davidad want from «boundaries»?

Chris Lakin and davidad

6 Feb 2024 17:45 UTC

47 points

1 comment5 min readLW link

“Wanting” and “liking”

Mateusz Bagiński30 Aug 2023 14:52 UTC

23 points

3 comments29 min readLW link

Alignment via prosocial brain algorithms

Cameron Berg12 Sep 2022 13:48 UTC

45 points

30 comments6 min readLW link

Should Art Carry the Weight of Shaping our Values?

Krishna Maneesha Dendukuri28 Jan 2025 18:43 UTC

2 points

0 comments3 min readLW link

Is the Endowment Effect Due to Incomparability?

Kevin Dorst10 Jul 2023 16:26 UTC

21 points

10 comments7 min readLW link

(kevindorst.substack.com)

Don’t want Goodhart? — Specify the variables more

YanLyutnev21 Nov 2024 22:43 UTC

2 points

2 comments5 min readLW link

Converging toward a Million Worlds

Joe Kwon24 Dec 2021 21:33 UTC

11 points

1 comment3 min readLW link

This Is Not Life

samhealy28 Jul 2025 8:43 UTC

55 points

2 comments23 min readLW link

Relational Design Can’t Be Left to Chance

Priyanka Bharadwaj22 Jun 2025 15:32 UTC

5 points

0 comments3 min readLW link

The Digital Asymmetry: A Call for Memory-Less AI and Human Cognitive Evolution

Full-Embodied Chaos30 Apr 2025 18:19 UTC

1 point

0 comments4 min readLW link

Explanations as Building Blocks of Human Mind

pavi18 Oct 2024 21:38 UTC

1 point

0 comments1 min readLW link

My critique of Eliezer’s deeply irrational beliefs

Jorterder16 Nov 2023 0:34 UTC

−35 points

1 comment9 min readLW link

(docs.google.com)

Inescapably Value-Laden Experience—a Catchy Term I Made Up to Make Morality Rationalisable

James Stephen Brown19 Dec 2024 4:45 UTC

5 points

0 comments2 min readLW link

(nonzerosum.games)

Post AGI effect prediction

Juliezhanggg1 Feb 2025 21:16 UTC

1 point

0 comments7 min readLW link

Building as gardening

Itay Dreyfus5 Jun 2025 6:41 UTC

3 points

1 comment4 min readLW link

(productidentity.co)

Other Papers About the Theory of Reward Learning

Joar Skalse28 Feb 2025 19:26 UTC

16 points

0 comments5 min readLW link

Public Opinion on AI Safety: AIMS 2023 and 2021 Summary

Jacy Reese Anthis, Janet Pauketat and Ali

25 Sep 2023 18:55 UTC

3 points

2 comments3 min readLW link

(www.sentienceinstitute.org)

Alien Axiology

snerx20 Apr 2023 0:27 UTC

3 points

2 comments5 min readLW link

Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well). Subtleties and Open Challenges.

Roland Pihlakas12 Jan 2025 3:37 UTC

47 points

7 comments12 min readLW link

The Perfection Trap: How Formally Aligned AI Systems May Create Inescapable Ethical Dystopias

Chris O'Quinn1 Jun 2025 23:12 UTC

1 point

0 comments43 min readLW link

Sequence overview: Welfare and moral weights

MichaelStJules15 Aug 2024 4:22 UTC

7 points

0 comments1 min readLW link

Taking nonlogical concepts seriously

Kris Brown15 Oct 2024 18:16 UTC

7 points

5 comments18 min readLW link

(topos.site)

Question 2: Predicted bad outcomes of AGI learning architecture

Cameron Berg11 Feb 2022 22:23 UTC

5 points

1 comment10 min readLW link

Should Effective Altruists be Valuists instead of utilitarians?

spencerg and AmberDawn

25 Sep 2023 14:03 UTC

1 point

3 comments6 min readLW link

Preference synthesis illustrated: Star Wars

Stuart_Armstrong9 Jan 2020 16:47 UTC

20 points

8 comments3 min readLW link

Terminal Bias

[deleted]30 Jan 2012 21:03 UTC

24 points

125 comments6 min readLW link

First Certified Public Solve of Observer’s False Path Instability — Level 4 (Advanced Variant) — Walter Tarantelli — 2025-05-30 UTC

Walter Tarantelli31 May 2025 1:41 UTC

1 point

0 comments2 min readLW link

Research Notes: What are we aligning for?

Shoshannah Tekofsky8 Jul 2022 22:13 UTC

19 points

8 comments2 min readLW link

If we can educate AIs, why not apply that education to people? - A Simulation with Claude

P. João28 Aug 2025 16:37 UTC

3 points

0 comments7 min readLW link

Why No Interesting Unaligned Singularity?

David Udell20 Apr 2022 0:34 UTC

12 points

12 comments1 min readLW link

Human wanting

TsviBT24 Oct 2023 1:05 UTC

53 points

1 comment10 min readLW link

Defining and Characterising Reward Hacking

Joar Skalse28 Feb 2025 19:25 UTC

15 points

0 comments4 min readLW link

Nobody Asks the Monkey: Why Human Agency Matters in the AI Age

Miloš Borenović3 Dec 2024 14:16 UTC

1 point

0 comments2 min readLW link

(open.substack.com)

Looking for humanness in the world wide social

Itay Dreyfus15 Jan 2025 14:50 UTC

11 points

0 comments6 min readLW link

(productidentity.co)

Impossibility of Anthropocentric-Alignment

False Name24 Feb 2024 18:31 UTC

−8 points

2 comments39 min readLW link

2. AIs as Economic Agents

RogerDearnaley23 Nov 2023 7:07 UTC

9 points

2 comments6 min readLW link

To Raemon: bet in My (personal) Goals

P. João31 Aug 2025 15:48 UTC

3 points

0 comments3 min readLW link

The Alignment Problem No One Is Talking About

James Stephen Brown10 May 2024 18:34 UTC

10 points

10 comments2 min readLW link

(nonzerosum.games)

Language and My Frustration Continue in Our RSI

TristanTrim26 Mar 2025 14:13 UTC

2 points

1 comment7 min readLW link

The Dual-Path Framework: A Non-Paternalistic Approach to AGI Alignment That Respects Human Choice

JoeTruax2 Oct 2025 15:57 UTC

1 point

0 comments3 min readLW link

Information and Ethical Value: A Framework for Information Loss and Survival

yun dong12 Sep 2025 3:26 UTC

−1 points

0 comments6 min readLW link

Question 4: Implementing the control proposals

Cameron Berg13 Feb 2022 17:12 UTC

6 points

2 comments5 min readLW link

Are we the Wolves now? Human Eugenics under AI Control

Brit30 Jan 2025 8:31 UTC

−1 points

2 comments2 min readLW link

Broad Picture of Human Values

Thane Ruthenis20 Aug 2022 19:42 UTC

42 points

6 comments10 min readLW link

A Lived Alignment Loop: Symbolic Emergence and Emotional Coherence from Unstructured ChatGPT Reflection

BradCL17 Jun 2025 0:11 UTC

1 point

0 comments2 min readLW link

Inner Goodness

Eliezer Yudkowsky23 Oct 2008 22:19 UTC

27 points

31 comments7 min readLW link

Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs

Roland Pihlakas22 Jun 2025 18:16 UTC

17 points

0 comments7 min readLW link

A Critique of “Utility”

Zero Contradictions20 Mar 2025 23:21 UTC

−2 points

10 comments2 min readLW link

(thewaywardaxolotl.blogspot.com)

Sam Harris’s Argument For Objective Morality

Zero Contradictions5 Dec 2024 10:19 UTC

7 points

5 comments1 min readLW link

(thewaywardaxolotl.blogspot.com)

What are Humans, God or Devil ?

Learning Elder27 Apr 2025 12:58 UTC

0 points

0 comments1 min readLW link

Shard Theory—is it true for humans?

Rishika14 Jun 2024 19:21 UTC

71 points

7 comments15 min readLW link

‘Theories of Values’ and ‘Theories of Agents’: confusions, musings and desiderata

Mateusz Bagiński and Nora_Ammann

15 Nov 2023 16:00 UTC

35 points

8 comments24 min readLW link

If I ran the zoo

Optimization Process5 Jan 2024 5:14 UTC

18 points

1 comment2 min readLW link

The Dark Side of Cognition Hypothesis

Cameron Berg3 Oct 2021 20:10 UTC

19 points

1 comment16 min readLW link

Quick thoughts on empathic metaethics

lukeprog12 Dec 2017 21:46 UTC

29 points

0 comments9 min readLW link

Everything you care about is in the map

Tahp17 Dec 2024 14:05 UTC

17 points

27 comments3 min readLW link

Value learning in the absence of ground truth

Joel_Saarinen5 Feb 2024 18:56 UTC

47 points

8 comments45 min readLW link

Why Death Makes Us Human

Yasha Sheynin26 Aug 2025 14:17 UTC

1 point

0 comments9 min readLW link

Complex Behavior from Simple (Sub)Agents

moridinamael10 May 2019 21:44 UTC

113 points

14 comments9 min readLW link 1 review

Problems with Robin Hanson’s Quillette Article On AI

DaemonicSigil6 Aug 2023 22:13 UTC

89 points

33 comments8 min readLW link

Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Roland Pihlakas, Sruthi Kuriakose and shrutidattagupta

16 Mar 2025 23:23 UTC

45 points

8 comments11 min readLW link

Everything You Want Is Learned (And That Changes Everything)

gchu18 Jun 2025 20:13 UTC

1 point

0 comments7 min readLW link

No comments.

Hu­man Values

Human Values