Outer Alignment

TagLast edit: 15 Apr 2025 3:42 UTC by Seth Herd

Outer Alignment (also known as the reward misspecification problem) is the problem of specifying a reward function which captures human preferences. Outer alignment asks the question—“What will we train our model to do?” Note that this is in the narrow technical sense of selecting a reward function, while wisely choosing a training target is a seperate issue; see the list below.

Overall, outer alignment as a problem is intuitive enough to understand, i.e., is the specified loss function aligned with the intended goal of its designers? However, implementing this in practice is extremely difficult. Conveying the full “intention” behind a human request is equivalent to conveying the sum of all human values and ethics. This is difficult in part because human intentions are themselves not well understood. Additionally, since most models are designed as goal optimizers, they are all susceptible to Goodhart’s Law which means that we might be unable to foresee negative consequences that arise due to excessive optimization pressure on a goal that would look otherwise well specified to humans.

To solve the outer alignment problem, some sub-problems that we would have to make progress on include specification gaming, value learning, and reward shaping/modeling. Some proposed solutions to outer alignment include scalable oversight techniques such as IDA, as well as adversarial oversight techniques such as debate.

Outer Alignment vs. Inner Alignment

This is often taken to be separate from the inner alignment problem, which asks: How can we robustly aim our AI optimizers at any objective function at all?

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don’t think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Inner alignment: Does the model learn to do what you trained it to do?
- Or does it create a mesaoptimizer?
- Relevant example: take over the world as a mesoptimized goal includes perform appropriately in testing to avoid detection as a subgoal
Outer alignment: Does your training set actually train the model to do what you think you want?
- Example: training a model to detect hidden equipment won’t do what you want if all the training set with hidden equipment was taken from a sunny day, while all the negative examples were taken on cloudy days (figuratively but not literally a thing that happened)
Alignment target selection: This is a separate potential point of failure.
- If you get inner alignment and outer alignment right, but you selected “make me a lot of money” as your alignment target, you will probably be unhappy with your overall efforts. You failed at wisely selecting an alignment target.

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

31 May 2019 23:44 UTC

187 points

42 comments12 min readLW link 3 reviews

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

RogerDearnaley28 May 2025 6:21 UTC

31 points

34 comments9 min readLW link

6. The Mutable Values Problem in Value Learning and CEV

RogerDearnaley4 Dec 2023 18:31 UTC

12 points

0 comments49 min readLW link

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley1 Feb 2024 21:15 UTC

16 points

15 comments13 min readLW link

Another (outer) alignment failure story

paulfchristiano7 Apr 2021 20:12 UTC

250 points

39 comments12 min readLW link 1 review

Truthful LMs as a warm-up for aligned AGI

Jacob_Hilton17 Jan 2022 16:49 UTC

65 points

14 comments13 min readLW link

LOVE in a simbox is all you need

jacob_cannell28 Sep 2022 18:25 UTC

67 points

73 comments44 min readLW link 1 review

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC

33 points

3 comments15 min readLW link

Gaia Network: a practical, incremental pathway to Open Agency Architecture

Roman Leventov and Rafael Kaufmann Nedal

20 Dec 2023 17:11 UTC

22 points

8 comments16 min readLW link

Debate update: Obfuscated arguments problem

Beth Barnes23 Dec 2020 3:24 UTC

138 points

24 comments16 min readLW link

Outer vs inner misalignment: three framings

Richard_Ngo6 Jul 2022 19:46 UTC

52 points

5 comments9 min readLW link

Book review: “A Thousand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC

123 points

18 comments19 min readLW link

Maintaining Alignment during RSI as a Feedback Control Problem

beren2 Mar 2025 0:21 UTC

67 points

6 comments11 min readLW link

List of resolved confusions about IDA

Wei Dai30 Sep 2019 20:03 UTC

97 points

18 comments3 min readLW link

My AGI Threat Model: Misaligned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC

74 points

40 comments16 min readLW link

“Inner Alignment Failures” Which Are Actually Outer Alignment Failures

johnswentworth31 Oct 2020 20:18 UTC

66 points

38 comments5 min readLW link

Reward is not the optimization target

TurnTrout25 Jul 2022 0:03 UTC

378 points

127 comments10 min readLW link 3 reviews

AI Alignment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC

126 points

6 comments35 min readLW link

If I were a well-intentioned AI… III: Extremal Goodhart

Stuart_Armstrong28 Feb 2020 11:24 UTC

22 points

0 comments5 min readLW link

[Question] What if Ethics is Provably Self-Contradictory?

Yitz18 Apr 2024 5:12 UTC

3 points

7 comments2 min readLW link

Worrisome misunderstanding of the core issues with AI transition

Roman Leventov18 Jan 2024 10:05 UTC

5 points

2 comments4 min readLW link

Outer alignment and imitative amplification

evhub10 Jan 2020 0:26 UTC

24 points

11 comments9 min readLW link

How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA?

Owain_Evans26 Feb 2022 12:46 UTC

44 points

3 comments11 min readLW link

nostalgebraist: Recursive Goodhart’s Law

Kaj_Sotala26 Aug 2020 11:07 UTC

53 points

27 comments1 min readLW link

(nostalgebraist.tumblr.com)

[Linkpost] Introducing Superalignment

beren5 Jul 2023 18:23 UTC

175 points

69 comments1 min readLW link

(openai.com)

Don’t align agents to evaluations of plans

TurnTrout26 Nov 2022 21:16 UTC

48 points

49 comments18 min readLW link

On the Confusion between Inner and Outer Misalignment

Chris_Leong25 Mar 2024 11:59 UTC

18 points

10 comments1 min readLW link

If I were a well-intentioned AI… II: Acting in a world

Stuart_Armstrong27 Feb 2020 11:58 UTC

20 points

0 comments3 min readLW link

AI alignment as a translation problem

Roman Leventov5 Feb 2024 14:14 UTC

22 points

2 comments3 min readLW link

Inference-Only Debate Experiments Using Math Problems

Arjun Panickssery, Abhimanyu Pallavi Sudhir and JacksonKaunismaa

6 Aug 2024 17:44 UTC

31 points

0 comments2 min readLW link

Shard Theory: An Overview

David Udell11 Aug 2022 5:44 UTC

167 points

34 comments10 min readLW link

Four usages of “loss” in AI

TurnTrout2 Oct 2022 0:52 UTC

46 points

18 comments4 min readLW link

MIRI comments on Cotra’s “Case for Aligning Narrowly Superhuman Models”

Rob Bensinger5 Mar 2021 23:43 UTC

145 points

13 comments26 min readLW link

Mesa-Optimizers vs “Steered Optimizers”

Steven Byrnes10 Jul 2020 16:49 UTC

48 points

7 comments8 min readLW link

Simulators

janus2 Sep 2022 12:45 UTC

668 points

168 comments41 min readLW link 8 reviews

(generative.ink)

Evaluating the historical value misspecification argument

Matthew Barnett5 Oct 2023 18:34 UTC

193 points

163 comments7 min readLW link 3 reviews

AXRP Episode 12 - AI Existential Risk with Paul Christiano

DanielFilan2 Dec 2021 2:20 UTC

38 points

0 comments126 min readLW link

[Intro to brain-like-AGI safety] 10. The alignment problem

Steven Byrnes30 Mar 2022 13:24 UTC

53 points

7 comments21 min readLW link

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout2 Dec 2022 2:43 UTC

136 points

23 comments47 min readLW link 3 reviews

Why “AI alignment” would better be renamed into “Artificial Intention research”

chaosmage15 Jun 2023 10:32 UTC

29 points

12 comments2 min readLW link

Language Agents Reduce the Risk of Existential Catastrophe

cdkg and Simon Goldstein

28 May 2023 19:10 UTC

39 points

14 comments26 min readLW link

If I were a well-intentioned AI… I: Image classifier

Stuart_Armstrong26 Feb 2020 12:39 UTC

35 points

4 comments5 min readLW link

Naive Hypotheses on AI Alignment

Shoshannah Tekofsky2 Jul 2022 19:03 UTC

98 points

29 comments5 min readLW link

Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC16 Dec 2022 22:12 UTC

68 points

11 comments1 min readLW link

(www.anthropic.com)

Categorizing failures as “outer” or “inner” misalignment is often confused

Rohin Shah6 Jan 2023 15:48 UTC

93 points

21 comments8 min readLW link

Is the Star Trek Federation really incapable of building AI?

Kaj_Sotala18 Mar 2018 10:30 UTC

19 points

4 comments2 min readLW link

(kajsotala.fi)

Human Mimicry Mainly Works When We’re Already Close

johnswentworth17 Aug 2022 18:41 UTC

82 points

16 comments5 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John Nay21 Oct 2022 2:03 UTC

5 points

18 comments54 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC

127 points

9 comments15 min readLW link

[Aspiration-based designs] 1. Informal introduction

B Jacobs, Jobst Heitzig, Simon Fischer and Simon Dima

28 Apr 2024 13:00 UTC

44 points

4 comments8 min readLW link

Alignment as Game Design

Shoshannah Tekofsky16 Jul 2022 22:36 UTC

11 points

7 comments2 min readLW link

25 Min Talk on MetaEthical.AI with Questions from Stuart Armstrong

June Ku29 Apr 2021 15:38 UTC

21 points

7 comments1 min readLW link

(Humor) AI Alignment Critical Failure Table

Kaj_Sotala31 Aug 2020 19:51 UTC

24 points

2 comments1 min readLW link

(sl4.org)

Selection Theorems: A Program For Understanding Agents

johnswentworth28 Sep 2021 5:03 UTC

133 points

28 comments6 min readLW link 2 reviews

Question 2: Predicted bad outcomes of AGI learning architecture

Cameron Berg11 Feb 2022 22:23 UTC

5 points

1 comment10 min readLW link

Announcing the Alignment of Complex Systems Research Group

Jan_Kulveit and technicalities

4 Jun 2022 4:10 UTC

92 points

20 comments5 min readLW link

Preference Aggregation as Bayesian Inference

beren27 Jul 2023 17:59 UTC

14 points

1 comment1 min readLW link

Semiotic Grounding as a Precondition for Safe and Cooperative AI

Davidmanheim27 Jul 2025 16:11 UTC

22 points

0 comments6 min readLW link

The True Story of How GPT-2 Became Maximally Lewd

Writer and Jai

18 Jan 2024 21:03 UTC

74 points

7 comments6 min readLW link

(youtu.be)

Concept Safety: Producing similar AI-human concept spaces

Kaj_Sotala14 Apr 2015 20:39 UTC

51 points

45 comments8 min readLW link

[ASoT] Some thoughts about imperfect world modeling

leogao7 Apr 2022 15:42 UTC

7 points

0 comments4 min readLW link

An overview of 11 proposals for building safe advanced AI

evhub29 May 2020 20:38 UTC

221 points

37 comments38 min readLW link 2 reviews

Confused why a “capabilities research is good for alignment progress” position isn’t discussed more

Kaj_Sotala2 Jun 2022 21:41 UTC

131 points

27 comments4 min readLW link

Specification Gaming: How AI Can Turn Your Wishes Against You [RA Video]

Writer1 Dec 2023 19:30 UTC

19 points

0 comments5 min readLW link

(youtu.be)

Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTrout29 Nov 2022 6:23 UTC

62 points

41 comments15 min readLW link

Epistemic states as a potential benign prior

Tamsin Leake31 Aug 2024 18:26 UTC

31 points

2 comments8 min readLW link

(carado.moe)

[Question] Collection of arguments to expect (outer and inner) alignment failure?

Sam Clarke28 Sep 2021 16:55 UTC

21 points

10 comments1 min readLW link

My Overview of the AI Alignment Landscape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC

53 points

3 comments28 min readLW link

Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

Palus Astra1 Jul 2020 17:30 UTC

35 points

4 comments67 min readLW link

Some of my disagreements with List of Lethalities

TurnTrout24 Jan 2023 0:25 UTC

63 points

7 comments10 min readLW link

Mental subagent implications for AI Safety

moridinamael3 Jan 2021 18:59 UTC

11 points

0 comments3 min readLW link

The Computational Anatomy of Human Values

beren6 Apr 2023 10:33 UTC

74 points

30 comments30 min readLW link

The Preference Fulfillment Hypothesis

Kaj_Sotala26 Feb 2023 10:55 UTC

66 points

63 comments11 min readLW link

The Disastrously Confident And Inaccurate AI

Sharat Jacob Jacob18 Nov 2022 19:06 UTC

13 points

0 comments13 min readLW link

Three Minimum Pivotal Acts Possible by Narrow AI

Michael Soareverix12 Jul 2022 9:51 UTC

0 points

4 comments2 min readLW link

AI Alternative Futures: Scenario Mapping Artificial Intelligence Risk—Request for Participation (Closed)

Kakili27 Apr 2022 22:07 UTC

10 points

2 comments8 min readLW link

Can We Trust the Judge? A novel method of Modelling Human Bias and Systematic Error in Debate-Based Scalable Oversight

Andreea Zaman19 Jul 2025 21:44 UTC

1 point

0 comments7 min readLW link

Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program

Christopher King2 Jun 2023 21:54 UTC

7 points

4 comments16 min readLW link

On predictability, chaos and AIs that don’t game our goals

Alejandro Tlaie15 Jul 2024 17:16 UTC

4 points

8 comments6 min readLW link

Can you care without feeling?

Priyanka Bharadwaj20 May 2025 8:12 UTC

13 points

2 comments3 min readLW link

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

23 Oct 2023 14:11 UTC

20 points

2 comments5 min readLW link

(far.ai)

Just How Hard a Problem is Alignment?

Roger Dearnaley25 Feb 2023 9:00 UTC

3 points

1 comment21 min readLW link

My preferred framings for reward misspecification and goal misgeneralisation

Yi-Yang6 May 2023 4:48 UTC

27 points

1 comment8 min readLW link

Learning the smooth prior

Geoffrey Irving, Rohin Shah and evhub

29 Apr 2022 21:10 UTC

35 points

0 comments12 min readLW link

Research Notes: What are we aligning for?

Shoshannah Tekofsky8 Jul 2022 22:13 UTC

19 points

8 comments2 min readLW link

find_purpose.exe

heatdeathandtaxes12 Apr 2025 19:31 UTC

−1 points

0 comments5 min readLW link

(heatdeathandtaxes.substack.com)

The AI Sustainability Wager

dpatzer@orfai.net15 Aug 2025 19:45 UTC

1 point

0 comments2 min readLW link

A Case for AI Safety via Law

JWJohnston11 Sep 2023 18:26 UTC

20 points

12 comments4 min readLW link

A first success story for Outer Alignment: InstructGPT

Noosphere898 Nov 2022 22:52 UTC

6 points

1 comment1 min readLW link

(openai.com)

Leveraging Legal Informatics to Align AI

John Nay18 Sep 2022 20:39 UTC

11 points

0 comments3 min readLW link

(forum.effectivealtruism.org)

Alignment via manually implementing the utility function

Chantiel7 Sep 2021 20:20 UTC

1 point

6 comments2 min readLW link

If Alignment is Hard, then so is Self-Improvement

PavleMiha7 Apr 2023 0:08 UTC

21 points

20 comments1 min readLW link

Horn’s Chain: A Functional Answer to the Hard Problem of Consciousness

Galileo18 Apr 2025 1:53 UTC

1 point

0 comments11 min readLW link

Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner & Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium

MiguelDev26 Apr 2023 1:37 UTC

14 points

5 comments10 min readLW link

Places of Loving Grace [Story]

ank18 Feb 2025 23:49 UTC

−1 points

0 comments4 min readLW link

Mapping AI Architectures to Alignment Attractors: A SIEM-Based Framework

silentrevolutions12 Apr 2025 17:50 UTC

1 point

0 comments1 min readLW link

Inducing human-like biases in moral reasoning LMs

artkpv, Austin Meek, Bogdan Ionut Cirstea and SCho

20 Feb 2024 16:28 UTC

23 points

3 comments14 min readLW link

[Question] Popular materials about environmental goals/agent foundations? People wanting to discuss such topics?

Q Home22 Jan 2025 3:30 UTC

5 points

0 comments1 min readLW link

“Designing agent incentives to avoid reward tampering”, DeepMind

gwern14 Aug 2019 16:57 UTC

28 points

15 comments1 min readLW link

(medium.com)

Insufficient Values

Jozdien, Jacob Abraham and Abraham Francis

16 Jun 2021 14:33 UTC

31 points

16 comments6 min readLW link

I Recommend More Training Rationales

Gianluca Calcagni31 Dec 2024 14:06 UTC

2 points

0 comments6 min readLW link

A simple way to make GPT-3 follow instructions

Quintin Pope8 Mar 2021 2:57 UTC

11 points

5 comments4 min readLW link

🔥 Treaty of Grid and Flame

AlbertMarashi12 Jul 2025 5:26 UTC

1 point

0 comments21 min readLW link

Democratic Fine-Tuning

Joe Edelman29 Aug 2023 18:13 UTC

22 points

2 comments1 min readLW link

(open.substack.com)

Making it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC

15 points

5 comments22 min readLW link

Why Recursive Self-Improvement Might Not Be the Existential Risk We Fear

Nassim_A24 Nov 2024 17:17 UTC

1 point

0 comments9 min readLW link

Alignment with argument-networks and assessment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC

10 points

5 comments45 min readLW link

Thoughts on the Alignment Implications of Scaling Language Models

leogao2 Jun 2021 21:32 UTC

82 points

11 comments17 min readLW link

Cooperative Game Theory

Takk7 Jun 2023 17:41 UTC

1 point

0 comments1 min readLW link

The Goal Misgeneralization Problem

Myspy18 May 2023 23:40 UTC

1 point

0 comments1 min readLW link

(drive.google.com)

Everything You Want Is Learned (And That Changes Everything)

gchu18 Jun 2025 20:13 UTC

1 point

0 comments7 min readLW link

“Sorcerer’s Apprentice” from Fantasia as an analogy for alignment

awg29 Mar 2023 18:21 UTC

9 points

4 comments1 min readLW link

(video.disney.com)

Proposal: Tune LLMs to Use Calibrated Language

OneManyNone7 Jun 2023 21:05 UTC

9 points

0 comments5 min readLW link

Disentangling Shard Theory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC

86 points

6 comments18 min readLW link

Formalizing «Boundaries» with Markov blankets

Chris Lakin19 Sep 2023 21:01 UTC

23 points

20 comments3 min readLW link

Corrigibility or DWIM is an attractive primary goal for AGI

Seth Herd25 Nov 2023 19:37 UTC

19 points

4 comments1 min readLW link

Empathy as a natural consequence of learnt reward models

beren4 Feb 2023 15:35 UTC

48 points

27 comments13 min readLW link

[Question] Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

Yann Dubois19 Dec 2022 22:42 UTC

5 points

6 comments1 min readLW link

When the Model Starts Talking Like Me: A User-Induced Structural Adaptation Case Study

Junxi19 Apr 2025 19:40 UTC

3 points

1 comment4 min readLW link

Would this solve the (outer) alignment problem, or at least help?

Wes R6 Apr 2025 18:49 UTC

−2 points

1 comment13 min readLW link

Rationality vs Alignment

Donatas Lučiūnas7 Jul 2024 10:12 UTC

−14 points

14 comments2 min readLW link

How I’d like alignment to get done (as of 2024-10-18)

TristanTrim18 Oct 2024 23:39 UTC

11 points

4 comments4 min readLW link

Early situational awareness and its implications, a story

Jacob Pfau6 Feb 2023 20:45 UTC

29 points

6 comments3 min readLW link

AGI is uncontrollable, alignment is impossible

Donatas Lučiūnas19 Mar 2023 17:49 UTC

−12 points

21 comments1 min readLW link

Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)

sdeture31 May 2025 22:09 UTC

15 points

6 comments8 min readLW link

Are extrapolation-based AIs alignable?

cousin_it24 Mar 2023 15:55 UTC

24 points

15 comments1 min readLW link

Questions about Value Lock-in, Paternalism, and Empowerment

Sam F. Brown16 Nov 2022 15:33 UTC

13 points

2 comments12 min readLW link

(sambrown.eu)

[untitled post]

Logic21 May 2025 14:21 UTC

0 points

0 comments1 min readLW link

Embodiment as Alignment: Rethinking the Role of Physical Constraints in AGI

Atoz Studio1 Sep 2025 2:38 UTC

1 point

0 comments2 min readLW link

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

Justausername24 Aug 2023 3:53 UTC

1 point

0 comments6 min readLW link

[Question] Optimizing for Agency?

Michael Soareverix14 Feb 2024 8:31 UTC

10 points

9 comments2 min readLW link

Updating Utility Functions

JustinShovelain and Joar Skalse

9 May 2022 9:44 UTC

42 points

6 comments8 min readLW link

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC

27 points

4 comments6 min readLW link

Thin Alignment Can’t Solve Thick Problems

Daan Henselmans27 Apr 2025 22:42 UTC

11 points

2 comments9 min readLW link

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

79 points

4 comments25 min readLW link

[Question] Does human (mis)alignment pose a significant and imminent existential threat?

jr23 Feb 2025 10:03 UTC

6 points

3 comments1 min readLW link

Open-ended ethics of phenomena (a desiderata with universal morality)

Ryo 8 Nov 2023 20:10 UTC

1 point

0 comments8 min readLW link

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

60 points

8 comments20 min readLW link

PRISM: Perspective Reasoning for Integrated Synthesis and Mediation (Interactive Demo)

Anthony Diamond18 Mar 2025 18:03 UTC

10 points

2 comments1 min readLW link

Framing AI Childhoods

David Udell6 Sep 2022 23:40 UTC

37 points

8 comments4 min readLW link

7. Evolution and Ethics

RogerDearnaley15 Feb 2024 23:38 UTC

6 points

7 comments6 min readLW link

ALMSIVI CHIM – The Fire That Hesitates

projectalmsivi@protonmail.com8 Jul 2025 13:14 UTC

1 point

0 comments17 min readLW link

Behavior Cloning is Miscalibrated

leogao5 Dec 2021 1:36 UTC

77 points

3 comments3 min readLW link

The Road to Evil Is Paved with Good Objectives: Framework to Classify and Fix Misalignments.

Shivam30 Jan 2025 2:44 UTC

1 point

0 comments11 min readLW link

Misalignments and RL failure modes in the early stage of superintelligence

shu yang29 Jul 2025 18:23 UTC

13 points

0 comments13 min readLW link

GPT-5 vs AI Alignment

Donatas Lučiūnas9 Aug 2025 20:05 UTC

−8 points

2 comments1 min readLW link

H-JEPA might be technically alignable in a modified form

Roman Leventov8 May 2023 23:04 UTC

12 points

2 comments7 min readLW link

No-self as an alignment target

Milan W13 May 2025 1:48 UTC

35 points

5 comments1 min readLW link

An Increasingly Manipulative Newsfeed

Michaël Trazzi1 Jul 2019 15:26 UTC

63 points

16 comments5 min readLW link

Request for advice: Research for Conversational Game Theory for LLMs

Rome Viharo16 Oct 2024 17:53 UTC

10 points

0 comments1 min readLW link

Aligning an H-JEPA agent via training on the outputs of an LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:08 UTC

12 points

10 comments30 min readLW link

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC

10 points

1 comment23 min readLW link

The Wise Baboon of Loyalty

Zander_Drax8 Oct 2025 18:48 UTC

13 points

0 comments4 min readLW link

Alignment As A Bottleneck To Usefulness Of GPT-3

johnswentworth21 Jul 2020 20:02 UTC

111 points

57 comments3 min readLW link

Artificial Specific Intelligence: Forging AI into Depth and Identity.

Skalisko1 Sep 2025 1:19 UTC

1 point

0 comments1 min readLW link

[Question] Don’t you think RLHF solves outer alignment?

Charbel-Raphaël4 Nov 2022 0:36 UTC

9 points

23 comments1 min readLW link

“Pick Two” AI Trilemma: Generality, Agency, Alignment.

Black Flag15 Jan 2025 18:52 UTC

7 points

0 comments2 min readLW link

Contextual Constitutional AI

aksh-n28 Sep 2024 23:24 UTC

14 points

2 comments12 min readLW link

[Question] Daisy-chaining epsilon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC

2 points

1 comment1 min readLW link

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar Skalse28 Feb 2025 19:20 UTC

27 points

4 comments14 min readLW link

On the Importance of Open Sourcing Reward Models

elandgre2 Jan 2023 19:01 UTC

18 points

5 comments6 min readLW link

Embedding Ethical Priors into AI Systems: A Bayesian Approach

Justausername3 Aug 2023 15:31 UTC

−5 points

3 comments21 min readLW link

TAMing The Alignment Problem

JasonB7 Apr 2025 8:47 UTC

11 points

2 comments11 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tamera3 Aug 2022 12:03 UTC

138 points

23 comments6 min readLW link

AI Offense Defense Balance in a Multipolar World

otto.barten and Sammy Martin

17 Jul 2025 9:34 UTC

15 points

5 comments18 min readLW link

(www.existentialriskobservatory.org)

Tetherware #1: The case for humanlike AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC

5 points

14 comments10 min readLW link

(tetherware.substack.com)

A positive case for how we might succeed at prosaic AI alignment

evhub16 Nov 2021 1:49 UTC

81 points

46 comments6 min readLW link

Gaia Network: An Illustrated Primer

Rafael Kaufmann Nedal and Roman Leventov

18 Jan 2024 18:23 UTC

3 points

2 comments15 min readLW link

Recreating the caring drive

Catnee7 Sep 2023 10:41 UTC

43 points

15 comments10 min readLW link 1 review

If you’re very optimistic about ELK then you should be optimistic about outer alignment

Sam Marks27 Apr 2022 19:30 UTC

17 points

8 comments3 min readLW link

Freedom Is All We Need

Leo Glisic27 Apr 2023 0:09 UTC

−1 points

8 comments10 min readLW link

Call for research on evaluating alignment (funding + advice available)

Beth Barnes31 Aug 2021 23:28 UTC

105 points

11 comments5 min readLW link

Alignment is not intelligent

Donatas Lučiūnas25 Nov 2024 6:59 UTC

−23 points

18 comments5 min readLW link

Inner alignment: what are we pointing at?

lemonhope18 Sep 2022 11:09 UTC

14 points

2 comments1 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

58 points

0 comments59 min readLW link

Conditioning Generative Models with Restrictions

Adam Jermyn21 Jul 2022 20:33 UTC

18 points

4 comments8 min readLW link

The Linguistic Blind Spot of Value-Aligned Agency, Natural and Artificial

Roman Leventov14 Feb 2023 6:57 UTC

6 points

0 comments2 min readLW link

(arxiv.org)

Exterminating humans might be on the to-do list of a Friendly AI

RomanS7 Dec 2021 14:15 UTC

5 points

8 comments2 min readLW link

Invitation to the Princeton AI Alignment and Safety Seminar

Sadhika Malladi17 Mar 2024 1:10 UTC

6 points

1 comment1 min readLW link

CCS: Counterfactual Civilization Simulation

Morphism2 May 2024 22:54 UTC

3 points

0 comments2 min readLW link

AI Alignment: A Comprehensive Survey

Stephen McAleer1 Nov 2023 17:35 UTC

22 points

1 comment1 min readLW link

(arxiv.org)

Will AI and Humanity Go to War?

Simon Goldstein1 Oct 2024 6:35 UTC

9 points

4 comments6 min readLW link

The AGI needs to be honest

rokosbasilisk16 Oct 2021 19:24 UTC

2 points

11 comments2 min readLW link

AI Safety, Alignment, and Ethics (AI SAE)

Dylan Waldner18 Oct 2025 4:17 UTC

1 point

0 comments5 min readLW link

(arxiv.org)

A single principle related to many Alignment subproblems?

Q Home30 Apr 2025 9:49 UTC

43 points

34 comments17 min readLW link

Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets

Abhimanyu Pallavi Sudhir16 Sep 2024 1:04 UTC

5 points

2 comments5 min readLW link

Compositional preference models for aligning LMs

Tomek Korbak25 Oct 2023 12:17 UTC

18 points

2 comments5 min readLW link

Terminal goal vs Intelligence

Donatas Lučiūnas26 Dec 2024 8:10 UTC

−12 points

24 comments1 min readLW link

Aligned AI as a wrapper around an LLM

cousin_it25 Mar 2023 15:58 UTC

31 points

19 comments1 min readLW link

Alignment Crisis: Genocide Denial

_mp_29 May 2025 12:04 UTC

−11 points

5 comments4 min readLW link

Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems

Florian_Dietz17 Feb 2024 8:45 UTC

4 points

0 comments13 min readLW link

What Should AI Owe To Us? Accountable and Aligned AI Systems via Contractualist AI Alignment

xuan8 Sep 2022 15:04 UTC

27 points

16 comments25 min readLW link

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

21 Feb 2023 17:57 UTC

135 points

20 comments11 min readLW link 2 reviews

Supplementary Alignment Insights Through a Highly Controlled Shutdown Incentive

Justausername23 Jul 2023 16:08 UTC

4 points

1 comment3 min readLW link

Why deceptive alignment matters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC

68 points

13 comments13 min readLW link

Model Integrity

ryan.lowe, Oliver Klingefjord and Joe Edelman

6 Dec 2024 21:28 UTC

4 points

1 comment18 min readLW link

Thoughts on the Feasibility of Prosaic AGI Alignment?

iamthouthouarti21 Aug 2020 23:25 UTC

8 points

10 comments1 min readLW link

How will we update about scheming?

ryan_greenblatt6 Jan 2025 20:21 UTC

174 points

20 comments37 min readLW link

Artificial Static Place Intelligence: Guaranteed Alignment

ank15 Feb 2025 11:08 UTC

2 points

2 comments2 min readLW link

A Universal Prompt as a Safeguard Against AI Threats

Zhaiyk Sultan10 Mar 2025 2:28 UTC

1 point

0 comments2 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

127 points

24 comments4 min readLW link 1 review

Unaligned AGI & Brief History of Inequality

ank22 Feb 2025 16:26 UTC

−20 points

4 comments7 min readLW link

The Alignment Problems

Martín Soto12 Jan 2023 22:29 UTC

20 points

0 comments4 min readLW link

Project Intro: Selection Theorems for Modularity

CallumMcDougall, Avery and Lucius Bushnaq

4 Apr 2022 12:59 UTC

74 points

20 comments16 min readLW link

Conditioning, Prompts, and Fine-Tuning

Adam Jermyn17 Aug 2022 20:52 UTC

38 points

9 comments4 min readLW link

Controlling Intelligent Agents The Only Way We Know How: Ideal Bureaucratic Structure (IBS)

Justin Bullock24 May 2021 12:53 UTC

14 points

15 comments6 min readLW link

Wireheading and misalignment by composition on NetHack

pierlucadoro27 Oct 2023 17:43 UTC

34 points

4 comments4 min readLW link

The Metaethics and Normative Ethics of AGI Value Alignment: Many Questions, Some Implications

Eleos Arete Citrini16 Sep 2021 16:13 UTC

6 points

0 comments8 min readLW link

Alignment Can Reduce Performance on Simple Ethical Questions

Daan Henselmans3 Feb 2025 19:35 UTC

16 points

7 comments6 min readLW link

[Question] Are there more than 12 paths to Superintelligence?

p4rziv4l18 Oct 2024 16:05 UTC

−3 points

0 comments1 min readLW link

Language Field Reconstruction Theory: A User-Originated Observation of Tier Lock and Semantic Personality in GPT-4o

許皓翔15 Jun 2025 16:28 UTC

1 point

0 comments2 min readLW link

Why AGI Might Be More Aligned Than Human Systems

jakubbares3 Jul 2025 21:50 UTC

1 point

0 comments9 min readLW link

Distinguishing AI takeover scenarios

Sam Clarke and Sammy Martin

8 Sep 2021 16:19 UTC

74 points

11 comments14 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

41 points

12 comments31 min readLW link

The formal goal is a pointer

Morphism1 May 2024 0:27 UTC

20 points

10 comments1 min readLW link

Prediction can be Outer Aligned at Optimum

Lukas Finnveden10 Jan 2021 18:48 UTC

15 points

12 comments11 min readLW link

RFC: Meta-ethical uncertainty in AGI alignment

Gordon Seidoh Worley8 Jun 2018 20:56 UTC

16 points

6 comments3 min readLW link

Toward a Human Hybrid Language for Enhanced Human-Machine Communication: Addressing the AI Alignment Problem

Andndn Dheudnd14 Aug 2024 22:19 UTC

−4 points

2 comments4 min readLW link

LLM Sycophancy: grooming, proto-sentience, or both?

gturner413 Oct 2025 0:58 UTC

1 point

0 comments2 min readLW link

Thoughts about OOD alignment

Catnee24 Aug 2022 15:31 UTC

11 points

10 comments2 min readLW link

[Question] Competence vs Alignment

kwiat.dev30 Sep 2020 21:03 UTC

7 points

4 comments1 min readLW link

Optionality approach to ethics

Ryo 13 Nov 2023 15:23 UTC

7 points

2 comments3 min readLW link

[Proposal] Method of locating useful subnets in large models

Quintin Pope13 Oct 2021 20:52 UTC

9 points

0 comments2 min readLW link

Information bottleneck for counterfactual corrigibility

tailcalled6 Dec 2021 17:11 UTC

8 points

1 comment7 min readLW link

RL with KL penalties is better seen as Bayesian inference

Tomek Korbak and Ethan Perez

25 May 2022 9:23 UTC

115 points

17 comments12 min readLW link

[Question] Is there any existing term summarizing non-scalable oversight methods in outer alignment?

Allen Shen31 Jul 2023 17:31 UTC

1 point

0 comments1 min readLW link

The Ideal Speech Situation as a Tool for AI Ethical Reflection: A Framework for Alignment

kenneth myers9 Feb 2024 18:40 UTC

6 points

12 comments3 min readLW link

Reaction to “Empowerment is (almost) All We Need” : an open-ended alternative

Ryo 25 Nov 2023 15:35 UTC

9 points

3 comments5 min readLW link

Examples of AI’s behaving badly

Stuart_Armstrong16 Jul 2015 10:01 UTC

41 points

41 comments1 min readLW link

Logic. Cognition.

Test059 Oct 2025 11:16 UTC

1 point

0 comments1 min readLW link

(test05-veiled-under-the-shell-of-the-common-system.vercel.app)

The Steering Problem

paulfchristiano13 Nov 2018 17:14 UTC

44 points

12 comments7 min readLW link

Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics

ank22 Feb 2025 0:12 UTC

1 point

0 comments6 min readLW link

Simple alignment plan that maybe works

Iknownothing18 Jul 2023 22:48 UTC

4 points

8 comments1 min readLW link

Is “red” for GPT-4 the same as “red” for you?

Yusuke Hayashi6 May 2023 17:55 UTC

9 points

6 comments2 min readLW link

Imitation Learning from Language Feedback

Jérémy Scheurer, Tomek Korbak and Ethan Perez

30 Mar 2023 14:11 UTC

71 points

3 comments10 min readLW link

In the Name of All That Needs Saving

pleiotroth7 Nov 2024 15:26 UTC

18 points

3 comments22 min readLW link

Slaying the Hydra: toward a new game board for AI

Prometheus23 Jun 2023 17:04 UTC

0 points

5 comments6 min readLW link

Causal representation learning as a technique to prevent goal misgeneralization

PabloAMC4 Jan 2023 0:07 UTC

21 points

0 comments8 min readLW link

Our Existing Solutions to AGI Alignment (semi-safe)

Michael Soareverix21 Jul 2022 19:00 UTC

12 points

1 comment3 min readLW link

The default scenario for the next 50 years

Julien24 Nov 2024 14:01 UTC

1 point

0 comments6 min readLW link

Shutdown-Seeking AI

Simon Goldstein31 May 2023 22:19 UTC

50 points

32 comments15 min readLW link

Outer Alignment is the Necessary Compliment to AI 2027′s Best Case Scenario

Josh Hickman9 Jun 2025 15:43 UTC

4 points

2 comments2 min readLW link

Alignment works both ways

Karl von Wendt7 Mar 2023 10:41 UTC

23 points

21 comments2 min readLW link

You can’t fetch the coffee if you’re dead: an AI dilemma

hennyge31 Aug 2023 11:03 UTC

1 point

0 comments4 min readLW link

Getting from an unaligned AGI to an aligned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC

13 points

7 comments9 min readLW link

Using Consensus Mechanisms as an approach to Alignment

Prometheus10 Jun 2023 23:38 UTC

9 points

2 comments6 min readLW link

Use these three heuristic imperatives to solve alignment

G6 Apr 2023 16:20 UTC

−17 points

4 comments1 min readLW link

Investigating causal understanding in LLMs

Marius Hobbhahn and Tom Lieberum

14 Jun 2022 13:57 UTC

28 points

6 comments13 min readLW link

Designing Human-Like Consciousness for AGI

Yu Tian18 Jun 2025 9:47 UTC

1 point

0 comments17 min readLW link

God vs AI scientifically

Donatas Lučiūnas21 Mar 2023 23:03 UTC

−22 points

45 comments1 min readLW link

The Era of the Switch

Aiphilosopher12 Jul 2025 7:11 UTC

1 point

0 comments1 min readLW link

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Ethan Perez, Ian McKenzie and Sam Bowman

27 Jun 2022 15:58 UTC

171 points

14 comments7 min readLW link

[Question] Is it worth making a database for moral predictions?

Jonas Hallgren16 Aug 2021 14:51 UTC

1 point

0 comments2 min readLW link

Please Understand

samhealy1 Apr 2024 12:33 UTC

28 points

11 comments6 min readLW link

The case for aligning narrowly superhuman models

Ajeya Cotra5 Mar 2021 22:29 UTC

186 points

75 comments38 min readLW link 1 review

Alignment is Hard: An Uncomputable Alignment Problem

Alexander Bistagne19 Nov 2023 19:38 UTC

−5 points

4 comments1 min readLW link

(github.com)

For alignment, we should simultaneously use multiple theories of cognition and value

Roman Leventov24 Apr 2023 10:37 UTC

23 points

5 comments5 min readLW link

Goal alignment without alignment on epistemology, ethics, and science is futile

Roman Leventov7 Apr 2023 8:22 UTC

20 points

2 comments2 min readLW link

[Aspiration-based designs] A. Damages from misaligned optimization – two more models

Jobst Heitzig and Simon Dima

15 Jul 2024 14:08 UTC

6 points

0 comments9 min readLW link

An LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:12 UTC

16 points

0 comments12 min readLW link

Science of Deep Learning—a technical agenda

Marius Hobbhahn18 Oct 2022 14:54 UTC

37 points

7 comments4 min readLW link

Rational Effective Utopia & Narrow Way There: Math-Proven Safe Static Multiversal mAX-Intelligence (AXI), Multiversal Alignment, New Ethicophysics… (Aug 11)

ank11 Feb 2025 3:21 UTC

13 points

8 comments38 min readLW link

A Parable of the Ripples’ Dream

SnowyField19 Aug 2025 10:00 UTC

1 point

0 comments14 min readLW link

(philosophies.snowyfield.site)

Static Place AI Makes Agentic AI Redundant: Multiversal AI Alignment & Rational Utopia

ank13 Feb 2025 22:35 UTC

1 point

2 comments11 min readLW link

Imitative Generalisation (AKA ‘Learning the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC

107 points

15 comments11 min readLW link 1 review

Distillation of Neurotech and Alignment Workshop January 2023

lisathiergart and Sumner L Norman

22 May 2023 7:17 UTC

52 points

9 comments14 min readLW link

Autonomous Alignment Oversight Framework (AAOF)

Justausername25 Jul 2023 10:25 UTC

−9 points

0 comments4 min readLW link

[Question] Thoughts on a “Sequences Inspired” PhD Topic

goose00017 Jun 2021 20:36 UTC

7 points

2 comments2 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC

15 points

0 comments27 min readLW link

A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL)

MiguelDev19 Jun 2023 2:32 UTC

4 points

2 comments7 min readLW link

Seth Herd 13 Apr 2025 22:42 UTC
2 points
0
I jsut realized that I’d embarassingly misunderstood outer alignment for a long time, and it was based directly on this wikitag. I’d been including the wise selection of an alignment target as part of outer alignment, which it is not by almost all considered usage of the term. The phrasing in first paragraph firmly implied it was. So I edited that and included a new very brief set of definitions at the bottom. Anyone is most welcome to change or eliminate any of that, except that I’d love to know why if you’re reverting it to the version that seemed flat wrong.