Outer Alignment

TagLast edit: Apr 15, 2025, 3:42 AM by Seth Herd

Outer Alignment (also known as the reward misspecification problem) is the problem of specifying a reward function which captures human preferences. Outer alignment asks the question—“What will we train our model to do?” Note that this is in the narrow technical sense of selecting a reward function, while wisely choosing a training target is a seperate issue; see the list below.

Overall, outer alignment as a problem is intuitive enough to understand, i.e., is the specified loss function aligned with the intended goal of its designers? However, implementing this in practice is extremely difficult. Conveying the full “intention” behind a human request is equivalent to conveying the sum of all human values and ethics. This is difficult in part because human intentions are themselves not well understood. Additionally, since most models are designed as goal optimizers, they are all susceptible to Goodhart’s Law which means that we might be unable to foresee negative consequences that arise due to excessive optimization pressure on a goal that would look otherwise well specified to humans.

To solve the outer alignment problem, some sub-problems that we would have to make progress on include specification gaming, value learning, and reward shaping/modeling. Some proposed solutions to outer alignment include scalable oversight techniques such as IDA, as well as adversarial oversight techniques such as debate.

Outer Alignment vs. Inner Alignment

This is often taken to be separate from the inner alignment problem, which asks: How can we robustly aim our AI optimizers at any objective function at all?

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don’t think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Inner alignment: Does the model learn to do what you trained it to do?
- Or does it create a mesaoptimizer?
- Relevant example: take over the world as a mesoptimized goal includes perform appropriately in testing to avoid detection as a subgoal
Outer alignment: Does your training set actually train the model to do what you think you want?
- Example: training a model to detect hidden equipment won’t do what you want if all the training set with hidden equipment was taken from a sunny day, while all the negative examples were taken on cloudy days (figuratively but not literally a thing that happened)
Alignment target selection: This is a separate potential point of failure.
- If you get inner alignment and outer alignment right, but you selected “make me a lot of money” as your alignment target, you will probably be unhappy with your overall efforts. You failed at wisely selecting an alignment target.

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

May 31, 2019, 11:44 PM

187 points

42 comments12 min readLW link 3 reviews

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

RogerDearnaleyMay 28, 2025, 6:21 AM

22 points

33 comments9 min readLW link

6. The Mutable Values Problem in Value Learning and CEV

RogerDearnaleyDec 4, 2023, 6:31 PM

12 points

0 comments49 min readLW link

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaleyFeb 1, 2024, 9:15 PM

16 points

15 comments13 min readLW link

Another (outer) alignment failure story

paulfchristianoApr 7, 2021, 8:12 PM

249 points

38 comments12 min readLW link 1 review

Truthful LMs as a warm-up for aligned AGI

Jacob_HiltonJan 17, 2022, 4:49 PM

65 points

14 comments13 min readLW link

LOVE in a simbox is all you need

jacob_cannellSep 28, 2022, 6:25 PM

67 points

73 comments44 min readLW link 1 review

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaleyMay 25, 2023, 9:26 AM

33 points

3 comments15 min readLW link

Gaia Network: a practical, incremental pathway to Open Agency Architecture

Roman Leventov and Rafael Kaufmann Nedal

Dec 20, 2023, 5:11 PM

22 points

8 comments16 min readLW link

Debate update: Obfuscated arguments problem

Beth BarnesDec 23, 2020, 3:24 AM

136 points

24 comments16 min readLW link

Outer vs inner misalignment: three framings

Richard_NgoJul 6, 2022, 7:46 PM

52 points

5 comments9 min readLW link

Book review: “A Thousand Brains” by Jeff Hawkins

Steven ByrnesMar 4, 2021, 5:10 AM

122 points

18 comments19 min readLW link

Maintaining Alignment during RSI as a Feedback Control Problem

berenMar 2, 2025, 12:21 AM

66 points

6 comments11 min readLW link

List of resolved confusions about IDA

Wei DaiSep 30, 2019, 8:03 PM

97 points

18 comments3 min readLW link

My AGI Threat Model: Misaligned Model-Based RL Agent

Steven ByrnesMar 25, 2021, 1:45 PM

74 points

40 comments16 min readLW link

“Inner Alignment Failures” Which Are Actually Outer Alignment Failures

johnswentworthOct 31, 2020, 8:18 PM

66 points

38 comments5 min readLW link

Reward is not the optimization target

TurnTroutJul 25, 2022, 12:03 AM

376 points

123 comments10 min readLW link 3 reviews

AI Alignment 2018-19 Review

Rohin ShahJan 28, 2020, 2:19 AM

126 points

6 comments35 min readLW link

If I were a well-intentioned AI… III: Extremal Goodhart

Stuart_ArmstrongFeb 28, 2020, 11:24 AM

22 points

0 comments5 min readLW link

[Question] What if Ethics is Provably Self-Contradictory?

YitzApr 18, 2024, 5:12 AM

3 points

7 comments2 min readLW link

Worrisome misunderstanding of the core issues with AI transition

Roman LeventovJan 18, 2024, 10:05 AM

5 points

2 comments4 min readLW link

Outer alignment and imitative amplification

evhubJan 10, 2020, 12:26 AM

24 points

11 comments9 min readLW link

How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA?

Owain_EvansFeb 26, 2022, 12:46 PM

44 points

3 comments11 min readLW link

nostalgebraist: Recursive Goodhart’s Law

Kaj_SotalaAug 26, 2020, 11:07 AM

53 points

27 comments1 min readLW link

(nostalgebraist.tumblr.com)

[Linkpost] Introducing Superalignment

berenJul 5, 2023, 6:23 PM

175 points

69 comments1 min readLW link

(openai.com)

Don’t align agents to evaluations of plans

TurnTroutNov 26, 2022, 9:16 PM

48 points

49 comments18 min readLW link

On the Confusion between Inner and Outer Misalignment

Chris_LeongMar 25, 2024, 11:59 AM

17 points

10 comments1 min readLW link

If I were a well-intentioned AI… II: Acting in a world

Stuart_ArmstrongFeb 27, 2020, 11:58 AM

20 points

0 comments3 min readLW link

AI alignment as a translation problem

Roman LeventovFeb 5, 2024, 2:14 PM

22 points

2 comments3 min readLW link

Inference-Only Debate Experiments Using Math Problems

Arjun Panickssery, Abhimanyu Pallavi Sudhir and JacksonKaunismaa

Aug 6, 2024, 5:44 PM

31 points

0 comments2 min readLW link

Shard Theory: An Overview

David UdellAug 11, 2022, 5:44 AM

167 points

34 comments10 min readLW link

Four usages of “loss” in AI

TurnTroutOct 2, 2022, 12:52 AM

46 points

18 comments4 min readLW link

MIRI comments on Cotra’s “Case for Aligning Narrowly Superhuman Models”

Rob BensingerMar 5, 2021, 11:43 PM

142 points

13 comments26 min readLW link

Mesa-Optimizers vs “Steered Optimizers”

Steven ByrnesJul 10, 2020, 4:49 PM

45 points

7 comments8 min readLW link

Simulators

janusSep 2, 2022, 12:45 PM

636 points

168 comments41 min readLW link 8 reviews

(generative.ink)

Evaluating the historical value misspecification argument

Matthew BarnettOct 5, 2023, 6:34 PM

188 points

162 comments7 min readLW link 3 reviews

AXRP Episode 12 - AI Existential Risk with Paul Christiano

DanielFilanDec 2, 2021, 2:20 AM

38 points

0 comments126 min readLW link

[Intro to brain-like-AGI safety] 10. The alignment problem

Steven ByrnesMar 30, 2022, 1:24 PM

52 points

7 comments21 min readLW link

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTroutDec 2, 2022, 2:43 AM

149 points

22 comments47 min readLW link 3 reviews

Why “AI alignment” would better be renamed into “Artificial Intention research”

chaosmageJun 15, 2023, 10:32 AM

29 points

12 comments2 min readLW link

Language Agents Reduce the Risk of Existential Catastrophe

cdkg and Simon Goldstein

May 28, 2023, 7:10 PM

39 points

14 comments26 min readLW link

If I were a well-intentioned AI… I: Image classifier

Stuart_ArmstrongFeb 26, 2020, 12:39 PM

35 points

4 comments5 min readLW link

Naive Hypotheses on AI Alignment

Shoshannah TekofskyJul 2, 2022, 7:03 PM

98 points

29 comments5 min readLW link

Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceCDec 16, 2022, 10:12 PM

68 points

11 comments1 min readLW link

(www.anthropic.com)

Categorizing failures as “outer” or “inner” misalignment is often confused

Rohin ShahJan 6, 2023, 3:48 PM

93 points

21 comments8 min readLW link

Is the Star Trek Federation really incapable of building AI?

Kaj_SotalaMar 18, 2018, 10:30 AM

19 points

4 comments2 min readLW link

(kajsotala.fi)

Human Mimicry Mainly Works When We’re Already Close

johnswentworthAug 17, 2022, 6:41 PM

82 points

16 comments5 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John NayOct 21, 2022, 2:03 AM

5 points

18 comments54 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel NandaDec 15, 2021, 11:44 PM

127 points

9 comments15 min readLW link

[Aspiration-based designs] 1. Informal introduction

B Jacobs, Jobst Heitzig, Simon Fischer and Simon Dima

Apr 28, 2024, 1:00 PM

44 points

4 comments8 min readLW link

Alignment as Game Design

Shoshannah TekofskyJul 16, 2022, 10:36 PM

11 points

7 comments2 min readLW link

25 Min Talk on MetaEthical.AI with Questions from Stuart Armstrong

June KuApr 29, 2021, 3:38 PM

21 points

7 comments1 min readLW link

(Humor) AI Alignment Critical Failure Table

Kaj_SotalaAug 31, 2020, 7:51 PM

24 points

2 comments1 min readLW link

(sl4.org)

Selection Theorems: A Program For Understanding Agents

johnswentworthSep 28, 2021, 5:03 AM

128 points

28 comments6 min readLW link 2 reviews

Question 2: Predicted bad outcomes of AGI learning architecture

Cameron BergFeb 11, 2022, 10:23 PM

5 points

1 comment10 min readLW link

Announcing the Alignment of Complex Systems Research Group

Jan_Kulveit and technicalities

Jun 4, 2022, 4:10 AM

91 points

20 comments5 min readLW link

Preference Aggregation as Bayesian Inference

berenJul 27, 2023, 5:59 PM

14 points

1 comment1 min readLW link

The True Story of How GPT-2 Became Maximally Lewd

Writer and Jai

Jan 18, 2024, 9:03 PM

70 points

7 comments6 min readLW link

(youtu.be)

Concept Safety: Producing similar AI-human concept spaces

Kaj_SotalaApr 14, 2015, 8:39 PM

51 points

45 comments8 min readLW link

[ASoT] Some thoughts about imperfect world modeling

leogaoApr 7, 2022, 3:42 PM

7 points

0 comments4 min readLW link

An overview of 11 proposals for building safe advanced AI

evhubMay 29, 2020, 8:38 PM

220 points

37 comments38 min readLW link 2 reviews

Confused why a “capabilities research is good for alignment progress” position isn’t discussed more

Kaj_SotalaJun 2, 2022, 9:41 PM

130 points

27 comments4 min readLW link

Specification Gaming: How AI Can Turn Your Wishes Against You [RA Video]

WriterDec 1, 2023, 7:30 PM

19 points

0 comments5 min readLW link

(youtu.be)

Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTroutNov 29, 2022, 6:23 AM

62 points

41 comments15 min readLW link

Epistemic states as a potential benign prior

Tamsin LeakeAug 31, 2024, 6:26 PM

31 points

2 comments8 min readLW link

(carado.moe)

[Question] Collection of arguments to expect (outer and inner) alignment failure?

Sam ClarkeSep 28, 2021, 4:55 PM

21 points

10 comments1 min readLW link

My Overview of the AI Alignment Landscape: Threat Models

Neel NandaDec 25, 2021, 11:07 PM

53 points

3 comments28 min readLW link

Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

Palus AstraJul 1, 2020, 5:30 PM

35 points

4 comments67 min readLW link

Some of my disagreements with List of Lethalities

TurnTroutJan 24, 2023, 12:25 AM

70 points

7 comments10 min readLW link

Mental subagent implications for AI Safety

moridinamaelJan 3, 2021, 6:59 PM

11 points

0 comments3 min readLW link

The Computational Anatomy of Human Values

berenApr 6, 2023, 10:33 AM

74 points

30 comments30 min readLW link

The Preference Fulfillment Hypothesis

Kaj_SotalaFeb 26, 2023, 10:55 AM

66 points

62 comments11 min readLW link

The Disastrously Confident And Inaccurate AI

Sharat Jacob JacobNov 18, 2022, 7:06 PM

13 points

0 comments13 min readLW link

Three Minimum Pivotal Acts Possible by Narrow AI

Michael SoareverixJul 12, 2022, 9:51 AM

0 points

4 comments2 min readLW link

AI Alternative Futures: Scenario Mapping Artificial Intelligence Risk—Request for Participation (Closed)

KakiliApr 27, 2022, 10:07 PM

10 points

2 comments8 min readLW link

Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program

Christopher KingJun 2, 2023, 9:54 PM

7 points

4 comments16 min readLW link

On predictability, chaos and AIs that don’t game our goals

Alejandro TlaieJul 15, 2024, 5:16 PM

4 points

8 comments6 min readLW link

Can you care without feeling?

Priyanka BharadwajMay 20, 2025, 8:12 AM

13 points

2 comments3 min readLW link

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

Oct 23, 2023, 2:11 PM

20 points

2 comments5 min readLW link

(far.ai)

Just How Hard a Problem is Alignment?

Roger DearnaleyFeb 25, 2023, 9:00 AM

3 points

1 comment21 min readLW link

My preferred framings for reward misspecification and goal misgeneralisation

Yi-YangMay 6, 2023, 4:48 AM

27 points

1 comment8 min readLW link

Learning the smooth prior

Geoffrey Irving, Rohin Shah and evhub

Apr 29, 2022, 9:10 PM

35 points

0 comments12 min readLW link

Research Notes: What are we aligning for?

Shoshannah TekofskyJul 8, 2022, 10:13 PM

19 points

8 comments2 min readLW link

find_purpose.exe

heatdeathandtaxesApr 12, 2025, 7:31 PM

−1 points

0 comments5 min readLW link

(heatdeathandtaxes.substack.com)

A Case for AI Safety via Law

JWJohnstonSep 11, 2023, 6:26 PM

20 points

12 comments4 min readLW link

A first success story for Outer Alignment: InstructGPT

Noosphere89Nov 8, 2022, 10:52 PM

6 points

1 comment1 min readLW link

(openai.com)

Leveraging Legal Informatics to Align AI

John NaySep 18, 2022, 8:39 PM

11 points

0 comments3 min readLW link

(forum.effectivealtruism.org)

Alignment via manually implementing the utility function

ChantielSep 7, 2021, 8:20 PM

1 point

6 comments2 min readLW link

If Alignment is Hard, then so is Self-Improvement

PavleMihaApr 7, 2023, 12:08 AM

21 points

20 comments1 min readLW link

Horn’s Chain: A Functional Answer to the Hard Problem of Consciousness

GalileoApr 18, 2025, 1:53 AM

1 point

0 comments11 min readLW link

Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner & Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium

MiguelDevApr 26, 2023, 1:37 AM

14 points

5 comments10 min readLW link

Places of Loving Grace [Story]

ankFeb 18, 2025, 11:49 PM

−1 points

0 comments4 min readLW link

Mapping AI Architectures to Alignment Attractors: A SIEM-Based Framework

silentrevolutionsApr 12, 2025, 5:50 PM

1 point

0 comments1 min readLW link

Inducing human-like biases in moral reasoning LMs

Artyom Karpov, Austin Meek, Bogdan Ionut Cirstea and SCho

Feb 20, 2024, 4:28 PM

23 points

3 comments14 min readLW link

[Question] Popular materials about environmental goals/agent foundations? People wanting to discuss such topics?

Q HomeJan 22, 2025, 3:30 AM

5 points

0 comments1 min readLW link

“Designing agent incentives to avoid reward tampering”, DeepMind

gwernAug 14, 2019, 4:57 PM

28 points

15 comments1 min readLW link

(medium.com)

Insufficient Values

Jozdien, Jacob Abraham and Abraham Francis

Jun 16, 2021, 2:33 PM

31 points

16 comments6 min readLW link

I Recommend More Training Rationales

Gianluca CalcagniDec 31, 2024, 2:06 PM

2 points

0 comments6 min readLW link

A simple way to make GPT-3 follow instructions

Quintin PopeMar 8, 2021, 2:57 AM

11 points

5 comments4 min readLW link

Democratic Fine-Tuning

Joe EdelmanAug 29, 2023, 6:13 PM

22 points

2 comments1 min readLW link

(open.substack.com)

Making it harder for an AGI to “trick” us, with STVs

Tor Økland BarstadJul 9, 2022, 2:42 PM

15 points

5 comments22 min readLW link

Why Recursive Self-Improvement Might Not Be the Existential Risk We Fear

Nassim_ANov 24, 2024, 5:17 PM

1 point

0 comments9 min readLW link

Alignment with argument-networks and assessment-predictions

Tor Økland BarstadDec 13, 2022, 2:17 AM

10 points

5 comments45 min readLW link

Thoughts on the Alignment Implications of Scaling Language Models

leogaoJun 2, 2021, 9:32 PM

82 points

11 comments17 min readLW link

Cooperative Game Theory

TakkJun 7, 2023, 5:41 PM

1 point

0 comments1 min readLW link

The Goal Misgeneralization Problem

MyspyMay 18, 2023, 11:40 PM

1 point

0 comments1 min readLW link

(drive.google.com)

“Sorcerer’s Apprentice” from Fantasia as an analogy for alignment

awgMar 29, 2023, 6:21 PM

9 points

4 comments1 min readLW link

(video.disney.com)

Proposal: Tune LLMs to Use Calibrated Language

OneManyNoneJun 7, 2023, 9:05 PM

9 points

0 comments5 min readLW link

Partial Simulation Extrapolation: A Proposal for Building Safer Simulators

lukemarksJun 17, 2023, 1:55 PM

16 points

0 comments10 min readLW link

Disentangling Shard Theory into Atomic Claims

Leon LangJan 13, 2023, 4:23 AM

86 points

6 comments18 min readLW link

Formalizing «Boundaries» with Markov blankets

Chris LakinSep 19, 2023, 9:01 PM

21 points

20 comments3 min readLW link

Corrigibility or DWIM is an attractive primary goal for AGI

Seth HerdNov 25, 2023, 7:37 PM

19 points

4 comments1 min readLW link

Empathy as a natural consequence of learnt reward models

berenFeb 4, 2023, 3:35 PM

48 points

27 comments13 min readLW link

[Question] Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

Yann DuboisDec 19, 2022, 10:42 PM

5 points

6 comments1 min readLW link

When the Model Starts Talking Like Me: A User-Induced Structural Adaptation Case Study

JunxiApr 19, 2025, 7:40 PM

3 points

1 comment4 min readLW link

Would this solve the (outer) alignment problem, or at least help?

Wes RApr 6, 2025, 6:49 PM

−2 points

1 comment13 min readLW link

Rationality vs Alignment

Donatas LučiūnasJul 7, 2024, 10:12 AM

−14 points

14 comments2 min readLW link

How I’d like alignment to get done (as of 2024-10-18)

TristanTrimOct 18, 2024, 11:39 PM

11 points

4 comments4 min readLW link

Early situational awareness and its implications, a story

Jacob PfauFeb 6, 2023, 8:45 PM

29 points

6 comments3 min readLW link

AGI is uncontrollable, alignment is impossible

Donatas LučiūnasMar 19, 2023, 5:49 PM

−12 points

21 comments1 min readLW link

Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)

sdetureMay 31, 2025, 10:09 PM

15 points

6 comments8 min readLW link

Are extrapolation-based AIs alignable?

cousin_itMar 24, 2023, 3:55 PM

24 points

15 comments1 min readLW link

Questions about Value Lock-in, Paternalism, and Empowerment

Sam F. BrownNov 16, 2022, 3:33 PM

13 points

2 comments12 min readLW link

(sambrown.eu)

[untitled post]

LogicMay 21, 2025, 2:21 PM

0 points

0 comments1 min readLW link

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

JustausernameAug 24, 2023, 3:53 AM

1 point

0 comments6 min readLW link

[Question] Optimizing for Agency?

Michael SoareverixFeb 14, 2024, 8:31 AM

10 points

9 comments2 min readLW link

Updating Utility Functions

JustinShovelain and Joar Skalse

May 9, 2022, 9:44 AM

41 points

6 comments8 min readLW link

Levels of goals and alignment

zeshenSep 16, 2022, 4:44 PM

27 points

4 comments6 min readLW link

Thin Alignment Can’t Solve Thick Problems

Daan HenselmansApr 27, 2025, 10:42 PM

11 points

2 comments9 min readLW link

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

Nov 1, 2022, 11:03 AM

78 points

4 comments25 min readLW link

[Question] Does human (mis)alignment pose a significant and imminent existential threat?

jrFeb 23, 2025, 10:03 AM

6 points

3 comments1 min readLW link

Open-ended ethics of phenomena (a desiderata with universal morality)

Ryo Nov 8, 2023, 8:10 PM

1 point

0 comments8 min readLW link

Conditioning Generative Models for Alignment

JozdienJul 18, 2022, 7:11 AM

60 points

8 comments20 min readLW link

PRISM: Perspective Reasoning for Integrated Synthesis and Mediation (Interactive Demo)

Anthony DiamondMar 18, 2025, 6:03 PM

10 points

2 comments1 min readLW link

Framing AI Childhoods

David UdellSep 6, 2022, 11:40 PM

37 points

8 comments4 min readLW link

7. Evolution and Ethics

RogerDearnaleyFeb 15, 2024, 11:38 PM

3 points

7 comments6 min readLW link

Behavior Cloning is Miscalibrated

leogaoDec 5, 2021, 1:36 AM

78 points

3 comments3 min readLW link

The Road to Evil Is Paved with Good Objectives: Framework to Classify and Fix Misalignments.

ShivamJan 30, 2025, 2:44 AM

1 point

0 comments11 min readLW link

H-JEPA might be technically alignable in a modified form

Roman LeventovMay 8, 2023, 11:04 PM

12 points

2 comments7 min readLW link

No-self as an alignment target

Milan WMay 13, 2025, 1:48 AM

35 points

5 comments1 min readLW link

An Increasingly Manipulative Newsfeed

Michaël TrazziJul 1, 2019, 3:26 PM

63 points

16 comments5 min readLW link

Request for advice: Research for Conversational Game Theory for LLMs

Rome ViharoOct 16, 2024, 5:53 PM

10 points

0 comments1 min readLW link

Aligning an H-JEPA agent via training on the outputs of an LLM-based “exemplary actor”

Roman LeventovMay 29, 2023, 11:08 AM

12 points

10 comments30 min readLW link

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger DearnaleyFeb 21, 2023, 9:05 AM

10 points

1 comment23 min readLW link

Alignment As A Bottleneck To Usefulness Of GPT-3

johnswentworthJul 21, 2020, 8:02 PM

111 points

57 comments3 min readLW link

[Question] Don’t you think RLHF solves outer alignment?

Charbel-RaphaëlNov 4, 2022, 12:36 AM

9 points

23 comments1 min readLW link

“Pick Two” AI Trilemma: Generality, Agency, Alignment.

Black FlagJan 15, 2025, 6:52 PM

7 points

0 comments2 min readLW link

Contextual Constitutional AI

aksh-nSep 28, 2024, 11:24 PM

14 points

2 comments12 min readLW link

[Question] Daisy-chaining epsilon-step verifiers

DecaeneusApr 6, 2023, 2:07 AM

2 points

1 comment1 min readLW link

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar SkalseFeb 28, 2025, 7:20 PM

26 points

4 comments14 min readLW link

On the Importance of Open Sourcing Reward Models

elandgreJan 2, 2023, 7:01 PM

18 points

5 comments6 min readLW link

Embedding Ethical Priors into AI Systems: A Bayesian Approach

JustausernameAug 3, 2023, 3:31 PM

−5 points

3 comments21 min readLW link

Breaking down the MEAT of Alignment

JasonBrownApr 7, 2025, 8:47 AM

7 points

2 comments11 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tameraAug 3, 2022, 12:03 PM

136 points

23 comments6 min readLW link

Tetherware #1: The case for humanlike AI with free will

Jáchym FibírJan 30, 2025, 10:58 AM

5 points

14 comments10 min readLW link

(tetherware.substack.com)

A positive case for how we might succeed at prosaic AI alignment

evhubNov 16, 2021, 1:49 AM

81 points

46 comments6 min readLW link

Gaia Network: An Illustrated Primer

Rafael Kaufmann Nedal and Roman Leventov

Jan 18, 2024, 6:23 PM

3 points

2 comments15 min readLW link

Recreating the caring drive

CatneeSep 7, 2023, 10:41 AM

43 points

15 comments10 min readLW link 1 review

If you’re very optimistic about ELK then you should be optimistic about outer alignment

Sam MarksApr 27, 2022, 7:30 PM

17 points

8 comments3 min readLW link

Freedom Is All We Need

Leo GlisicApr 27, 2023, 12:09 AM

−1 points

8 comments10 min readLW link

Call for research on evaluating alignment (funding + advice available)

Beth BarnesAug 31, 2021, 11:28 PM

105 points

11 comments5 min readLW link

Alignment is not intelligent

Donatas LučiūnasNov 25, 2024, 6:59 AM

−23 points

18 comments5 min readLW link

Inner alignment: what are we pointing at?

lemonhopeSep 18, 2022, 11:09 AM

14 points

2 comments1 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. MurphyMay 12, 2022, 8:01 PM

58 points

0 comments59 min readLW link

Conditioning Generative Models with Restrictions

Adam JermynJul 21, 2022, 8:33 PM

18 points

4 comments8 min readLW link

The Linguistic Blind Spot of Value-Aligned Agency, Natural and Artificial

Roman LeventovFeb 14, 2023, 6:57 AM

6 points

0 comments2 min readLW link

(arxiv.org)

Exterminating humans might be on the to-do list of a Friendly AI

RomanSDec 7, 2021, 2:15 PM

5 points

8 comments2 min readLW link

Invitation to the Princeton AI Alignment and Safety Seminar

Sadhika MalladiMar 17, 2024, 1:10 AM

6 points

1 comment1 min readLW link

CCS: Counterfactual Civilization Simulation

MorphismMay 2, 2024, 10:54 PM

3 points

0 comments2 min readLW link

AI Alignment: A Comprehensive Survey

Stephen McAleerNov 1, 2023, 5:35 PM

20 points

1 comment1 min readLW link

(arxiv.org)

Will AI and Humanity Go to War?

Simon GoldsteinOct 1, 2024, 6:35 AM

9 points

4 comments6 min readLW link

The AGI needs to be honest

rokosbasiliskOct 16, 2021, 7:24 PM

2 points

11 comments2 min readLW link

A single principle related to many Alignment subproblems?

Q HomeApr 30, 2025, 9:49 AM

37 points

34 comments17 min readLW link

Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets

Abhimanyu Pallavi SudhirSep 16, 2024, 1:04 AM

5 points

1 comment5 min readLW link

Compositional preference models for aligning LMs

Tomek KorbakOct 25, 2023, 12:17 PM

18 points

2 comments5 min readLW link

Terminal goal vs Intelligence

Donatas LučiūnasDec 26, 2024, 8:10 AM

−12 points

24 comments1 min readLW link

Aligned AI as a wrapper around an LLM

cousin_itMar 25, 2023, 3:58 PM

31 points

19 comments1 min readLW link

Alignment Crisis: Genocide Denial

_mp_May 29, 2025, 12:04 PM

−11 points

5 comments4 min readLW link

Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems

Florian_DietzFeb 17, 2024, 8:45 AM

4 points

0 comments13 min readLW link

What Should AI Owe To Us? Accountable and Aligned AI Systems via Contractualist AI Alignment

xuanSep 8, 2022, 3:04 PM

26 points

16 comments25 min readLW link

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

Feb 21, 2023, 5:57 PM

135 points

20 comments11 min readLW link 2 reviews

Supplementary Alignment Insights Through a Highly Controlled Shutdown Incentive

JustausernameJul 23, 2023, 4:08 PM

4 points

1 comment3 min readLW link

Why deceptive alignment matters for AGI safety

Marius HobbhahnSep 15, 2022, 1:38 PM

68 points

13 comments13 min readLW link

Model Integrity

ryan.lowe, Oliver Klingefjord and Joe Edelman

Dec 6, 2024, 9:28 PM

4 points

1 comment18 min readLW link

Thoughts on the Feasibility of Prosaic AGI Alignment?

iamthouthouartiAug 21, 2020, 11:25 PM

8 points

10 comments1 min readLW link

How will we update about scheming?

ryan_greenblattJan 6, 2025, 8:21 PM

171 points

20 comments37 min readLW link

Artificial Static Place Intelligence: Guaranteed Alignment

ankFeb 15, 2025, 11:08 AM

2 points

2 comments2 min readLW link

A Universal Prompt as a Safeguard Against AI Threats

Zhaiyk SultanMar 10, 2025, 2:28 AM

1 point

0 comments2 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

Nov 1, 2022, 11:03 AM

127 points

24 comments4 min readLW link 1 review

Unaligned AGI & Brief History of Inequality

ankFeb 22, 2025, 4:26 PM

−20 points

4 comments7 min readLW link

The Alignment Problems

Martín SotoJan 12, 2023, 10:29 PM

20 points

0 comments4 min readLW link

Project Intro: Selection Theorems for Modularity

CallumMcDougall, Avery and Lucius Bushnaq

Apr 4, 2022, 12:59 PM

73 points

20 comments16 min readLW link

Conditioning, Prompts, and Fine-Tuning

Adam JermynAug 17, 2022, 8:52 PM

38 points

9 comments4 min readLW link

Controlling Intelligent Agents The Only Way We Know How: Ideal Bureaucratic Structure (IBS)

Justin BullockMay 24, 2021, 12:53 PM

14 points

15 comments6 min readLW link

Wireheading and misalignment by composition on NetHack

pierlucadoroOct 27, 2023, 5:43 PM

34 points

4 comments4 min readLW link

The Metaethics and Normative Ethics of AGI Value Alignment: Many Questions, Some Implications

Eleos Arete CitriniSep 16, 2021, 4:13 PM

6 points

0 comments8 min readLW link

Alignment Can Reduce Performance on Simple Ethical Questions

Daan HenselmansFeb 3, 2025, 7:35 PM

16 points

7 comments6 min readLW link

[Question] Are there more than 12 paths to Superintelligence?

p4rziv4lOct 18, 2024, 4:05 PM

−3 points

0 comments1 min readLW link

Distinguishing AI takeover scenarios

Sam Clarke and Sammy Martin

Sep 8, 2021, 4:19 PM

74 points

11 comments14 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM

41 points

12 comments31 min readLW link

The formal goal is a pointer

MorphismMay 1, 2024, 12:27 AM

20 points

10 comments1 min readLW link

Prediction can be Outer Aligned at Optimum

Lukas FinnvedenJan 10, 2021, 6:48 PM

15 points

12 comments11 min readLW link

RFC: Meta-ethical uncertainty in AGI alignment

Gordon Seidoh WorleyJun 8, 2018, 8:56 PM

16 points

6 comments3 min readLW link

Toward a Human Hybrid Language for Enhanced Human-Machine Communication: Addressing the AI Alignment Problem

Andndn DheudndAug 14, 2024, 10:19 PM

−4 points

2 comments4 min readLW link

Thoughts about OOD alignment

CatneeAug 24, 2022, 3:31 PM

11 points

10 comments2 min readLW link

[Question] Competence vs Alignment

kwiat.devSep 30, 2020, 9:03 PM

7 points

4 comments1 min readLW link

Optionality approach to ethics

Ryo Nov 13, 2023, 3:23 PM

7 points

2 comments3 min readLW link

[Proposal] Method of locating useful subnets in large models

Quintin PopeOct 13, 2021, 8:52 PM

9 points

0 comments2 min readLW link

Information bottleneck for counterfactual corrigibility

tailcalledDec 6, 2021, 5:11 PM

8 points

1 comment7 min readLW link

RL with KL penalties is better seen as Bayesian inference

Tomek Korbak and Ethan Perez

May 25, 2022, 9:23 AM

115 points

17 comments12 min readLW link

[Question] Is there any existing term summarizing non-scalable oversight methods in outer alignment?

Allen ShenJul 31, 2023, 5:31 PM

1 point

0 comments1 min readLW link

The Ideal Speech Situation as a Tool for AI Ethical Reflection: A Framework for Alignment

kenneth myersFeb 9, 2024, 6:40 PM

6 points

12 comments3 min readLW link

Reaction to “Empowerment is (almost) All We Need” : an open-ended alternative

Ryo Nov 25, 2023, 3:35 PM

9 points

3 comments5 min readLW link

Examples of AI’s behaving badly

Stuart_ArmstrongJul 16, 2015, 10:01 AM

41 points

41 comments1 min readLW link

The Steering Problem

paulfchristianoNov 13, 2018, 5:14 PM

44 points

12 comments7 min readLW link

Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics

ankFeb 22, 2025, 12:12 AM

1 point

0 comments6 min readLW link

Higher Dimension Cartesian Objects and Aligning ‘Tiling Simulators’

lukemarksJun 11, 2023, 12:13 AM

22 points

0 comments5 min readLW link

Simple alignment plan that maybe works

IknownothingJul 18, 2023, 10:48 PM

4 points

8 comments1 min readLW link

Is “red” for GPT-4 the same as “red” for you?

Yusuke HayashiMay 6, 2023, 5:55 PM

9 points

6 comments2 min readLW link

Imitation Learning from Language Feedback

Jérémy Scheurer, Tomek Korbak and Ethan Perez

Mar 30, 2023, 2:11 PM

71 points

3 comments10 min readLW link

In the Name of All That Needs Saving

pleiotrothNov 7, 2024, 3:26 PM

18 points

3 comments22 min readLW link

Slaying the Hydra: toward a new game board for AI

PrometheusJun 23, 2023, 5:04 PM

0 points

5 comments6 min readLW link

Causal representation learning as a technique to prevent goal misgeneralization

PabloAMCJan 4, 2023, 12:07 AM

21 points

0 comments8 min readLW link

Our Existing Solutions to AGI Alignment (semi-safe)

Michael SoareverixJul 21, 2022, 7:00 PM

12 points

1 comment3 min readLW link

The default scenario for the next 50 years

JulienNov 24, 2024, 2:01 PM

1 point

0 comments6 min readLW link

Shutdown-Seeking AI

Simon GoldsteinMay 31, 2023, 10:19 PM

50 points

32 comments15 min readLW link

Outer Alignment is the Necessary Compliment to AI 2027′s Best Case Scenario

Josh HickmanJun 9, 2025, 3:43 PM

4 points

2 comments2 min readLW link

Alignment works both ways

Karl von WendtMar 7, 2023, 10:41 AM

23 points

21 comments2 min readLW link

You can’t fetch the coffee if you’re dead: an AI dilemma

hennygeAug 31, 2023, 11:03 AM

1 point

0 comments4 min readLW link

Getting from an unaligned AGI to an aligned AGI?

Tor Økland BarstadJun 21, 2022, 12:36 PM

13 points

7 comments9 min readLW link

Using Consensus Mechanisms as an approach to Alignment

PrometheusJun 10, 2023, 11:38 PM

11 points

2 comments6 min readLW link

Use these three heuristic imperatives to solve alignment

GApr 6, 2023, 4:20 PM

−17 points

4 comments1 min readLW link

Investigating causal understanding in LLMs

Marius Hobbhahn and Tom Lieberum

Jun 14, 2022, 1:57 PM

28 points

6 comments13 min readLW link

God vs AI scientifically

Donatas LučiūnasMar 21, 2023, 11:03 PM

−22 points

45 comments1 min readLW link

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Ethan Perez, Ian McKenzie and Sam Bowman

Jun 27, 2022, 3:58 PM

171 points

14 comments7 min readLW link

[Question] Is it worth making a database for moral predictions?

Jonas HallgrenAug 16, 2021, 2:51 PM

1 point

0 comments2 min readLW link

Please Understand

samhealyApr 1, 2024, 12:33 PM

28 points

11 comments6 min readLW link

The case for aligning narrowly superhuman models

Ajeya CotraMar 5, 2021, 10:29 PM

186 points

75 comments38 min readLW link 1 review

Alignment is Hard: An Uncomputable Alignment Problem

Alexander BistagneNov 19, 2023, 7:38 PM

−5 points

4 comments1 min readLW link

(github.com)

For alignment, we should simultaneously use multiple theories of cognition and value

Roman LeventovApr 24, 2023, 10:37 AM

23 points

5 comments5 min readLW link

Goal alignment without alignment on epistemology, ethics, and science is futile

Roman LeventovApr 7, 2023, 8:22 AM

20 points

2 comments2 min readLW link

[Aspiration-based designs] A. Damages from misaligned optimization – two more models

Jobst Heitzig and Simon Dima

Jul 15, 2024, 2:08 PM

6 points

0 comments9 min readLW link

An LLM-based “exemplary actor”

Roman LeventovMay 29, 2023, 11:12 AM

16 points

0 comments12 min readLW link

Science of Deep Learning—a technical agenda

Marius HobbhahnOct 18, 2022, 2:54 PM

37 points

7 comments4 min readLW link

Rational Effective Utopia & Narrow Way There: Multiversal AI Alignment, Place AI, New Ethicophysics… (Updated)

ankFeb 11, 2025, 3:21 AM

13 points

8 comments35 min readLW link

Static Place AI Makes Agentic AI Redundant: Multiversal AI Alignment & Rational Utopia

ankFeb 13, 2025, 10:35 PM

1 point

2 comments11 min readLW link

Imitative Generalisation (AKA ‘Learning the Prior’)

Beth BarnesJan 10, 2021, 12:30 AM

107 points

15 comments11 min readLW link 1 review

Distillation of Neurotech and Alignment Workshop January 2023

lisathiergart and Sumner L Norman

May 22, 2023, 7:17 AM

52 points

9 comments14 min readLW link

Autonomous Alignment Oversight Framework (AAOF)

JustausernameJul 25, 2023, 10:25 AM

−9 points

0 comments4 min readLW link

[Question] Thoughts on a “Sequences Inspired” PhD Topic

goose000Jun 17, 2021, 8:36 PM

7 points

2 comments2 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjrFeb 12, 2021, 7:55 AM

15 points

0 comments27 min readLW link

A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL)

MiguelDevJun 19, 2023, 2:32 AM

4 points

2 comments7 min readLW link

Seth Herd Apr 13, 2025, 10:42 PM
2 points
0
I jsut realized that I’d embarassingly misunderstood outer alignment for a long time, and it was based directly on this wikitag. I’d been including the wise selection of an alignment target as part of outer alignment, which it is not by almost all considered usage of the term. The phrasing in first paragraph firmly implied it was. So I edited that and included a new very brief set of definitions at the bottom. Anyone is most welcome to change or eliminate any of that, except that I’d love to know why if you’re reverting it to the version that seemed flat wrong.