RSS

Outer Alignment

TagLast edit: Apr 15, 2025, 3:42 AM by Seth Herd

Outer Alignment (also known as the reward misspecification problem) is the problem of specifying a reward function which captures human preferences. Outer alignment asks the question—“What will we train our model to do?” Note that this is in the narrow technical sense of selecting a reward function, while wisely choosing a training target is a seperate issue; see the list below.

Overall, outer alignment as a problem is intuitive enough to understand, i.e., is the specified loss function aligned with the intended goal of its designers? However, implementing this in practice is extremely difficult. Conveying the full “intention” behind a human request is equivalent to conveying the sum of all human values and ethics. This is difficult in part because human intentions are themselves not well understood. Additionally, since most models are designed as goal optimizers, they are all susceptible to Goodhart’s Law which means that we might be unable to foresee negative consequences that arise due to excessive optimization pressure on a goal that would look otherwise well specified to humans.

To solve the outer alignment problem, some sub-problems that we would have to make progress on include specification gaming, value learning, and reward shaping/​modeling. Some proposed solutions to outer alignment include scalable oversight techniques such as IDA, as well as adversarial oversight techniques such as debate.

Outer Alignment vs. Inner Alignment

This is often taken to be separate from the inner alignment problem, which asks: How can we robustly aim our AI optimizers at any objective function at all?

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don’t think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Risks from Learned Op­ti­miza­tion: Introduction

May 31, 2019, 11:44 PM
187 points
42 comments12 min readLW link3 reviews

6. The Mutable Values Prob­lem in Value Learn­ing and CEV

RogerDearnaleyDec 4, 2023, 6:31 PM
12 points
0 comments49 min readLW link

Align­ment has a Basin of At­trac­tion: Beyond the Orthog­o­nal­ity Thesis

RogerDearnaleyFeb 1, 2024, 9:15 PM
16 points
15 comments13 min readLW link

Another (outer) al­ign­ment failure story

paulfchristianoApr 7, 2021, 8:12 PM
248 points
38 comments12 min readLW link1 review

Truth­ful LMs as a warm-up for al­igned AGI

Jacob_HiltonJan 17, 2022, 4:49 PM
65 points
14 comments13 min readLW link

LOVE in a sim­box is all you need

jacob_cannellSep 28, 2022, 6:25 PM
66 points
73 comments44 min readLW link1 review

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaleyMay 25, 2023, 9:26 AM
33 points
3 comments15 min readLW link

Gaia Net­work: a prac­ti­cal, in­cre­men­tal path­way to Open Agency Architecture

Dec 20, 2023, 5:11 PM
22 points
8 comments16 min readLW link

De­bate up­date: Obfus­cated ar­gu­ments problem

Beth BarnesDec 23, 2020, 3:24 AM
136 points
24 comments16 min readLW link

Outer vs in­ner mis­al­ign­ment: three framings

Richard_NgoJul 6, 2022, 7:46 PM
51 points
5 comments9 min readLW link

Book re­view: “A Thou­sand Brains” by Jeff Hawkins

Steven ByrnesMar 4, 2021, 5:10 AM
122 points
18 comments19 min readLW link

Main­tain­ing Align­ment dur­ing RSI as a Feed­back Con­trol Problem

berenMar 2, 2025, 12:21 AM
66 points
6 comments11 min readLW link

List of re­solved con­fu­sions about IDA

Wei DaiSep 30, 2019, 8:03 PM
97 points
18 comments3 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven ByrnesMar 25, 2021, 1:45 PM
74 points
40 comments16 min readLW link

“In­ner Align­ment Failures” Which Are Ac­tu­ally Outer Align­ment Failures

johnswentworthOct 31, 2020, 8:18 PM
66 points
38 comments5 min readLW link

Re­ward is not the op­ti­miza­tion target

TurnTroutJul 25, 2022, 12:03 AM
376 points
123 comments10 min readLW link3 reviews

AI Align­ment 2018-19 Review

Rohin ShahJan 28, 2020, 2:19 AM
126 points
6 comments35 min readLW link

If I were a well-in­ten­tioned AI… III: Ex­tremal Goodhart

Stuart_ArmstrongFeb 28, 2020, 11:24 AM
22 points
0 comments5 min readLW link

[Question] What if Ethics is Prov­ably Self-Con­tra­dic­tory?

YitzApr 18, 2024, 5:12 AM
3 points
7 comments2 min readLW link

Wor­ri­some mi­s­un­der­stand­ing of the core is­sues with AI transition

Roman LeventovJan 18, 2024, 10:05 AM
5 points
2 comments4 min readLW link

Outer al­ign­ment and imi­ta­tive amplification

evhubJan 10, 2020, 12:26 AM
24 points
11 comments9 min readLW link

How do new mod­els from OpenAI, Deep­Mind and An­thropic perform on Truth­fulQA?

Owain_EvansFeb 26, 2022, 12:46 PM
44 points
3 comments11 min readLW link

nos­talge­braist: Re­cur­sive Good­hart’s Law

Kaj_SotalaAug 26, 2020, 11:07 AM
53 points
27 comments1 min readLW link
(nostalgebraist.tumblr.com)

[Linkpost] In­tro­duc­ing Superalignment

berenJul 5, 2023, 6:23 PM
175 points
69 comments1 min readLW link
(openai.com)

Don’t al­ign agents to eval­u­a­tions of plans

TurnTroutNov 26, 2022, 9:16 PM
48 points
49 comments18 min readLW link

On the Con­fu­sion be­tween In­ner and Outer Misalignment

Chris_LeongMar 25, 2024, 11:59 AM
17 points
10 comments1 min readLW link

If I were a well-in­ten­tioned AI… II: Act­ing in a world

Stuart_ArmstrongFeb 27, 2020, 11:58 AM
20 points
0 comments3 min readLW link

AI al­ign­ment as a trans­la­tion problem

Roman LeventovFeb 5, 2024, 2:14 PM
22 points
2 comments3 min readLW link

In­fer­ence-Only De­bate Ex­per­i­ments Us­ing Math Problems

Aug 6, 2024, 5:44 PM
31 points
0 comments2 min readLW link

Shard The­ory: An Overview

David UdellAug 11, 2022, 5:44 AM
166 points
34 comments10 min readLW link

Four us­ages of “loss” in AI

TurnTroutOct 2, 2022, 12:52 AM
46 points
18 comments4 min readLW link

MIRI com­ments on Co­tra’s “Case for Align­ing Nar­rowly Su­per­hu­man Models”

Rob BensingerMar 5, 2021, 11:43 PM
142 points
13 comments26 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven ByrnesJul 10, 2020, 4:49 PM
45 points
7 comments8 min readLW link

Simulators

janusSep 2, 2022, 12:45 PM
633 points
168 comments41 min readLW link8 reviews
(generative.ink)

Eval­u­at­ing the his­tor­i­cal value mis­speci­fi­ca­tion argument

Matthew BarnettOct 5, 2023, 6:34 PM
190 points
162 comments7 min readLW link3 reviews

AXRP Epi­sode 12 - AI Ex­is­ten­tial Risk with Paul Christiano

DanielFilanDec 2, 2021, 2:20 AM
38 points
0 comments126 min readLW link

[In­tro to brain-like-AGI safety] 10. The al­ign­ment problem

Steven ByrnesMar 30, 2022, 1:24 PM
52 points
7 comments21 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTroutDec 2, 2022, 2:43 AM
149 points
22 comments47 min readLW link3 reviews

Why “AI al­ign­ment” would bet­ter be re­named into “Ar­tifi­cial In­ten­tion re­search”

chaosmageJun 15, 2023, 10:32 AM
29 points
12 comments2 min readLW link

Lan­guage Agents Re­duce the Risk of Ex­is­ten­tial Catastrophe

May 28, 2023, 7:10 PM
39 points
14 comments26 min readLW link

If I were a well-in­ten­tioned AI… I: Image classifier

Stuart_ArmstrongFeb 26, 2020, 12:39 PM
35 points
4 comments5 min readLW link

Naive Hy­pothe­ses on AI Alignment

Shoshannah TekofskyJul 2, 2022, 7:03 PM
98 points
29 comments5 min readLW link

Paper: Con­sti­tu­tional AI: Harm­less­ness from AI Feed­back (An­thropic)

LawrenceCDec 16, 2022, 10:12 PM
68 points
11 comments1 min readLW link
(www.anthropic.com)

Cat­e­go­riz­ing failures as “outer” or “in­ner” mis­al­ign­ment is of­ten confused

Rohin ShahJan 6, 2023, 3:48 PM
93 points
21 comments8 min readLW link

Is the Star Trek Fed­er­a­tion re­ally in­ca­pable of build­ing AI?

Kaj_SotalaMar 18, 2018, 10:30 AM
19 points
4 comments2 min readLW link
(kajsotala.fi)

Hu­man Mimicry Mainly Works When We’re Already Close

johnswentworthAug 17, 2022, 6:41 PM
82 points
16 comments5 min readLW link

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John NayOct 21, 2022, 2:03 AM
5 points
18 comments54 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel NandaDec 15, 2021, 11:44 PM
127 points
9 comments15 min readLW link

[Aspira­tion-based de­signs] 1. In­for­mal in­tro­duc­tion

Apr 28, 2024, 1:00 PM
44 points
4 comments8 min readLW link

Align­ment as Game Design

Shoshannah TekofskyJul 16, 2022, 10:36 PM
11 points
7 comments2 min readLW link

25 Min Talk on Me­taEth­i­cal.AI with Ques­tions from Stu­art Armstrong

June KuApr 29, 2021, 3:38 PM
21 points
7 comments1 min readLW link

(Hu­mor) AI Align­ment Crit­i­cal Failure Table

Kaj_SotalaAug 31, 2020, 7:51 PM
24 points
2 comments1 min readLW link
(sl4.org)

Selec­tion The­o­rems: A Pro­gram For Un­der­stand­ing Agents

johnswentworthSep 28, 2021, 5:03 AM
128 points
28 comments6 min readLW link2 reviews

Ques­tion 2: Pre­dicted bad out­comes of AGI learn­ing architecture

Cameron BergFeb 11, 2022, 10:23 PM
5 points
1 comment10 min readLW link

An­nounc­ing the Align­ment of Com­plex Sys­tems Re­search Group

Jun 4, 2022, 4:10 AM
91 points
20 comments5 min readLW link

Prefer­ence Ag­gre­ga­tion as Bayesian Inference

berenJul 27, 2023, 5:59 PM
14 points
1 comment1 min readLW link

The True Story of How GPT-2 Be­came Max­i­mally Lewd

Jan 18, 2024, 9:03 PM
70 points
7 comments6 min readLW link
(youtu.be)

Con­cept Safety: Pro­duc­ing similar AI-hu­man con­cept spaces

Kaj_SotalaApr 14, 2015, 8:39 PM
51 points
45 comments8 min readLW link

[ASoT] Some thoughts about im­perfect world modeling

leogaoApr 7, 2022, 3:42 PM
7 points
0 comments4 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhubMay 29, 2020, 8:38 PM
220 points
37 comments38 min readLW link2 reviews

Con­fused why a “ca­pa­bil­ities re­search is good for al­ign­ment progress” po­si­tion isn’t dis­cussed more

Kaj_SotalaJun 2, 2022, 9:41 PM
130 points
27 comments4 min readLW link

Speci­fi­ca­tion Gam­ing: How AI Can Turn Your Wishes Against You [RA Video]

WriterDec 1, 2023, 7:30 PM
19 points
0 comments5 min readLW link
(youtu.be)

Align­ment al­lows “non­ro­bust” de­ci­sion-in­fluences and doesn’t re­quire ro­bust grading

TurnTroutNov 29, 2022, 6:23 AM
62 points
41 comments15 min readLW link

Epistemic states as a po­ten­tial be­nign prior

Tamsin LeakeAug 31, 2024, 6:26 PM
31 points
2 comments8 min readLW link
(carado.moe)

[Question] Col­lec­tion of ar­gu­ments to ex­pect (outer and in­ner) al­ign­ment failure?

Sam ClarkeSep 28, 2021, 4:55 PM
21 points
10 comments1 min readLW link

My Overview of the AI Align­ment Land­scape: Threat Models

Neel NandaDec 25, 2021, 11:07 PM
53 points
3 comments28 min readLW link

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus AstraJul 1, 2020, 5:30 PM
35 points
4 comments67 min readLW link

Some of my dis­agree­ments with List of Lethalities

TurnTroutJan 24, 2023, 12:25 AM
70 points
7 comments10 min readLW link

Men­tal sub­agent im­pli­ca­tions for AI Safety

moridinamaelJan 3, 2021, 6:59 PM
11 points
0 comments3 min readLW link

The Com­pu­ta­tional Anatomy of Hu­man Values

berenApr 6, 2023, 10:33 AM
74 points
30 comments30 min readLW link

The Prefer­ence Fulfill­ment Hypothesis

Kaj_SotalaFeb 26, 2023, 10:55 AM
66 points
62 comments11 min readLW link

The Disas­trously Con­fi­dent And Inac­cu­rate AI

Sharat Jacob JacobNov 18, 2022, 7:06 PM
13 points
0 comments13 min readLW link

Three Min­i­mum Pivotal Acts Pos­si­ble by Nar­row AI

Michael SoareverixJul 12, 2022, 9:51 AM
0 points
4 comments2 min readLW link

AI Alter­na­tive Fu­tures: Sce­nario Map­ping Ar­tifi­cial In­tel­li­gence Risk—Re­quest for Par­ti­ci­pa­tion (*Closed*)

KakiliApr 27, 2022, 10:07 PM
10 points
2 comments8 min readLW link

In­fer­ence from a Math­e­mat­i­cal De­scrip­tion of an Ex­ist­ing Align­ment Re­search: a pro­posal for an outer al­ign­ment re­search program

Christopher KingJun 2, 2023, 9:54 PM
7 points
4 comments16 min readLW link

On pre­dictabil­ity, chaos and AIs that don’t game our goals

Alejandro TlaieJul 15, 2024, 5:16 PM
4 points
8 comments6 min readLW link

VLM-RM: Spec­i­fy­ing Re­wards with Nat­u­ral Language

Oct 23, 2023, 2:11 PM
20 points
2 comments5 min readLW link
(far.ai)

Just How Hard a Prob­lem is Align­ment?

Roger DearnaleyFeb 25, 2023, 9:00 AM
3 points
1 comment21 min readLW link

My preferred fram­ings for re­ward mis­speci­fi­ca­tion and goal misgeneralisation

Yi-YangMay 6, 2023, 4:48 AM
27 points
1 comment8 min readLW link

Learn­ing the smooth prior

Apr 29, 2022, 9:10 PM
35 points
0 comments12 min readLW link

Re­search Notes: What are we al­ign­ing for?

Shoshannah TekofskyJul 8, 2022, 10:13 PM
19 points
8 comments2 min readLW link

find_pur­pose.exe

heatdeathandtaxesApr 12, 2025, 7:31 PM
−1 points
0 comments5 min readLW link
(heatdeathandtaxes.substack.com)

A Case for AI Safety via Law

JWJohnstonSep 11, 2023, 6:26 PM
18 points
12 comments4 min readLW link

A first suc­cess story for Outer Align­ment: In­struc­tGPT

Noosphere89Nov 8, 2022, 10:52 PM
6 points
1 comment1 min readLW link
(openai.com)

Lev­er­ag­ing Le­gal In­for­mat­ics to Align AI

John NaySep 18, 2022, 8:39 PM
11 points
0 comments3 min readLW link
(forum.effectivealtruism.org)

Align­ment via man­u­ally im­ple­ment­ing the util­ity function

ChantielSep 7, 2021, 8:20 PM
1 point
6 comments2 min readLW link

If Align­ment is Hard, then so is Self-Improvement

PavleMihaApr 7, 2023, 12:08 AM
21 points
20 comments1 min readLW link

Horn’s Chain: A Func­tional An­swer to the Hard Prob­lem of Consciousness

GalileoApr 18, 2025, 1:53 AM
1 point
0 comments11 min readLW link

Archety­pal Trans­fer Learn­ing: a Pro­posed Align­ment Solu­tion that solves the In­ner & Outer Align­ment Prob­lem while adding Cor­rigible Traits to GPT-2-medium

MiguelDevApr 26, 2023, 1:37 AM
14 points
5 comments10 min readLW link

Places of Lov­ing Grace [Story]

ankFeb 18, 2025, 11:49 PM
−1 points
0 comments4 min readLW link

Map­ping AI Ar­chi­tec­tures to Align­ment At­trac­tors: A SIEM-Based Framework

silentrevolutionsApr 12, 2025, 5:50 PM
1 point
0 comments1 min readLW link

In­duc­ing hu­man-like bi­ases in moral rea­son­ing LMs

Feb 20, 2024, 4:28 PM
23 points
3 comments14 min readLW link

[Question] Pop­u­lar ma­te­ri­als about en­vi­ron­men­tal goals/​agent foun­da­tions? Peo­ple want­ing to dis­cuss such top­ics?

Q HomeJan 22, 2025, 3:30 AM
5 points
0 comments1 min readLW link

“De­sign­ing agent in­cen­tives to avoid re­ward tam­per­ing”, DeepMind

gwernAug 14, 2019, 4:57 PM
28 points
15 comments1 min readLW link
(medium.com)

In­suffi­cient Values

Jun 16, 2021, 2:33 PM
31 points
16 comments6 min readLW link

I Recom­mend More Train­ing Rationales

Gianluca CalcagniDec 31, 2024, 2:06 PM
2 points
0 comments6 min readLW link

A sim­ple way to make GPT-3 fol­low instructions

Quintin PopeMar 8, 2021, 2:57 AM
11 points
5 comments4 min readLW link

Demo­cratic Fine-Tuning

Joe EdelmanAug 29, 2023, 6:13 PM
22 points
2 comments1 min readLW link
(open.substack.com)

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland BarstadJul 9, 2022, 2:42 PM
15 points
5 comments22 min readLW link

Why Re­cur­sive Self-Im­prove­ment Might Not Be the Ex­is­ten­tial Risk We Fear

Nassim_ANov 24, 2024, 5:17 PM
1 point
0 comments9 min readLW link

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland BarstadDec 13, 2022, 2:17 AM
10 points
5 comments45 min readLW link

Thoughts on the Align­ment Im­pli­ca­tions of Scal­ing Lan­guage Models

leogaoJun 2, 2021, 9:32 PM
82 points
11 comments17 min readLW link

Co­op­er­a­tive Game Theory

TakkJun 7, 2023, 5:41 PM
1 point
0 comments1 min readLW link

The Goal Mis­gen­er­al­iza­tion Problem

MyspyMay 18, 2023, 11:40 PM
1 point
0 comments1 min readLW link
(drive.google.com)

“Sorcerer’s Ap­pren­tice” from Fan­ta­sia as an anal­ogy for alignment

awgMar 29, 2023, 6:21 PM
9 points
4 comments1 min readLW link
(video.disney.com)

Pro­posal: Tune LLMs to Use Cal­ibrated Language

OneManyNoneJun 7, 2023, 9:05 PM
9 points
0 comments5 min readLW link

Par­tial Si­mu­la­tion Ex­trap­o­la­tion: A Pro­posal for Build­ing Safer Simulators

lukemarksJun 17, 2023, 1:55 PM
16 points
0 comments10 min readLW link

Disen­tan­gling Shard The­ory into Atomic Claims

Leon LangJan 13, 2023, 4:23 AM
86 points
6 comments18 min readLW link

For­mal­iz­ing «Boundaries» with Markov blankets

ChipmonkSep 19, 2023, 9:01 PM
21 points
20 comments3 min readLW link

Cor­rigi­bil­ity or DWIM is an at­trac­tive pri­mary goal for AGI

Seth HerdNov 25, 2023, 7:37 PM
16 points
4 comments1 min readLW link

Em­pa­thy as a nat­u­ral con­se­quence of learnt re­ward models

berenFeb 4, 2023, 3:35 PM
48 points
27 comments13 min readLW link

[Question] Will re­search in AI risk jinx it? Con­se­quences of train­ing AI on AI risk arguments

Yann DuboisDec 19, 2022, 10:42 PM
5 points
6 comments1 min readLW link

When the Model Starts Talk­ing Like Me: A User-In­duced Struc­tural Adap­ta­tion Case Study

JunxiApr 19, 2025, 7:40 PM
3 points
1 comment4 min readLW link

Would this solve the (outer) al­ign­ment prob­lem, or at least help?

Wes RApr 6, 2025, 6:49 PM
−2 points
1 comment13 min readLW link

Ra­tion­al­ity vs Alignment

Donatas LučiūnasJul 7, 2024, 10:12 AM
−14 points
14 comments2 min readLW link

How I’d like al­ign­ment to get done (as of 2024-10-18)

TristanTrimOct 18, 2024, 11:39 PM
11 points
4 comments4 min readLW link

Early situ­a­tional aware­ness and its im­pli­ca­tions, a story

Jacob PfauFeb 6, 2023, 8:45 PM
29 points
6 comments3 min readLW link

AGI is un­con­trol­lable, al­ign­ment is impossible

Donatas LučiūnasMar 19, 2023, 5:49 PM
−12 points
21 comments1 min readLW link

Are ex­trap­o­la­tion-based AIs al­ignable?

cousin_itMar 24, 2023, 3:55 PM
24 points
15 comments1 min readLW link

Ques­tions about Value Lock-in, Pa­ter­nal­ism, and Empowerment

Sam F. BrownNov 16, 2022, 3:33 PM
13 points
2 comments12 min readLW link
(sambrown.eu)

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

JustausernameAug 24, 2023, 3:53 AM
1 point
0 comments6 min readLW link

[Question] Op­ti­miz­ing for Agency?

Michael SoareverixFeb 14, 2024, 8:31 AM
10 points
9 comments2 min readLW link

Up­dat­ing Utility Functions

May 9, 2022, 9:44 AM
41 points
6 comments8 min readLW link

Levels of goals and alignment

zeshenSep 16, 2022, 4:44 PM
27 points
4 comments6 min readLW link

Thin Align­ment Can’t Solve Thick Problems

Daan HenselmansApr 27, 2025, 10:42 PM
11 points
2 comments9 min readLW link

Disprov­ing the “Peo­ple-Pleas­ing” Hy­poth­e­sis for AI Self-Re­ports of Experience

rifeJan 26, 2025, 3:53 PM
3 points
18 comments12 min readLW link

Threat Model Liter­a­ture Review

Nov 1, 2022, 11:03 AM
78 points
4 comments25 min readLW link

[Question] Does hu­man (mis)al­ign­ment pose a sig­nifi­cant and im­mi­nent ex­is­ten­tial threat?

jrFeb 23, 2025, 10:03 AM
6 points
3 comments1 min readLW link

Open-ended ethics of phe­nom­ena (a desider­ata with uni­ver­sal moral­ity)

Ryo Nov 8, 2023, 8:10 PM
1 point
0 comments8 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

JozdienJul 18, 2022, 7:11 AM
60 points
8 comments20 min readLW link

PRISM: Per­spec­tive Rea­son­ing for In­te­grated Syn­the­sis and Me­di­a­tion (In­ter­ac­tive Demo)

Anthony DiamondMar 18, 2025, 6:03 PM
10 points
2 comments1 min readLW link

Fram­ing AI Childhoods

David UdellSep 6, 2022, 11:40 PM
37 points
8 comments4 min readLW link

7. Evolu­tion and Ethics

RogerDearnaleyFeb 15, 2024, 11:38 PM
3 points
6 comments6 min readLW link

Be­hav­ior Clon­ing is Miscalibrated

leogaoDec 5, 2021, 1:36 AM
78 points
3 comments3 min readLW link

The Road to Evil Is Paved with Good Ob­jec­tives: Frame­work to Clas­sify and Fix Misal­ign­ments.

ShivamJan 30, 2025, 2:44 AM
1 point
0 comments11 min readLW link

H-JEPA might be tech­ni­cally al­ignable in a mod­ified form

Roman LeventovMay 8, 2023, 11:04 PM
12 points
2 comments7 min readLW link

No-self as an al­ign­ment target

Milan WMay 13, 2025, 1:48 AM
33 points
5 comments1 min readLW link

An In­creas­ingly Ma­nipu­la­tive Newsfeed

Michaël TrazziJul 1, 2019, 3:26 PM
63 points
16 comments5 min readLW link

Re­quest for ad­vice: Re­search for Con­ver­sa­tional Game The­ory for LLMs

Rome ViharoOct 16, 2024, 5:53 PM
10 points
0 comments1 min readLW link

Align­ing an H-JEPA agent via train­ing on the out­puts of an LLM-based “ex­em­plary ac­tor”

Roman LeventovMay 29, 2023, 11:08 AM
12 points
10 comments30 min readLW link

Break­ing the Op­ti­mizer’s Curse, and Con­se­quences for Ex­is­ten­tial Risks and Value Learning

Roger DearnaleyFeb 21, 2023, 9:05 AM
10 points
1 comment23 min readLW link

Align­ment As A Bot­tle­neck To Use­ful­ness Of GPT-3

johnswentworthJul 21, 2020, 8:02 PM
111 points
57 comments3 min readLW link

[Question] Don’t you think RLHF solves outer al­ign­ment?

Charbel-RaphaëlNov 4, 2022, 12:36 AM
9 points
23 comments1 min readLW link

“Pick Two” AI Trilemma: Gen­er­al­ity, Agency, Align­ment.

Black FlagJan 15, 2025, 6:52 PM
7 points
0 comments2 min readLW link

Con­tex­tual Con­sti­tu­tional AI

aksh-nSep 28, 2024, 11:24 PM
13 points
2 comments12 min readLW link

[Question] Daisy-chain­ing ep­silon-step verifiers

DecaeneusApr 6, 2023, 2:07 AM
2 points
1 comment1 min readLW link

The The­o­ret­i­cal Re­ward Learn­ing Re­search Agenda: In­tro­duc­tion and Motivation

Joar SkalseFeb 28, 2025, 7:20 PM
25 points
4 comments14 min readLW link

On the Im­por­tance of Open Sourc­ing Re­ward Models

elandgreJan 2, 2023, 7:01 PM
18 points
5 comments6 min readLW link

Embed­ding Eth­i­cal Pri­ors into AI Sys­tems: A Bayesian Approach

JustausernameAug 3, 2023, 3:31 PM
−5 points
3 comments21 min readLW link

Break­ing down the MEAT of Alignment

JasonBrownApr 7, 2025, 8:47 AM
7 points
2 comments11 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tameraAug 3, 2022, 12:03 PM
136 points
23 comments6 min readLW link

Tether­ware #1: The case for hu­man­like AI with free will

Jáchym FibírJan 30, 2025, 10:58 AM
5 points
14 comments10 min readLW link
(tetherware.substack.com)

A pos­i­tive case for how we might suc­ceed at pro­saic AI alignment

evhubNov 16, 2021, 1:49 AM
81 points
46 comments6 min readLW link

Gaia Net­work: An Illus­trated Primer

Jan 18, 2024, 6:23 PM
3 points
2 comments15 min readLW link

Re­cre­at­ing the car­ing drive

CatneeSep 7, 2023, 10:41 AM
43 points
15 comments10 min readLW link1 review

If you’re very op­ti­mistic about ELK then you should be op­ti­mistic about outer alignment

Sam MarksApr 27, 2022, 7:30 PM
17 points
8 comments3 min readLW link

Free­dom Is All We Need

Leo GlisicApr 27, 2023, 12:09 AM
−1 points
8 comments10 min readLW link

Call for re­search on eval­u­at­ing al­ign­ment (fund­ing + ad­vice available)

Beth BarnesAug 31, 2021, 11:28 PM
105 points
11 comments5 min readLW link

Align­ment is not intelligent

Donatas LučiūnasNov 25, 2024, 6:59 AM
−23 points
18 comments5 min readLW link

In­ner al­ign­ment: what are we point­ing at?

lemonhopeSep 18, 2022, 11:09 AM
14 points
2 comments1 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. MurphyMay 12, 2022, 8:01 PM
58 points
0 comments59 min readLW link

Con­di­tion­ing Gen­er­a­tive Models with Restrictions

Adam JermynJul 21, 2022, 8:33 PM
18 points
4 comments8 min readLW link

The Lin­guis­tic Blind Spot of Value-Aligned Agency, Nat­u­ral and Ar­tifi­cial

Roman LeventovFeb 14, 2023, 6:57 AM
6 points
0 comments2 min readLW link
(arxiv.org)

Ex­ter­mi­nat­ing hu­mans might be on the to-do list of a Friendly AI

RomanSDec 7, 2021, 2:15 PM
5 points
8 comments2 min readLW link

In­vi­ta­tion to the Prince­ton AI Align­ment and Safety Seminar

Sadhika MalladiMar 17, 2024, 1:10 AM
6 points
1 comment1 min readLW link

CCS: Coun­ter­fac­tual Civ­i­liza­tion Simulation

MorphismMay 2, 2024, 10:54 PM
3 points
0 comments2 min readLW link

AI Align­ment: A Com­pre­hen­sive Survey

Stephen McAleerNov 1, 2023, 5:35 PM
20 points
1 comment1 min readLW link
(arxiv.org)

Will AI and Hu­man­ity Go to War?

Simon GoldsteinOct 1, 2024, 6:35 AM
9 points
4 comments6 min readLW link

The AGI needs to be honest

rokosbasiliskOct 16, 2021, 7:24 PM
2 points
11 comments2 min readLW link

A sin­gle prin­ci­ple re­lated to many Align­ment sub­prob­lems?

Q HomeApr 30, 2025, 9:49 AM
34 points
5 comments16 min readLW link

Re­in­force­ment Learn­ing from In­for­ma­tion Bazaar Feed­back, and other uses of in­for­ma­tion markets

Abhimanyu Pallavi SudhirSep 16, 2024, 1:04 AM
5 points
1 comment5 min readLW link

Com­po­si­tional prefer­ence mod­els for al­ign­ing LMs

Tomek KorbakOct 25, 2023, 12:17 PM
18 points
2 comments5 min readLW link

Ter­mi­nal goal vs Intelligence

Donatas LučiūnasDec 26, 2024, 8:10 AM
−12 points
24 comments1 min readLW link

Aligned AI as a wrap­per around an LLM

cousin_itMar 25, 2023, 3:58 PM
31 points
19 comments1 min readLW link

Achiev­ing AI Align­ment through De­liber­ate Uncer­tainty in Mul­ti­a­gent Systems

Florian_DietzFeb 17, 2024, 8:45 AM
4 points
0 comments13 min readLW link

What Should AI Owe To Us? Ac­countable and Aligned AI Sys­tems via Con­trac­tu­al­ist AI Alignment

xuanSep 8, 2022, 3:04 PM
26 points
16 comments25 min readLW link

Pre­train­ing Lan­guage Models with Hu­man Preferences

Feb 21, 2023, 5:57 PM
135 points
20 comments11 min readLW link2 reviews

Sup­ple­men­tary Align­ment In­sights Through a Highly Con­trol­led Shut­down Incentive

JustausernameJul 23, 2023, 4:08 PM
4 points
1 comment3 min readLW link

Why de­cep­tive al­ign­ment mat­ters for AGI safety

Marius HobbhahnSep 15, 2022, 1:38 PM
68 points
13 comments13 min readLW link

Model Integrity

Dec 6, 2024, 9:28 PM
4 points
1 comment18 min readLW link

Thoughts on the Fea­si­bil­ity of Pro­saic AGI Align­ment?

iamthouthouartiAug 21, 2020, 11:25 PM
8 points
10 comments1 min readLW link

How will we up­date about schem­ing?

ryan_greenblattJan 6, 2025, 8:21 PM
171 points
20 comments36 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ankFeb 15, 2025, 11:08 AM
2 points
2 comments2 min readLW link

A Univer­sal Prompt as a Safe­guard Against AI Threats

Zhaiyk SultanMar 10, 2025, 2:28 AM
1 point
0 comments2 min readLW link

Clar­ify­ing AI X-risk

Nov 1, 2022, 11:03 AM
127 points
24 comments4 min readLW link1 review

Unal­igned AGI & Brief His­tory of Inequality

ankFeb 22, 2025, 4:26 PM
−20 points
4 comments7 min readLW link

The Align­ment Problems

Martín SotoJan 12, 2023, 10:29 PM
20 points
0 comments4 min readLW link

Pro­ject In­tro: Selec­tion The­o­rems for Modularity

Apr 4, 2022, 12:59 PM
73 points
20 comments16 min readLW link

Con­di­tion­ing, Prompts, and Fine-Tuning

Adam JermynAug 17, 2022, 8:52 PM
38 points
9 comments4 min readLW link

Con­trol­ling In­tel­li­gent Agents The Only Way We Know How: Ideal Bureau­cratic Struc­ture (IBS)

Justin BullockMay 24, 2021, 12:53 PM
14 points
15 comments6 min readLW link

Wire­head­ing and mis­al­ign­ment by com­po­si­tion on NetHack

pierlucadoroOct 27, 2023, 5:43 PM
34 points
4 comments4 min readLW link

The Me­taethics and Nor­ma­tive Ethics of AGI Value Align­ment: Many Ques­tions, Some Implications

Eleos Arete CitriniSep 16, 2021, 4:13 PM
6 points
0 comments8 min readLW link

Align­ment Can Re­duce Perfor­mance on Sim­ple Eth­i­cal Questions

Daan HenselmansFeb 3, 2025, 7:35 PM
16 points
7 comments6 min readLW link

[Question] Are there more than 12 paths to Su­per­in­tel­li­gence?

p4rziv4lOct 18, 2024, 4:05 PM
−3 points
0 comments1 min readLW link

Dist­in­guish­ing AI takeover scenarios

Sep 8, 2021, 4:19 PM
74 points
11 comments14 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM
41 points
12 comments31 min readLW link

The for­mal goal is a pointer

MorphismMay 1, 2024, 12:27 AM
20 points
10 comments1 min readLW link

Pre­dic­tion can be Outer Aligned at Optimum

Lukas FinnvedenJan 10, 2021, 6:48 PM
15 points
12 comments11 min readLW link

RFC: Meta-eth­i­cal un­cer­tainty in AGI alignment

Gordon Seidoh WorleyJun 8, 2018, 8:56 PM
16 points
6 comments3 min readLW link

Toward a Hu­man Hy­brid Lan­guage for En­hanced Hu­man-Ma­chine Com­mu­ni­ca­tion: Ad­dress­ing the AI Align­ment Problem

Andndn DheudndAug 14, 2024, 10:19 PM
−4 points
2 comments4 min readLW link

Thoughts about OOD alignment

CatneeAug 24, 2022, 3:31 PM
11 points
10 comments2 min readLW link

[Question] Com­pe­tence vs Alignment

kwiat.devSep 30, 2020, 9:03 PM
7 points
4 comments1 min readLW link

Op­tion­al­ity ap­proach to ethics

Ryo Nov 13, 2023, 3:23 PM
7 points
2 comments3 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin PopeOct 13, 2021, 8:52 PM
9 points
0 comments2 min readLW link

In­for­ma­tion bot­tle­neck for coun­ter­fac­tual corrigibility

tailcalledDec 6, 2021, 5:11 PM
8 points
1 comment7 min readLW link

RL with KL penalties is bet­ter seen as Bayesian inference

May 25, 2022, 9:23 AM
114 points
17 comments12 min readLW link

[Question] Is there any ex­ist­ing term sum­ma­riz­ing non-scal­able over­sight meth­ods in outer al­ign­ment?

Allen ShenJul 31, 2023, 5:31 PM
1 point
0 comments1 min readLW link

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

kenneth myersFeb 9, 2024, 6:40 PM
6 points
12 comments3 min readLW link

Re­ac­tion to “Em­pow­er­ment is (al­most) All We Need” : an open-ended alternative

Ryo Nov 25, 2023, 3:35 PM
9 points
3 comments5 min readLW link

Ex­am­ples of AI’s be­hav­ing badly

Stuart_ArmstrongJul 16, 2015, 10:01 AM
41 points
41 comments1 min readLW link

The Steer­ing Problem

paulfchristianoNov 13, 2018, 5:14 PM
44 points
12 comments7 min readLW link

In­tel­li­gence–Agency Equiv­alence ≈ Mass–En­ergy Equiv­alence: On Static Na­ture of In­tel­li­gence & Phys­i­cal­iza­tion of Ethics

ankFeb 22, 2025, 12:12 AM
1 point
0 comments6 min readLW link

Higher Di­men­sion Carte­sian Ob­jects and Align­ing ‘Tiling Si­mu­la­tors’

lukemarksJun 11, 2023, 12:13 AM
22 points
0 comments5 min readLW link

Sim­ple al­ign­ment plan that maybe works

IknownothingJul 18, 2023, 10:48 PM
4 points
8 comments1 min readLW link

Is “red” for GPT-4 the same as “red” for you?

Yusuke HayashiMay 6, 2023, 5:55 PM
9 points
6 comments2 min readLW link

Imi­ta­tion Learn­ing from Lan­guage Feedback

Mar 30, 2023, 2:11 PM
71 points
3 comments10 min readLW link

In the Name of All That Needs Saving

pleiotrothNov 7, 2024, 3:26 PM
18 points
3 comments22 min readLW link

Slay­ing the Hy­dra: to­ward a new game board for AI

PrometheusJun 23, 2023, 5:04 PM
0 points
5 comments6 min readLW link

Causal rep­re­sen­ta­tion learn­ing as a tech­nique to pre­vent goal misgeneralization

PabloAMCJan 4, 2023, 12:07 AM
21 points
0 comments8 min readLW link

Our Ex­ist­ing Solu­tions to AGI Align­ment (semi-safe)

Michael SoareverixJul 21, 2022, 7:00 PM
12 points
1 comment3 min readLW link

The de­fault sce­nario for the next 50 years

JulienNov 24, 2024, 2:01 PM
1 point
0 comments6 min readLW link

Shut­down-Seek­ing AI

Simon GoldsteinMay 31, 2023, 10:19 PM
50 points
32 comments15 min readLW link

Align­ment works both ways

Karl von WendtMar 7, 2023, 10:41 AM
23 points
21 comments2 min readLW link

You can’t fetch the coffee if you’re dead: an AI dilemma

hennygeAug 31, 2023, 11:03 AM
1 point
0 comments4 min readLW link

Get­ting from an un­al­igned AGI to an al­igned AGI?

Tor Økland BarstadJun 21, 2022, 12:36 PM
13 points
7 comments9 min readLW link

Us­ing Con­sen­sus Mechanisms as an ap­proach to Alignment

PrometheusJun 10, 2023, 11:38 PM
11 points
2 comments6 min readLW link

Use these three heuris­tic im­per­a­tives to solve alignment

GApr 6, 2023, 4:20 PM
−17 points
4 comments1 min readLW link

In­ves­ti­gat­ing causal un­der­stand­ing in LLMs

Jun 14, 2022, 1:57 PM
28 points
6 comments13 min readLW link

God vs AI scientifically

Donatas LučiūnasMar 21, 2023, 11:03 PM
−22 points
45 comments1 min readLW link

An­nounc­ing the In­verse Scal­ing Prize ($250k Prize Pool)

Jun 27, 2022, 3:58 PM
171 points
14 comments7 min readLW link

[Question] Is it worth mak­ing a database for moral pre­dic­tions?

Jonas HallgrenAug 16, 2021, 2:51 PM
1 point
0 comments2 min readLW link

Please Understand

samhealyApr 1, 2024, 12:33 PM
28 points
11 comments6 min readLW link

The case for al­ign­ing nar­rowly su­per­hu­man models

Ajeya CotraMar 5, 2021, 10:29 PM
186 points
75 comments38 min readLW link1 review

Align­ment is Hard: An Un­com­putable Align­ment Problem

Alexander BistagneNov 19, 2023, 7:38 PM
−5 points
4 comments1 min readLW link
(github.com)

For al­ign­ment, we should si­mul­ta­neously use mul­ti­ple the­o­ries of cog­ni­tion and value

Roman LeventovApr 24, 2023, 10:37 AM
23 points
5 comments5 min readLW link

Goal al­ign­ment with­out al­ign­ment on episte­mol­ogy, ethics, and sci­ence is futile

Roman LeventovApr 7, 2023, 8:22 AM
20 points
2 comments2 min readLW link

[Aspira­tion-based de­signs] A. Da­m­ages from mis­al­igned op­ti­miza­tion – two more models

Jul 15, 2024, 2:08 PM
6 points
0 comments9 min readLW link

An LLM-based “ex­em­plary ac­tor”

Roman LeventovMay 29, 2023, 11:12 AM
16 points
0 comments12 min readLW link

Science of Deep Learn­ing—a tech­ni­cal agenda

Marius HobbhahnOct 18, 2022, 2:54 PM
37 points
7 comments4 min readLW link

Ra­tional Effec­tive Utopia & Nar­row Way There: Mul­tiver­sal AI Align­ment, Place AI, New Ethico­physics… (Up­dated)

ankFeb 11, 2025, 3:21 AM
13 points
8 comments35 min readLW link

Static Place AI Makes Agen­tic AI Re­dun­dant: Mul­tiver­sal AI Align­ment & Ra­tional Utopia

ankFeb 13, 2025, 10:35 PM
1 point
2 comments11 min readLW link

Imi­ta­tive Gen­er­al­i­sa­tion (AKA ‘Learn­ing the Prior’)

Beth BarnesJan 10, 2021, 12:30 AM
107 points
15 comments11 min readLW link1 review

Distil­la­tion of Neu­rotech and Align­ment Work­shop Jan­uary 2023

May 22, 2023, 7:17 AM
51 points
9 comments14 min readLW link

Au­tonomous Align­ment Over­sight Frame­work (AAOF)

JustausernameJul 25, 2023, 10:25 AM
−9 points
0 comments4 min readLW link

[Question] Thoughts on a “Se­quences In­spired” PhD Topic

goose000Jun 17, 2021, 8:36 PM
7 points
2 comments2 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjrFeb 12, 2021, 7:55 AM
15 points
0 comments27 min readLW link

A Mul­tidis­ci­plinary Ap­proach to Align­ment (MATA) and Archety­pal Trans­fer Learn­ing (ATL)

MiguelDevJun 19, 2023, 2:32 AM
4 points
2 comments7 min readLW link