Outer Alignment

TagLast edit: 9 Oct 2023 23:38 UTC by Linda Linsefors

Outer alignment asks the question—“What should we aim our model at?” In other words, is the model optimizing for the correct reward such that there are no exploitable loopholes? It is also known as the reward misspecification problem.

Overall, outer alignment as a problem is intuitive enough to understand, i.e., is the specified loss function aligned with the intended goal of its designers? However, implementing this in practice is extremely difficult. Conveying the full “intention” behind a human request is equivalent to conveying the sum of all human values and ethics. This is difficult in part because human intentions are themselves not well understood. Additionally, since most models are designed as goal optimizers, they are all susceptible to Goodhart’s Law which means that we might be unable to foresee negative consequences that arise due to excessive optimization pressure on a goal that would look otherwise well specified to humans.

To solve the outer alignment problem, some sub-problems that we would have to make progress on include specification gaming, value learning, and reward shaping/​modeling. Some proposed solutions to outer alignment include scalable oversight techniques such as IDA, as well as adversarial oversight techniques such as debate.

Outer Alignment vs. Inner Alignment

This is often taken to be separate from the inner alignment problem, which asks: How can we robustly aim our AI optimizers at any objective function at all?

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don’t think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
183 points
42 comments12 min readLW link3 reviews

Align­ment has a Basin of At­trac­tion: Beyond the Orthog­o­nal­ity Thesis

RogerDearnaley1 Feb 2024 21:15 UTC
4 points
15 comments13 min readLW link

6. The Mutable Values Prob­lem in Value Learn­ing and CEV

RogerDearnaley4 Dec 2023 18:31 UTC
12 points
0 comments49 min readLW link

Another (outer) al­ign­ment failure story

paulfchristiano7 Apr 2021 20:12 UTC
237 points
38 comments12 min readLW link1 review

Book re­view: “A Thou­sand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC
116 points
18 comments19 min readLW link

De­bate up­date: Obfus­cated ar­gu­ments problem

Beth Barnes23 Dec 2020 3:24 UTC
135 points
24 comments16 min readLW link

Gaia Net­work: a prac­ti­cal, in­cre­men­tal path­way to Open Agency Architecture

20 Dec 2023 17:11 UTC
14 points
8 comments16 min readLW link

Truth­ful LMs as a warm-up for al­igned AGI

Jacob_Hilton17 Jan 2022 16:49 UTC
65 points
14 comments13 min readLW link

LOVE in a sim­box is all you need

jacob_cannell28 Sep 2022 18:25 UTC
63 points
72 comments44 min readLW link1 review

Outer vs in­ner mis­al­ign­ment: three framings

Richard_Ngo6 Jul 2022 19:46 UTC
49 points
5 comments9 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC
32 points
3 comments15 min readLW link

Con­cept Safety: Pro­duc­ing similar AI-hu­man con­cept spaces

Kaj_Sotala14 Apr 2015 20:39 UTC
51 points
45 comments8 min readLW link


janus2 Sep 2022 12:45 UTC
592 points
161 comments41 min readLW link8 reviews

The Com­pu­ta­tional Anatomy of Hu­man Values

beren6 Apr 2023 10:33 UTC
70 points
30 comments30 min readLW link

If I were a well-in­ten­tioned AI… I: Image classifier

Stuart_Armstrong26 Feb 2020 12:39 UTC
35 points
4 comments5 min readLW link

[Linkpost] In­tro­duc­ing Superalignment

beren5 Jul 2023 18:23 UTC
173 points
68 comments1 min readLW link

If I were a well-in­ten­tioned AI… II: Act­ing in a world

Stuart_Armstrong27 Feb 2020 11:58 UTC
20 points
0 comments3 min readLW link

nos­talge­braist: Re­cur­sive Good­hart’s Law

Kaj_Sotala26 Aug 2020 11:07 UTC
53 points
27 comments1 min readLW link

(Hu­mor) AI Align­ment Crit­i­cal Failure Table

Kaj_Sotala31 Aug 2020 19:51 UTC
24 points
2 comments1 min readLW link

If I were a well-in­ten­tioned AI… III: Ex­tremal Goodhart

Stuart_Armstrong28 Feb 2020 11:24 UTC
22 points
0 comments5 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
126 points
6 comments35 min readLW link

Four us­ages of “loss” in AI

TurnTrout2 Oct 2022 0:52 UTC
43 points
18 comments4 min readLW link

Speci­fi­ca­tion Gam­ing: How AI Can Turn Your Wishes Against You [RA Video]

Writer1 Dec 2023 19:30 UTC
19 points
0 comments5 min readLW link

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John Nay21 Oct 2022 2:03 UTC
5 points
18 comments54 min readLW link

Selec­tion The­o­rems: A Pro­gram For Un­der­stand­ing Agents

johnswentworth28 Sep 2021 5:03 UTC
124 points
28 comments6 min readLW link2 reviews

[Question] Col­lec­tion of ar­gu­ments to ex­pect (outer and in­ner) al­ign­ment failure?

Sam Clarke28 Sep 2021 16:55 UTC
21 points
10 comments1 min readLW link

Prefer­ence Ag­gre­ga­tion as Bayesian Inference

beren27 Jul 2023 17:59 UTC
14 points
1 comment1 min readLW link

AXRP Epi­sode 12 - AI Ex­is­ten­tial Risk with Paul Christiano

DanielFilan2 Dec 2021 2:20 UTC
38 points
0 comments126 min readLW link

AI al­ign­ment as a trans­la­tion problem

Roman Leventov5 Feb 2024 14:14 UTC
21 points
2 comments3 min readLW link

Don’t al­ign agents to eval­u­a­tions of plans

TurnTrout26 Nov 2022 21:16 UTC
42 points
49 comments18 min readLW link

Align­ment al­lows “non­ro­bust” de­ci­sion-in­fluences and doesn’t re­quire ro­bust grading

TurnTrout29 Nov 2022 6:23 UTC
60 points
42 comments15 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTrout2 Dec 2022 2:43 UTC
138 points
22 comments47 min readLW link3 reviews

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus Astra1 Jul 2020 17:30 UTC
35 points
4 comments67 min readLW link

Lan­guage Agents Re­duce the Risk of Ex­is­ten­tial Catastrophe

28 May 2023 19:10 UTC
30 points
14 comments26 min readLW link

Paper: Con­sti­tu­tional AI: Harm­less­ness from AI Feed­back (An­thropic)

LawrenceC16 Dec 2022 22:12 UTC
68 points
11 comments1 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
127 points
9 comments15 min readLW link

My Overview of the AI Align­ment Land­scape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC
52 points
3 comments28 min readLW link

Cat­e­go­riz­ing failures as “outer” or “in­ner” mis­al­ign­ment is of­ten confused

Rohin Shah6 Jan 2023 15:48 UTC
86 points
21 comments8 min readLW link

Ques­tion 2: Pre­dicted bad out­comes of AGI learn­ing architecture

Cameron Berg11 Feb 2022 22:23 UTC
5 points
1 comment10 min readLW link

Some of my dis­agree­ments with List of Lethalities

TurnTrout24 Jan 2023 0:25 UTC
68 points
7 comments10 min readLW link

How do new mod­els from OpenAI, Deep­Mind and An­thropic perform on Truth­fulQA?

Owain_Evans26 Feb 2022 12:46 UTC
44 points
3 comments11 min readLW link

[In­tro to brain-like-AGI safety] 10. The al­ign­ment problem

Steven Byrnes30 Mar 2022 13:24 UTC
48 points
6 comments19 min readLW link

“In­ner Align­ment Failures” Which Are Ac­tu­ally Outer Align­ment Failures

johnswentworth31 Oct 2020 20:18 UTC
66 points
38 comments5 min readLW link

[ASoT] Some thoughts about im­perfect world modeling

leogao7 Apr 2022 15:42 UTC
7 points
0 comments4 min readLW link

Outer al­ign­ment and imi­ta­tive amplification

evhub10 Jan 2020 0:26 UTC
24 points
11 comments9 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
205 points
36 comments38 min readLW link2 reviews

Men­tal sub­agent im­pli­ca­tions for AI Safety

moridinamael3 Jan 2021 18:59 UTC
11 points
0 comments3 min readLW link

The Prefer­ence Fulfill­ment Hypothesis

Kaj_Sotala26 Feb 2023 10:55 UTC
66 points
62 comments11 min readLW link

List of re­solved con­fu­sions about IDA

Wei Dai30 Sep 2019 20:03 UTC
94 points
18 comments3 min readLW link

Eval­u­at­ing the his­tor­i­cal value mis­speci­fi­ca­tion argument

Matthew Barnett5 Oct 2023 18:34 UTC
160 points
139 comments7 min readLW link

Is the Star Trek Fed­er­a­tion re­ally in­ca­pable of build­ing AI?

Kaj_Sotala18 Mar 2018 10:30 UTC
19 points
4 comments2 min readLW link

Con­fused why a “ca­pa­bil­ities re­search is good for al­ign­ment progress” po­si­tion isn’t dis­cussed more

Kaj_Sotala2 Jun 2022 21:41 UTC
129 points
27 comments4 min readLW link

An­nounc­ing the Align­ment of Com­plex Sys­tems Re­search Group

4 Jun 2022 4:10 UTC
91 points
20 comments5 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
45 points
7 comments8 min readLW link

Naive Hy­pothe­ses on AI Alignment

Shoshannah Tekofsky2 Jul 2022 19:03 UTC
98 points
29 comments5 min readLW link

Why “AI al­ign­ment” would bet­ter be re­named into “Ar­tifi­cial In­ten­tion re­search”

chaosmage15 Jun 2023 10:32 UTC
28 points
13 comments2 min readLW link

MIRI com­ments on Co­tra’s “Case for Align­ing Nar­rowly Su­per­hu­man Models”

Rob Bensinger5 Mar 2021 23:43 UTC
142 points
13 comments26 min readLW link

Align­ment as Game Design

Shoshannah Tekofsky16 Jul 2022 22:36 UTC
11 points
7 comments2 min readLW link

The True Story of How GPT-2 Be­came Max­i­mally Lewd

18 Jan 2024 21:03 UTC
70 points
7 comments6 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
68 points
40 comments16 min readLW link

Re­ward is not the op­ti­miza­tion target

TurnTrout25 Jul 2022 0:03 UTC
355 points
121 comments10 min readLW link3 reviews

Wor­ri­some mi­s­un­der­stand­ing of the core is­sues with AI transition

Roman Leventov18 Jan 2024 10:05 UTC
5 points
2 comments4 min readLW link

Shard The­ory: An Overview

David Udell11 Aug 2022 5:44 UTC
160 points
34 comments10 min readLW link

Hu­man Mimicry Mainly Works When We’re Already Close

johnswentworth17 Aug 2022 18:41 UTC
80 points
16 comments5 min readLW link

25 Min Talk on Me­taEth­i­cal.AI with Ques­tions from Stu­art Armstrong

June Ku29 Apr 2021 15:38 UTC
21 points
7 comments1 min readLW link

Sup­ple­men­tary Align­ment In­sights Through a Highly Con­trol­led Shut­down Incentive

Justausername23 Jul 2023 16:08 UTC
4 points
1 comment3 min readLW link

Au­tonomous Align­ment Over­sight Frame­work (AAOF)

Justausername25 Jul 2023 10:25 UTC
−9 points
0 comments4 min readLW link

[Question] Com­pe­tence vs Alignment

Ariel Kwiatkowski30 Sep 2020 21:03 UTC
7 points
4 comments1 min readLW link

[Question] Is there any ex­ist­ing term sum­ma­riz­ing non-scal­able over­sight meth­ods in outer al­ign­ment?

Allen Shen31 Jul 2023 17:31 UTC
1 point
0 comments1 min readLW link

Embed­ding Eth­i­cal Pri­ors into AI Sys­tems: A Bayesian Approach

Justausername3 Aug 2023 15:31 UTC
−5 points
3 comments21 min readLW link

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

Justausername24 Aug 2023 3:53 UTC
1 point
0 comments6 min readLW link

Demo­cratic Fine-Tuning

Joe Edelman29 Aug 2023 18:13 UTC
25 points
2 comments1 min readLW link

You can’t fetch the coffee if you’re dead: an AI dilemma

hennyge31 Aug 2023 11:03 UTC
1 point
0 comments4 min readLW link

Re­cre­at­ing the car­ing drive

Catnee7 Sep 2023 10:41 UTC
43 points
14 comments10 min readLW link

A Case for AI Safety via Law

JWJohnston11 Sep 2023 18:26 UTC
17 points
12 comments4 min readLW link

For­mal­iz­ing «Boundaries» with Markov blankets

Chipmonk19 Sep 2023 21:01 UTC
20 points
19 comments3 min readLW link

AGI Align­ment is iso­mor­phic to Un­con­di­tional Love

Raghuvar Nadig9 Oct 2023 15:58 UTC
−11 points
0 comments11 min readLW link

VLM-RM: Spec­i­fy­ing Re­wards with Nat­u­ral Language

23 Oct 2023 14:11 UTC
20 points
2 comments5 min readLW link

Imi­ta­tive Gen­er­al­i­sa­tion (AKA ‘Learn­ing the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC
103 points
15 comments12 min readLW link1 review

Pre­dic­tion can be Outer Aligned at Optimum

Lukas Finnveden10 Jan 2021 18:48 UTC
15 points
12 comments11 min readLW link

The case for al­ign­ing nar­rowly su­per­hu­man models

Ajeya Cotra5 Mar 2021 22:29 UTC
184 points
75 comments38 min readLW link1 review

A sim­ple way to make GPT-3 fol­low instructions

Quintin Pope8 Mar 2021 2:57 UTC
11 points
5 comments4 min readLW link

RFC: Meta-eth­i­cal un­cer­tainty in AGI alignment

Gordon Seidoh Worley8 Jun 2018 20:56 UTC
16 points
6 comments3 min readLW link

Con­trol­ling In­tel­li­gent Agents The Only Way We Know How: Ideal Bureau­cratic Struc­ture (IBS)

Justin Bullock24 May 2021 12:53 UTC
14 points
15 comments6 min readLW link

Thoughts on the Align­ment Im­pli­ca­tions of Scal­ing Lan­guage Models

leogao2 Jun 2021 21:32 UTC
82 points
11 comments17 min readLW link

In­suffi­cient Values

16 Jun 2021 14:33 UTC
31 points
15 comments5 min readLW link

[Question] Thoughts on a “Se­quences In­spired” PhD Topic

goose00017 Jun 2021 20:36 UTC
7 points
2 comments2 min readLW link

[Question] Is it worth mak­ing a database for moral pre­dic­tions?

Jonas Hallgren16 Aug 2021 14:51 UTC
1 point
0 comments2 min readLW link

Call for re­search on eval­u­at­ing al­ign­ment (fund­ing + ad­vice available)

Beth Barnes31 Aug 2021 23:28 UTC
105 points
11 comments5 min readLW link

Dist­in­guish­ing AI takeover scenarios

8 Sep 2021 16:19 UTC
72 points
11 comments14 min readLW link

Align­ment via man­u­ally im­ple­ment­ing the util­ity function

Chantiel7 Sep 2021 20:20 UTC
1 point
6 comments2 min readLW link

The Me­taethics and Nor­ma­tive Ethics of AGI Value Align­ment: Many Ques­tions, Some Implications

Eleos Arete Citrini16 Sep 2021 16:13 UTC
6 points
0 comments8 min readLW link

The AGI needs to be honest

rokosbasilisk16 Oct 2021 19:24 UTC
2 points
11 comments2 min readLW link

A pos­i­tive case for how we might suc­ceed at pro­saic AI alignment

evhub16 Nov 2021 1:49 UTC
80 points
46 comments6 min readLW link

Be­hav­ior Clon­ing is Miscalibrated

leogao5 Dec 2021 1:36 UTC
76 points
3 comments3 min readLW link

In­for­ma­tion bot­tle­neck for coun­ter­fac­tual corrigibility

tailcalled6 Dec 2021 17:11 UTC
8 points
1 comment7 min readLW link

Ex­ter­mi­nat­ing hu­mans might be on the to-do list of a Friendly AI

RomanS7 Dec 2021 14:15 UTC
5 points
8 comments2 min readLW link

Pro­ject In­tro: Selec­tion The­o­rems for Modularity

4 Apr 2022 12:59 UTC
71 points
20 comments16 min readLW link

Learn­ing the smooth prior

29 Apr 2022 21:10 UTC
35 points
0 comments12 min readLW link

Up­dat­ing Utility Functions

9 May 2022 9:44 UTC
37 points
6 comments8 min readLW link

AI Alter­na­tive Fu­tures: Sce­nario Map­ping Ar­tifi­cial In­tel­li­gence Risk—Re­quest for Par­ti­ci­pa­tion (*Closed*)

Kakili27 Apr 2022 22:07 UTC
10 points
2 comments8 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
53 points
0 comments59 min readLW link

RL with KL penalties is bet­ter seen as Bayesian inference

25 May 2022 9:23 UTC
114 points
17 comments12 min readLW link

In­ves­ti­gat­ing causal un­der­stand­ing in LLMs

14 Jun 2022 13:57 UTC
28 points
6 comments13 min readLW link

Get­ting from an un­al­igned AGI to an al­igned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC
13 points
7 comments9 min readLW link

An­nounc­ing the In­verse Scal­ing Prize ($250k Prize Pool)

27 Jun 2022 15:58 UTC
169 points
14 comments7 min readLW link

Re­search Notes: What are we al­ign­ing for?

Shoshannah Tekofsky8 Jul 2022 22:13 UTC
19 points
8 comments2 min readLW link

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC
15 points
5 comments22 min readLW link

Three Min­i­mum Pivotal Acts Pos­si­ble by Nar­row AI

Michael Soareverix12 Jul 2022 9:51 UTC
0 points
4 comments2 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
58 points
8 comments20 min readLW link

Our Ex­ist­ing Solu­tions to AGI Align­ment (semi-safe)

Michael Soareverix21 Jul 2022 19:00 UTC
12 points
1 comment3 min readLW link

Con­di­tion­ing Gen­er­a­tive Models with Restrictions

Adam Jermyn21 Jul 2022 20:33 UTC
18 points
4 comments8 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tamera3 Aug 2022 12:03 UTC
126 points
23 comments6 min readLW link

Con­di­tion­ing, Prompts, and Fine-Tuning

Adam Jermyn17 Aug 2022 20:52 UTC
38 points
9 comments4 min readLW link

Thoughts about OOD alignment

Catnee24 Aug 2022 15:31 UTC
11 points
10 comments2 min readLW link

Fram­ing AI Childhoods

David Udell6 Sep 2022 23:40 UTC
37 points
8 comments4 min readLW link

What Should AI Owe To Us? Ac­countable and Aligned AI Sys­tems via Con­trac­tu­al­ist AI Alignment

xuan8 Sep 2022 15:04 UTC
32 points
15 comments25 min readLW link

Why de­cep­tive al­ign­ment mat­ters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC
57 points
13 comments13 min readLW link

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC
27 points
4 comments6 min readLW link

In­ner al­ign­ment: what are we point­ing at?

lukehmiles18 Sep 2022 11:09 UTC
14 points
2 comments1 min readLW link

Lev­er­ag­ing Le­gal In­for­mat­ics to Align AI

John Nay18 Sep 2022 20:39 UTC
11 points
0 comments3 min readLW link

Plan­ning ca­pac­ity and daemons

lukehmiles26 Sep 2022 0:15 UTC
2 points
0 comments5 min readLW link

Science of Deep Learn­ing—a tech­ni­cal agenda

Marius Hobbhahn18 Oct 2022 14:54 UTC
36 points
7 comments4 min readLW link

Clar­ify­ing AI X-risk

1 Nov 2022 11:03 UTC
127 points
24 comments4 min readLW link1 review

Threat Model Liter­a­ture Review

1 Nov 2022 11:03 UTC
74 points
4 comments25 min readLW link

Ques­tions about Value Lock-in, Pa­ter­nal­ism, and Empowerment

Sam F. Brown16 Nov 2022 15:33 UTC
13 points
2 comments12 min readLW link

If you’re very op­ti­mistic about ELK then you should be op­ti­mistic about outer alignment

Sam Marks27 Apr 2022 19:30 UTC
17 points
8 comments3 min readLW link

[Question] Don’t you think RLHF solves outer al­ign­ment?

Charbel-Raphaël4 Nov 2022 0:36 UTC
9 points
23 comments1 min readLW link

A first suc­cess story for Outer Align­ment: In­struc­tGPT

Noosphere898 Nov 2022 22:52 UTC
6 points
1 comment1 min readLW link

The Disas­trously Con­fi­dent And Inac­cu­rate AI

Sharat Jacob Jacob18 Nov 2022 19:06 UTC
13 points
0 comments13 min readLW link

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC
10 points
5 comments45 min readLW link

Disen­tan­gling Shard The­ory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC
85 points
6 comments18 min readLW link

[Question] Will re­search in AI risk jinx it? Con­se­quences of train­ing AI on AI risk arguments

Yann Dubois19 Dec 2022 22:42 UTC
5 points
6 comments1 min readLW link

On the Im­por­tance of Open Sourc­ing Re­ward Models

elandgre2 Jan 2023 19:01 UTC
17 points
5 comments6 min readLW link

Causal rep­re­sen­ta­tion learn­ing as a tech­nique to pre­vent goal misgeneralization

PabloAMC4 Jan 2023 0:07 UTC
19 points
0 comments8 min readLW link

The Align­ment Problems

Martín Soto12 Jan 2023 22:29 UTC
19 points
0 comments4 min readLW link

Em­pa­thy as a nat­u­ral con­se­quence of learnt re­ward models

beren4 Feb 2023 15:35 UTC
46 points
27 comments13 min readLW link

Early situ­a­tional aware­ness and its im­pli­ca­tions, a story

Jacob Pfau6 Feb 2023 20:45 UTC
29 points
6 comments3 min readLW link

The Lin­guis­tic Blind Spot of Value-Aligned Agency, Nat­u­ral and Ar­tifi­cial

Roman Leventov14 Feb 2023 6:57 UTC
6 points
0 comments2 min readLW link

Pre­train­ing Lan­guage Models with Hu­man Preferences

21 Feb 2023 17:57 UTC
133 points
18 comments11 min readLW link

Break­ing the Op­ti­mizer’s Curse, and Con­se­quences for Ex­is­ten­tial Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC
10 points
1 comment23 min readLW link

Just How Hard a Prob­lem is Align­ment?

Roger Dearnaley25 Feb 2023 9:00 UTC
−1 points
1 comment21 min readLW link

Align­ment works both ways

Karl von Wendt7 Mar 2023 10:41 UTC
22 points
21 comments2 min readLW link

AGI is un­con­trol­lable, al­ign­ment is impossible

Donatas Lučiūnas19 Mar 2023 17:49 UTC
−12 points
21 comments1 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

Gaia Net­work: An Illus­trated Primer

18 Jan 2024 18:23 UTC
1 point
2 comments15 min readLW link

7. Evolu­tion and Ethics

RogerDearnaley15 Feb 2024 23:38 UTC
2 points
6 comments6 min readLW link

In­duc­ing hu­man-like bi­ases in moral rea­son­ing LMs

20 Feb 2024 16:28 UTC
16 points
1 comment14 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC
20 points
6 comments31 min readLW link

The Ideal Speech Si­tu­a­tion as a Tool for AI Eth­i­cal Reflec­tion: A Frame­work for Alignment

kenneth myers9 Feb 2024 18:40 UTC
6 points
12 comments3 min readLW link

[Question] Op­ti­miz­ing for Agency?

Michael Soareverix14 Feb 2024 8:31 UTC
8 points
4 comments2 min readLW link

Achiev­ing AI Align­ment through De­liber­ate Uncer­tainty in Mul­ti­a­gent Systems

Florian_Dietz17 Feb 2024 8:45 UTC
3 points
0 comments13 min readLW link

Open-ended ethics of phe­nom­ena (a desider­ata with uni­ver­sal moral­ity)

Ryo 8 Nov 2023 20:10 UTC
1 point
0 comments8 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments26 min readLW link

God vs AI scientifically

Donatas Lučiūnas21 Mar 2023 23:03 UTC
−22 points
40 comments1 min readLW link

Aligned AI as a wrap­per around an LLM

cousin_it25 Mar 2023 15:58 UTC
31 points
19 comments1 min readLW link

Are ex­trap­o­la­tion-based AIs al­ignable?

cousin_it24 Mar 2023 15:55 UTC
22 points
15 comments1 min readLW link

“Sorcerer’s Ap­pren­tice” from Fan­ta­sia as an anal­ogy for alignment

awg29 Mar 2023 18:21 UTC
7 points
4 comments1 min readLW link

Imi­ta­tion Learn­ing from Lan­guage Feedback

30 Mar 2023 14:11 UTC
71 points
3 comments10 min readLW link

[Question] Daisy-chain­ing ep­silon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC
2 points
1 comment1 min readLW link

Use these three heuris­tic im­per­a­tives to solve alignment

G6 Apr 2023 16:20 UTC
−17 points
4 comments1 min readLW link

If Align­ment is Hard, then so is Self-Improvement

PavleMiha7 Apr 2023 0:08 UTC
21 points
20 comments1 min readLW link

Goal al­ign­ment with­out al­ign­ment on episte­mol­ogy, ethics, and sci­ence is futile

Roman Leventov7 Apr 2023 8:22 UTC
20 points
2 comments2 min readLW link

Co­op­er­a­tive Game Theory

Takk7 Jun 2023 17:41 UTC
1 point
0 comments1 min readLW link

For al­ign­ment, we should si­mul­ta­neously use mul­ti­ple the­o­ries of cog­ni­tion and value

Roman Leventov24 Apr 2023 10:37 UTC
22 points
5 comments5 min readLW link

Archety­pal Trans­fer Learn­ing: a Pro­posed Align­ment Solu­tion that solves the In­ner & Outer Align­ment Prob­lem while adding Cor­rigible Traits to GPT-2-medium

MiguelDev26 Apr 2023 1:37 UTC
14 points
5 comments10 min readLW link

Free­dom Is All We Need

Leo Glisic27 Apr 2023 0:09 UTC
−1 points
8 comments10 min readLW link

Com­po­si­tional prefer­ence mod­els for al­ign­ing LMs

Tomek Korbak25 Oct 2023 12:17 UTC
18 points
2 comments5 min readLW link

Wire­head­ing and mis­al­ign­ment by com­po­si­tion on NetHack

pierlucadoro27 Oct 2023 17:43 UTC
34 points
4 comments4 min readLW link

AI Align­ment: A Com­pre­hen­sive Survey

Stephen McAleer1 Nov 2023 17:35 UTC
14 points
0 comments1 min readLW link

Op­tion­al­ity ap­proach to ethics

Ryo 13 Nov 2023 15:23 UTC
7 points
2 comments3 min readLW link

Align­ment is Hard: An Un­com­putable Align­ment Problem

Alexander Bistagne19 Nov 2023 19:38 UTC
−5 points
4 comments1 min readLW link

Re­ac­tion to “Em­pow­er­ment is (al­most) All We Need” : an open-ended alternative

Ryo 25 Nov 2023 15:35 UTC
9 points
3 comments5 min readLW link

Cor­rigi­bil­ity or DWIM is an at­trac­tive pri­mary goal for AGI

Seth Herd25 Nov 2023 19:37 UTC
17 points
4 comments1 min readLW link

An In­creas­ingly Ma­nipu­la­tive Newsfeed

Michaël Trazzi1 Jul 2019 15:26 UTC
62 points
16 comments5 min readLW link

My preferred fram­ings for re­ward mis­speci­fi­ca­tion and goal misgeneralisation

Yi-Yang6 May 2023 4:48 UTC
24 points
1 comment8 min readLW link

Is “red” for GPT-4 the same as “red” for you?

Yusuke Hayashi6 May 2023 17:55 UTC
9 points
6 comments2 min readLW link

H-JEPA might be tech­ni­cally al­ignable in a mod­ified form

Roman Leventov8 May 2023 23:04 UTC
12 points
2 comments7 min readLW link

The Goal Mis­gen­er­al­iza­tion Problem

Myspy18 May 2023 23:40 UTC
1 point
0 comments1 min readLW link

Distil­la­tion of Neu­rotech and Align­ment Work­shop Jan­uary 2023

22 May 2023 7:17 UTC
51 points
9 comments14 min readLW link

The Steer­ing Problem

paulfchristiano13 Nov 2018 17:14 UTC
43 points
12 comments7 min readLW link

An LLM-based “ex­em­plary ac­tor”

Roman Leventov29 May 2023 11:12 UTC
16 points
0 comments12 min readLW link

In­fer­ence from a Math­e­mat­i­cal De­scrip­tion of an Ex­ist­ing Align­ment Re­search: a pro­posal for an outer al­ign­ment re­search program

Christopher King2 Jun 2023 21:54 UTC
7 points
4 comments16 min readLW link

Align­ing an H-JEPA agent via train­ing on the out­puts of an LLM-based “ex­em­plary ac­tor”

Roman Leventov29 May 2023 11:08 UTC
12 points
10 comments30 min readLW link

Shut­down-Seek­ing AI

Simon Goldstein31 May 2023 22:19 UTC
48 points
31 comments15 min readLW link

“De­sign­ing agent in­cen­tives to avoid re­ward tam­per­ing”, DeepMind

gwern14 Aug 2019 16:57 UTC
28 points
15 comments1 min readLW link

Higher Di­men­sion Carte­sian Ob­jects and Align­ing ‘Tiling Si­mu­la­tors’

lukemarks11 Jun 2023 0:13 UTC
22 points
0 comments5 min readLW link

Us­ing Con­sen­sus Mechanisms as an ap­proach to Alignment

Prometheus10 Jun 2023 23:38 UTC
9 points
2 comments6 min readLW link

Pro­posal: Tune LLMs to Use Cal­ibrated Language

OneManyNone7 Jun 2023 21:05 UTC
9 points
0 comments5 min readLW link

Ex­am­ples of AI’s be­hav­ing badly

Stuart_Armstrong16 Jul 2015 10:01 UTC
41 points
41 comments1 min readLW link

A Mul­tidis­ci­plinary Ap­proach to Align­ment (MATA) and Archety­pal Trans­fer Learn­ing (ATL)

MiguelDev19 Jun 2023 2:32 UTC
4 points
2 comments7 min readLW link

Par­tial Si­mu­la­tion Ex­trap­o­la­tion: A Pro­posal for Build­ing Safer Simulators

lukemarks17 Jun 2023 13:55 UTC
16 points
0 comments10 min readLW link

Slay­ing the Hy­dra: to­ward a new game board for AI

Prometheus23 Jun 2023 17:04 UTC
0 points
5 comments6 min readLW link

Thoughts on the Fea­si­bil­ity of Pro­saic AGI Align­ment?

iamthouthouarti21 Aug 2020 23:25 UTC
8 points
10 comments1 min readLW link

Align­ment As A Bot­tle­neck To Use­ful­ness Of GPT-3

johnswentworth21 Jul 2020 20:02 UTC
111 points
57 comments3 min readLW link

Sim­ple al­ign­ment plan that maybe works

Iknownothing18 Jul 2023 22:48 UTC
4 points
8 comments1 min readLW link
No comments.