RSS

Outer Alignment

TagLast edit: 11 Dec 2022 3:01 UTC by MichelJusten

Outer Alignment in the context of machine learning is the property where the specified loss function is aligned with the intended goal of its designers. This is an intuitive notion, in part because human intentions are themselves not well-understood. This is what is typically discussed as the ‘value alignment’ problem. It is contrasted with inner alignment, which discusses if an optimizer is the production of an outer aligned system, then whether that optimizer is itself aligned.See also:

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
170 points
42 comments12 min readLW link3 reviews

Another (outer) al­ign­ment failure story

paulfchristiano7 Apr 2021 20:12 UTC
224 points
38 comments12 min readLW link1 review

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

De­bate up­date: Obfus­cated ar­gu­ments problem

Beth Barnes23 Dec 2020 3:24 UTC
125 points
24 comments16 min readLW link

Book re­view: “A Thou­sand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC
111 points
18 comments19 min readLW link

Truth­ful LMs as a warm-up for al­igned AGI

Jacob_Hilton17 Jan 2022 16:49 UTC
65 points
14 comments13 min readLW link

Outer vs in­ner mis­al­ign­ment: three framings

Richard_Ngo6 Jul 2022 19:46 UTC
46 points
4 comments9 min readLW link

LOVE in a sim­box is all you need

jacob_cannell28 Sep 2022 18:25 UTC
66 points
69 comments44 min readLW link

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus Astra1 Jul 2020 17:30 UTC
35 points
4 comments67 min readLW link

Outer al­ign­ment and imi­ta­tive amplification

evhub10 Jan 2020 0:26 UTC
24 points
11 comments9 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
202 points
36 comments38 min readLW link2 reviews

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
45 points
7 comments8 min readLW link

List of re­solved con­fu­sions about IDA

Wei_Dai30 Sep 2019 20:03 UTC
94 points
18 comments3 min readLW link

Is the Star Trek Fed­er­a­tion re­ally in­ca­pable of build­ing AI?

Kaj_Sotala18 Mar 2018 10:30 UTC
19 points
4 comments2 min readLW link
(kajsotala.fi)

If I were a well-in­ten­tioned AI… I: Image classifier

Stuart_Armstrong26 Feb 2020 12:39 UTC
35 points
4 comments5 min readLW link

If I were a well-in­ten­tioned AI… II: Act­ing in a world

Stuart_Armstrong27 Feb 2020 11:58 UTC
20 points
0 comments3 min readLW link

If I were a well-in­ten­tioned AI… III: Ex­tremal Goodhart

Stuart_Armstrong28 Feb 2020 11:24 UTC
22 points
0 comments5 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
125 points
6 comments35 min readLW link

Con­cept Safety: Pro­duc­ing similar AI-hu­man con­cept spaces

Kaj_Sotala14 Apr 2015 20:39 UTC
50 points
45 comments8 min readLW link

nos­talge­braist: Re­cur­sive Good­hart’s Law

Kaj_Sotala26 Aug 2020 11:07 UTC
53 points
27 comments1 min readLW link
(nostalgebraist.tumblr.com)

(Hu­mor) AI Align­ment Crit­i­cal Failure Table

Kaj_Sotala31 Aug 2020 19:51 UTC
24 points
2 comments1 min readLW link
(sl4.org)

“In­ner Align­ment Failures” Which Are Ac­tu­ally Outer Align­ment Failures

johnswentworth31 Oct 2020 20:18 UTC
61 points
38 comments5 min readLW link

Men­tal sub­agent im­pli­ca­tions for AI Safety

moridinamael3 Jan 2021 18:59 UTC
11 points
0 comments3 min readLW link

MIRI com­ments on Co­tra’s “Case for Align­ing Nar­rowly Su­per­hu­man Models”

Rob Bensinger5 Mar 2021 23:43 UTC
140 points
13 comments26 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
68 points
40 comments16 min readLW link

25 Min Talk on Me­taEth­i­cal.AI with Ques­tions from Stu­art Armstrong

June Ku29 Apr 2021 15:38 UTC
21 points
7 comments1 min readLW link

Selec­tion The­o­rems: A Pro­gram For Un­der­stand­ing Agents

johnswentworth28 Sep 2021 5:03 UTC
118 points
27 comments6 min readLW link2 reviews

[Question] Col­lec­tion of ar­gu­ments to ex­pect (outer and in­ner) al­ign­ment failure?

Sam Clarke28 Sep 2021 16:55 UTC
20 points
10 comments1 min readLW link

AXRP Epi­sode 12 - AI Ex­is­ten­tial Risk with Paul Christiano

DanielFilan2 Dec 2021 2:20 UTC
36 points
0 comments125 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
121 points
9 comments15 min readLW link

My Overview of the AI Align­ment Land­scape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC
50 points
4 comments28 min readLW link

Ques­tion 2: Pre­dicted bad out­comes of AGI learn­ing architecture

Cameron Berg11 Feb 2022 22:23 UTC
5 points
1 comment10 min readLW link

How do new mod­els from OpenAI, Deep­Mind and An­thropic perform on Truth­fulQA?

Owain_Evans26 Feb 2022 12:46 UTC
42 points
3 comments11 min readLW link

[In­tro to brain-like-AGI safety] 10. The al­ign­ment problem

Steven Byrnes30 Mar 2022 13:24 UTC
45 points
4 comments19 min readLW link

[ASoT] Some thoughts about im­perfect world modeling

leogao7 Apr 2022 15:42 UTC
7 points
0 comments4 min readLW link

Con­fused why a “ca­pa­bil­ities re­search is good for al­ign­ment progress” po­si­tion isn’t dis­cussed more

Kaj_Sotala2 Jun 2022 21:41 UTC
129 points
27 comments4 min readLW link

An­nounc­ing the Align­ment of Com­plex Sys­tems Re­search Group

4 Jun 2022 4:10 UTC
84 points
18 comments5 min readLW link

Naive Hy­pothe­ses on AI Alignment

Shoshannah Tekofsky2 Jul 2022 19:03 UTC
93 points
29 comments5 min readLW link

Align­ment as Game Design

Shoshannah Tekofsky16 Jul 2022 22:36 UTC
11 points
7 comments2 min readLW link

Re­ward is not the op­ti­miza­tion target

TurnTrout25 Jul 2022 0:03 UTC
279 points
104 comments10 min readLW link

Shard The­ory: An Overview

David Udell11 Aug 2022 5:44 UTC
141 points
34 comments10 min readLW link

Hu­man Mimicry Mainly Works When We’re Already Close

johnswentworth17 Aug 2022 18:41 UTC
70 points
16 comments5 min readLW link

Simulators

janus2 Sep 2022 12:45 UTC
575 points
111 comments41 min readLW link
(generative.ink)

Four us­ages of “loss” in AI

TurnTrout2 Oct 2022 0:52 UTC
42 points
18 comments4 min readLW link

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John Nay21 Oct 2022 2:03 UTC
3 points
18 comments54 min readLW link

Don’t al­ign agents to eval­u­a­tions of plans

TurnTrout26 Nov 2022 21:16 UTC
43 points
48 comments18 min readLW link

Align­ment al­lows “non­ro­bust” de­ci­sion-in­fluences and doesn’t re­quire ro­bust grading

TurnTrout29 Nov 2022 6:23 UTC
57 points
41 comments15 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTrout2 Dec 2022 2:43 UTC
101 points
18 comments47 min readLW link

Paper: Con­sti­tu­tional AI: Harm­less­ness from AI Feed­back (An­thropic)

LawrenceC16 Dec 2022 22:12 UTC
64 points
11 comments1 min readLW link
(www.anthropic.com)

Cat­e­go­riz­ing failures as “outer” or “in­ner” mis­al­ign­ment is of­ten confused

Rohin Shah6 Jan 2023 15:48 UTC
82 points
21 comments8 min readLW link

Some of my dis­agree­ments with List of Lethalities

TurnTrout24 Jan 2023 0:25 UTC
67 points
7 comments10 min readLW link

The Prefer­ence Fulfill­ment Hypothesis

Kaj_Sotala26 Feb 2023 10:55 UTC
59 points
61 comments11 min readLW link

An In­creas­ingly Ma­nipu­la­tive Newsfeed

Michaël Trazzi1 Jul 2019 15:26 UTC
62 points
16 comments5 min readLW link

The Steer­ing Problem

paulfchristiano13 Nov 2018 17:14 UTC
43 points
12 comments7 min readLW link

“De­sign­ing agent in­cen­tives to avoid re­ward tam­per­ing”, DeepMind

gwern14 Aug 2019 16:57 UTC
28 points
15 comments1 min readLW link
(medium.com)

Ex­am­ples of AI’s be­hav­ing badly

Stuart_Armstrong16 Jul 2015 10:01 UTC
41 points
39 comments1 min readLW link

Thoughts on the Fea­si­bil­ity of Pro­saic AGI Align­ment?

iamthouthouarti21 Aug 2020 23:25 UTC
8 points
10 comments1 min readLW link

Align­ment As A Bot­tle­neck To Use­ful­ness Of GPT-3

johnswentworth21 Jul 2020 20:02 UTC
111 points
57 comments3 min readLW link

[Question] Com­pe­tence vs Alignment

Ariel Kwiatkowski30 Sep 2020 21:03 UTC
6 points
4 comments1 min readLW link

Imi­ta­tive Gen­er­al­i­sa­tion (AKA ‘Learn­ing the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC
94 points
15 comments12 min readLW link1 review

Pre­dic­tion can be Outer Aligned at Optimum

Lanrian10 Jan 2021 18:48 UTC
15 points
12 comments11 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments26 min readLW link

The case for al­ign­ing nar­rowly su­per­hu­man models

Ajeya Cotra5 Mar 2021 22:29 UTC
190 points
75 comments38 min readLW link1 review

A sim­ple way to make GPT-3 fol­low instructions

Quintin Pope8 Mar 2021 2:57 UTC
11 points
5 comments4 min readLW link

RFC: Meta-eth­i­cal un­cer­tainty in AGI alignment

Gordon Seidoh Worley8 Jun 2018 20:56 UTC
16 points
6 comments3 min readLW link

Con­trol­ling In­tel­li­gent Agents The Only Way We Know How: Ideal Bureau­cratic Struc­ture (IBS)

Justin Bullock24 May 2021 12:53 UTC
11 points
11 comments6 min readLW link

Thoughts on the Align­ment Im­pli­ca­tions of Scal­ing Lan­guage Models

leogao2 Jun 2021 21:32 UTC
82 points
11 comments17 min readLW link

In­suffi­cient Values

16 Jun 2021 14:33 UTC
29 points
15 comments5 min readLW link

[Question] Thoughts on a “Se­quences In­spired” PhD Topic

goose00017 Jun 2021 20:36 UTC
7 points
2 comments2 min readLW link

[Question] Is it worth mak­ing a database for moral pre­dic­tions?

Jonas Hallgren16 Aug 2021 14:51 UTC
1 point
0 comments2 min readLW link

Call for re­search on eval­u­at­ing al­ign­ment (fund­ing + ad­vice available)

Beth Barnes31 Aug 2021 23:28 UTC
105 points
11 comments5 min readLW link

Dist­in­guish­ing AI takeover scenarios

8 Sep 2021 16:19 UTC
69 points
11 comments14 min readLW link

Align­ment via man­u­ally im­ple­ment­ing the util­ity function

Chantiel7 Sep 2021 20:20 UTC
1 point
6 comments2 min readLW link

The Me­taethics and Nor­ma­tive Ethics of AGI Value Align­ment: Many Ques­tions, Some Implications

Dario Citrini16 Sep 2021 16:13 UTC
6 points
0 comments8 min readLW link

The AGI needs to be honest

rokosbasilisk16 Oct 2021 19:24 UTC
2 points
12 comments2 min readLW link

A pos­i­tive case for how we might suc­ceed at pro­saic AI alignment

evhub16 Nov 2021 1:49 UTC
79 points
47 comments6 min readLW link

Be­hav­ior Clon­ing is Miscalibrated

leogao5 Dec 2021 1:36 UTC
77 points
3 comments3 min readLW link

In­for­ma­tion bot­tle­neck for coun­ter­fac­tual corrigibility

tailcalled6 Dec 2021 17:11 UTC
8 points
1 comment7 min readLW link

Ex­ter­mi­nat­ing hu­mans might be on the to-do list of a Friendly AI

RomanS7 Dec 2021 14:15 UTC
5 points
8 comments2 min readLW link

Pro­ject In­tro: Selec­tion The­o­rems for Modularity

4 Apr 2022 12:59 UTC
69 points
20 comments16 min readLW link

Learn­ing the smooth prior

29 Apr 2022 21:10 UTC
33 points
0 comments12 min readLW link

Up­dat­ing Utility Functions

9 May 2022 9:44 UTC
36 points
6 comments8 min readLW link

AI Alter­na­tive Fu­tures: Sce­nario Map­ping Ar­tifi­cial In­tel­li­gence Risk—Re­quest for Par­ti­ci­pa­tion (*Closed*)

Kakili27 Apr 2022 22:07 UTC
10 points
2 comments8 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
51 points
0 comments59 min readLW link

RL with KL penalties is bet­ter seen as Bayesian inference

25 May 2022 9:23 UTC
94 points
15 comments12 min readLW link

On in­ner and outer al­ign­ment, and their confusion

NinaR26 May 2022 21:56 UTC
6 points
7 comments4 min readLW link

In­ves­ti­gat­ing causal un­der­stand­ing in LLMs

14 Jun 2022 13:57 UTC
28 points
6 comments13 min readLW link

Get­ting from an un­al­igned AGI to an al­igned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC
11 points
7 comments9 min readLW link

An­nounc­ing the In­verse Scal­ing Prize ($250k Prize Pool)

27 Jun 2022 15:58 UTC
168 points
14 comments7 min readLW link

Re­search Notes: What are we al­ign­ing for?

Shoshannah Tekofsky8 Jul 2022 22:13 UTC
19 points
8 comments2 min readLW link

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC
14 points
5 comments22 min readLW link

Three Min­i­mum Pivotal Acts Pos­si­ble by Nar­row AI

Michael Soareverix12 Jul 2022 9:51 UTC
0 points
4 comments2 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
52 points
8 comments20 min readLW link

Our Ex­ist­ing Solu­tions to AGI Align­ment (semi-safe)

Michael Soareverix21 Jul 2022 19:00 UTC
12 points
1 comment3 min readLW link

Con­di­tion­ing Gen­er­a­tive Models with Restrictions

Adam Jermyn21 Jul 2022 20:33 UTC
17 points
4 comments8 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tamera3 Aug 2022 12:03 UTC
106 points
22 comments6 min readLW link

Con­di­tion­ing, Prompts, and Fine-Tuning

Adam Jermyn17 Aug 2022 20:52 UTC
37 points
9 comments4 min readLW link

Thoughts about OOD alignment

Dmitry Savishchev24 Aug 2022 15:31 UTC
11 points
10 comments2 min readLW link

Fram­ing AI Childhoods

David Udell6 Sep 2022 23:40 UTC
37 points
8 comments4 min readLW link

What Should AI Owe To Us? Ac­countable and Aligned AI Sys­tems via Con­trac­tu­al­ist AI Alignment

xuan8 Sep 2022 15:04 UTC
31 points
15 comments25 min readLW link

Why de­cep­tive al­ign­ment mat­ters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC
48 points
13 comments13 min readLW link

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC
27 points
4 comments6 min readLW link

In­ner al­ign­ment: what are we point­ing at?

lukehmiles18 Sep 2022 11:09 UTC
7 points
2 comments1 min readLW link

Lev­er­ag­ing Le­gal In­for­mat­ics to Align AI

John Nay18 Sep 2022 20:39 UTC
11 points
0 comments3 min readLW link
(forum.effectivealtruism.org)

Plan­ning ca­pac­ity and daemons

lukehmiles26 Sep 2022 0:15 UTC
2 points
0 comments5 min readLW link

Science of Deep Learn­ing—a tech­ni­cal agenda

Marius Hobbhahn18 Oct 2022 14:54 UTC
36 points
7 comments4 min readLW link

Clar­ify­ing AI X-risk

1 Nov 2022 11:03 UTC
107 points
23 comments4 min readLW link

Threat Model Liter­a­ture Review

1 Nov 2022 11:03 UTC
67 points
4 comments25 min readLW link

Ques­tions about Value Lock-in, Pa­ter­nal­ism, and Empowerment

Sam16 Nov 2022 15:33 UTC
12 points
2 comments12 min readLW link
(sambrown.eu)

If you’re very op­ti­mistic about ELK then you should be op­ti­mistic about outer alignment

Sam Marks27 Apr 2022 19:30 UTC
17 points
8 comments3 min readLW link

[Question] Don’t you think RLHF solves outer al­ign­ment?

Raphaël S4 Nov 2022 0:36 UTC
4 points
23 comments1 min readLW link

A first suc­cess story for Outer Align­ment: In­struc­tGPT

Noosphere898 Nov 2022 22:52 UTC
6 points
1 comment1 min readLW link
(openai.com)

The Disas­trously Con­fi­dent And Inac­cu­rate AI

Sharat Jacob Jacob18 Nov 2022 19:06 UTC
13 points
0 comments13 min readLW link

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC
7 points
5 comments45 min readLW link

Disen­tan­gling Shard The­ory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC
79 points
6 comments18 min readLW link

[Question] Will re­search in AI risk jinx it? Con­se­quences of train­ing AI on AI risk arguments

Yann Dubois19 Dec 2022 22:42 UTC
5 points
6 comments1 min readLW link

On the Im­por­tance of Open Sourc­ing Re­ward Models

elandgre2 Jan 2023 19:01 UTC
17 points
5 comments6 min readLW link

Causal rep­re­sen­ta­tion learn­ing as a tech­nique to pre­vent goal misgeneralization

PabloAMC4 Jan 2023 0:07 UTC
18 points
0 comments8 min readLW link

The Align­ment Problems

Martín Soto12 Jan 2023 22:29 UTC
19 points
0 comments4 min readLW link

Em­pa­thy as a nat­u­ral con­se­quence of learnt re­ward models

beren4 Feb 2023 15:35 UTC
37 points
26 comments13 min readLW link

Early situ­a­tional aware­ness and its im­pli­ca­tions, a story

Jacob Pfau6 Feb 2023 20:45 UTC
20 points
6 comments3 min readLW link

The Lin­guis­tic Blind Spot of Value-Aligned Agency, Nat­u­ral and Ar­tifi­cial

Roman Leventov14 Feb 2023 6:57 UTC
6 points
0 comments2 min readLW link
(arxiv.org)

Pre­train­ing Lan­guage Models with Hu­man Preferences

21 Feb 2023 17:57 UTC
129 points
18 comments11 min readLW link

Break­ing the Op­ti­mizer’s Curse, and Con­se­quences for Ex­is­ten­tial Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC
4 points
0 comments23 min readLW link

Just How Hard a Prob­lem is Align­ment?

Roger Dearnaley25 Feb 2023 9:00 UTC
−1 points
1 comment21 min readLW link

Align­ment works both ways

Karl von Wendt7 Mar 2023 10:41 UTC
21 points
21 comments2 min readLW link

AGI is un­con­trol­lable, al­ign­ment is impossible

Donatas Lučiūnas19 Mar 2023 17:49 UTC
−12 points
21 comments1 min readLW link

God vs AI scientifically

Donatas Lučiūnas21 Mar 2023 23:03 UTC
−22 points
40 comments1 min readLW link

Aligned AI as a wrap­per around an LLM

cousin_it25 Mar 2023 15:58 UTC
27 points
19 comments1 min readLW link

Are ex­trap­o­la­tion-based AIs al­ignable?

cousin_it24 Mar 2023 15:55 UTC
22 points
15 comments1 min readLW link

“Sorcerer’s Ap­pren­tice” from Fan­ta­sia as an anal­ogy for alignment

awg29 Mar 2023 18:21 UTC
4 points
3 comments1 min readLW link
(video.disney.com)
No comments.