RSS

Outer Alignment

TagLast edit: 10 Aug 2020 16:35 UTC by brook

Outer Alignment in the context of machine learning is the property where the specified loss function is aligned with the intended goal of its designers. This is an intuitive notion, in part because human intentions are themselves not well-understood. This is what is typically discussed as the ‘value alignment’ problem. It is contrasted with inner alignment, which discusses if an optimizer is the production of an outer aligned system, then whether that optimizer is itself aligned.See also:

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
156 points
42 comments12 min readLW link3 reviews

Another (outer) al­ign­ment failure story

paulfchristiano7 Apr 2021 20:12 UTC
201 points
37 comments12 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

De­bate up­date: Obfus­cated ar­gu­ments problem

Beth Barnes23 Dec 2020 3:24 UTC
118 points
20 comments16 min readLW link

Book re­view: “A Thou­sand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC
107 points
18 comments19 min readLW link

Truth­ful LMs as a warm-up for al­igned AGI

Jacob_Hilton17 Jan 2022 16:49 UTC
64 points
14 comments13 min readLW link

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus Astra1 Jul 2020 17:30 UTC
35 points
4 comments67 min readLW link

Outer al­ign­ment and imi­ta­tive amplification

evhub10 Jan 2020 0:26 UTC
31 points
11 comments9 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
181 points
34 comments38 min readLW link2 reviews

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
43 points
5 comments8 min readLW link

List of re­solved con­fu­sions about IDA

Wei_Dai30 Sep 2019 20:03 UTC
94 points
18 comments3 min readLW link

Is the Star Trek Fed­er­a­tion re­ally in­ca­pable of build­ing AI?

Kaj_Sotala18 Mar 2018 10:30 UTC
19 points
4 comments2 min readLW link
(kajsotala.fi)

If I were a well-in­ten­tioned AI… I: Image classifier

Stuart_Armstrong26 Feb 2020 12:39 UTC
35 points
4 comments5 min readLW link

If I were a well-in­ten­tioned AI… II: Act­ing in a world

Stuart_Armstrong27 Feb 2020 11:58 UTC
20 points
0 comments3 min readLW link

If I were a well-in­ten­tioned AI… III: Ex­tremal Goodhart

Stuart_Armstrong28 Feb 2020 11:24 UTC
21 points
0 comments5 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
125 points
6 comments35 min readLW link

Con­cept Safety: Pro­duc­ing similar AI-hu­man con­cept spaces

Kaj_Sotala14 Apr 2015 20:39 UTC
50 points
45 comments8 min readLW link

nos­talge­braist: Re­cur­sive Good­hart’s Law

Kaj_Sotala26 Aug 2020 11:07 UTC
53 points
27 comments1 min readLW link
(nostalgebraist.tumblr.com)

(Hu­mor) AI Align­ment Crit­i­cal Failure Table

Kaj_Sotala31 Aug 2020 19:51 UTC
24 points
2 comments1 min readLW link
(sl4.org)

“In­ner Align­ment Failures” Which Are Ac­tu­ally Outer Align­ment Failures

johnswentworth31 Oct 2020 20:18 UTC
64 points
38 comments5 min readLW link

Men­tal sub­agent im­pli­ca­tions for AI Safety

moridinamael3 Jan 2021 18:59 UTC
11 points
0 comments3 min readLW link

MIRI com­ments on Co­tra’s “Case for Align­ing Nar­rowly Su­per­hu­man Models”

Rob Bensinger5 Mar 2021 23:43 UTC
134 points
13 comments26 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
64 points
40 comments16 min readLW link

25 Min Talk on Me­taEth­i­cal.AI with Ques­tions from Stu­art Armstrong

June Ku29 Apr 2021 15:38 UTC
21 points
7 comments1 min readLW link

Selec­tion The­o­rems: A Pro­gram For Un­der­stand­ing Agents

johnswentworth28 Sep 2021 5:03 UTC
86 points
22 comments6 min readLW link

[Question] Col­lec­tion of ar­gu­ments to ex­pect (outer and in­ner) al­ign­ment failure?

Sam Clarke28 Sep 2021 16:55 UTC
20 points
10 comments1 min readLW link

AXRP Epi­sode 12 - AI Ex­is­ten­tial Risk with Paul Christiano

DanielFilan2 Dec 2021 2:20 UTC
36 points
0 comments125 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
101 points
9 comments16 min readLW link

My Overview of the AI Align­ment Land­scape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC
38 points
4 comments28 min readLW link

Ques­tion 2: Pre­dicted bad out­comes of AGI learn­ing architecture

Cameron Berg11 Feb 2022 22:23 UTC
5 points
1 comment10 min readLW link

How do new mod­els from OpenAI, Deep­Mind and An­thropic perform on Truth­fulQA?

Owain_Evans26 Feb 2022 12:46 UTC
40 points
3 comments11 min readLW link

[In­tro to brain-like-AGI safety] 10. The al­ign­ment problem

Steven Byrnes30 Mar 2022 13:24 UTC
32 points
2 comments21 min readLW link

[ASoT] Some thoughts about im­perfect world modeling

leogao7 Apr 2022 15:42 UTC
7 points
0 comments4 min readLW link

An In­creas­ingly Ma­nipu­la­tive Newsfeed

Michaël Trazzi1 Jul 2019 15:26 UTC
61 points
16 comments5 min readLW link

The Steer­ing Problem

paulfchristiano13 Nov 2018 17:14 UTC
41 points
12 comments7 min readLW link

“De­sign­ing agent in­cen­tives to avoid re­ward tam­per­ing”, DeepMind

gwern14 Aug 2019 16:57 UTC
28 points
15 comments1 min readLW link
(medium.com)

Ex­am­ples of AI’s be­hav­ing badly

Stuart_Armstrong16 Jul 2015 10:01 UTC
41 points
37 comments1 min readLW link

Thoughts on the Fea­si­bil­ity of Pro­saic AGI Align­ment?

iamthouthouarti21 Aug 2020 23:25 UTC
8 points
10 comments1 min readLW link

Align­ment As A Bot­tle­neck To Use­ful­ness Of GPT-3

johnswentworth21 Jul 2020 20:02 UTC
108 points
57 comments3 min readLW link

[Question] Com­pe­tence vs Alignment

Ariel Kwiatkowski30 Sep 2020 21:03 UTC
6 points
4 comments1 min readLW link

Imi­ta­tive Gen­er­al­i­sa­tion (AKA ‘Learn­ing the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC
86 points
14 comments12 min readLW link

Pre­dic­tion can be Outer Aligned at Optimum

Lanrian10 Jan 2021 18:48 UTC
15 points
12 comments11 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments26 min readLW link

The case for al­ign­ing nar­rowly su­per­hu­man models

Ajeya Cotra5 Mar 2021 22:29 UTC
180 points
74 comments38 min readLW link

A sim­ple way to make GPT-3 fol­low instructions

Quintin Pope8 Mar 2021 2:57 UTC
10 points
5 comments4 min readLW link

RFC: Meta-eth­i­cal un­cer­tainty in AGI alignment

G Gordon Worley III8 Jun 2018 20:56 UTC
16 points
6 comments3 min readLW link

Con­trol­ling In­tel­li­gent Agents The Only Way We Know How: Ideal Bureau­cratic Struc­ture (IBS)

Justin Bullock24 May 2021 12:53 UTC
10 points
11 comments6 min readLW link

Thoughts on the Align­ment Im­pli­ca­tions of Scal­ing Lan­guage Models

leogao2 Jun 2021 21:32 UTC
79 points
11 comments17 min readLW link

MIRIx Part I: In­suffi­cient Values

16 Jun 2021 14:33 UTC
29 points
15 comments6 min readLW link

[Question] Thoughts on a “Se­quences In­spired” PhD Topic

goose00017 Jun 2021 20:36 UTC
7 points
2 comments2 min readLW link

[Question] Is it worth mak­ing a database for moral pre­dic­tions?

Jonas Hallgren16 Aug 2021 14:51 UTC
1 point
0 comments2 min readLW link

Call for re­search on eval­u­at­ing al­ign­ment (fund­ing + ad­vice available)

Beth Barnes31 Aug 2021 23:28 UTC
105 points
11 comments5 min readLW link

Dist­in­guish­ing AI takeover scenarios

8 Sep 2021 16:19 UTC
62 points
11 comments14 min readLW link

Align­ment via man­u­ally im­ple­ment­ing the util­ity function

Chantiel7 Sep 2021 20:20 UTC
1 point
6 comments2 min readLW link

The Me­taethics and Nor­ma­tive Ethics of AGI Value Align­ment: Many Ques­tions, Some Implications

Dario Citrini16 Sep 2021 16:13 UTC
6 points
0 comments8 min readLW link

The AGI needs to be honest

rokosbasilisk16 Oct 2021 19:24 UTC
2 points
12 comments2 min readLW link

A pos­i­tive case for how we might suc­ceed at pro­saic AI alignment

evhub16 Nov 2021 1:49 UTC
84 points
45 comments6 min readLW link

Be­hav­ior Clon­ing is Miscalibrated

leogao5 Dec 2021 1:36 UTC
52 points
3 comments3 min readLW link

In­for­ma­tion bot­tle­neck for coun­ter­fac­tual corrigibility

tailcalled6 Dec 2021 17:11 UTC
8 points
1 comment7 min readLW link

Ex­ter­mi­nat­ing hu­mans might be on the to-do list of a Friendly AI

RomanS7 Dec 2021 14:15 UTC
5 points
8 comments2 min readLW link

Pro­ject In­tro: Selec­tion The­o­rems for Modularity

4 Apr 2022 12:59 UTC
65 points
19 comments16 min readLW link

Learn­ing the smooth prior

29 Apr 2022 21:10 UTC
27 points
0 comments12 min readLW link

Up­dat­ing Utility Functions

9 May 2022 9:44 UTC
33 points
7 comments8 min readLW link

AI Alter­na­tive Fu­tures: Sce­nario Map­ping Ar­tifi­cial In­tel­li­gence Risk—Re­quest for Par­ti­ci­pa­tion (*Edit*)

Kakili27 Apr 2022 22:07 UTC
10 points
2 comments9 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
37 points
0 comments59 min readLW link

RL with KL penalties is bet­ter seen as Bayesian inference

25 May 2022 9:23 UTC
39 points
3 comments12 min readLW link

On in­ner and outer al­ign­ment, and their confusion

NinaR26 May 2022 21:56 UTC
5 points
4 comments4 min readLW link
No comments.