Outer Alignment

TagLast edit: 10 Aug 2020 16:35 UTC by brook

Outer Alignment in the context of machine learning is the property where the specified loss function is aligned with the intended goal of its designers. This is an intuitive notion, in part because human intentions are themselves not well-understood. This is what is typically discussed as the ‘value alignment’ problem. It is contrasted with inner alignment, which discusses if an optimizer is the production of an outer aligned system, then whether that optimizer is itself aligned.See also:

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
140 points
40 comments12 min readLW link3 nominations3 reviews

De­bate up­date: Obfus­cated ar­gu­ments problem

Beth Barnes23 Dec 2020 3:24 UTC
105 points
20 comments16 min readLW link

Book re­view: “A Thou­sand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC
97 points
14 comments19 min readLW link

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus Astra1 Jul 2020 17:30 UTC
34 points
4 comments67 min readLW link

Outer al­ign­ment and imi­ta­tive amplification

evhub10 Jan 2020 0:26 UTC
29 points
11 comments9 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
147 points
30 comments38 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
40 points
5 comments8 min readLW link

List of re­solved con­fu­sions about IDA

Wei_Dai30 Sep 2019 20:03 UTC
94 points
18 comments3 min readLW link

Is the Star Trek Fed­er­a­tion re­ally in­ca­pable of build­ing AI?

Kaj_Sotala18 Mar 2018 10:30 UTC
10 points
4 comments2 min readLW link

If I were a well-in­ten­tioned AI… I: Image classifier

Stuart_Armstrong26 Feb 2020 12:39 UTC
35 points
4 comments5 min readLW link

If I were a well-in­ten­tioned AI… II: Act­ing in a world

Stuart_Armstrong27 Feb 2020 11:58 UTC
20 points
0 comments3 min readLW link

If I were a well-in­ten­tioned AI… III: Ex­tremal Goodhart

Stuart_Armstrong28 Feb 2020 11:24 UTC
21 points
0 comments5 min readLW link

AI Align­ment 2018-19 Review

rohinmshah28 Jan 2020 2:19 UTC
115 points
6 comments35 min readLW link

Con­cept Safety: Pro­duc­ing similar AI-hu­man con­cept spaces

Kaj_Sotala14 Apr 2015 20:39 UTC
49 points
45 comments8 min readLW link

nos­talge­braist: Re­cur­sive Good­hart’s Law

Kaj_Sotala26 Aug 2020 11:07 UTC
52 points
27 comments1 min readLW link

(Hu­mor) AI Align­ment Crit­i­cal Failure Table

Kaj_Sotala31 Aug 2020 19:51 UTC
24 points
2 comments1 min readLW link

“In­ner Align­ment Failures” Which Are Ac­tu­ally Outer Align­ment Failures

johnswentworth31 Oct 2020 20:18 UTC
51 points
38 comments5 min readLW link

Men­tal sub­agent im­pli­ca­tions for AI Safety

moridinamael3 Jan 2021 18:59 UTC
11 points
0 comments3 min readLW link

MIRI com­ments on Co­tra’s “Case for Align­ing Nar­rowly Su­per­hu­man Models”

Rob Bensinger5 Mar 2021 23:43 UTC
124 points
13 comments26 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
61 points
27 comments16 min readLW link

An In­creas­ingly Ma­nipu­la­tive Newsfeed

Michaël Trazzi1 Jul 2019 15:26 UTC
57 points
14 comments5 min readLW link

The Steer­ing Problem

paulfchristiano13 Nov 2018 17:14 UTC
37 points
11 comments7 min readLW link

“De­sign­ing agent in­cen­tives to avoid re­ward tam­per­ing”, DeepMind

gwern14 Aug 2019 16:57 UTC
28 points
15 comments1 min readLW link

Ex­am­ples of AI’s be­hav­ing badly

Stuart_Armstrong16 Jul 2015 10:01 UTC
41 points
37 comments1 min readLW link

Thoughts on the Fea­si­bil­ity of Pro­saic AGI Align­ment?

iamthouthouarti21 Aug 2020 23:25 UTC
8 points
10 comments1 min readLW link

Align­ment As A Bot­tle­neck To Use­ful­ness Of GPT-3

johnswentworth21 Jul 2020 20:02 UTC
97 points
57 comments3 min readLW link

[Question] Com­pe­tence vs Alignment

Ariel Kwiatkowski30 Sep 2020 21:03 UTC
6 points
4 comments1 min readLW link

Imi­ta­tive Gen­er­al­i­sa­tion (AKA ‘Learn­ing the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC
74 points
12 comments12 min readLW link

Pre­dic­tion can be Outer Aligned at Optimum

Lanrian10 Jan 2021 18:48 UTC
13 points
11 comments11 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments26 min readLW link

The case for al­ign­ing nar­rowly su­per­hu­man models

Ajeya Cotra5 Mar 2021 22:29 UTC
167 points
72 comments38 min readLW link

A sim­ple way to make GPT-3 fol­low instructions

Quintin Pope8 Mar 2021 2:57 UTC
6 points
5 comments4 min readLW link

Another (outer) al­ign­ment failure story

paulfchristiano7 Apr 2021 20:12 UTC
120 points
19 comments12 min readLW link
No comments.