In­ner Alignment

TagLast edit: 22 Oct 2021 20:18 UTC by plex

Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) is aligned with the objective function of the training process. As an example, evolution is an optimization force that itself ‘designed’ optimizers (humans) to achieve its goals. However, humans do not primarily maximise reproductive success, they instead use birth control and then go out and have fun. This is a failure of inner alignment.

The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization:

We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

Related Pages: Mesa-Optimization

External Links:

Video by Robert Miles

In­ner Align­ment: Ex­plain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC
134 points
15 comments12 min readLW link

The In­ner Align­ment Problem

4 Jun 2019 1:20 UTC
77 points
17 comments13 min readLW link

De­mons in Im­perfect Search

johnswentworth11 Feb 2020 20:25 UTC
84 points
21 comments3 min readLW link

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
144 points
40 comments12 min readLW link3 nominations3 reviews

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC
54 points
45 comments7 min readLW link

Con­crete ex­per­i­ments in in­ner alignment

evhub6 Sep 2019 22:16 UTC
61 points
12 comments6 min readLW link

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhub10 Sep 2019 23:03 UTC
54 points
13 comments27 min readLW link

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristiano5 May 2018 22:40 UTC
79 points
70 comments2 min readLW link

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam Scholl12 Aug 2020 7:47 UTC
144 points
90 comments5 min readLW link

The Solomonoff Prior is Malign

Mark Xu14 Oct 2020 1:33 UTC
136 points
34 comments16 min readLW link

Tes­sel­lat­ing Hills: a toy model for demons in im­perfect search

DaemonicSigil20 Feb 2020 0:12 UTC
74 points
15 comments2 min readLW link

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
83 points
34 comments3 min readLW link2 nominations2 reviews

Are min­i­mal cir­cuits de­cep­tive?

evhub7 Sep 2019 18:11 UTC
56 points
8 comments8 min readLW link

Mal­ign gen­er­al­iza­tion with­out in­ter­nal search

Matthew Barnett12 Jan 2020 18:03 UTC
43 points
12 comments4 min readLW link

Book re­view: “A Thou­sand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC
105 points
18 comments19 min readLW link

Em­piri­cal Ob­ser­va­tions of Ob­jec­tive Ro­bust­ness Failures

23 Jun 2021 23:23 UTC
59 points
5 comments9 min readLW link

Dis­cus­sion: Ob­jec­tive Ro­bust­ness and In­ner Align­ment Terminology

23 Jun 2021 23:25 UTC
56 points
5 comments9 min readLW link

In­ner al­ign­ment in the brain

Steven Byrnes22 Apr 2020 13:14 UTC
75 points
16 comments16 min readLW link

Towards an em­piri­cal in­ves­ti­ga­tion of in­ner alignment

evhub23 Sep 2019 20:43 UTC
43 points
9 comments6 min readLW link

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus Astra1 Jul 2020 17:30 UTC
34 points
4 comments67 min readLW link

In­ner al­ign­ment re­quires mak­ing as­sump­tions about hu­man values

Matthew Barnett20 Jan 2020 18:38 UTC
26 points
9 comments4 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
165 points
30 comments38 min readLW link

[Question] Does iter­ated am­plifi­ca­tion tackle the in­ner al­ign­ment prob­lem?

JanBrauner15 Feb 2020 12:58 UTC
7 points
4 comments1 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
40 points
5 comments8 min readLW link

If I were a well-in­ten­tioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC
26 points
2 comments6 min readLW link

AI Align­ment 2018-19 Review

rohinmshah28 Jan 2020 2:19 UTC
121 points
6 comments35 min readLW link

Defin­ing ca­pa­bil­ity and al­ign­ment in gra­di­ent descent

Edouard Harris5 Nov 2020 14:36 UTC
21 points
6 comments10 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
67 points
2 comments16 min readLW link

In­ner Align­ment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC
114 points
35 comments11 min readLW link

AXRP Epi­sode 4 - Risks from Learned Op­ti­miza­tion with Evan Hubinger

DanielFilan18 Feb 2021 0:03 UTC
41 points
10 comments86 min readLW link

Against evolu­tion as an anal­ogy for how hu­mans will cre­ate AGI

Steven Byrnes23 Mar 2021 12:29 UTC
32 points
25 comments25 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
62 points
40 comments16 min readLW link

Gra­da­tions of In­ner Align­ment Obstacles

abramdemski20 Apr 2021 22:18 UTC
79 points
22 comments9 min readLW link

Pre-Train­ing + Fine-Tun­ing Fa­vors Deception

Mark Xu8 May 2021 18:36 UTC
25 points
2 comments3 min readLW link

For­mal In­ner Align­ment, Prospectus

abramdemski12 May 2021 19:57 UTC
90 points
55 comments16 min readLW link

Model-based RL, De­sires, Brains, Wireheading

Steven Byrnes14 Jul 2021 15:11 UTC
17 points
1 comment13 min readLW link

Re-Define In­tent Align­ment?

abramdemski22 Jul 2021 19:00 UTC
25 points
31 comments4 min readLW link

Ap­pli­ca­tions for De­con­fus­ing Goal-Directedness

adamShimi8 Aug 2021 13:05 UTC
34 points
0 comments5 min readLW link

Ap­proaches to gra­di­ent hacking

adamShimi14 Aug 2021 15:16 UTC
10 points
7 comments8 min readLW link

Selec­tion The­o­rems: A Pro­gram For Un­der­stand­ing Agents

johnswentworth28 Sep 2021 5:03 UTC
73 points
20 comments6 min readLW link

[Question] Col­lec­tion of ar­gu­ments to ex­pect (outer and in­ner) al­ign­ment failure?

Sam Clarke28 Sep 2021 16:55 UTC
20 points
10 comments1 min readLW link

A sim­ple en­vi­ron­ment for show­ing mesa misalignment

Matthew Barnett26 Sep 2019 4:44 UTC
69 points
9 comments2 min readLW link

Ba­bies and Bun­nies: A Cau­tion About Evo-Psych

Alicorn22 Feb 2010 1:53 UTC
81 points
844 comments2 min readLW link

2-D Robustness

vlad_m30 Aug 2019 20:27 UTC
70 points
1 comment2 min readLW link

[AN #67]: Creat­ing en­vi­ron­ments in which to study in­ner al­ign­ment failures

rohinmshah7 Oct 2019 17:10 UTC
17 points
0 comments8 min readLW link

Ex­am­ples of AI’s be­hav­ing badly

Stuart_Armstrong16 Jul 2015 10:01 UTC
41 points
37 comments1 min readLW link

Safely and use­fully spec­tat­ing on AIs op­ti­miz­ing over toy worlds

AlexMennen31 Jul 2018 18:30 UTC
24 points
16 comments2 min readLW link

“In­ner Align­ment Failures” Which Are Ac­tu­ally Outer Align­ment Failures

johnswentworth31 Oct 2020 20:18 UTC
60 points
38 comments5 min readLW link

AI Align­ment Us­ing Re­v­erse Simulation

Sven Nilsen12 Jan 2021 20:48 UTC
1 point
0 comments1 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments26 min readLW link

For­mal Solu­tion to the In­ner Align­ment Problem

michaelcohen18 Feb 2021 14:51 UTC
46 points
123 comments2 min readLW link

Re­sponse to “What does the uni­ver­sal prior ac­tu­ally look like?”

michaelcohen20 May 2021 16:12 UTC
35 points
34 comments18 min readLW link

MIRIx Part I: In­suffi­cient Values

16 Jun 2021 14:33 UTC
29 points
15 comments6 min readLW link

Call for re­search on eval­u­at­ing al­ign­ment (fund­ing + ad­vice available)

Beth Barnes31 Aug 2021 23:28 UTC
104 points
11 comments5 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogao5 Sep 2021 22:42 UTC
20 points
11 comments4 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogao24 Oct 2021 0:43 UTC
23 points
1 comment12 min readLW link

Meta learn­ing to gra­di­ent hack

Quintin Pope1 Oct 2021 19:25 UTC
45 points
10 comments3 min readLW link

The eval­u­a­tion func­tion of an AI is not its aim

Yair Halberstadt10 Oct 2021 14:52 UTC
13 points
5 comments3 min readLW link

[Question] What ex­actly is GPT-3′s base ob­jec­tive?

Daniel Kokotajlo10 Nov 2021 0:57 UTC
49 points
15 comments2 min readLW link
No comments.