In­ner Alignment

TagLast edit: 16 Feb 2021 20:08 UTC by Yoav Ravid

Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimzer) is aligned with the objective funcition of the training process. As an example, evolution is an optimization force that itself ‘designed’ optimizers (humans) to achieve its goals. However, humans do not primarily maximise reproductive success, they instead use birth control and then go out and have fun. This is a failure of inner alignment.

The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization:

We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

Related Pages: Mesa-Optimization

External Links:

Video by Robert Miles

In­ner Align­ment: Ex­plain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC
122 points
13 comments12 min readLW link

The In­ner Align­ment Problem

4 Jun 2019 1:20 UTC
76 points
17 comments13 min readLW link

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
140 points
40 comments12 min readLW link3 nominations3 reviews

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC
53 points
45 comments7 min readLW link

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam Scholl12 Aug 2020 7:47 UTC
138 points
90 comments5 min readLW link

The Solomonoff Prior is Malign

Mark Xu14 Oct 2020 1:33 UTC
129 points
34 comments16 min readLW link

Book re­view: “A Thou­sand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC
97 points
14 comments19 min readLW link

In­ner al­ign­ment in the brain

Steven Byrnes22 Apr 2020 13:14 UTC
73 points
16 comments15 min readLW link

Con­crete ex­per­i­ments in in­ner alignment

evhub6 Sep 2019 22:16 UTC
60 points
12 comments6 min readLW link

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhub10 Sep 2019 23:03 UTC
54 points
10 comments27 min readLW link

Towards an em­piri­cal in­ves­ti­ga­tion of in­ner alignment

evhub23 Sep 2019 20:43 UTC
43 points
9 comments6 min readLW link

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus Astra1 Jul 2020 17:30 UTC
34 points
4 comments67 min readLW link

In­ner al­ign­ment re­quires mak­ing as­sump­tions about hu­man values

Matthew Barnett20 Jan 2020 18:38 UTC
26 points
9 comments4 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
147 points
30 comments38 min readLW link

[Question] Does iter­ated am­plifi­ca­tion tackle the in­ner al­ign­ment prob­lem?

JanBrauner15 Feb 2020 12:58 UTC
7 points
4 comments1 min readLW link

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristiano5 May 2018 22:40 UTC
79 points
69 comments2 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
40 points
5 comments8 min readLW link

De­mons in Im­perfect Search

johnswentworth11 Feb 2020 20:25 UTC
81 points
18 comments3 min readLW link

If I were a well-in­ten­tioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC
26 points
2 comments6 min readLW link

AI Align­ment 2018-19 Review

rohinmshah28 Jan 2020 2:19 UTC
115 points
6 comments35 min readLW link

Defin­ing ca­pa­bil­ity and al­ign­ment in gra­di­ent descent

Edouard Harris5 Nov 2020 14:36 UTC
21 points
6 comments10 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
54 points
2 comments16 min readLW link

In­ner Align­ment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC
110 points
31 comments11 min readLW link

AXRP Epi­sode 4 - Risks from Learned Op­ti­miza­tion with Evan Hubinger

DanielFilan18 Feb 2021 0:03 UTC
41 points
10 comments86 min readLW link

Against evolu­tion as an anal­ogy for how hu­mans will cre­ate AGI

Steven Byrnes23 Mar 2021 12:29 UTC
38 points
25 comments25 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
61 points
27 comments16 min readLW link

A sim­ple en­vi­ron­ment for show­ing mesa misalignment

Matthew Barnett26 Sep 2019 4:44 UTC
63 points
9 comments2 min readLW link

Ba­bies and Bun­nies: A Cau­tion About Evo-Psych

Alicorn22 Feb 2010 1:53 UTC
80 points
844 comments2 min readLW link

Tes­sel­lat­ing Hills: a toy model for demons in im­perfect search

DaemonicSigil20 Feb 2020 0:12 UTC
71 points
15 comments2 min readLW link

2-D Robustness

vlad_m30 Aug 2019 20:27 UTC
67 points
1 comment2 min readLW link

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
74 points
34 comments3 min readLW link2 nominations2 reviews

Are min­i­mal cir­cuits de­cep­tive?

evhub7 Sep 2019 18:11 UTC
51 points
8 comments8 min readLW link

Mal­ign gen­er­al­iza­tion with­out in­ter­nal search

Matthew Barnett12 Jan 2020 18:03 UTC
43 points
12 comments4 min readLW link

[AN #67]: Creat­ing en­vi­ron­ments in which to study in­ner al­ign­ment failures

rohinmshah7 Oct 2019 17:10 UTC
17 points
0 comments8 min readLW link

Ex­am­ples of AI’s be­hav­ing badly

Stuart_Armstrong16 Jul 2015 10:01 UTC
41 points
37 comments1 min readLW link

Safely and use­fully spec­tat­ing on AIs op­ti­miz­ing over toy worlds

AlexMennen31 Jul 2018 18:30 UTC
24 points
16 comments2 min readLW link

“In­ner Align­ment Failures” Which Are Ac­tu­ally Outer Align­ment Failures

johnswentworth31 Oct 2020 20:18 UTC
51 points
38 comments5 min readLW link

AI Align­ment Us­ing Re­v­erse Simulation

Sven Nilsen12 Jan 2021 20:48 UTC
1 point
0 comments1 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments26 min readLW link

For­mal Solu­tion to the In­ner Align­ment Problem

michaelcohen18 Feb 2021 14:51 UTC
46 points
122 comments2 min readLW link
No comments.