TagLast edit: 20 Sep 2022 17:38 UTC by plex

Mesa-Optimization is the situation that occurs when a learned model (such as a neural network) is itself an optimizer. In this situation, a base optimizer creates a second optimizer, called a mesa-optimizer. The primary reference work for this concept is Hubinger et al.’s “Risks from Learned Optimization in Advanced Machine Learning Systems”.

Example: Natural selection is an optimization process that optimizes for reproductive fitness. Natural selection produced humans, who are themselves optimizers. Humans are therefore mesa-optimizers of natural selection.

In the context of AI alignment, the concern is that a base optimizer (e.g., a gradient descent process) may produce a learned model that is itself an optimizer, and that has unexpected and undesirable properties. Even if the gradient descent process is in some sense “trying” to do exactly what human developers want, the resultant mesa-optimizer will not typically be trying to do the exact same thing.[1]


Previously work under this concept was called Inner Optimizer or Optimization Daemons.

Wei Dai brings up a similar idea in an SL4 thread.[2]

The optimization daemons article on Arbital was published probably in 2016.[1]

Jessica Taylor wrote two posts about daemons while at MIRI:

See also

External links

Video by Robert Miles

Some posts that reference optimization daemons:

  1. ^
  2. ^

    Wei Dai. ‘”friendly” humans?’ December 31, 2003.

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
165 points
42 comments12 min readLW link3 reviews

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam Scholl12 Aug 2020 7:47 UTC
147 points
87 comments5 min readLW link

Embed­ded Agency (full-text ver­sion)

15 Nov 2018 19:49 UTC
143 points
15 comments54 min readLW link

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC
54 points
45 comments7 min readLW link

Con­di­tions for Mesa-Optimization

1 Jun 2019 20:52 UTC
75 points
48 comments12 min readLW link

Try­ing to Make a Treach­er­ous Mesa-Optimizer

MadHatter9 Nov 2022 18:07 UTC
86 points
13 comments4 min readLW link

Search­ing for Search

28 Nov 2022 15:31 UTC
61 points
6 comments14 min readLW link

Sub­sys­tem Alignment

6 Nov 2018 16:16 UTC
100 points
12 comments1 min readLW link

Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work

7 Jun 2019 19:53 UTC
78 points
4 comments6 min readLW link

De­cep­tive Alignment

5 Jun 2019 20:16 UTC
97 points
11 comments17 min readLW link

The In­ner Align­ment Problem

4 Jun 2019 1:20 UTC
98 points
17 comments13 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
45 points
7 comments8 min readLW link

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristiano5 May 2018 22:40 UTC
81 points
70 comments2 min readLW link1 review

[Question] What spe­cific dan­gers arise when ask­ing GPT-N to write an Align­ment Fo­rum post?

Matthew Barnett28 Jul 2020 2:56 UTC
44 points
14 comments1 min readLW link

If I were a well-in­ten­tioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC
26 points
2 comments6 min readLW link

Why GPT wants to mesa-op­ti­mize & how we might change this

John_Maxwell19 Sep 2020 13:48 UTC
55 points
32 comments9 min readLW link

Prize for prob­a­ble problems

paulfchristiano8 Mar 2018 16:58 UTC
60 points
63 comments4 min readLW link

Defin­ing ca­pa­bil­ity and al­ign­ment in gra­di­ent descent

Edouard Harris5 Nov 2020 14:36 UTC
22 points
6 comments10 min readLW link

AXRP Epi­sode 4 - Risks from Learned Op­ti­miza­tion with Evan Hubinger

DanielFilan18 Feb 2021 0:03 UTC
41 points
10 comments86 min readLW link

For­mal Solu­tion to the In­ner Align­ment Problem

michaelcohen18 Feb 2021 14:51 UTC
46 points
123 comments2 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
85 points
6 comments16 min readLW link

Thoughts on safety in pre­dic­tive learning

Steven Byrnes30 Jun 2021 19:17 UTC
18 points
17 comments19 min readLW link

Utility ≠ Reward

vlad_m5 Sep 2019 17:28 UTC
101 points
25 comments12 min readLW link2 reviews

Garrabrant and Shah on hu­man mod­el­ing in AGI

Rob Bensinger4 Aug 2021 4:35 UTC
57 points
10 comments47 min readLW link

Ap­proaches to gra­di­ent hacking

adamShimi14 Aug 2021 15:16 UTC
16 points
8 comments8 min readLW link

Thoughts on gra­di­ent hacking

Richard_Ngo3 Sep 2021 13:02 UTC
33 points
12 comments4 min readLW link

Meta learn­ing to gra­di­ent hack

Quintin Pope1 Oct 2021 19:25 UTC
54 points
11 comments3 min readLW link

Model­ing Risks From Learned Optimization

Ben Cottier12 Oct 2021 20:54 UTC
44 points
0 comments12 min readLW link

Fea­ture Selection

Zack_M_Davis1 Nov 2021 0:22 UTC
301 points
23 comments16 min readLW link

In­ner Align­ment: Ex­plain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC
175 points
46 comments13 min readLW link2 reviews

[ASoT] Some thoughts about de­cep­tive mesaoptimization

leogao28 Mar 2022 21:14 UTC
24 points
5 comments7 min readLW link

[ASoT] Some thoughts about im­perfect world modeling

leogao7 Apr 2022 15:42 UTC
7 points
0 comments4 min readLW link

[Question] Three ques­tions about mesa-optimizers

Eric Neyman12 Apr 2022 2:58 UTC
23 points
5 comments3 min readLW link

[Question] Why is pseudo-al­ign­ment “worse” than other ways ML can fail to gen­er­al­ize?

nostalgebraist18 Jul 2020 22:54 UTC
45 points
10 comments2 min readLW link

The Speed + Sim­plic­ity Prior is prob­a­bly anti-deceptive

Yonadav Shavit27 Apr 2022 19:30 UTC
30 points
29 comments12 min readLW link

Is evolu­tion­ary in­fluence the mesa ob­jec­tive that we’re in­ter­ested in?

David Johnston3 May 2022 1:18 UTC
3 points
2 comments5 min readLW link

How much should we worry about mesa-op­ti­miza­tion challenges?

sudo -i25 Jul 2022 3:56 UTC
4 points
13 comments2 min readLW link

Mesa-op­ti­miza­tion for goals defined only within a train­ing en­vi­ron­ment is dangerous

Rubi J. Hudson17 Aug 2022 3:56 UTC
6 points
2 comments4 min readLW link

Mesa-Op­ti­miz­ers via Grokking

orthonormal6 Dec 2022 20:05 UTC
33 points
4 comments6 min readLW link

2-D Robustness

vlad_m30 Aug 2019 20:27 UTC
77 points
8 comments2 min readLW link

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
99 points
39 comments3 min readLW link2 reviews

[AN #58] Mesa op­ti­miza­tion: what it is, and why we should care

Rohin Shah24 Jun 2019 16:10 UTC
54 points
9 comments8 min readLW link

Weak ar­gu­ments against the uni­ver­sal prior be­ing malign

X4vier14 Jun 2018 17:11 UTC
50 points
23 comments3 min readLW link

[Question] Do mesa-op­ti­mizer risk ar­gu­ments rely on the train-test paradigm?

Ben Cottier10 Sep 2020 15:36 UTC
12 points
7 comments1 min readLW link

Evolu­tions Build­ing Evolu­tions: Lay­ers of Gen­er­ate and Test

plex5 Feb 2021 18:21 UTC
11 points
1 comment6 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments26 min readLW link

Gra­da­tions of In­ner Align­ment Obstacles

abramdemski20 Apr 2021 22:18 UTC
80 points
22 comments9 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogao5 Sep 2021 22:42 UTC
21 points
11 comments4 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogao24 Oct 2021 0:43 UTC
25 points
1 comment12 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

Some real ex­am­ples of gra­di­ent hacking

Oliver Sourbut22 Nov 2021 0:11 UTC
15 points
8 comments2 min readLW link

Un­der­stand­ing Gra­di­ent Hacking

peterbarnett10 Dec 2021 15:58 UTC
30 points
5 comments30 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC
16 points
0 comments42 min readLW link

Align­ment Prob­lems All the Way Down

peterbarnett22 Jan 2022 0:19 UTC
26 points
7 comments10 min readLW link

[Question] Do mesa-op­ti­miza­tion prob­lems cor­re­late with low-slack?

sudo -i4 Feb 2022 21:11 UTC
1 point
1 comment1 min readLW link

Thoughts on Danger­ous Learned Optimization

peterbarnett19 Feb 2022 10:46 UTC
4 points
2 comments4 min readLW link

Why No *In­ter­est­ing* Unal­igned Sin­gu­lar­ity?

David Udell20 Apr 2022 0:34 UTC
12 points
14 comments1 min readLW link

Mesa-util­ity func­tions might not be purely proxy goals

Thomas Kwa22 Apr 2022 22:16 UTC
12 points
17 comments1 min readLW link

[ASoT] Con­se­quen­tial­ist mod­els as a su­per­set of mesaoptimizers

leogao23 Apr 2022 17:57 UTC
36 points
2 comments4 min readLW link

Agency As a Nat­u­ral Abstraction

Thane Ruthenis13 May 2022 18:02 UTC
55 points
9 comments13 min readLW link

A Story of AI Risk: In­struc­tGPT-N

peterbarnett26 May 2022 23:22 UTC
24 points
0 comments8 min readLW link

Towards Gears-Level Un­der­stand­ing of Agency

Thane Ruthenis16 Jun 2022 22:00 UTC
24 points
4 comments18 min readLW link

Goal Align­ment Is Ro­bust To the Sharp Left Turn

Thane Ruthenis13 Jul 2022 20:23 UTC
45 points
15 comments4 min readLW link

Con­ver­gence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC
37 points
1 comment13 min readLW link

Gra­di­ent de­scent doesn’t se­lect for in­ner search

Ivan Vendrov13 Aug 2022 4:15 UTC
35 points
23 comments4 min readLW link

De­cep­tion as the op­ti­mal: mesa-op­ti­miz­ers and in­ner al­ign­ment

Eleni Angelou16 Aug 2022 4:49 UTC
10 points
0 comments5 min readLW link

In­ter­pretabil­ity Tools Are an At­tack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC
42 points
22 comments1 min readLW link

Broad Pic­ture of Hu­man Values

Thane Ruthenis20 Aug 2022 19:42 UTC
36 points
5 comments10 min readLW link

Are Gen­er­a­tive World Models a Mesa-Op­ti­miza­tion Risk?

Thane Ruthenis29 Aug 2022 18:37 UTC
12 points
2 comments3 min readLW link

In­ner al­ign­ment: what are we point­ing at?

lcmgcd18 Sep 2022 11:09 UTC
7 points
2 comments1 min readLW link

Towards de­con­fus­ing wire­head­ing and re­ward maximization

leogao21 Sep 2022 0:36 UTC
69 points
7 comments4 min readLW link

Plan­ning ca­pac­ity and daemons

lcmgcd26 Sep 2022 0:15 UTC
2 points
0 comments5 min readLW link

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC
21 points
4 comments8 min readLW link

My (naive) take on Risks from Learned Optimization

artkpv31 Oct 2022 10:59 UTC
7 points
0 comments5 min readLW link

Value For­ma­tion: An Over­ar­ch­ing Model

Thane Ruthenis15 Nov 2022 17:16 UTC
20 points
0 comments34 min readLW link

Cau­tion when in­ter­pret­ing Deep­mind’s In-con­text RL paper

Sam Marks1 Nov 2022 2:42 UTC
103 points
6 comments4 min readLW link

The Disas­trously Con­fi­dent And Inac­cu­rate AI

Sharat Jacob Jacob18 Nov 2022 19:06 UTC
10 points
0 comments13 min readLW link