TagLast edit: 20 May 2022 17:58 UTC by Rob Bensinger

Mesa-Optimization is the situation that occurs when a learned model (such as a neural network) is itself an optimizer. In this situation, a base optimizer creates a second optimizer, called a mesa-optimizer. The primary reference work for this concept is Hubinger et al.’s “Risks from Learned Optimization in Advanced Machine Learning Systems”.

Example: Natural selection is an optimization process that optimizes for reproductive fitness. Natural selection produced humans, who are themselves optimizers. Humans are therefore mesa-optimizers of natural selection.

In the context of AI alignment, the concern is that a base optimizer (e.g., a gradient descent process) may produce a learned model that is itself an optimizer, and that has unexpected and undesirable properties. Even if the gradient descent process is in some sense “trying” to do exactly what human developers want, the resultant mesa-optimizer will not typically be trying to do the exact same thing.1


Previously work under this concept was called Inner Optimizer or Optimization Daemons.

Wei Dai brings up a similar idea in an SL4 thread.2

The optimization daemons article on Arbital was published probably in 2016.3

Jessica Taylor wrote two posts about daemons while at MIRI:

See also


  1. “Optimization daemons”. Arbital.

  2. Wei Dai. ‘”friendly” humans?’ December 31, 2003.

External links

Video by Robert Miles

Some posts that reference optimization daemons:

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
156 points
42 comments12 min readLW link3 reviews

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam Scholl12 Aug 2020 7:47 UTC
149 points
90 comments5 min readLW link

Embed­ded Agency (full-text ver­sion)

15 Nov 2018 19:49 UTC
125 points
11 comments54 min readLW link

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC
54 points
45 comments7 min readLW link

Con­di­tions for Mesa-Optimization

1 Jun 2019 20:52 UTC
68 points
47 comments12 min readLW link

Sub­sys­tem Alignment

6 Nov 2018 16:16 UTC
100 points
12 comments1 min readLW link

Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work

7 Jun 2019 19:53 UTC
73 points
4 comments6 min readLW link

De­cep­tive Alignment

5 Jun 2019 20:16 UTC
84 points
11 comments17 min readLW link

The In­ner Align­ment Problem

4 Jun 2019 1:20 UTC
90 points
17 comments13 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
43 points
5 comments8 min readLW link

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristiano5 May 2018 22:40 UTC
80 points
70 comments2 min readLW link1 review

[Question] What spe­cific dan­gers arise when ask­ing GPT-N to write an Align­ment Fo­rum post?

Matthew Barnett28 Jul 2020 2:56 UTC
43 points
14 comments1 min readLW link

If I were a well-in­ten­tioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC
26 points
2 comments6 min readLW link

Why GPT wants to mesa-op­ti­mize & how we might change this

John_Maxwell19 Sep 2020 13:48 UTC
54 points
32 comments9 min readLW link

Prize for prob­a­ble problems

paulfchristiano8 Mar 2018 16:58 UTC
60 points
63 comments4 min readLW link

Defin­ing ca­pa­bil­ity and al­ign­ment in gra­di­ent descent

Edouard Harris5 Nov 2020 14:36 UTC
22 points
6 comments10 min readLW link

AXRP Epi­sode 4 - Risks from Learned Op­ti­miza­tion with Evan Hubinger

DanielFilan18 Feb 2021 0:03 UTC
41 points
10 comments86 min readLW link

For­mal Solu­tion to the In­ner Align­ment Problem

michaelcohen18 Feb 2021 14:51 UTC
46 points
123 comments2 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
76 points
4 comments16 min readLW link

Thoughts on safety in pre­dic­tive learning

Steven Byrnes30 Jun 2021 19:17 UTC
18 points
17 comments19 min readLW link

Utility ≠ Reward

vlad_m5 Sep 2019 17:28 UTC
98 points
24 comments12 min readLW link2 reviews

Garrabrant and Shah on hu­man mod­el­ing in AGI

Rob Bensinger4 Aug 2021 4:35 UTC
57 points
10 comments47 min readLW link

Ap­proaches to gra­di­ent hacking

adamShimi14 Aug 2021 15:16 UTC
15 points
7 comments8 min readLW link

Thoughts on gra­di­ent hacking

Richard_Ngo3 Sep 2021 13:02 UTC
31 points
11 comments4 min readLW link

Meta learn­ing to gra­di­ent hack

Quintin Pope1 Oct 2021 19:25 UTC
45 points
10 comments3 min readLW link

Model­ing Risks From Learned Optimization

Ben Cottier12 Oct 2021 20:54 UTC
44 points
0 comments12 min readLW link

Fea­ture Selection

Zack_M_Davis1 Nov 2021 0:22 UTC
290 points
22 comments16 min readLW link

[ASoT] Some thoughts about de­cep­tive mesaoptimization

leogao28 Mar 2022 21:14 UTC
24 points
5 comments7 min readLW link

[ASoT] Some thoughts about im­perfect world modeling

leogao7 Apr 2022 15:42 UTC
7 points
0 comments4 min readLW link

[Question] Three ques­tions about mesa-optimizers

Eric Neyman12 Apr 2022 2:58 UTC
23 points
5 comments3 min readLW link

[Question] Why is pseudo-al­ign­ment “worse” than other ways ML can fail to gen­er­al­ize?

nostalgebraist18 Jul 2020 22:54 UTC
45 points
10 comments2 min readLW link

The Speed + Sim­plic­ity Prior is prob­a­bly anti-deceptive

Yonadav Shavit27 Apr 2022 19:30 UTC
30 points
29 comments12 min readLW link

Is evolu­tion­ary in­fluence the mesa ob­jec­tive that we’re in­ter­ested in?

David Johnston3 May 2022 1:18 UTC
3 points
2 comments5 min readLW link

2-D Robustness

vlad_m30 Aug 2019 20:27 UTC
73 points
1 comment2 min readLW link

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
90 points
39 comments3 min readLW link2 reviews

[AN #58] Mesa op­ti­miza­tion: what it is, and why we should care

Rohin Shah24 Jun 2019 16:10 UTC
54 points
9 comments8 min readLW link

Weak ar­gu­ments against the uni­ver­sal prior be­ing malign

X4vier14 Jun 2018 17:11 UTC
50 points
23 comments3 min readLW link

[Question] Do mesa-op­ti­mizer risk ar­gu­ments rely on the train-test paradigm?

Ben Cottier10 Sep 2020 15:36 UTC
12 points
7 comments1 min readLW link

Evolu­tions Build­ing Evolu­tions: Lay­ers of Gen­er­ate and Test

plex5 Feb 2021 18:21 UTC
11 points
1 comment6 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments26 min readLW link

Gra­da­tions of In­ner Align­ment Obstacles

abramdemski20 Apr 2021 22:18 UTC
79 points
22 comments9 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogao5 Sep 2021 22:42 UTC
21 points
11 comments4 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogao24 Oct 2021 0:43 UTC
25 points
1 comment12 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

Some real ex­am­ples of gra­di­ent hacking

Oliver Sourbut22 Nov 2021 0:11 UTC
7 points
4 comments2 min readLW link

Un­der­stand­ing Gra­di­ent Hacking

peterbarnett10 Dec 2021 15:58 UTC
27 points
5 comments30 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC
14 points
0 comments42 min readLW link

In­ner Align­ment: Ex­plain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC
170 points
39 comments13 min readLW link2 reviews

Align­ment Prob­lems All the Way Down

peterbarnett22 Jan 2022 0:19 UTC
25 points
7 comments10 min readLW link

[Question] Do mesa-op­ti­miza­tion prob­lems cor­re­late with low-slack?

sudo -i4 Feb 2022 21:11 UTC
1 point
0 comments1 min readLW link

Thoughts on Danger­ous Learned Optimization

peterbarnett19 Feb 2022 10:46 UTC
4 points
2 comments4 min readLW link

Why No *In­ter­est­ing* Unal­igned Sin­gu­lar­ity?

David Udell20 Apr 2022 0:34 UTC
11 points
13 comments1 min readLW link

Mesa-util­ity func­tions might not be purely proxy goals

Thomas Kwa22 Apr 2022 22:16 UTC
12 points
17 comments1 min readLW link

[ASoT] Con­se­quen­tial­ist mod­els as a su­per­set of mesaoptimizers

leogao23 Apr 2022 17:57 UTC
35 points
2 comments4 min readLW link

Agency As a Nat­u­ral Abstraction

Thane Ruthenis13 May 2022 18:02 UTC
38 points
8 comments13 min readLW link

A Story of AI Risk: In­struc­tGPT-N

peterbarnett26 May 2022 23:22 UTC
17 points
0 comments8 min readLW link
No comments.