TagLast edit: 16 Feb 2021 20:06 UTC by Yoav Ravid

Mesa-Optimization is the situation that occurs when a learned model (such as a neural network) is itself an optimizer. A base optimizer optimizes and creates a mesa-optimizer. Previously work under this concept was called Inner Optimizer or Optimization Daemons.


Natural selection is an optimization process (that optimizes for reproductive fitness) that produced humans (who are capable of pursuing goals that no longer correlate reliably with reproductive fitness). In this case, humans are optimization daemons of natural selection. In the context of AI alignment, the concern is that an artificial general intelligence exerting optimization pressure may produce mesa-optimizers that break alignment.1


Previously work under this concept was called Inner Optimizer or Optimization Daemons.

Wei Dai brings up a similar idea in an SL4 thread.2

The optimization daemons article on Arbital was published probably in 2016.3

Jessica Taylor wrote two posts about daemons while at MIRI:

See also


  1. “Optimization daemons”. Arbital.

  2. Wei Dai. ‘”friendly” humans?’ December 31, 2003.

External links

Video by Robert Miles

Some posts that reference optimization daemons:

Related ideas

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam Scholl12 Aug 2020 7:47 UTC
142 points
90 comments5 min readLW link

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
144 points
40 comments12 min readLW link3 nominations3 reviews

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC
54 points
45 comments7 min readLW link

Embed­ded Agency (full-text ver­sion)

15 Nov 2018 19:49 UTC
124 points
11 comments54 min readLW link

Con­di­tions for Mesa-Optimization

1 Jun 2019 20:52 UTC
63 points
47 comments12 min readLW link

Sub­sys­tem Alignment

6 Nov 2018 16:16 UTC
99 points
12 comments1 min readLW link

Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work

7 Jun 2019 19:53 UTC
70 points
4 comments6 min readLW link

De­cep­tive Alignment

5 Jun 2019 20:16 UTC
71 points
11 comments17 min readLW link

The In­ner Align­ment Problem

4 Jun 2019 1:20 UTC
77 points
17 comments13 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
40 points
5 comments8 min readLW link

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristiano5 May 2018 22:40 UTC
79 points
70 comments2 min readLW link

[Question] What spe­cific dan­gers arise when ask­ing GPT-N to write an Align­ment Fo­rum post?

Matthew Barnett28 Jul 2020 2:56 UTC
43 points
14 comments1 min readLW link

If I were a well-in­ten­tioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC
26 points
2 comments6 min readLW link

Why GPT wants to mesa-op­ti­mize & how we might change this

John_Maxwell19 Sep 2020 13:48 UTC
53 points
32 comments9 min readLW link

Prize for prob­a­ble problems

paulfchristiano8 Mar 2018 16:58 UTC
60 points
63 comments4 min readLW link

Defin­ing ca­pa­bil­ity and al­ign­ment in gra­di­ent descent

Edouard Harris5 Nov 2020 14:36 UTC
21 points
6 comments10 min readLW link

AXRP Epi­sode 4 - Risks from Learned Op­ti­miza­tion with Evan Hubinger

DanielFilan18 Feb 2021 0:03 UTC
41 points
10 comments86 min readLW link

For­mal Solu­tion to the In­ner Align­ment Problem

michaelcohen18 Feb 2021 14:51 UTC
46 points
123 comments2 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
66 points
2 comments16 min readLW link

Thoughts on safety in pre­dic­tive learning

Steven Byrnes30 Jun 2021 19:17 UTC
18 points
17 comments19 min readLW link

Utility ≠ Reward

vlad_m5 Sep 2019 17:28 UTC
97 points
24 comments12 min readLW link2 nominations2 reviews

Garrabrant and Shah on hu­man mod­el­ing in AGI

Rob Bensinger4 Aug 2021 4:35 UTC
56 points
10 comments47 min readLW link

Ap­proaches to gra­di­ent hacking

adamShimi14 Aug 2021 15:16 UTC
10 points
7 comments8 min readLW link

Thoughts on gra­di­ent hacking

Richard_Ngo3 Sep 2021 13:02 UTC
31 points
8 comments4 min readLW link

Meta learn­ing to gra­di­ent hack

Quintin Pope1 Oct 2021 19:25 UTC
36 points
10 comments3 min readLW link

Model­ing Risks From Learned Optimization

Ben Cottier12 Oct 2021 20:54 UTC
44 points
0 comments12 min readLW link

2-D Robustness

vlad_m30 Aug 2019 20:27 UTC
69 points
1 comment2 min readLW link

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
82 points
34 comments3 min readLW link2 nominations2 reviews

[AN #58] Mesa op­ti­miza­tion: what it is, and why we should care

rohinmshah24 Jun 2019 16:10 UTC
52 points
9 comments8 min readLW link

Weak ar­gu­ments against the uni­ver­sal prior be­ing malign

X4vier14 Jun 2018 17:11 UTC
49 points
23 comments3 min readLW link

[Question] Do mesa-op­ti­mizer risk ar­gu­ments rely on the train-test paradigm?

Ben Cottier10 Sep 2020 15:36 UTC
12 points
7 comments1 min readLW link

Evolu­tions Build­ing Evolu­tions: Lay­ers of Gen­er­ate and Test

plex5 Feb 2021 18:21 UTC
11 points
1 comment6 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments26 min readLW link

Gra­da­tions of In­ner Align­ment Obstacles

abramdemski20 Apr 2021 22:18 UTC
79 points
22 comments9 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogao5 Sep 2021 22:42 UTC
20 points
11 comments4 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogao24 Oct 2021 0:43 UTC
20 points
1 comment12 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
7 points
0 comments2 min readLW link
No comments.