Mesa-Optimization

TagLast edit: Mar 19, 2023, 8:15 PM by Diabloto96

Mesa-Optimization is the situation that occurs when a learned model (such as a neural network) is itself an optimizer. In this situation, a base optimizer creates a second optimizer, called a mesa-optimizer. The primary reference work for this concept is Hubinger et al.’s “Risks from Learned Optimization in Advanced Machine Learning Systems”.

Example: Natural selection is an optimization process that optimizes for reproductive fitness. Natural selection produced humans, who are themselves optimizers. Humans are therefore mesa-optimizers of natural selection.

In the context of AI alignment, the concern is that a base optimizer (e.g., a gradient descent process) may produce a learned model that is itself an optimizer, and that has unexpected and undesirable properties. Even if the gradient descent process is in some sense “trying” to do exactly what human developers want, the resultant mesa-optimizer will not typically be trying to do the exact same thing.^[1]

History

Previously work under this concept was called Inner Optimizer or Optimization Daemons.

Wei Dai brings up a similar idea in an SL4 thread.^[2]

The optimization daemons article on Arbital was published probably in 2016.^[1]

Jessica Taylor wrote two posts about daemons while at MIRI:

“Are daemons a problem for ideal agents?” (2017-02-11)
“Maximally efficient agents will probably have an anti-daemon immune system” (2017-02-23)

External links

Video by Robert Miles

Some posts that reference optimization daemons:

“Cause prioritization for downside-focused value systems”: “Alternatively, perhaps goal preservation becomes more difficult the more capable AI systems become, in which case the future might be controlled by unstable goal functions taking turns over the steering wheel”
“Techniques for optimizing worst-case performance”: “The difficulty of optimizing worst-case performance is one of the most likely reasons that I think prosaic AI alignment might turn out to be impossible (if combined with an unlucky empirical situation).” (the phrase “unlucky empirical situation” links to the optimization daemons page on Arbital)

^
“Optimization daemons”. Arbital.
^
Wei Dai. ‘”friendly” humans?’ December 31, 2003.

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

May 31, 2019, 11:44 PM

187 points

42 comments12 min readLW link 3 reviews

Matt Botvinick on the spontaneous emergence of learning algorithms

Adam SchollAug 12, 2020, 7:47 AM

154 points

87 comments5 min readLW link

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

Jul 25, 2024, 10:00 PM

59 points

8 comments2 min readLW link

(arxiv.org)

Mesa-Search vs Mesa-Control

abramdemskiAug 18, 2020, 6:51 PM

55 points

45 comments7 min readLW link

Embedded Agency (full-text version)

Scott Garrabrant and abramdemski

Nov 15, 2018, 7:49 PM

209 points

17 comments54 min readLW link

Conditions for Mesa-Optimization

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 1, 2019, 8:52 PM

84 points

48 comments12 min readLW link

Trying to Make a Treacherous Mesa-Optimizer

MadHatterNov 9, 2022, 6:07 PM

95 points

14 comments4 min readLW link

(attentionspan.blog)

Why almost every RL agent does learned optimization

Lee SharkeyFeb 12, 2023, 4:58 AM

32 points

3 comments5 min readLW link

Searching for Search

NicholasKees and janus

Nov 28, 2022, 3:31 PM

97 points

9 comments14 min readLW link 1 review

Subsystem Alignment

abramdemski and Scott Garrabrant

Nov 6, 2018, 4:16 PM

102 points

12 comments1 min readLW link

Does SGD Produce Deceptive Alignment?

Mark XuNov 6, 2020, 11:48 PM

96 points

9 comments16 min readLW link

Deceptive Alignment

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 5, 2019, 8:16 PM

118 points

20 comments17 min readLW link

The Speed + Simplicity Prior is probably anti-deceptive

Yonadav ShavitApr 27, 2022, 7:30 PM

30 points

28 comments12 min readLW link

Anomalous tokens reveal the original identities of Instruct models

janus and jdp

Feb 9, 2023, 1:30 AM

140 points

16 comments9 min readLW link

(generative.ink)

If I were a well-intentioned AI… IV: Mesa-optimising

Stuart_ArmstrongMar 2, 2020, 12:16 PM

26 points

2 comments6 min readLW link

Utility ≠ Reward

Vlad MikulikSep 5, 2019, 5:28 PM

131 points

24 comments1 min readLW link 2 reviews

AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

DanielFilanFeb 18, 2021, 12:03 AM

43 points

10 comments87 min readLW link

[Question] What specific dangers arise when asking GPT-N to write an Alignment Forum post?

Matthew BarnettJul 28, 2020, 2:56 AM

46 points

14 comments1 min readLW link

Is evolutionary influence the mesa objective that we’re interested in?

David JohnstonMay 3, 2022, 1:18 AM

3 points

2 comments5 min readLW link

Mesa-Optimizers via Grokking

orthonormalDec 6, 2022, 8:05 PM

36 points

4 comments6 min readLW link

Meta learning to gradient hack

Quintin PopeOct 1, 2021, 7:25 PM

55 points

11 comments3 min readLW link

Clarifying mesa-optimization

Marius Hobbhahn and Pierre Peigné

Mar 21, 2023, 3:53 PM

38 points

6 comments10 min readLW link

Thoughts on gradient hacking

Richard_NgoSep 3, 2021, 1:02 PM

33 points

11 comments4 min readLW link

Turning up the Heat on Deceptively-Misaligned AI

J BostockJan 7, 2025, 12:13 AM

19 points

16 comments4 min readLW link

Principled Satisficing To Avoid Goodhart

JenniferRMAug 16, 2024, 7:05 PM

45 points

2 comments8 min readLW link

Prize for probable problems

paulfchristianoMar 8, 2018, 4:58 PM

60 points

63 comments4 min readLW link

Inner Alignment: Explain like I’m 12 Edition

Rafael HarthAug 1, 2020, 3:24 PM

184 points

47 comments13 min readLW link 2 reviews

Feature Selection

Zack_M_DavisNov 1, 2021, 12:22 AM

322 points

24 comments16 min readLW link 1 review

[ASoT] Some thoughts about deceptive mesaoptimization

leogaoMar 28, 2022, 9:14 PM

24 points

5 comments7 min readLW link

Open question: are minimal circuits daemon-free?

paulfchristianoMay 5, 2018, 10:40 PM

83 points

70 comments2 min readLW link 1 review

Approaches to gradient hacking

adamShimiAug 14, 2021, 3:16 PM

16 points

8 comments8 min readLW link

[ASoT] Some thoughts about imperfect world modeling

leogaoApr 7, 2022, 3:42 PM

7 points

0 comments4 min readLW link

Defining capability and alignment in gradient descent

Edouard HarrisNov 5, 2020, 2:36 PM

22 points

6 comments10 min readLW link

Thoughts on safety in predictive learning

Steven ByrnesJun 30, 2021, 7:17 PM

20 points

17 comments19 min readLW link

Formal Solution to the Inner Alignment Problem

michaelcohenFeb 18, 2021, 2:51 PM

49 points

123 comments2 min readLW link

Risks from Learned Optimization: Conclusion and Related Work

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 7, 2019, 7:53 PM

82 points

5 comments6 min readLW link

Mesa-optimization for goals defined only within a training environment is dangerous

Rubi J. HudsonAug 17, 2022, 3:56 AM

6 points

2 comments4 min readLW link

Garrabrant and Shah on human modeling in AGI

Rob BensingerAug 4, 2021, 4:35 AM

60 points

10 comments47 min readLW link

The Inner Alignment Problem

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

Jun 4, 2019, 1:20 AM

105 points

17 comments13 min readLW link

Why GPT wants to mesa-optimize & how we might change this

John_MaxwellSep 19, 2020, 1:48 PM

55 points

33 comments9 min readLW link

AXRP Episode 38.3 - Erik Jenner on Learned Look-Ahead

DanielFilanDec 12, 2024, 5:40 AM

20 points

0 comments16 min readLW link

[Question] Why is pseudo-alignment “worse” than other ways ML can fail to generalize?

nostalgebraistJul 18, 2020, 10:54 PM

45 points

9 comments2 min readLW link

Satisficers want to become maximisers

Stuart_ArmstrongOct 21, 2011, 4:27 PM

38 points

70 comments1 min readLW link

Consequentialism is in the Stars not Ourselves

DragonGodApr 24, 2023, 12:02 AM

7 points

19 comments5 min readLW link

Mesa-Optimizers vs “Steered Optimizers”

Steven ByrnesJul 10, 2020, 4:49 PM

45 points

7 comments8 min readLW link

[Question] Three questions about mesa-optimizers

Eric NeymanApr 12, 2022, 2:58 AM

26 points

5 comments3 min readLW link

How much should we worry about mesa-optimization challenges?

sudoJul 25, 2022, 3:56 AM

4 points

13 comments2 min readLW link

Modeling Risks From Learned Optimization

Ben CottierOct 12, 2021, 8:54 PM

45 points

0 comments12 min readLW link

Counting arguments provide no evidence for AI doom

Nora Belrose and Quintin Pope

Feb 27, 2024, 11:03 PM

101 points

188 comments14 min readLW link

Mlyyrczo

lsusrDec 26, 2022, 7:58 AM

41 points

14 comments3 min readLW link

No convincing evidence for gradient descent in activation space

BlaineApr 12, 2023, 4:48 AM

85 points

9 comments20 min readLW link

[Question] Do mesa-optimization problems correlate with low-slack?

sudoFeb 4, 2022, 9:11 PM

1 point

1 comment1 min readLW link

In Defense of Wrapper-Minds

Thane RuthenisDec 28, 2022, 6:28 PM

24 points

38 comments3 min readLW link

Gradations of Inner Alignment Obstacles

abramdemskiApr 20, 2021, 10:18 PM

84 points

22 comments9 min readLW link

Medical Image Registration: The obscure field where Deep Mesaoptimizers are already at the top of the benchmarks. (post + colab notebook)

HastingsJan 30, 2023, 10:46 PM

35 points

1 comment3 min readLW link

The Illusion of Alignment: Why Current AI Safety Strategies Fall Short

S. Lilith devJun 1, 2025, 12:08 PM

−1 points

0 comments2 min readLW link

Goal Alignment Is Robust To the Sharp Left Turn

Thane RuthenisJul 13, 2022, 8:23 PM

43 points

16 comments4 min readLW link

Understanding mesa-optimization using toy models

tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy and Can

May 7, 2023, 5:00 PM

45 points

2 comments10 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjrFeb 12, 2021, 7:55 AM

15 points

0 comments27 min readLW link

Simple experiments with deceptive alignment

Andreas_MoeMay 15, 2023, 5:41 PM

7 points

0 comments4 min readLW link

Some real examples of gradient hacking

Oliver SourbutNov 22, 2021, 12:11 AM

15 points

8 comments2 min readLW link

Disincentivizing deception in mesa optimizers with Model Tampering

martinkunevJul 11, 2023, 12:44 AM

3 points

0 comments2 min readLW link

Gradient Filtering

Jozdien and janus

Jan 18, 2023, 8:09 PM

56 points

16 comments13 min readLW link

What are the plans for solving the inner alignment problem?

Leonard HollowayJan 17, 2025, 9:45 PM

12 points

4 comments1 min readLW link

Broad Picture of Human Values

Thane RuthenisAug 20, 2022, 7:42 PM

42 points

6 comments10 min readLW link

Obstacles to gradient hacking

leogaoSep 5, 2021, 10:42 PM

28 points

11 comments4 min readLW link

Imagine a world where Microsoft employees used Bing

Christopher KingMar 31, 2023, 6:36 PM

6 points

2 comments2 min readLW link

[Proposal] Method of locating useful subnets in large models

Quintin PopeOct 13, 2021, 8:52 PM

9 points

0 comments2 min readLW link

Interpretability Tools Are an Attack Channel

Thane RuthenisAug 17, 2022, 6:47 PM

42 points

14 comments1 min readLW link

Towards Deconfusing Gradient Hacking

leogaoOct 24, 2021, 12:43 AM

39 points

3 comments12 min readLW link

More experiments in GPT-4 agency: writing memos

Christopher KingMar 24, 2023, 5:51 PM

5 points

2 comments10 min readLW link

Inner alignment: what are we pointing at?

lemonhopeSep 18, 2022, 11:09 AM

14 points

2 comments1 min readLW link

A Story of AI Risk: InstructGPT-N

peterbarnettMay 26, 2022, 11:22 PM

24 points

0 comments8 min readLW link

Inner Optimization Mechanisms in Neural Nets

ProgramCrafterMay 12, 2024, 5:52 PM

3 points

1 comment1 min readLW link

The Human’s Role in Mesa Optimization

silentbobMay 9, 2024, 12:07 PM

5 points

0 comments2 min readLW link

My (naive) take on Risks from Learned Optimization

Artyom KarpovOct 31, 2022, 10:59 AM

7 points

0 comments5 min readLW link

Powerful mesa-optimisation is already here

Roman LeventovFeb 17, 2023, 4:59 AM

35 points

1 comment2 min readLW link

(arxiv.org)

Understanding Gradient Hacking

peterbarnettDec 10, 2021, 3:58 PM

41 points

5 comments30 min readLW link

Gradient hacking is extremely difficult

berenJan 24, 2023, 3:45 PM

170 points

22 comments5 min readLW link

Value Formation: An Overarching Model

Thane RuthenisNov 15, 2022, 5:16 PM

34 points

20 comments34 min readLW link

Deception as the optimal: mesa-optimizers and inner alignment

Eleni AngelouAug 16, 2022, 4:49 AM

11 points

0 comments5 min readLW link

Challenge proposal: smallest possible self-hardening backdoor for RLHF

Christopher KingJun 29, 2023, 4:56 PM

7 points

0 comments2 min readLW link

Towards a solution to the alignment problem via objective detection and evaluation

Paul CologneseApr 12, 2023, 3:39 PM

9 points

7 comments12 min readLW link

Measuring Learned Optimization in Small Transformer Models

J BostockApr 8, 2024, 2:41 PM

22 points

0 comments11 min readLW link

Runaway Optimizers in Mind Space

silentbobJul 16, 2023, 2:26 PM

16 points

0 comments12 min readLW link

[Question] Do mesa-optimizer risk arguments rely on the train-test paradigm?

Ben CottierSep 10, 2020, 3:36 PM

12 points

7 comments1 min readLW link

The Disastrously Confident And Inaccurate AI

Sharat Jacob JacobNov 18, 2022, 7:06 PM

13 points

0 comments13 min readLW link

The Inner Alignment Problem

Jakub HalmešFeb 24, 2024, 5:55 PM

1 point

1 comment3 min readLW link

(jakubhalmes.substack.com)

Towards Gears-Level Understanding of Agency

Thane RuthenisJun 16, 2022, 10:00 PM

25 points

4 comments18 min readLW link

Visualizing neural network planning

Nevan Wichers, Victor Tao, Fazl and Riccardo Volpato

May 9, 2024, 6:40 AM

4 points

0 comments5 min readLW link

A Proposal for Structured Cognitive Substrates Beneath Language Models

VerityIXMay 11, 2025, 4:40 PM

1 point

0 comments1 min readLW link

Greed Is the Root of This Evil

Thane RuthenisOct 13, 2022, 8:40 PM

21 points

7 comments8 min readLW link

Are Generative World Models a Mesa-Optimization Risk?

Thane RuthenisAug 29, 2022, 6:37 PM

14 points

2 comments3 min readLW link

Does GPT-4 exhibit agency when summarizing articles?

Christopher KingMar 24, 2023, 3:49 PM

16 points

2 comments5 min readLW link

Why Recursive Self-Improvement Might Not Be the Existential Risk We Fear

Nassim_ANov 24, 2024, 5:17 PM

1 point

0 comments9 min readLW link

It Can’t Be Mesa-Optimizers All The Way Down (Or Else It Can’t Be Long-Term Supercoherence?)

Austin WitteMar 31, 2023, 7:21 AM

20 points

5 comments4 min readLW link

Caution when interpreting Deepmind’s In-context RL paper

Sam MarksNov 1, 2022, 2:42 AM

105 points

8 comments4 min readLW link

Weak arguments against the universal prior being malign

X4vierJun 14, 2018, 5:11 PM

50 points

23 comments3 min readLW link

Gradient descent doesn’t select for inner search

Ivan VendrovAug 13, 2022, 4:15 AM

47 points

23 comments4 min readLW link

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

abhayesian, Jannik Brinkmann and Victor Levoso

May 28, 2024, 5:29 AM

50 points

1 comment9 min readLW link

(arxiv.org)

Motivations, Natural Selection, and Curriculum Engineering

Oliver SourbutDec 16, 2021, 1:07 AM

16 points

0 comments42 min readLW link

[AN #58] Mesa optimization: what it is, and why we should care

Rohin ShahJun 24, 2019, 4:10 PM

55 points

10 comments8 min readLW link

(mailchi.mp)

GPT-4 is bad at strategic thinking

Christopher KingMar 27, 2023, 3:11 PM

22 points

8 comments1 min readLW link

GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2

Christopher KingMar 31, 2023, 5:05 PM

6 points

4 comments4 min readLW link

Convergence Towards World-Models: A Gears-Level Model

Thane RuthenisAug 4, 2022, 11:31 PM

38 points

1 comment13 min readLW link

Towards deconfusing wireheading and reward maximization

leogaoSep 21, 2022, 12:36 AM

81 points

7 comments4 min readLW link

[ASoT] Consequentialist models as a superset of mesaoptimizers

leogaoApr 23, 2022, 5:57 PM

38 points

2 comments4 min readLW link

Gradient hacking

evhubOct 16, 2019, 12:53 AM

107 points

39 comments3 min readLW link 2 reviews

Why No Interesting Unaligned Singularity?

David UdellApr 20, 2022, 12:34 AM

12 points

12 comments1 min readLW link

Thoughts on Dangerous Learned Optimization

peterbarnettFeb 19, 2022, 10:46 AM

4 points

2 comments4 min readLW link

2-D Robustness

Vlad MikulikAug 30, 2019, 8:27 PM

85 points

8 comments2 min readLW link

Evolutions Building Evolutions: Layers of Generate and Test

plexFeb 5, 2021, 6:21 PM

12 points

1 comment6 min readLW link

Mesa-Optimization: Explain it like I’m 10 Edition

brookAug 26, 2023, 11:04 PM

20 points

1 comment6 min readLW link

Against Boltzmann mesaoptimizers

porbyJan 30, 2023, 2:55 AM

77 points

6 comments4 min readLW link

Agency As a Natural Abstraction

Thane RuthenisMay 13, 2022, 6:02 PM

55 points

9 comments13 min readLW link

Alignment Problems All the Way Down

peterbarnettJan 22, 2022, 12:19 AM

29 points

7 comments11 min readLW link

No comments.

Mesa-Optimization

History

See also

External links