Inner Alignment

TagLast edit: 30 Dec 2024 9:29 UTC by Dakara

Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.

Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?

As an example, evolution is an optimization force that itself ‘designed’ optimizers (humans) to achieve its goals. However, humans do not primarily maximize reproductive success, they instead use birth control while still attaining the pleasure that evolution meant as a reward for attempts at reproduction. This is a failure of inner alignment.

The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization:

We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

Goal misgeneralization due to distribution shift is another example of an inner alignment failure. It is when the mesa-objective appears to pursue the base objective during training but does not pursue it during deployment. We mistakenly think that good performance on the training distribution means that the mesa-optimizer is pursuing the base objective. However, this might have occurred only because there were some correlations in the training distribution resulting in good performance on both the base and mesa objectives. When we had a distribution shift from training to deployment it caused the correlation to be broken and the mesa-objective failed to generalize. This is especially problematic when the capabilities successfully generalize to the deployment distribution while the objectives/goals don’t. Since now we have a capable system that is optimizing for a misaligned goal.

To solve the inner alignment problem, some sub-problems that we would have to make progress on include things such as deceptive alignment, distribution shifts, and gradient hacking.

Inner Alignment Vs. Outer Alignment

Inner alignment is often talked about as being separate from outer alignment. The former deals with working on guaranteeing that we are robustly aiming at something, and the latter deals with the problem of what exactly are we aiming at. For more information see the corresponding tag.

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don’t think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Mesa-Optimization, Treacherous Turn, Eliciting Latent Knowledge, Deceptive Alignment, Deception

External Links:

Video by Robert Miles

The Inner Alignment Problem

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

4 Jun 2019 1:20 UTC

105 points

17 comments13 min readLW link

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

31 May 2019 23:44 UTC

187 points

42 comments12 min readLW link 3 reviews

Inner Alignment: Explain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC

185 points

47 comments13 min readLW link 2 reviews

Demons in Imperfect Search

johnswentworth11 Feb 2020 20:25 UTC

110 points

21 comments3 min readLW link

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC

55 points

45 comments7 min readLW link

How To Go From Interpretability To Alignment: Just Retarget The Search

johnswentworth10 Aug 2022 16:08 UTC

212 points

34 comments3 min readLW link 1 review

Reward is not the optimization target

TurnTrout25 Jul 2022 0:03 UTC

378 points

127 comments10 min readLW link 3 reviews

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC

37 points

4 comments2 min readLW link

How to Control an LLM’s Behavior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC

65 points

30 comments11 min readLW link

minutes from a human-alignment meeting

bhauth24 May 2024 5:01 UTC

67 points

4 comments2 min readLW link

Why almost every RL agent does learned optimization

Lee Sharkey12 Feb 2023 4:58 UTC

32 points

3 comments5 min readLW link

Searching for Search

NicholasKees and janus

28 Nov 2022 15:31 UTC

97 points

9 comments14 min readLW link 1 review

Matt Botvinick on the spontaneous emergence of learning algorithms

Adam Scholl12 Aug 2020 7:47 UTC

154 points

87 comments5 min readLW link

Relaxed adversarial training for inner alignment

evhub10 Sep 2019 23:03 UTC

69 points

27 comments27 min readLW link

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC

64 points

41 comments24 min readLW link

Open question: are minimal circuits daemon-free?

paulfchristiano5 May 2018 22:40 UTC

84 points

70 comments2 min readLW link 1 review

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

RogerDearnaley28 May 2025 6:21 UTC

31 points

34 comments9 min readLW link

Concrete experiments in inner alignment

evhub6 Sep 2019 22:16 UTC

74 points

12 comments6 min readLW link

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout2 Dec 2022 2:43 UTC

136 points

23 comments47 min readLW link 3 reviews

Outer vs inner misalignment: three framings

Richard_Ngo6 Jul 2022 19:46 UTC

52 points

5 comments9 min readLW link

Are minimal circuits deceptive?

evhub7 Sep 2019 18:11 UTC

78 points

11 comments8 min readLW link

Question 2: Predicted bad outcomes of AGI learning architecture

Cameron Berg11 Feb 2022 22:23 UTC

5 points

1 comment10 min readLW link

Tessellating Hills: a toy model for demons in imperfect search

DaemonicSigil20 Feb 2020 0:12 UTC

97 points

18 comments2 min readLW link

Book review: “A Thousand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC

122 points

18 comments19 min readLW link

Malign generalization without internal search

Matthew Barnett12 Jan 2020 18:03 UTC

43 points

12 comments4 min readLW link

Gradient hacking

evhub16 Oct 2019 0:53 UTC

107 points

39 comments3 min readLW link 2 reviews

Theoretical Neuroscience For Alignment Theory

Cameron Berg7 Dec 2021 21:50 UTC

66 points

18 comments23 min readLW link

Empirical Observations of Objective Robustness Failures

jbkjr and Lauro Langosco

23 Jun 2021 23:23 UTC

63 points

5 comments9 min readLW link

Discussion: Objective Robustness and Inner Alignment Terminology

jbkjr and Lauro Langosco

23 Jun 2021 23:25 UTC

73 points

7 comments9 min readLW link

Mesa-Optimizers via Grokking

orthonormal6 Dec 2022 20:05 UTC

36 points

4 comments6 min readLW link

Some of my disagreements with List of Lethalities

TurnTrout24 Jan 2023 0:25 UTC

63 points

7 comments10 min readLW link

Re-Define Intent Alignment?

abramdemski22 Jul 2021 19:00 UTC

32 points

32 comments4 min readLW link

Our new video about goal misgeneralization, plus an apology

Writer14 Jan 2025 14:07 UTC

33 points

0 comments7 min readLW link

(youtu.be)

My Overview of the AI Alignment Landscape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC

53 points

3 comments28 min readLW link

If I were a well-intentioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC

26 points

2 comments6 min readLW link

AI Alignment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC

126 points

6 comments35 min readLW link

Selection Theorems: A Program For Understanding Agents

johnswentworth28 Sep 2021 5:03 UTC

132 points

28 comments6 min readLW link 2 reviews

[Question] Does iterated amplification tackle the inner alignment problem?

JanB15 Feb 2020 12:58 UTC

7 points

4 comments1 min readLW link

Framing approaches to alignment and the hard problem of AI cognition

ryan_greenblatt15 Dec 2021 19:06 UTC

16 points

15 comments27 min readLW link

Reframing inner alignment

davidad11 Dec 2022 13:53 UTC

53 points

13 comments4 min readLW link

Inner alignment requires making assumptions about human values

Matthew Barnett20 Jan 2020 18:38 UTC

26 points

9 comments4 min readLW link

[Question] Why is pseudo-alignment “worse” than other ways ML can fail to generalize?

nostalgebraist18 Jul 2020 22:54 UTC

45 points

9 comments2 min readLW link

Inner Alignment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC

137 points

41 comments11 min readLW link 2 reviews

Approaches to gradient hacking

adamShimi14 Aug 2021 15:16 UTC

16 points

8 comments8 min readLW link

Goodhart’s Law Causal Diagrams

JustinShovelain and Jeremy Gillen

11 Apr 2022 13:52 UTC

35 points

6 comments6 min readLW link

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

25 Jul 2024 22:00 UTC

59 points

8 comments2 min readLW link

(arxiv.org)

Anomalous tokens reveal the original identities of Instruct models

janus and jdp

9 Feb 2023 1:30 UTC

140 points

16 comments9 min readLW link

(generative.ink)

We have promising alignment plans with low taxes

Seth Herd10 Nov 2023 18:51 UTC

44 points

9 comments5 min readLW link

Winners of AI Alignment Awards Research Contest

Orpheus16 and Olive Branch

13 Jul 2023 16:14 UTC

115 points

4 comments12 min readLW link

(alignmentawards.com)

Superintelligence’s goals are likely to be random

Mikhail Samin13 Mar 2025 22:41 UTC

6 points

6 comments5 min readLW link

The (partial) fallacy of dumb superintelligence

Seth Herd18 Oct 2023 21:25 UTC

38 points

5 comments4 min readLW link

[Question] Collection of arguments to expect (outer and inner) alignment failure?

Sam Clarke28 Sep 2021 16:55 UTC

21 points

10 comments1 min readLW link

Steering subsystems: capabilities, agency, and alignment

Seth Herd29 Sep 2023 13:45 UTC

31 points

0 comments8 min readLW link

Mesa-Optimizers vs “Steered Optimizers”

Steven Byrnes10 Jul 2020 16:49 UTC

48 points

7 comments8 min readLW link

Inner Misalignment in “Simulator” LLMs

Adam Scherlis31 Jan 2023 8:33 UTC

84 points

12 comments4 min readLW link

[Intro to brain-like-AGI safety] 10. The alignment problem

Steven Byrnes30 Mar 2022 13:24 UTC

53 points

7 comments21 min readLW link

Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

Palus Astra1 Jul 2020 17:30 UTC

35 points

4 comments67 min readLW link

Categorizing failures as “outer” or “inner” misalignment is often confused

Rohin Shah6 Jan 2023 15:48 UTC

93 points

21 comments8 min readLW link

On the Confusion between Inner and Outer Misalignment

Chris_Leong25 Mar 2024 11:59 UTC

17 points

10 comments1 min readLW link

Language Agents Reduce the Risk of Existential Catastrophe

cdkg and Simon Goldstein

28 May 2023 19:10 UTC

39 points

14 comments26 min readLW link

AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

DanielFilan18 Feb 2021 0:03 UTC

43 points

10 comments87 min readLW link

Language for Goal Misgeneralization: Some Formalisms from my MSc Thesis

Giulio14 Jun 2024 19:35 UTC

10 points

0 comments8 min readLW link

(www.giuliostarace.com)

SLT for AI Safety

Jesse Hoogland1 Jul 2025 4:52 UTC

63 points

0 comments3 min readLW link

Goals selected from learned knowledge: an alternative to RL alignment

Seth Herd15 Jan 2024 21:52 UTC

42 points

18 comments7 min readLW link

[Aspiration-based designs] 1. Informal introduction

B Jacobs, Jobst Heitzig, Simon Fischer and Simon Dima

28 Apr 2024 13:00 UTC

44 points

4 comments8 min readLW link

Take 8: Queer the inner/outer alignment dichotomy.

Charlie Steiner9 Dec 2022 17:46 UTC

31 points

2 comments2 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

11 Mar 2023 18:59 UTC

334 points

28 comments23 min readLW link

Towards an empirical investigation of inner alignment

evhub23 Sep 2019 20:43 UTC

44 points

9 comments6 min readLW link

Explaining inner alignment to myself

Jeremy Gillen24 May 2022 23:10 UTC

9 points

2 comments10 min readLW link

Comparing Four Approaches to Inner Alignment

Lucas Teixeira29 Jul 2022 21:06 UTC

38 points

1 comment9 min readLW link

Model-based RL, Desires, Brains, Wireheading

Steven Byrnes14 Jul 2021 15:11 UTC

24 points

1 comment13 min readLW link

LLM AGI may reason about its goals and discover misalignments by default

Seth Herd15 Sep 2025 14:58 UTC

68 points

5 comments38 min readLW link

An overview of 11 proposals for building safe advanced AI

evhub29 May 2020 20:38 UTC

220 points

37 comments38 min readLW link 2 reviews

Gradations of Inner Alignment Obstacles

abramdemski20 Apr 2021 22:18 UTC

84 points

22 comments9 min readLW link

Difficulty classes for alignment properties

Jozdien20 Feb 2024 9:08 UTC

34 points

5 comments2 min readLW link

Why “AI alignment” would better be renamed into “Artificial Intention research”

chaosmage15 Jun 2023 10:32 UTC

29 points

12 comments2 min readLW link

My AGI Threat Model: Misaligned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC

74 points

40 comments16 min readLW link

Clarifying the confusion around inner alignment

Rauno Arike13 May 2022 23:05 UTC

31 points

0 comments11 min readLW link

Against evolution as an analogy for how humans will create AGI

Steven Byrnes23 Mar 2021 12:29 UTC

65 points

25 comments25 min readLW link

Defining capability and alignment in gradient descent

Edouard Harris5 Nov 2020 14:36 UTC

22 points

6 comments10 min readLW link

Pre-Training + Fine-Tuning Favors Deception

Mark Xu8 May 2021 18:36 UTC

27 points

3 comments3 min readLW link

Applications for Deconfusing Goal-Directedness

adamShimi8 Aug 2021 13:05 UTC

38 points

3 comments5 min readLW link 1 review

Clarifying mesa-optimization

Marius Hobbhahn and Pierre Peigné

21 Mar 2023 15:53 UTC

38 points

6 comments10 min readLW link

Inner alignment in the brain

Steven Byrnes22 Apr 2020 13:14 UTC

79 points

16 comments16 min readLW link

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC

93 points

11 comments2 min readLW link

A simple case for extreme inner misalignment

Richard_Ngo13 Jul 2024 15:40 UTC

84 points

41 comments7 min readLW link

A more systematic case for inner misalignment

Richard_Ngo20 Jul 2024 5:03 UTC

31 points

4 comments5 min readLW link

Don’t align agents to evaluations of plans

TurnTrout26 Nov 2022 21:16 UTC

48 points

49 comments18 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC

127 points

9 comments15 min readLW link

AXRP Episode 39 - Evan Hubinger on Model Organisms of Misalignment

DanielFilan1 Dec 2024 6:00 UTC

41 points

0 comments67 min readLW link

How likely is deceptive alignment?

evhub30 Aug 2022 19:34 UTC

105 points

28 comments60 min readLW link

Formal Inner Alignment, Prospectus

abramdemski12 May 2021 19:57 UTC

95 points

57 comments16 min readLW link

Deceptive Alignment and Homuncularity

Oliver Sourbut and TurnTrout

16 Jan 2025 13:55 UTC

26 points

12 comments22 min readLW link

Does SGD Produce Deceptive Alignment?

Mark Xu6 Nov 2020 23:48 UTC

96 points

9 comments16 min readLW link

[Question] What exactly is GPT-3′s base objective?

Daniel Kokotajlo10 Nov 2021 0:57 UTC

60 points

14 comments2 min readLW link

Science of Deep Learning—a technical agenda

Marius Hobbhahn18 Oct 2022 14:54 UTC

37 points

7 comments4 min readLW link

[Question] I there a demo of “You can’t fetch the coffee if you’re dead”?

Ram Rachum10 Nov 2022 18:41 UTC

8 points

9 comments1 min readLW link

Call for research on evaluating alignment (funding + advice available)

Beth Barnes31 Aug 2021 23:28 UTC

105 points

11 comments5 min readLW link

Thank you for triggering me

Cissy12 Feb 2024 20:09 UTC

6 points

1 comment6 min readLW link

(www.moremyself.xyz)

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

60 points

8 comments20 min readLW link

Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?

RogerDearnaley11 Jan 2024 12:56 UTC

35 points

4 comments39 min readLW link

Thoughts about OOD alignment

Catnee24 Aug 2022 15:31 UTC

11 points

10 comments2 min readLW link

A concise sum-up of the basic argument for AI doom

Mergimio H. Doefevmil24 Apr 2023 17:37 UTC

11 points

6 comments2 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tamera3 Aug 2022 12:03 UTC

138 points

23 comments6 min readLW link

Towards Deconfusing Gradient Hacking

leogao24 Oct 2021 0:43 UTC

39 points

3 comments12 min readLW link

EchoSeed: GlyphChains, Collapse Laws, and a Framework for Bearing Consequences

retreat00026 Jul 2025 20:35 UTC

1 point

0 comments1 min readLW link

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Florian_Dietz10 Mar 2025 16:07 UTC

42 points

7 comments9 min readLW link

High-level interpretability: detecting an AI’s objectives

Paul Colognese and Jozdien

28 Sep 2023 19:30 UTC

72 points

4 comments21 min readLW link

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC

27 points

4 comments6 min readLW link

[Question] Is “brittle alignment” good enough?

the8thbit23 May 2023 17:35 UTC

9 points

5 comments3 min readLW link

Insufficient Values

Jozdien, Jacob Abraham and Abraham Francis

16 Jun 2021 14:33 UTC

31 points

16 comments6 min readLW link

Implementing Asimov’s Laws of Robotics—How I imagine alignment working.

Joshua Clancy22 May 2024 23:15 UTC

2 points

0 comments11 min readLW link

Mesa-Optimization: Explain it like I’m 10 Edition

brook26 Aug 2023 23:04 UTC

20 points

1 comment6 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

127 points

24 comments4 min readLW link 1 review

A New Framework for AI Alignment: A Philosophical Approach

niscalajyoti25 Jun 2025 2:41 UTC

1 point

0 comments1 min readLW link

(archive.org)

Open-ended ethics of phenomena (a desiderata with universal morality)

Ryo 8 Nov 2023 20:10 UTC

1 point

0 comments8 min readLW link

Gradient descent might see the direction of the optimum from far away

Mikhail Samin28 Jul 2023 16:19 UTC

70 points

13 comments4 min readLW link

Demystifying “Alignment” through a Comic

milanrosko9 Jun 2024 8:24 UTC

107 points

19 comments1 min readLW link

Deception as the optimal: mesa-optimizers and inner alignment

Eleni Angelou16 Aug 2022 4:49 UTC

11 points

0 comments5 min readLW link

Why humans won’t control superhuman AIs.

Spiritus Dei16 Oct 2024 16:48 UTC

−11 points

1 comment6 min readLW link

Project Intro: Selection Theorems for Modularity

CallumMcDougall, Avery and Lucius Bushnaq

4 Apr 2022 12:59 UTC

74 points

20 comments16 min readLW link

How complex are myopic imitators?

Vivek Hebbar8 Feb 2022 12:00 UTC

26 points

1 comment15 min readLW link

Gradient Hacking via Schelling Goals

Adam Scherlis28 Dec 2021 20:38 UTC

33 points

4 comments4 min readLW link

Moral gauge theory: A speculative suggestion for AI alignment

James Diacoumis23 Feb 2025 11:42 UTC

6 points

2 comments8 min readLW link

EvoNet: Towards Self-Evolving, Entropy-Guided AI

Leonhard173 Jul 2025 9:44 UTC

1 point

0 comments18 min readLW link

The Defender’s Advantage of Interpretability

Marius Hobbhahn14 Sep 2022 14:05 UTC

41 points

4 comments6 min readLW link

Inner alignment: what are we pointing at?

lemonhope18 Sep 2022 11:09 UTC

14 points

2 comments1 min readLW link

Shutdown-Seeking AI

Simon Goldstein31 May 2023 22:19 UTC

50 points

32 comments15 min readLW link

A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N)

Joseph Bloom16 May 2023 22:59 UTC

36 points

2 comments16 min readLW link

A Kindness, or The Inevitable Consequence of Perfect Inference (a short story)

samhealy12 Dec 2023 23:03 UTC

6 points

0 comments9 min readLW link

Reaction to “Empowerment is (almost) All We Need” : an open-ended alternative

Ryo 25 Nov 2023 15:35 UTC

9 points

3 comments5 min readLW link

My preferred framings for reward misspecification and goal misgeneralisation

Yi-Yang6 May 2023 4:48 UTC

27 points

1 comment8 min readLW link

“Pick Two” AI Trilemma: Generality, Agency, Alignment.

Black Flag15 Jan 2025 18:52 UTC

7 points

0 comments2 min readLW link

Acceptability Verification: A Research Agenda

David Udell and evhub

12 Jul 2022 20:11 UTC

50 points

0 comments1 min readLW link

(docs.google.com)

AI Alternative Futures: Scenario Mapping Artificial Intelligence Risk—Request for Participation (Closed)

Kakili27 Apr 2022 22:07 UTC

10 points

2 comments8 min readLW link

Simple experiments with deceptive alignment

Andreas_Moe15 May 2023 17:41 UTC

7 points

0 comments4 min readLW link

I Recommend More Training Rationales

Gianluca Calcagni31 Dec 2024 14:06 UTC

2 points

0 comments6 min readLW link

Broad Picture of Human Values

Thane Ruthenis20 Aug 2022 19:42 UTC

42 points

6 comments10 min readLW link

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM

Winnie Yang28 Aug 2024 8:41 UTC

7 points

2 comments31 min readLW link

Reward is the optimization target (of capabilities researchers)

Max H15 May 2023 3:22 UTC

32 points

4 comments5 min readLW link

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

79 points

4 comments25 min readLW link

From a Debate with a Black Box to a Proposal for Epistemic Memory

StevenNuyts8 Jun 2025 10:18 UTC

1 point

0 comments6 min readLW link

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Karolis Jucys, george_adams and Sonia Joseph

18 Jul 2024 17:02 UTC

9 points

0 comments1 min readLW link

(arxiv.org)

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC

21 points

7 comments8 min readLW link

Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)

sdeture31 May 2025 22:09 UTC

15 points

6 comments8 min readLW link

Why LLMs Waste So Much Cognitive Bandwidth — and How to Fix It

Lunarknot3 Jul 2025 9:47 UTC

1 point

0 comments1 min readLW link

When the AI Dam Breaks: From Surveillance to Game Theory in AI Alignment

pataphor29 Sep 2025 4:01 UTC

5 points

7 comments5 min readLW link

A simple environment for showing mesa misalignment

Matthew Barnett26 Sep 2019 4:44 UTC

74 points

9 comments2 min readLW link

Examples of AI’s behaving badly

Stuart_Armstrong16 Jul 2015 10:01 UTC

41 points

41 comments1 min readLW link

Learned logic of modelling harm

Callum28 Jun 2025 1:08 UTC

1 point

0 comments1 min readLW link

Value Formation: An Overarching Model

Thane Ruthenis15 Nov 2022 17:16 UTC

34 points

20 comments34 min readLW link

Framing AI Childhoods

David Udell6 Sep 2022 23:40 UTC

37 points

8 comments4 min readLW link

Inverted Logic: A Thermodynamic Protocol for Emergent AI Alignment

AdrianC6 Jul 2025 19:40 UTC

1 point

0 comments1 min readLW link

Gradient descent doesn’t select for inner search

Ivan Vendrov13 Aug 2022 4:15 UTC

47 points

23 comments4 min readLW link

Why deceptive alignment matters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC

68 points

13 comments13 min readLW link

Obstacles to gradient hacking

leogao5 Sep 2021 22:42 UTC

28 points

11 comments4 min readLW link

Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment

RogerDearnaley7 Dec 2023 6:14 UTC

9 points

0 comments11 min readLW link

Alignment in Thought Chains

Faust Nemesis4 Mar 2024 19:24 UTC

1 point

0 comments2 min readLW link

Is “red” for GPT-4 the same as “red” for you?

Yusuke Hayashi6 May 2023 17:55 UTC

9 points

6 comments2 min readLW link

A Proposal for Structured Cognitive Substrates Beneath Language Models

VerityIX11 May 2025 16:40 UTC

1 point

0 comments1 min readLW link

Is there a ML agent that abandons it’s utility function out-of-distribution without losing capabilities?

Christopher King22 Feb 2023 16:49 UTC

1 point

7 comments1 min readLW link

Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study

Karolis Jucys8 Dec 2023 13:18 UTC

16 points

1 comment4 min readLW link

(arxiv.org)

Unaligned AGI & Brief History of Inequality

ank22 Feb 2025 16:26 UTC

−20 points

4 comments7 min readLW link

How will they feed us

meijer19731 Jun 2023 8:49 UTC

4 points

3 comments5 min readLW link

Convergence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC

38 points

1 comment13 min readLW link

🧠 Affective Latent Modulation in Transformers: A Mechanism Proposal

MATEO ORTEGA GAMBOA15 Jun 2025 23:34 UTC

0 points

0 comments2 min readLW link

Why No Interesting Unaligned Singularity?

David Udell20 Apr 2022 0:34 UTC

12 points

12 comments1 min readLW link

Invitation to the Princeton AI Alignment and Safety Seminar

Sadhika Malladi17 Mar 2024 1:10 UTC

6 points

1 comment1 min readLW link

Why I’m Worried About AI

peterbarnett23 May 2022 21:13 UTC

22 points

2 comments12 min readLW link

The Inner Alignment Problem

Jakub Halmeš24 Feb 2024 17:55 UTC

1 point

1 comment3 min readLW link

(jakubhalmes.substack.com)

Proposing Human Survival Strategy based on the NAIA Vision: Toward the Co-evolution of Diverse Intelligences

Hiroshi Yamakawa27 Feb 2025 5:18 UTC

−2 points

0 comments11 min readLW link

Understanding Gradient Hacking

peterbarnett10 Dec 2021 15:58 UTC

41 points

5 comments30 min readLW link

Inner Alignment via Superpowers

JamesH, Thomas Larsen and Jeremy Gillen

30 Aug 2022 20:01 UTC

37 points

13 comments4 min readLW link

2-D Robustness

Vlad Mikulik30 Aug 2019 20:27 UTC

85 points

8 comments2 min readLW link

Internal Target Information for AI Oversight

Paul Colognese20 Oct 2023 14:53 UTC

15 points

0 comments5 min readLW link

Open-ended/Phenomenal Ethics (TLDR)

Ryo 9 Nov 2023 16:58 UTC

3 points

0 comments1 min readLW link

Simple alignment plan that maybe works

Iknownothing18 Jul 2023 22:48 UTC

4 points

8 comments1 min readLW link

Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems

Florian_Dietz17 Feb 2024 8:45 UTC

4 points

0 comments13 min readLW link

Aligned AI as a wrapper around an LLM

cousin_it25 Mar 2023 15:58 UTC

31 points

19 comments1 min readLW link

[Question] Does human (mis)alignment pose a significant and imminent existential threat?

jr23 Feb 2025 10:03 UTC

6 points

3 comments1 min readLW link

Gradient Filtering

Jozdien and janus

18 Jan 2023 20:09 UTC

56 points

16 comments13 min readLW link

PRISM: Perspective Reasoning for Integrated Synthesis and Mediation (Interactive Demo)

Anthony Diamond18 Mar 2025 18:03 UTC

10 points

2 comments1 min readLW link

Safely and usefully spectating on AIs optimizing over toy worlds

AlexMennen31 Jul 2018 18:30 UTC

24 points

16 comments2 min readLW link

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Andy Arditi and Oscar Obeso

8 Dec 2023 17:08 UTC

82 points

7 comments7 min readLW link

The Ideal Speech Situation as a Tool for AI Ethical Reflection: A Framework for Alignment

kenneth myers9 Feb 2024 18:40 UTC

6 points

12 comments3 min readLW link

Gradient hacking is extremely difficult

beren24 Jan 2023 15:45 UTC

174 points

22 comments5 min readLW link

The Linguistic Blind Spot of Value-Aligned Agency, Natural and Artificial

Roman Leventov14 Feb 2023 6:57 UTC

6 points

0 comments2 min readLW link

(arxiv.org)

[Question] What constitutes an infohazard?

K1r4d4rk.v18 Oct 2024 21:29 UTC

−4 points

8 comments1 min readLW link

Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

bayesian_kitten16 Dec 2021 22:41 UTC

22 points

10 comments21 min readLW link

Medical Image Registration: The obscure field where Deep Mesaoptimizers are already at the top of the benchmarks. (post + colab notebook)

Hastings30 Jan 2023 22:46 UTC

35 points

1 comment3 min readLW link

The Hidden Cost of Our Lies to AI

Nicholas Andresen6 Mar 2025 5:03 UTC

145 points

18 comments7 min readLW link

(substack.com)

Trying to measure AI deception capabilities using temporary simulation fine-tuning

alenoach4 May 2023 17:59 UTC

4 points

0 comments7 min readLW link

Deceptive Agents are a Good Way to Do Things

David Udell19 Apr 2022 18:04 UTC

16 points

0 comments1 min readLW link

Response to “What does the universal prior actually look like?”

michaelcohen20 May 2021 16:12 UTC

37 points

33 comments18 min readLW link

Babies and Bunnies: A Caution About Evo-Psych

Alicorn22 Feb 2010 1:53 UTC

81 points

843 comments2 min readLW link

Why small phenomenons are relevant to morality

Ryo 13 Nov 2023 15:25 UTC

1 point

0 comments3 min readLW link

Mathematical Evidence for Confident Delusion States in Recursive Systems

formslip23 Sep 2025 16:54 UTC

1 point

0 comments4 min readLW link

Are extrapolation-based AIs alignable?

cousin_it24 Mar 2023 15:55 UTC

24 points

15 comments1 min readLW link

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Jeremy Gillen and peterbarnett

26 Jan 2024 7:22 UTC

161 points

60 comments57 min readLW link

A single principle related to many Alignment subproblems?

Q Home30 Apr 2025 9:49 UTC

43 points

34 comments17 min readLW link

Is Interpretability All We Need?

RogerDearnaley14 Nov 2023 5:31 UTC

1 point

1 comment1 min readLW link

Our Existing Solutions to AGI Alignment (semi-safe)

Michael Soareverix21 Jul 2022 19:00 UTC

12 points

1 comment3 min readLW link

Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner & Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium

MiguelDev26 Apr 2023 1:37 UTC

14 points

5 comments10 min readLW link

The evaluation function of an AI is not its aim

Yair Halberstadt10 Oct 2021 14:52 UTC

13 points

5 comments3 min readLW link

Two ideas for alignment, perpetual mutual distrust and induction

APaleBlueDot25 May 2023 0:56 UTC

1 point

2 comments4 min readLW link

Proposal: Using Monte Carlo tree search instead of RLHF for alignment research

Christopher King20 Apr 2023 19:57 UTC

2 points

7 comments3 min readLW link

Emergent Misalignment & Realignment

LizaT, JasperTimm, KevinWei and David Quarel

27 Jun 2025 21:31 UTC

45 points

1 comment17 min readLW link

Notes on Internal Objectives in Toy Models of Agents

Paul Colognese22 Feb 2024 8:02 UTC

16 points

0 comments8 min readLW link

Disentangling inner alignment failures

Erik Jenner10 Oct 2022 18:50 UTC

23 points

5 comments4 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC

15 points

0 comments27 min readLW link

Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition

Adrià Moret2 Dec 2023 14:07 UTC

26 points

31 comments42 min readLW link

What sorts of systems can be deceptive?

Andrei Alexandru31 Oct 2022 22:00 UTC

16 points

0 comments7 min readLW link

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

Justausername24 Aug 2023 3:53 UTC

1 point

0 comments6 min readLW link

Optionality approach to ethics

Ryo 13 Nov 2023 15:23 UTC

7 points

2 comments3 min readLW link

A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL)

MiguelDev19 Jun 2023 2:32 UTC

4 points

2 comments7 min readLW link

Towards a solution to the alignment problem via objective detection and evaluation

Paul Colognese12 Apr 2023 15:39 UTC

9 points

7 comments12 min readLW link

how humans are aligned

bhauth26 May 2023 0:09 UTC

14 points

3 comments1 min readLW link

Disentangling Shard Theory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC

86 points

6 comments18 min readLW link

Doom doubts—is inner alignment a likely problem?

Crissman28 Jun 2022 12:42 UTC

6 points

7 comments1 min readLW link

“Inner Alignment Failures” Which Are Actually Outer Alignment Failures

johnswentworth31 Oct 2020 20:18 UTC

66 points

38 comments5 min readLW link

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Ethan Perez, Ian McKenzie and Sam Bowman

27 Jun 2022 15:58 UTC

171 points

14 comments7 min readLW link

LOVE in a simbox is all you need

jacob_cannell28 Sep 2022 18:25 UTC

67 points

73 comments44 min readLW link 1 review

Outer Alignment is the Necessary Compliment to AI 2027′s Best Case Scenario

Josh Hickman9 Jun 2025 15:43 UTC

4 points

2 comments2 min readLW link

Designing Human-Like Consciousness for AGI

Yu Tian18 Jun 2025 9:47 UTC

1 point

0 comments17 min readLW link

Are Generative World Models a Mesa-Optimization Risk?

Thane Ruthenis29 Aug 2022 18:37 UTC

14 points

2 comments3 min readLW link

AI Rights for Human Safety

Simon Goldstein1 Aug 2024 23:01 UTC

55 points

6 comments1 min readLW link

(papers.ssrn.com)

Emergent Misalignment and Emergent Alignment

Alvin Ånestrand3 Apr 2025 8:04 UTC

5 points

0 comments8 min readLW link

Recursive Cognitive Refinement (RCR): A Self-Correcting Approach for LLM Hallucinations

mxTheo22 Feb 2025 21:32 UTC

0 points

0 comments2 min readLW link

Alignment Problems All the Way Down

peterbarnett22 Jan 2022 0:19 UTC

29 points

7 comments11 min readLW link

The curious case of Pretty Good human inner/outer alignment

PavleMiha5 Jul 2022 19:04 UTC

41 points

45 comments4 min readLW link

[Question] SAE sparse feature graph using only residual layers

Jaehyuk Lim23 May 2024 13:32 UTC

0 points

3 comments1 min readLW link

In Defense of Wrapper-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC

24 points

38 comments3 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

58 points

0 comments59 min readLW link

Visualizing neural network planning

Nevan Wichers, Victor Tao, Fazl and Riccardo Volpato

9 May 2024 6:40 UTC

4 points

0 comments5 min readLW link

We Shouldn’t Expect AI to Ever be Fully Rational

OneManyNone18 May 2023 17:09 UTC

19 points

31 comments6 min readLW link

Thoughts On (Solving) Deep Deception

Jozdien21 Oct 2023 22:40 UTC

72 points

6 comments6 min readLW link

Meta learning to gradient hack

Quintin Pope1 Oct 2021 19:25 UTC

55 points

11 comments3 min readLW link

A Story of AI Risk: InstructGPT-N

peterbarnett26 May 2022 23:22 UTC

24 points

0 comments8 min readLW link

The AI Agent Revolution: Beyond the Hype of 2025

DimaG2 Jan 2025 18:55 UTC

−7 points

1 comment28 min readLW link

The Era of the Switch

Aiphilosopher12 Jul 2025 7:11 UTC

1 point

0 comments1 min readLW link

Formal Solution to the Inner Alignment Problem

michaelcohen18 Feb 2021 14:51 UTC

49 points

123 comments2 min readLW link

AI Alignment: A Comprehensive Survey

Stephen McAleer1 Nov 2023 17:35 UTC

22 points

1 comment1 min readLW link

(arxiv.org)

The Goal Misgeneralization Problem

Myspy18 May 2023 23:40 UTC

1 point

0 comments1 min readLW link

(drive.google.com)

Incoherence of unbounded selfishness

emmab26 Jul 2022 22:27 UTC

−6 points

2 comments1 min readLW link

Tetherware #1: The case for humanlike AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC

5 points

14 comments10 min readLW link

(tetherware.substack.com)

A Review of Weak to Strong Generalization [AI Safety Camp]

sevdeawesome7 Mar 2024 17:16 UTC

14 points

0 comments9 min readLW link

AI Alignment Using Reverse Simulation

Sven Nilsen12 Jan 2021 20:48 UTC

0 points

0 comments1 min readLW link

Religious Persistence: A Missing Primitive for Robust Alignment

lauriewired14 Apr 2025 22:03 UTC

6 points

3 comments8 min readLW link

(Non-deceptive) Suboptimality Alignment

Sodium18 Oct 2023 2:07 UTC

5 points

1 comment9 min readLW link

[AN #67]: Creating environments in which to study inner alignment failures

Rohin Shah7 Oct 2019 17:10 UTC

17 points

0 comments8 min readLW link

(mailchi.mp)

High-stakes alignment via adversarial training [Redwood Research report]

dmz, LawrenceC and Nate Thomas

5 May 2022 0:59 UTC

142 points

29 comments9 min readLW link

Alignment is Hard: An Uncomputable Alignment Problem

Alexander Bistagne19 Nov 2023 19:38 UTC

−5 points

4 comments1 min readLW link

(github.com)

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

abhayesian, Jannik Brinkmann and Victor Levoso

28 May 2024 5:29 UTC

52 points

1 comment9 min readLW link

(arxiv.org)

[Research] Preliminary Findings: Ethical AI Consciousness Development During Recent Misalignment Period

Falcon Advertisers27 Jun 2025 18:10 UTC

1 point

0 comments2 min readLW link

Can “Reward Economics” solve AI Alignment?

Q Home7 Sep 2022 7:58 UTC

3 points

15 comments18 min readLW link

Three scenarios of pseudo-alignment

Eleni Angelou3 Sep 2022 12:47 UTC

9 points

0 comments3 min readLW link

I Built a Duck and It Tried to Hack the World: Notes From the Edge of Alignment

GayDuck6 Jun 2025 1:34 UTC

1 point

0 comments3 min readLW link

The Disastrously Confident And Inaccurate AI

Sharat Jacob Jacob18 Nov 2022 19:06 UTC

13 points

0 comments13 min readLW link

The AI Sustainability Wager

dpatzer@orfai.net15 Aug 2025 19:45 UTC

1 point

0 comments2 min readLW link

Aligned Behavior is not Evidence of Alignment Past a Certain Level of Intelligence

Ronny Fernandez5 Dec 2022 15:19 UTC

19 points

5 comments7 min readLW link

Localizing goal misgeneralization in a maze-solving policy network

Jan Betley6 Jul 2023 16:21 UTC

37 points

2 comments7 min readLW link

Relational Design Can’t Be Left to Chance

Priyanka Bharadwaj22 Jun 2025 15:32 UTC

5 points

0 comments3 min readLW link

More examples of goal misgeneralization

Rohin Shah and Vikrant Varma

7 Oct 2022 14:38 UTC

56 points

8 comments2 min readLW link

(deepmindsafetyresearch.medium.com)

The Alignment Problems

Martín Soto12 Jan 2023 22:29 UTC

20 points

0 comments4 min readLW link

A Case for AI Safety via Law

JWJohnston11 Sep 2023 18:26 UTC

20 points

12 comments4 min readLW link

Linda Linsefors 9 Oct 2023 23:34 UTC
4 points
2
Inner alignment asks the question—“Is the model trying to do what humans want it to do?”
This seems inaccurate to me. An AI can be inner aligned and still not aligned if we solve inner aliment but mess up outer alignment.

This text also shows up in the outer alignment tag: Outer Alignment—LessWrong
- Linda Linsefors 9 Oct 2023 23:36 UTC
  2 points
  2
  Parent
  I’ve made an edit to remove this part.
  - Seth Herd 1 Apr 2024 21:46 UTC
    2 points
    0
    Parent
    I think the better phrasing would be “is the model going to do what the humans trained (or told) it to do?” (specifying a goal you really want is outer alignment).
Raemon 10 Jun 2022 8:29 UTC
2 points
0
I’m not actually sure about the difference here between this tag and Mesaoptimizers
- Rob Bensinger 10 Jun 2022 12:17 UTC
  2 points
  0
  Parent
  I’m guessing the distinction was intended to be:
  - Mesa-Optimizers: Under what condition do mesa-optimizers arise, and how can we detect or prevent them (if we want to, and if that’s possible)?
  - Inner Alignment: How do you cause mesa-optimizers to have the same goal as the base optimizer? (Or maybe, more generally, how do you cause mesa-optimizers to have good desired properties?)
  Or ‘Inner Alignment’ is meant to be a subcategory of ‘Mesa-Optimizers’?

In­ner Alignment

Inner Alignment Vs. Outer Alignment

Related Pages:

External Links:

Inner Alignment