Inner Alignment

TagLast edit: 9 Oct 2023 23:35 UTC by Linda Linsefors

Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?

More specifically, Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.

As an example, evolution is an optimization force that itself ‘designed’ optimizers (humans) to achieve its goals. However, humans do not primarily maximize reproductive success, they instead use birth control while still attaining the pleasure that evolution meant as a reward for attempts at reproduction. This is a failure of inner alignment.

The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization:

We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

Goal misgeneralization due to distribution shift is another example of an inner alignment failure. It is when the mesa-objective appears to pursue the base objective during training but does not pursue it during deployment. We mistakenly think that good performance on the training distribution means that the mesa-optimizer is pursuing the base objective. However, this might have occurred only because there were some correlations in the training distribution resulting in good performance on both the base and mesa objectives. When we had a distribution shift from training to deployment it caused the correlation to be broken and the mesa-objective failed to generalize. This is especially problematic when the capabilities successfully generalize to the deployment distribution while the objectives/goals don’t. Since now we have a capable system that is optimizing for a misaligned goal.

To solve the inner alignment problem, some sub-problems that we would have to make progress on include things such as deceptive alignment, distribution shifts, and gradient hacking.

Inner Alignment Vs. Outer Alignment

Inner alignment is often talked about as being separate from outer alignment. The former deals with working on guaranteeing that we are robustly aiming at something, and the latter deals with the problem of what exactly are we aiming at. For more information see the corresponding tag.

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don’t think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Mesa-Optimization, Treacherous Turn, Eliciting Latent Knowledge, Deceptive Alignment, Deception

External Links:

Video by Robert Miles

Babies and Bunnies: A Caution About Evo-Psych

Alicorn22 Feb 2010 1:53 UTC

81 points

843 comments2 min readLW link

Examples of AI’s behaving badly

Stuart_Armstrong16 Jul 2015 10:01 UTC

41 points

41 comments1 min readLW link

Open question: are minimal circuits daemon-free?

paulfchristiano5 May 2018 22:40 UTC

83 points

70 comments2 min readLW link 1 review

Safely and usefully spectating on AIs optimizing over toy worlds

AlexMennen31 Jul 2018 18:30 UTC

24 points

16 comments2 min readLW link

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

31 May 2019 23:44 UTC

184 points

42 comments12 min readLW link 3 reviews

The Inner Alignment Problem

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

4 Jun 2019 1:20 UTC

103 points

17 comments13 min readLW link

2-D Robustness

Vlad Mikulik30 Aug 2019 20:27 UTC

85 points

8 comments2 min readLW link

Concrete experiments in inner alignment

evhub6 Sep 2019 22:16 UTC

71 points

12 comments6 min readLW link

Are minimal circuits deceptive?

evhub7 Sep 2019 18:11 UTC

77 points

11 comments8 min readLW link

Relaxed adversarial training for inner alignment

evhub10 Sep 2019 23:03 UTC

69 points

27 comments27 min readLW link

Towards an empirical investigation of inner alignment

evhub23 Sep 2019 20:43 UTC

44 points

9 comments6 min readLW link

A simple environment for showing mesa misalignment

Matthew Barnett26 Sep 2019 4:44 UTC

71 points

9 comments2 min readLW link

[AN #67]: Creating environments in which to study inner alignment failures

Rohin Shah7 Oct 2019 17:10 UTC

17 points

0 comments8 min readLW link

(mailchi.mp)

Gradient hacking

evhub16 Oct 2019 0:53 UTC

104 points

39 comments3 min readLW link 2 reviews

Malign generalization without internal search

Matthew Barnett12 Jan 2020 18:03 UTC

43 points

12 comments4 min readLW link

Inner alignment requires making assumptions about human values

Matthew Barnett20 Jan 2020 18:38 UTC

26 points

9 comments4 min readLW link

AI Alignment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC

126 points

6 comments35 min readLW link

Demons in Imperfect Search

johnswentworth11 Feb 2020 20:25 UTC

106 points

21 comments3 min readLW link

[Question] Does iterated amplification tackle the inner alignment problem?

JanB15 Feb 2020 12:58 UTC

7 points

4 comments1 min readLW link

Tessellating Hills: a toy model for demons in imperfect search

DaemonicSigil20 Feb 2020 0:12 UTC

97 points

18 comments2 min readLW link

If I were a well-intentioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC

26 points

2 comments6 min readLW link

Inner alignment in the brain

Steven Byrnes22 Apr 2020 13:14 UTC

79 points

16 comments16 min readLW link

An overview of 11 proposals for building safe advanced AI

evhub29 May 2020 20:38 UTC

205 points

36 comments38 min readLW link 2 reviews

Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

Palus Astra1 Jul 2020 17:30 UTC

35 points

4 comments67 min readLW link

Mesa-Optimizers vs “Steered Optimizers”

Steven Byrnes10 Jul 2020 16:49 UTC

45 points

7 comments8 min readLW link

[Question] Why is pseudo-alignment “worse” than other ways ML can fail to generalize?

nostalgebraist18 Jul 2020 22:54 UTC

45 points

9 comments2 min readLW link

Inner Alignment: Explain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC

179 points

46 comments13 min readLW link 2 reviews

Matt Botvinick on the spontaneous emergence of learning algorithms

Adam Scholl12 Aug 2020 7:47 UTC

153 points

87 comments5 min readLW link

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC

54 points

45 comments7 min readLW link

The Solomonoff Prior is Malign

Mark Xu14 Oct 2020 1:33 UTC

168 points

52 comments16 min readLW link 3 reviews

“Inner Alignment Failures” Which Are Actually Outer Alignment Failures

johnswentworth31 Oct 2020 20:18 UTC

66 points

38 comments5 min readLW link

Defining capability and alignment in gradient descent

Edouard Harris5 Nov 2020 14:36 UTC

22 points

6 comments10 min readLW link

Does SGD Produce Deceptive Alignment?

Mark Xu6 Nov 2020 23:48 UTC

96 points

9 comments16 min readLW link

Inner Alignment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC

137 points

39 comments11 min readLW link 2 reviews

AI Alignment Using Reverse Simulation

Sven Nilsen12 Jan 2021 20:48 UTC

0 points

0 comments1 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC

15 points

0 comments26 min readLW link

AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

DanielFilan18 Feb 2021 0:03 UTC

43 points

10 comments87 min readLW link

Formal Solution to the Inner Alignment Problem

michaelcohen18 Feb 2021 14:51 UTC

49 points

123 comments2 min readLW link

Book review: “A Thousand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC

116 points

18 comments19 min readLW link

Against evolution as an analogy for how humans will create AGI

Steven Byrnes23 Mar 2021 12:29 UTC

65 points

25 comments25 min readLW link

My AGI Threat Model: Misaligned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC

68 points

40 comments16 min readLW link

Gradations of Inner Alignment Obstacles

abramdemski20 Apr 2021 22:18 UTC

80 points

22 comments9 min readLW link

Pre-Training + Fine-Tuning Favors Deception

Mark Xu8 May 2021 18:36 UTC

27 points

3 comments3 min readLW link

Formal Inner Alignment, Prospectus

abramdemski12 May 2021 19:57 UTC

95 points

57 comments16 min readLW link

Response to “What does the universal prior actually look like?”

michaelcohen20 May 2021 16:12 UTC

36 points

33 comments18 min readLW link

Insufficient Values

Jozdien, Jacob Abraham and Abraham Francis

16 Jun 2021 14:33 UTC

31 points

15 comments5 min readLW link

Empirical Observations of Objective Robustness Failures

jbkjr and Lauro Langosco

23 Jun 2021 23:23 UTC

63 points

5 comments9 min readLW link

Discussion: Objective Robustness and Inner Alignment Terminology

jbkjr and Lauro Langosco

23 Jun 2021 23:25 UTC

73 points

7 comments9 min readLW link

Model-based RL, Desires, Brains, Wireheading

Steven Byrnes14 Jul 2021 15:11 UTC

22 points

1 comment13 min readLW link

Re-Define Intent Alignment?

abramdemski22 Jul 2021 19:00 UTC

28 points

32 comments4 min readLW link

Applications for Deconfusing Goal-Directedness

adamShimi8 Aug 2021 13:05 UTC

38 points

3 comments5 min readLW link 1 review

Approaches to gradient hacking

adamShimi14 Aug 2021 15:16 UTC

16 points

8 comments8 min readLW link

Call for research on evaluating alignment (funding + advice available)

Beth Barnes31 Aug 2021 23:28 UTC

105 points

11 comments5 min readLW link

Obstacles to gradient hacking

leogao5 Sep 2021 22:42 UTC

28 points

11 comments4 min readLW link

Selection Theorems: A Program For Understanding Agents

johnswentworth28 Sep 2021 5:03 UTC

123 points

28 comments6 min readLW link 2 reviews

[Question] Collection of arguments to expect (outer and inner) alignment failure?

Sam Clarke28 Sep 2021 16:55 UTC

21 points

10 comments1 min readLW link

Meta learning to gradient hack

Quintin Pope1 Oct 2021 19:25 UTC

55 points

11 comments3 min readLW link

The evaluation function of an AI is not its aim

Yair Halberstadt10 Oct 2021 14:52 UTC

13 points

5 comments3 min readLW link

Towards Deconfusing Gradient Hacking

leogao24 Oct 2021 0:43 UTC

39 points

3 comments12 min readLW link

[Question] What exactly is GPT-3′s base objective?

Daniel Kokotajlo10 Nov 2021 0:57 UTC

60 points

14 comments2 min readLW link

Theoretical Neuroscience For Alignment Theory

Cameron Berg7 Dec 2021 21:50 UTC

62 points

18 comments23 min readLW link

Understanding Gradient Hacking

peterbarnett10 Dec 2021 15:58 UTC

41 points

5 comments30 min readLW link

Framing approaches to alignment and the hard problem of AI cognition

ryan_greenblatt15 Dec 2021 19:06 UTC

16 points

15 comments27 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC

127 points

9 comments15 min readLW link

Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI

bayesian_kitten16 Dec 2021 22:41 UTC

22 points

10 comments21 min readLW link

My Overview of the AI Alignment Landscape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC

52 points

3 comments28 min readLW link

Gradient Hacking via Schelling Goals

Adam Scherlis28 Dec 2021 20:38 UTC

33 points

4 comments4 min readLW link

Alignment Problems All the Way Down

peterbarnett22 Jan 2022 0:19 UTC

26 points

7 comments11 min readLW link

How complex are myopic imitators?

Vivek Hebbar8 Feb 2022 12:00 UTC

26 points

1 comment15 min readLW link

Question 2: Predicted bad outcomes of AGI learning architecture

Cameron Berg11 Feb 2022 22:23 UTC

5 points

1 comment10 min readLW link

[Intro to brain-like-AGI safety] 10. The alignment problem

Steven Byrnes30 Mar 2022 13:24 UTC

48 points

6 comments19 min readLW link

Project Intro: Selection Theorems for Modularity

CallumMcDougall, Avery and Lucius Bushnaq

4 Apr 2022 12:59 UTC

71 points

20 comments16 min readLW link

Goodhart’s Law Causal Diagrams

JustinShovelain and Jeremy Gillen

11 Apr 2022 13:52 UTC

32 points

5 comments6 min readLW link

Deceptive Agents are a Good Way to Do Things

David Udell19 Apr 2022 18:04 UTC

16 points

0 comments1 min readLW link

Why No Interesting Unaligned Singularity?

David Udell20 Apr 2022 0:34 UTC

12 points

12 comments1 min readLW link

AI Alternative Futures: Scenario Mapping Artificial Intelligence Risk—Request for Participation (Closed)

Kakili27 Apr 2022 22:07 UTC

10 points

2 comments8 min readLW link

High-stakes alignment via adversarial training [Redwood Research report]

dmz, LawrenceC and Nate Thomas

5 May 2022 0:59 UTC

142 points

29 comments9 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

53 points

0 comments59 min readLW link

Clarifying the confusion around inner alignment

Rauno Arike13 May 2022 23:05 UTC

29 points

0 comments11 min readLW link

Why I’m Worried About AI

peterbarnett23 May 2022 21:13 UTC

22 points

2 comments12 min readLW link

Explaining inner alignment to myself

Jeremy Gillen24 May 2022 23:10 UTC

9 points

2 comments10 min readLW link

A Story of AI Risk: InstructGPT-N

peterbarnett26 May 2022 23:22 UTC

24 points

0 comments8 min readLW link

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Ethan Perez, Ian McKenzie and Sam Bowman

27 Jun 2022 15:58 UTC

169 points

14 comments7 min readLW link

Doom doubts—is inner alignment a likely problem?

Crissman28 Jun 2022 12:42 UTC

6 points

7 comments1 min readLW link

The curious case of Pretty Good human inner/outer alignment

PavleMiha5 Jul 2022 19:04 UTC

41 points

45 comments4 min readLW link

Outer vs inner misalignment: three framings

Richard_Ngo6 Jul 2022 19:46 UTC

49 points

5 comments9 min readLW link

Acceptability Verification: A Research Agenda

David Udell and evhub

12 Jul 2022 20:11 UTC

50 points

0 comments1 min readLW link

(docs.google.com)

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

58 points

8 comments20 min readLW link

Our Existing Solutions to AGI Alignment (semi-safe)

Michael Soareverix21 Jul 2022 19:00 UTC

12 points

1 comment3 min readLW link

Reward is not the optimization target

TurnTrout25 Jul 2022 0:03 UTC

348 points

123 comments10 min readLW link 3 reviews

Incoherence of unbounded selfishness

emmab26 Jul 2022 22:27 UTC

−6 points

2 comments1 min readLW link

Comparing Four Approaches to Inner Alignment

Lucas Teixeira29 Jul 2022 21:06 UTC

35 points

1 comment9 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tamera3 Aug 2022 12:03 UTC

130 points

23 comments6 min readLW link

Convergence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC

38 points

1 comment13 min readLW link

How To Go From Interpretability To Alignment: Just Retarget The Search

johnswentworth10 Aug 2022 16:08 UTC

179 points

33 comments3 min readLW link 1 review

Gradient descent doesn’t select for inner search

Ivan Vendrov13 Aug 2022 4:15 UTC

47 points

23 comments4 min readLW link

Deception as the optimal: mesa-optimizers and inner alignment

Eleni Angelou16 Aug 2022 4:49 UTC

11 points

0 comments5 min readLW link

Broad Picture of Human Values

Thane Ruthenis20 Aug 2022 19:42 UTC

42 points

6 comments10 min readLW link

Thoughts about OOD alignment

Catnee24 Aug 2022 15:31 UTC

11 points

10 comments2 min readLW link

Are Generative World Models a Mesa-Optimization Risk?

Thane Ruthenis29 Aug 2022 18:37 UTC

13 points

2 comments3 min readLW link

How likely is deceptive alignment?

evhub30 Aug 2022 19:34 UTC

102 points

28 comments60 min readLW link

Inner Alignment via Superpowers

JamesH, Thomas Larsen and Jeremy Gillen

30 Aug 2022 20:01 UTC

37 points

13 comments4 min readLW link

Three scenarios of pseudo-alignment

Eleni Angelou3 Sep 2022 12:47 UTC

9 points

0 comments3 min readLW link

Framing AI Childhoods

David Udell6 Sep 2022 23:40 UTC

37 points

8 comments4 min readLW link

Can “Reward Economics” solve AI Alignment?

Q Home7 Sep 2022 7:58 UTC

3 points

15 comments18 min readLW link

The Defender’s Advantage of Interpretability

Marius Hobbhahn14 Sep 2022 14:05 UTC

41 points

4 comments6 min readLW link

Why deceptive alignment matters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC

57 points

13 comments13 min readLW link

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC

27 points

4 comments6 min readLW link

Inner alignment: what are we pointing at?

lukehmiles18 Sep 2022 11:09 UTC

14 points

2 comments1 min readLW link

Planning capacity and daemons

lukehmiles26 Sep 2022 0:15 UTC

2 points

0 comments5 min readLW link

LOVE in a simbox is all you need

jacob_cannell28 Sep 2022 18:25 UTC

63 points

72 comments44 min readLW link 1 review

More examples of goal misgeneralization

Rohin Shah and Vikrant Varma

7 Oct 2022 14:38 UTC

53 points

8 comments2 min readLW link

(deepmindsafetyresearch.medium.com)

Disentangling inner alignment failures

Erik Jenner10 Oct 2022 18:50 UTC

20 points

5 comments4 min readLW link

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC

18 points

7 comments8 min readLW link

Science of Deep Learning—a technical agenda

Marius Hobbhahn18 Oct 2022 14:54 UTC

36 points

7 comments4 min readLW link

What sorts of systems can be deceptive?

Andrei Alexandru31 Oct 2022 22:00 UTC

16 points

0 comments7 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

127 points

24 comments4 min readLW link 1 review

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

74 points

4 comments25 min readLW link

[Question] I there a demo of “You can’t fetch the coffee if you’re dead”?

Ram Rachum10 Nov 2022 18:41 UTC

8 points

9 comments1 min readLW link

Value Formation: An Overarching Model

Thane Ruthenis15 Nov 2022 17:16 UTC

34 points

20 comments34 min readLW link

The Disastrously Confident And Inaccurate AI

Sharat Jacob Jacob18 Nov 2022 19:06 UTC

13 points

0 comments13 min readLW link

Don’t align agents to evaluations of plans

TurnTrout26 Nov 2022 21:16 UTC

42 points

49 comments18 min readLW link

Searching for Search

NicholasKees and janus

28 Nov 2022 15:31 UTC

86 points

7 comments14 min readLW link 1 review

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout2 Dec 2022 2:43 UTC

139 points

22 comments47 min readLW link 3 reviews

Aligned Behavior is not Evidence of Alignment Past a Certain Level of Intelligence

Ronny Fernandez5 Dec 2022 15:19 UTC

19 points

5 comments7 min readLW link

Mesa-Optimizers via Grokking

orthonormal6 Dec 2022 20:05 UTC

36 points

4 comments6 min readLW link

Take 8: Queer the inner/outer alignment dichotomy.

Charlie Steiner9 Dec 2022 17:46 UTC

28 points

2 comments2 min readLW link

Reframing inner alignment

davidad11 Dec 2022 13:53 UTC

53 points

13 comments4 min readLW link

In Defense of Wrapper-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC

23 points

38 comments3 min readLW link

Categorizing failures as “outer” or “inner” misalignment is often confused

Rohin Shah6 Jan 2023 15:48 UTC

86 points

21 comments8 min readLW link

The Alignment Problems

Martín Soto12 Jan 2023 22:29 UTC

19 points

0 comments4 min readLW link

Disentangling Shard Theory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC

85 points

6 comments18 min readLW link

Gradient Filtering

Jozdien and janus

18 Jan 2023 20:09 UTC

54 points

16 comments13 min readLW link

Some of my disagreements with List of Lethalities

TurnTrout24 Jan 2023 0:25 UTC

68 points

7 comments10 min readLW link

Gradient hacking is extremely difficult

beren24 Jan 2023 15:45 UTC

161 points

22 comments5 min readLW link

Medical Image Registration: The obscure field where Deep Mesaoptimizers are already at the top of the benchmarks. (post + colab notebook)

Hastings30 Jan 2023 22:46 UTC

23 points

0 comments3 min readLW link

Inner Misalignment in “Simulator” LLMs

Adam Scherlis31 Jan 2023 8:33 UTC

84 points

11 comments4 min readLW link

Anomalous tokens reveal the original identities of Instruct models

janus and jdp

9 Feb 2023 1:30 UTC

137 points

16 comments9 min readLW link

(generative.ink)

Why almost every RL agent does learned optimization

Lee Sharkey12 Feb 2023 4:58 UTC

32 points

3 comments5 min readLW link

The Linguistic Blind Spot of Value-Aligned Agency, Natural and Artificial

Roman Leventov14 Feb 2023 6:57 UTC

6 points

0 comments2 min readLW link

(arxiv.org)

Is there a ML agent that abandons it’s utility function out-of-distribution without losing capabilities?

Christopher King22 Feb 2023 16:49 UTC

1 point

7 comments1 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

11 Mar 2023 18:59 UTC

312 points

22 comments23 min readLW link

Clarifying mesa-optimization

Marius Hobbhahn and Pierre Peigné

21 Mar 2023 15:53 UTC

37 points

6 comments10 min readLW link

Are extrapolation-based AIs alignable?

cousin_it24 Mar 2023 15:55 UTC

22 points

15 comments1 min readLW link

Aligned AI as a wrapper around an LLM

cousin_it25 Mar 2023 15:58 UTC

31 points

19 comments1 min readLW link

Towards a solution to the alignment problem via objective detection and evaluation

Paul Colognese12 Apr 2023 15:39 UTC

9 points

7 comments12 min readLW link

Proposal: Using Monte Carlo tree search instead of RLHF for alignment research

Christopher King20 Apr 2023 19:57 UTC

2 points

7 comments3 min readLW link

A concise sum-up of the basic argument for AI doom

Mergimio H. Doefevmil24 Apr 2023 17:37 UTC

11 points

6 comments2 min readLW link

Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner & Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium

MiguelDev26 Apr 2023 1:37 UTC

14 points

5 comments10 min readLW link

Trying to measure AI deception capabilities using temporary simulation fine-tuning

alenoach4 May 2023 17:59 UTC

4 points

0 comments7 min readLW link

My preferred framings for reward misspecification and goal misgeneralisation

Yi-Yang6 May 2023 4:48 UTC

24 points

1 comment8 min readLW link

Is “red” for GPT-4 the same as “red” for you?

Yusuke Hayashi6 May 2023 17:55 UTC

9 points

6 comments2 min readLW link

Reward is the optimization target (of capabilities researchers)

Max H15 May 2023 3:22 UTC

32 points

4 comments5 min readLW link

Simple experiments with deceptive alignment

Andreas_Moe15 May 2023 17:41 UTC

7 points

0 comments4 min readLW link

A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N)

Joseph Bloom16 May 2023 22:59 UTC

36 points

2 comments16 min readLW link

We Shouldn’t Expect AI to Ever be Fully Rational

OneManyNone18 May 2023 17:09 UTC

19 points

31 comments6 min readLW link

The Goal Misgeneralization Problem

Myspy18 May 2023 23:40 UTC

1 point

0 comments1 min readLW link

(drive.google.com)

[Question] Is “brittle alignment” good enough?

the8thbit23 May 2023 17:35 UTC

9 points

5 comments3 min readLW link

Two ideas for alignment, perpetual mutual distrust and induction

APaleBlueDot25 May 2023 0:56 UTC

1 point

2 comments4 min readLW link

how humans are aligned

bhauth26 May 2023 0:09 UTC

14 points

3 comments1 min readLW link

Language Agents Reduce the Risk of Existential Catastrophe

cdkg and Simon Goldstein

28 May 2023 19:10 UTC

30 points

14 comments26 min readLW link

Shutdown-Seeking AI

Simon Goldstein31 May 2023 22:19 UTC

48 points

31 comments15 min readLW link

How will they feed us

meijer19731 Jun 2023 8:49 UTC

4 points

3 comments5 min readLW link

Why “AI alignment” would better be renamed into “Artificial Intention research”

chaosmage15 Jun 2023 10:32 UTC

29 points

12 comments2 min readLW link

A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL)

MiguelDev19 Jun 2023 2:32 UTC

4 points

2 comments7 min readLW link

Localizing goal misgeneralization in a maze-solving policy network

jan betley6 Jul 2023 16:21 UTC

37 points

2 comments7 min readLW link

Winners of AI Alignment Awards Research Contest

Akash and OliviaJ

13 Jul 2023 16:14 UTC

114 points

3 comments12 min readLW link

(alignmentawards.com)

Simple alignment plan that maybe works

Iknownothing18 Jul 2023 22:48 UTC

4 points

8 comments1 min readLW link

Visible loss landscape basins don’t correspond to distinct algorithms

Mikhail Samin28 Jul 2023 16:19 UTC

65 points

13 comments4 min readLW link

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

Justausername24 Aug 2023 3:53 UTC

1 point

0 comments6 min readLW link

Mesa-Optimization: Explain it like I’m 10 Edition

brook26 Aug 2023 23:04 UTC

20 points

1 comment6 min readLW link

A Case for AI Safety via Law

JWJohnston11 Sep 2023 18:26 UTC

17 points

12 comments4 min readLW link

High-level interpretability: detecting an AI’s objectives

Paul Colognese and Jozdien

28 Sep 2023 19:30 UTC

69 points

4 comments21 min readLW link

Steering subsystems: capabilities, agency, and alignment

Seth Herd29 Sep 2023 13:45 UTC

22 points

0 comments8 min readLW link

(Non-deceptive) Suboptimality Alignment

Sodium18 Oct 2023 2:07 UTC

3 points

1 comment8 min readLW link

The (partial) fallacy of dumb superintelligence

Seth Herd18 Oct 2023 21:25 UTC

27 points

5 comments4 min readLW link

Internal Target Information for AI Oversight

Paul Colognese20 Oct 2023 14:53 UTC

15 points

0 comments5 min readLW link

Thoughts On (Solving) Deep Deception

Jozdien21 Oct 2023 22:40 UTC

66 points

2 comments6 min readLW link

AI Alignment: A Comprehensive Survey

Stephen McAleer1 Nov 2023 17:35 UTC

15 points

1 comment1 min readLW link

(arxiv.org)

Open-ended ethics of phenomena (a desiderata with universal morality)

Ryo 8 Nov 2023 20:10 UTC

1 point

0 comments8 min readLW link

Open-ended/Phenomenal Ethics (TLDR)

Ryo 9 Nov 2023 16:58 UTC

3 points

0 comments1 min readLW link

We have promising alignment plans with low taxes

Seth Herd10 Nov 2023 18:51 UTC

30 points

9 comments5 min readLW link

Optionality approach to ethics

Ryo 13 Nov 2023 15:23 UTC

7 points

2 comments3 min readLW link

Why small phenomenons are relevant to morality

Ryo 13 Nov 2023 15:25 UTC

1 point

0 comments3 min readLW link

Is Interpretability All We Need?

RogerDearnaley14 Nov 2023 5:31 UTC

1 point

1 comment1 min readLW link

Alignment is Hard: An Uncomputable Alignment Problem

Alexander Bistagne19 Nov 2023 19:38 UTC

−5 points

4 comments1 min readLW link

(github.com)

Introduction and current research agenda

quila20 Nov 2023 12:42 UTC

27 points

1 comment1 min readLW link

Reaction to “Empowerment is (almost) All We Need” : an open-ended alternative

Ryo 25 Nov 2023 15:35 UTC

9 points

3 comments5 min readLW link

How to Control an LLM’s Behavior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC

64 points

30 comments11 min readLW link

Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition

Adrià Moret2 Dec 2023 14:07 UTC

26 points

31 comments42 min readLW link

Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment

RogerDearnaley7 Dec 2023 6:14 UTC

3 points

0 comments11 min readLW link

Results from the Turing Seminar hackathon

Charbel-Raphaël, jeanne_ and WCargo

7 Dec 2023 14:50 UTC

29 points

1 comment6 min readLW link

Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study

Karolis Ramanauskas8 Dec 2023 13:18 UTC

13 points

1 comment4 min readLW link

(arxiv.org)

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Andy Arditi and Oscar Balcells Obeso

8 Dec 2023 17:08 UTC

79 points

7 comments7 min readLW link

A Kindness, or The Inevitable Consequence of Perfect Inference (a short story)

samhealy12 Dec 2023 23:03 UTC

6 points

0 comments9 min readLW link

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC

35 points

4 comments2 min readLW link

Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?

RogerDearnaley11 Jan 2024 12:56 UTC

22 points

4 comments39 min readLW link

Goals selected from learned knowledge: an alternative to RL alignment

Seth Herd15 Jan 2024 21:52 UTC

39 points

17 comments7 min readLW link

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Jeremy Gillen and peterbarnett

26 Jan 2024 7:22 UTC

159 points

60 comments57 min readLW link

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC

90 points

7 comments2 min readLW link

The Ideal Speech Situation as a Tool for AI Ethical Reflection: A Framework for Alignment

kenneth myers9 Feb 2024 18:40 UTC

6 points

12 comments3 min readLW link

Thank you for triggering me

Cissy12 Feb 2024 20:09 UTC

4 points

1 comment6 min readLW link

(www.moremyself.xyz)

Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems

Florian_Dietz17 Feb 2024 8:45 UTC

3 points

0 comments13 min readLW link

Difficulty classes for alignment properties

Jozdien20 Feb 2024 9:08 UTC

32 points

5 comments2 min readLW link

Notes on Internal Objectives in Toy Models of Agents

Paul Colognese22 Feb 2024 8:02 UTC

16 points

0 comments8 min readLW link

The Inner Alignment Problem

Jakub Halmeš24 Feb 2024 17:55 UTC

1 point

1 comment3 min readLW link

(jakubhalmes.substack.com)

Alignment in Thought Chains

Faust Nemesis4 Mar 2024 19:24 UTC

1 point

0 comments2 min readLW link

A conversation with Claude3 about its consciousness

rife5 Mar 2024 19:44 UTC

−2 points

3 comments1 min readLW link

(i.imgur.com)

A Review of Weak to Strong Generalization [AI Safety Camp]

sevdeawesome7 Mar 2024 17:16 UTC

9 points

0 comments9 min readLW link

Invitation to the Princeton AI Alignment and Safety Seminar

Sadhika Malladi17 Mar 2024 1:10 UTC

6 points

1 comment1 min readLW link

On the Confusion between Inner and Outer Misalignment

Chris_Leong25 Mar 2024 11:59 UTC

17 points

10 comments1 min readLW link

Linda Linsefors 9 Oct 2023 23:34 UTC
4 points
2
Inner alignment asks the question—“Is the model trying to do what humans want it to do?”
This seems inaccurate to me. An AI can be inner aligned and still not aligned if we solve inner aliment but mess up outer alignment.

This text also shows up in the outer alignment tag: Outer Alignment—LessWrong
- Linda Linsefors 9 Oct 2023 23:36 UTC
  2 points
  2
  Parent
  I’ve made an edit to remove this part.
  - Seth Herd 1 Apr 2024 21:46 UTC
    2 points
    0
    Parent
    I think the better phrasing would be “is the model going to do what the humans trained (or told) it to do?” (specifying a goal you really want is outer alignment).
Raemon 10 Jun 2022 8:29 UTC
2 points
I’m not actually sure about the difference here between this tag and Mesaoptimizers
- Rob Bensinger 10 Jun 2022 12:17 UTC
  2 points
  Parent
  I’m guessing the distinction was intended to be:
  - Mesa-Optimizers: Under what condition do mesa-optimizers arise, and how can we detect or prevent them (if we want to, and if that’s possible)?
  - Inner Alignment: How do you cause mesa-optimizers to have the same goal as the base optimizer? (Or maybe, more generally, how do you cause mesa-optimizers to have good desired properties?)
  Or ‘Inner Alignment’ is meant to be a subcategory of ‘Mesa-Optimizers’?

In­ner Alignment

Inner Alignment Vs. Outer Alignment

Related Pages:

External Links:

Inner Alignment