In­ner Alignment

TagLast edit: 4 May 2023 22:16 UTC by markov

Inner alignment asks the question—“Is the model trying to do what humans want it to do?”, or in other words can we robustly aim our AI optimizers at any objective function at all?

More specifically, Inner Alignment is the problem of ensuring mesa-optimizers (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.

As an example, evolution is an optimization force that itself ‘designed’ optimizers (humans) to achieve its goals. However, humans do not primarily maximize reproductive success, they instead use birth control while still attaining the pleasure that evolution meant as a reward for attempts at reproduction. This is a failure of inner alignment.

The term was first given a definition in the Hubinger et al paper Risk from Learned Optimization:

We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

Goal misgeneralization due to distribution shift is another example of an inner alignment failure. It is when the mesa-objective appears to pursue the base objective during training but does not pursue it during deployment. We mistakenly think that good performance on the training distribution means that the mesa-optimizer is pursuing the base objective. However, this might have occurred only because there were some correlations in the training distribution resulting in good performance on both the base and mesa objectives. When we had a distribution shift from training to deployment it caused the correlation to be broken and the mesa-objective failed to generalize. This is especially problematic when the capabilities successfully generalize to the deployment distribution while the objectives/​goals don’t. Since now we have a capable system that is optimizing for a misaligned goal.

To solve the inner alignment problem, some sub-problems that we would have to make progress on include things such as deceptive alignment, distribution shifts, and gradient hacking.

Inner Alignment Vs. Outer Alignment

Inner alignment is often talked about as being separate from outer alignment. The former deals with working on guaranteeing that we are robustly aiming at something, and the latter deals with the problem of what exactly are we aiming at. For more information see the corresponding tag.

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don’t think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Related Pages:

Mesa-Optimization, Treacherous Turn, Eliciting Latent Knowledge, Deceptive Alignment, Deception

External Links:

The In­ner Align­ment Problem

4 Jun 2019 1:20 UTC
101 points
17 comments13 min readLW link

Risks from Learned Op­ti­miza­tion: Introduction

31 May 2019 23:44 UTC
177 points
42 comments12 min readLW link3 reviews

In­ner Align­ment: Ex­plain like I’m 12 Edition

Rafael Harth1 Aug 2020 15:24 UTC
178 points
46 comments13 min readLW link2 reviews

De­mons in Im­perfect Search

johnswentworth11 Feb 2020 20:25 UTC
105 points
21 comments3 min readLW link

Mesa-Search vs Mesa-Control

abramdemski18 Aug 2020 18:51 UTC
54 points
45 comments7 min readLW link

Search­ing for Search

28 Nov 2022 15:31 UTC
80 points
6 comments14 min readLW link

Re­ward is not the op­ti­miza­tion target

TurnTrout25 Jul 2022 0:03 UTC
291 points
109 comments10 min readLW link

Open ques­tion: are min­i­mal cir­cuits dae­mon-free?

paulfchristiano5 May 2018 22:40 UTC
82 points
70 comments2 min readLW link1 review

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhub10 Sep 2019 23:03 UTC
64 points
27 comments1 min readLW link

The Solomonoff Prior is Malign

Mark Xu14 Oct 2020 1:33 UTC
157 points
52 comments16 min readLW link3 reviews

Why al­most ev­ery RL agent does learned optimization

Lee Sharkey12 Feb 2023 4:58 UTC
32 points
3 comments5 min readLW link

Con­crete ex­per­i­ments in in­ner alignment

evhub6 Sep 2019 22:16 UTC
64 points
12 comments6 min readLW link

Matt Botv­inick on the spon­ta­neous emer­gence of learn­ing algorithms

Adam Scholl12 Aug 2020 7:47 UTC
153 points
87 comments5 min readLW link

Book re­view: “A Thou­sand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC
111 points
18 comments19 min readLW link

Outer vs in­ner mis­al­ign­ment: three framings

Richard_Ngo6 Jul 2022 19:46 UTC
48 points
4 comments9 min readLW link

Are min­i­mal cir­cuits de­cep­tive?

evhub7 Sep 2019 18:11 UTC
67 points
11 comments8 min readLW link

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
104 points
39 comments3 min readLW link2 reviews

Tes­sel­lat­ing Hills: a toy model for demons in im­perfect search

DaemonicSigil20 Feb 2020 0:12 UTC
93 points
18 comments2 min readLW link

Em­piri­cal Ob­ser­va­tions of Ob­jec­tive Ro­bust­ness Failures

23 Jun 2021 23:23 UTC
63 points
5 comments9 min readLW link

Dis­cus­sion: Ob­jec­tive Ro­bust­ness and In­ner Align­ment Terminology

23 Jun 2021 23:25 UTC
73 points
7 comments9 min readLW link

Ques­tion 2: Pre­dicted bad out­comes of AGI learn­ing architecture

Cameron Berg11 Feb 2022 22:23 UTC
5 points
1 comment10 min readLW link

Mal­ign gen­er­al­iza­tion with­out in­ter­nal search

Matthew Barnett12 Jan 2020 18:03 UTC
43 points
12 comments4 min readLW link

The­o­ret­i­cal Neu­ro­science For Align­ment Theory

Cameron Berg7 Dec 2021 21:50 UTC
62 points
18 comments23 min readLW link

Defin­ing ca­pa­bil­ity and al­ign­ment in gra­di­ent descent

Edouard Harris5 Nov 2020 14:36 UTC
22 points
6 comments10 min readLW link

Some of my dis­agree­ments with List of Lethalities

TurnTrout24 Jan 2023 0:25 UTC
68 points
7 comments10 min readLW link

[Question] Why is pseudo-al­ign­ment “worse” than other ways ML can fail to gen­er­al­ize?

nostalgebraist18 Jul 2020 22:54 UTC
45 points
9 comments2 min readLW link

Good­hart’s Law Causal Diagrams

11 Apr 2022 13:52 UTC
32 points
4 comments6 min readLW link

Does SGD Pro­duce De­cep­tive Align­ment?

Mark Xu6 Nov 2020 23:48 UTC
87 points
9 comments16 min readLW link

In­ner Misal­ign­ment in “Si­mu­la­tor” LLMs

Adam Scherlis31 Jan 2023 8:33 UTC
84 points
11 comments4 min readLW link

Ano­ma­lous to­kens re­veal the origi­nal iden­tities of In­struct models

9 Feb 2023 1:30 UTC
131 points
15 comments9 min readLW link

In­ner Align­ment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC
137 points
39 comments11 min readLW link2 reviews

[Question] Does iter­ated am­plifi­ca­tion tackle the in­ner al­ign­ment prob­lem?

JanBrauner15 Feb 2020 12:58 UTC
7 points
4 comments1 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
302 points
23 comments23 min readLW link

Clar­ify­ing the con­fu­sion around in­ner alignment

Rauno Arike13 May 2022 23:05 UTC
27 points
0 comments11 min readLW link

Clar­ify­ing mesa-optimization

21 Mar 2023 15:53 UTC
36 points
6 comments10 min readLW link

Ex­plain­ing in­ner al­ign­ment to myself

Jeremy Gillen24 May 2022 23:10 UTC
9 points
2 comments10 min readLW link

AXRP Epi­sode 4 - Risks from Learned Op­ti­miza­tion with Evan Hubinger

DanielFilan18 Feb 2021 0:03 UTC
41 points
10 comments86 min readLW link

Against evolu­tion as an anal­ogy for how hu­mans will cre­ate AGI

Steven Byrnes23 Mar 2021 12:29 UTC
48 points
25 comments25 min readLW link

My AGI Threat Model: Misal­igned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC
68 points
40 comments16 min readLW link

Gra­da­tions of In­ner Align­ment Obstacles

abramdemski20 Apr 2021 22:18 UTC
80 points
22 comments9 min readLW link

Pre-Train­ing + Fine-Tun­ing Fa­vors Deception

Mark Xu8 May 2021 18:36 UTC
27 points
3 comments3 min readLW link

For­mal In­ner Align­ment, Prospectus

abramdemski12 May 2021 19:57 UTC
91 points
57 comments16 min readLW link

Re-Define In­tent Align­ment?

abramdemski22 Jul 2021 19:00 UTC
27 points
32 comments4 min readLW link

Mesa-Op­ti­miz­ers vs “Steered Op­ti­miz­ers”

Steven Byrnes10 Jul 2020 16:49 UTC
45 points
7 comments8 min readLW link

How To Go From In­ter­pretabil­ity To Align­ment: Just Re­tar­get The Search

johnswentworth10 Aug 2022 16:08 UTC
157 points
32 comments3 min readLW link

Lan­guage Agents Re­duce the Risk of Ex­is­ten­tial Catastrophe

28 May 2023 19:10 UTC
24 points
12 comments26 min readLW link

Model-based RL, De­sires, Brains, Wireheading

Steven Byrnes14 Jul 2021 15:11 UTC
18 points
1 comment13 min readLW link

Com­par­ing Four Ap­proaches to In­ner Alignment

Lucas Teixeira29 Jul 2022 21:06 UTC
35 points
1 comment9 min readLW link

Ap­pli­ca­tions for De­con­fus­ing Goal-Directedness

adamShimi8 Aug 2021 13:05 UTC
36 points
3 comments5 min readLW link1 review

Ap­proaches to gra­di­ent hacking

adamShimi14 Aug 2021 15:16 UTC
16 points
8 comments8 min readLW link

How likely is de­cep­tive al­ign­ment?

evhub30 Aug 2022 19:34 UTC
96 points
22 comments60 min readLW link

In­ner al­ign­ment in the brain

Steven Byrnes22 Apr 2020 13:14 UTC
77 points
16 comments16 min readLW link

Selec­tion The­o­rems: A Pro­gram For Un­der­stand­ing Agents

johnswentworth28 Sep 2021 5:03 UTC
119 points
27 comments6 min readLW link2 reviews

[Question] Col­lec­tion of ar­gu­ments to ex­pect (outer and in­ner) al­ign­ment failure?

Sam Clarke28 Sep 2021 16:55 UTC
20 points
10 comments1 min readLW link

Fram­ing ap­proaches to al­ign­ment and the hard prob­lem of AI cognition

ryan_greenblatt15 Dec 2021 19:06 UTC
16 points
15 comments27 min readLW link

If I were a well-in­ten­tioned AI… IV: Mesa-optimising

Stuart_Armstrong2 Mar 2020 12:16 UTC
26 points
2 comments6 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
124 points
9 comments15 min readLW link

My Overview of the AI Align­ment Land­scape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC
51 points
3 comments28 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
126 points
6 comments35 min readLW link

Towards an em­piri­cal in­ves­ti­ga­tion of in­ner alignment

evhub23 Sep 2019 20:43 UTC
44 points
9 comments6 min readLW link

Evan Hub­inger on In­ner Align­ment, Outer Align­ment, and Pro­pos­als for Build­ing Safe Ad­vanced AI

Palus Astra1 Jul 2020 17:30 UTC
35 points
4 comments67 min readLW link

Don’t al­ign agents to eval­u­a­tions of plans

TurnTrout26 Nov 2022 21:16 UTC
44 points
48 comments18 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTrout2 Dec 2022 2:43 UTC
103 points
18 comments47 min readLW link

In­ner al­ign­ment re­quires mak­ing as­sump­tions about hu­man values

Matthew Barnett20 Jan 2020 18:38 UTC
26 points
9 comments4 min readLW link

Mesa-Op­ti­miz­ers via Grokking

orthonormal6 Dec 2022 20:05 UTC
36 points
4 comments6 min readLW link

Take 8: Queer the in­ner/​outer al­ign­ment di­chotomy.

Charlie Steiner9 Dec 2022 17:46 UTC
28 points
2 comments2 min readLW link

Refram­ing in­ner alignment

davidad11 Dec 2022 13:53 UTC
49 points
13 comments4 min readLW link

[In­tro to brain-like-AGI safety] 10. The al­ign­ment problem

Steven Byrnes30 Mar 2022 13:24 UTC
48 points
4 comments19 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
206 points
36 comments38 min readLW link2 reviews

Cat­e­go­riz­ing failures as “outer” or “in­ner” mis­al­ign­ment is of­ten confused

Rohin Shah6 Jan 2023 15:48 UTC
85 points
21 comments8 min readLW link

In Defense of Wrap­per-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC
26 points
34 comments3 min readLW link

The Align­ment Problems

Martín Soto12 Jan 2023 22:29 UTC
19 points
0 comments4 min readLW link

Gra­di­ent Filtering

18 Jan 2023 20:09 UTC
54 points
16 comments13 min readLW link

Gra­di­ent hack­ing is ex­tremely difficult

beren24 Jan 2023 15:45 UTC
146 points
19 comments5 min readLW link

Med­i­cal Image Regis­tra­tion: The ob­scure field where Deep Me­saop­ti­miz­ers are already at the top of the bench­marks. (post + co­lab note­book)

Hastings30 Jan 2023 22:46 UTC
19 points
0 comments3 min readLW link

The Lin­guis­tic Blind Spot of Value-Aligned Agency, Nat­u­ral and Ar­tifi­cial

Roman Leventov14 Feb 2023 6:57 UTC
6 points
0 comments2 min readLW link

Is there a ML agent that aban­dons it’s util­ity func­tion out-of-dis­tri­bu­tion with­out los­ing ca­pa­bil­ities?

Christopher King22 Feb 2023 16:49 UTC
1 point
7 comments1 min readLW link

Aligned AI as a wrap­per around an LLM

cousin_it25 Mar 2023 15:58 UTC
27 points
19 comments1 min readLW link

Are ex­trap­o­la­tion-based AIs al­ignable?

cousin_it24 Mar 2023 15:55 UTC
22 points
15 comments1 min readLW link

Towards a solu­tion to the al­ign­ment prob­lem via ob­jec­tive de­tec­tion and eval­u­a­tion

Paul Colognese12 Apr 2023 15:39 UTC
9 points
7 comments12 min readLW link

Pro­posal: Us­ing Monte Carlo tree search in­stead of RLHF for al­ign­ment research

Christopher King20 Apr 2023 19:57 UTC
2 points
7 comments3 min readLW link

A con­cise sum-up of the ba­sic ar­gu­ment for AI doom

Mergimio H. Doefevmil24 Apr 2023 17:37 UTC
11 points
6 comments2 min readLW link

Archety­pal Trans­fer Learn­ing: a Pro­posed Align­ment Solu­tion that solves the In­ner & Outer Align­ment Prob­lem while adding Cor­rigible Traits to GPT-2-medium

MiguelDev26 Apr 2023 1:37 UTC
12 points
5 comments10 min readLW link

Try­ing to mea­sure AI de­cep­tion ca­pa­bil­ities us­ing tem­po­rary simu­la­tion fine-tuning

alenoach4 May 2023 17:59 UTC
4 points
0 comments7 min readLW link

My preferred fram­ings for re­ward mis­speci­fi­ca­tion and goal misgeneralisation

Yi-Yang6 May 2023 4:48 UTC
24 points
1 comment8 min readLW link

Is “red” for GPT-4 the same as “red” for you?

Yusuke Hayashi6 May 2023 17:55 UTC
9 points
6 comments2 min readLW link

Re­ward is the op­ti­miza­tion tar­get (of ca­pa­bil­ities re­searchers)

Max H15 May 2023 3:22 UTC
24 points
4 comments5 min readLW link

Sim­ple ex­per­i­ments with de­cep­tive alignment

Andreas_Moe15 May 2023 17:41 UTC
7 points
0 comments4 min readLW link

The Goal Mis­gen­er­al­iza­tion Problem

Myspy18 May 2023 23:40 UTC
1 point
0 comments1 min readLW link

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of a GridWorld Agent-Si­mu­la­tor (Part 1 of N)

Joseph Bloom16 May 2023 22:59 UTC
36 points
2 comments16 min readLW link

We Shouldn’t Ex­pect Su­per­in­tel­li­gent AI to be Fully Rational

OneManyNone18 May 2023 17:09 UTC
19 points
31 comments6 min readLW link

[Question] Is “brit­tle al­ign­ment” good enough?

the8thbit23 May 2023 17:35 UTC
9 points
5 comments3 min readLW link

Two ideas for al­ign­ment, per­pet­ual mu­tual dis­trust and induction

APaleBlueDot25 May 2023 0:56 UTC
1 point
0 comments4 min readLW link

how hu­mans are aligned

bhauth26 May 2023 0:09 UTC
14 points
3 comments1 min readLW link

Shut­down-Seek­ing AI

Simon Goldstein31 May 2023 22:19 UTC
20 points
9 comments15 min readLW link

How will they feed us

meijer19731 Jun 2023 8:49 UTC
4 points
3 comments5 min readLW link

A sim­ple en­vi­ron­ment for show­ing mesa misalignment

Matthew Barnett26 Sep 2019 4:44 UTC
70 points
9 comments2 min readLW link

Ba­bies and Bun­nies: A Cau­tion About Evo-Psych

Alicorn22 Feb 2010 1:53 UTC
81 points
843 comments2 min readLW link

2-D Robustness

vlad_m30 Aug 2019 20:27 UTC
82 points
8 comments2 min readLW link

[AN #67]: Creat­ing en­vi­ron­ments in which to study in­ner al­ign­ment failures

Rohin Shah7 Oct 2019 17:10 UTC
17 points
0 comments8 min readLW link

Ex­am­ples of AI’s be­hav­ing badly

Stuart_Armstrong16 Jul 2015 10:01 UTC
41 points
39 comments1 min readLW link

Safely and use­fully spec­tat­ing on AIs op­ti­miz­ing over toy worlds

AlexMennen31 Jul 2018 18:30 UTC
24 points
16 comments2 min readLW link

“In­ner Align­ment Failures” Which Are Ac­tu­ally Outer Align­ment Failures

johnswentworth31 Oct 2020 20:18 UTC
61 points
38 comments5 min readLW link

AI Align­ment Us­ing Re­v­erse Simulation

Sven Nilsen12 Jan 2021 20:48 UTC
0 points
0 comments1 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments26 min readLW link

For­mal Solu­tion to the In­ner Align­ment Problem

michaelcohen18 Feb 2021 14:51 UTC
49 points
123 comments2 min readLW link

Re­sponse to “What does the uni­ver­sal prior ac­tu­ally look like?”

michaelcohen20 May 2021 16:12 UTC
36 points
33 comments18 min readLW link

In­suffi­cient Values

16 Jun 2021 14:33 UTC
29 points
15 comments5 min readLW link

Call for re­search on eval­u­at­ing al­ign­ment (fund­ing + ad­vice available)

Beth Barnes31 Aug 2021 23:28 UTC
105 points
11 comments5 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogao5 Sep 2021 22:42 UTC
28 points
11 comments4 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogao24 Oct 2021 0:43 UTC
32 points
3 comments12 min readLW link

Meta learn­ing to gra­di­ent hack

Quintin Pope1 Oct 2021 19:25 UTC
55 points
11 comments3 min readLW link

The eval­u­a­tion func­tion of an AI is not its aim

Yair Halberstadt10 Oct 2021 14:52 UTC
13 points
5 comments3 min readLW link

[Question] What ex­actly is GPT-3′s base ob­jec­tive?

Daniel Kokotajlo10 Nov 2021 0:57 UTC
60 points
14 comments2 min readLW link

Un­der­stand­ing Gra­di­ent Hacking

peterbarnett10 Dec 2021 15:58 UTC
31 points
5 comments30 min readLW link

Ev­i­dence Sets: Towards In­duc­tive-Bi­ases based Anal­y­sis of Pro­saic AGI

bayesian_kitten16 Dec 2021 22:41 UTC
22 points
10 comments21 min readLW link

Gra­di­ent Hack­ing via Schel­ling Goals

Adam Scherlis28 Dec 2021 20:38 UTC
33 points
4 comments4 min readLW link

Align­ment Prob­lems All the Way Down

peterbarnett22 Jan 2022 0:19 UTC
26 points
7 comments10 min readLW link

How com­plex are my­opic imi­ta­tors?

Vivek Hebbar8 Feb 2022 12:00 UTC
26 points
1 comment15 min readLW link

Pro­ject In­tro: Selec­tion The­o­rems for Modularity

4 Apr 2022 12:59 UTC
69 points
20 comments16 min readLW link

De­cep­tive Agents are a Good Way to Do Things

David Udell19 Apr 2022 18:04 UTC
15 points
0 comments1 min readLW link

Why No *In­ter­est­ing* Unal­igned Sin­gu­lar­ity?

David Udell20 Apr 2022 0:34 UTC
12 points
12 comments1 min readLW link

High-stakes al­ign­ment via ad­ver­sar­ial train­ing [Red­wood Re­search re­port]

5 May 2022 0:59 UTC
142 points
29 comments9 min readLW link

AI Alter­na­tive Fu­tures: Sce­nario Map­ping Ar­tifi­cial In­tel­li­gence Risk—Re­quest for Par­ti­ci­pa­tion (*Closed*)

Kakili27 Apr 2022 22:07 UTC
10 points
2 comments8 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
52 points
0 comments59 min readLW link

On in­ner and outer al­ign­ment, and their confusion

NinaR26 May 2022 21:56 UTC
6 points
7 comments4 min readLW link

A Story of AI Risk: In­struc­tGPT-N

peterbarnett26 May 2022 23:22 UTC
24 points
0 comments8 min readLW link

Why I’m Wor­ried About AI

peterbarnett23 May 2022 21:13 UTC
22 points
2 comments12 min readLW link

An­nounc­ing the In­verse Scal­ing Prize ($250k Prize Pool)

27 Jun 2022 15:58 UTC
168 points
14 comments7 min readLW link

Doom doubts—is in­ner al­ign­ment a likely prob­lem?

Crissman28 Jun 2022 12:42 UTC
6 points
7 comments1 min readLW link

The cu­ri­ous case of Pretty Good hu­man in­ner/​outer alignment

PavleMiha5 Jul 2022 19:04 UTC
41 points
45 comments4 min readLW link

Ac­cept­abil­ity Ver­ifi­ca­tion: A Re­search Agenda

12 Jul 2022 20:11 UTC
50 points
0 comments1 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
53 points
8 comments20 min readLW link

Our Ex­ist­ing Solu­tions to AGI Align­ment (semi-safe)

Michael Soareverix21 Jul 2022 19:00 UTC
12 points
1 comment3 min readLW link

In­co­her­ence of un­bounded selfishness

emmab26 Jul 2022 22:27 UTC
−6 points
2 comments1 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tamera3 Aug 2022 12:03 UTC
117 points
23 comments6 min readLW link

Con­ver­gence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC
38 points
1 comment13 min readLW link

Gra­di­ent de­scent doesn’t se­lect for in­ner search

Ivan Vendrov13 Aug 2022 4:15 UTC
41 points
23 comments4 min readLW link

De­cep­tion as the op­ti­mal: mesa-op­ti­miz­ers and in­ner al­ign­ment

Eleni Angelou16 Aug 2022 4:49 UTC
11 points
0 comments5 min readLW link

Broad Pic­ture of Hu­man Values

Thane Ruthenis20 Aug 2022 19:42 UTC
38 points
5 comments10 min readLW link

Thoughts about OOD alignment

Dmitry Savishchev24 Aug 2022 15:31 UTC
11 points
10 comments2 min readLW link

Are Gen­er­a­tive World Models a Mesa-Op­ti­miza­tion Risk?

Thane Ruthenis29 Aug 2022 18:37 UTC
13 points
2 comments3 min readLW link

Three sce­nar­ios of pseudo-al­ign­ment

Eleni Angelou3 Sep 2022 12:47 UTC
9 points
0 comments3 min readLW link

In­ner Align­ment via Superpowers

30 Aug 2022 20:01 UTC
37 points
13 comments4 min readLW link

Fram­ing AI Childhoods

David Udell6 Sep 2022 23:40 UTC
37 points
8 comments4 min readLW link

Can “Re­ward Eco­nomics” solve AI Align­ment?

Q Home7 Sep 2022 7:58 UTC
3 points
15 comments18 min readLW link

The Defen­der’s Ad­van­tage of Interpretability

Marius Hobbhahn14 Sep 2022 14:05 UTC
41 points
4 comments6 min readLW link

Why de­cep­tive al­ign­ment mat­ters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC
56 points
13 comments13 min readLW link

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC
27 points
4 comments6 min readLW link

In­ner al­ign­ment: what are we point­ing at?

lukehmiles18 Sep 2022 11:09 UTC
7 points
2 comments1 min readLW link

Plan­ning ca­pac­ity and daemons

lukehmiles26 Sep 2022 0:15 UTC
2 points
0 comments5 min readLW link

LOVE in a sim­box is all you need

jacob_cannell28 Sep 2022 18:25 UTC
66 points
69 comments44 min readLW link

More ex­am­ples of goal misgeneralization

7 Oct 2022 14:38 UTC
53 points
8 comments2 min readLW link

Disen­tan­gling in­ner al­ign­ment failures

Erik Jenner10 Oct 2022 18:50 UTC
14 points
5 comments4 min readLW link

Greed Is the Root of This Evil

Thane Ruthenis13 Oct 2022 20:40 UTC
21 points
4 comments8 min readLW link

Science of Deep Learn­ing—a tech­ni­cal agenda

Marius Hobbhahn18 Oct 2022 14:54 UTC
36 points
7 comments4 min readLW link

What sorts of sys­tems can be de­cep­tive?

Andrei Alexandru31 Oct 2022 22:00 UTC
15 points
0 comments7 min readLW link

Clar­ify­ing AI X-risk

1 Nov 2022 11:03 UTC
114 points
23 comments4 min readLW link

Threat Model Liter­a­ture Review

1 Nov 2022 11:03 UTC
69 points
4 comments25 min readLW link

[Question] I there a demo of “You can’t fetch the coffee if you’re dead”?

Ram Rachum10 Nov 2022 18:41 UTC
8 points
9 comments1 min readLW link

Value For­ma­tion: An Over­ar­ch­ing Model

Thane Ruthenis15 Nov 2022 17:16 UTC
29 points
20 comments34 min readLW link

The Disas­trously Con­fi­dent And Inac­cu­rate AI

Sharat Jacob Jacob18 Nov 2022 19:06 UTC
13 points
0 comments13 min readLW link

Aligned Be­hav­ior is not Ev­i­dence of Align­ment Past a Cer­tain Level of Intelligence

Ronny Fernandez5 Dec 2022 15:19 UTC
19 points
5 comments7 min readLW link

Disen­tan­gling Shard The­ory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC
82 points
6 comments18 min readLW link