RSS

Corrigibility

TagLast edit: 27 Nov 2023 9:24 UTC by Seth Herd

A ‘corrigible’ agent is one that doesn’t interfere with what we would intuitively see as attempts to ‘correct’ the agent, or ‘correct’ our mistakes in building it; and permits these ‘corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.

Corrigibility is also used in a broader sense, something like a helpful agent. Paul Christiano has defined corrigibility as an agent that will help me:

  • Figure out whether I built the right AI and correct any mistakes I made

  • Remain informed about the AI’s behavior and avoid unpleasant surprises

  • Make better decisions and clarify my preferences

  • Acquire resources and remain in effective control of them

  • Ensure that my AI systems continue to do all of these nice things

  • …and so on

See also:

Let’s See You Write That Cor­rigi­bil­ity Tag

Eliezer Yudkowsky19 Jun 2022 21:11 UTC
123 points
70 comments1 min readLW link

Corrigibility

paulfchristiano27 Nov 2018 21:50 UTC
57 points
8 comments6 min readLW link

Towards shut­down­able agents via stochas­tic choice

8 Jul 2024 10:14 UTC
49 points
5 comments23 min readLW link
(arxiv.org)

“Cor­rigi­bil­ity at some small length” by dath ilan

Christopher King5 Apr 2023 1:47 UTC
32 points
3 comments9 min readLW link
(www.glowfic.com)

What’s Hard About The Shut­down Problem

johnswentworth20 Oct 2023 21:13 UTC
98 points
32 comments4 min readLW link

The Shut­down Prob­lem: An AI Eng­ineer­ing Puz­zle for De­ci­sion Theorists

EJT23 Oct 2023 21:00 UTC
78 points
22 comments1 min readLW link
(philpapers.org)

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
122 points
29 comments8 min readLW link
(arxiv.org)

Cor­rigi­bil­ity could make things worse

ThomasCederborg11 Jun 2024 0:55 UTC
7 points
5 comments6 min readLW link

Re­ward Is Not Enough

Steven Byrnes16 Jun 2021 13:52 UTC
120 points
19 comments10 min readLW link1 review

The Shut­down Prob­lem: In­com­plete Prefer­ences as a Solution

EJT23 Feb 2024 16:01 UTC
50 points
22 comments41 min readLW link

A broad basin of at­trac­tion around hu­man val­ues?

Wei Dai12 Apr 2022 5:15 UTC
113 points
17 comments2 min readLW link

Non-Ob­struc­tion: A Sim­ple Con­cept Mo­ti­vat­ing Corrigibility

TurnTrout21 Nov 2020 19:35 UTC
74 points
20 comments19 min readLW link

Cake, or death!

Stuart_Armstrong25 Oct 2012 10:33 UTC
47 points
13 comments4 min readLW link

Solv­ing the whole AGI con­trol prob­lem, ver­sion 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC
63 points
7 comments26 min readLW link

AXRP Epi­sode 8 - As­sis­tance Games with Dy­lan Had­field-Menell

DanielFilan8 Jun 2021 23:20 UTC
22 points
1 comment72 min readLW link

Model-based RL, De­sires, Brains, Wireheading

Steven Byrnes14 Jul 2021 15:11 UTC
22 points
1 comment13 min readLW link

A Cer­tain For­mal­iza­tion of Cor­rigi­bil­ity Is VNM-Incoherent

TurnTrout20 Nov 2021 0:30 UTC
67 points
24 comments8 min readLW link

For­mal­iz­ing Policy-Mod­ifi­ca­tion Corrigibility

TurnTrout3 Dec 2021 1:31 UTC
25 points
6 comments6 min readLW link

Cor­rigible om­ni­scient AI ca­pa­ble of mak­ing clones

Kaj_Sotala22 Mar 2015 12:19 UTC
5 points
4 comments1 min readLW link
(www.sharelatex.com)

Cor­rigible but mis­al­igned: a su­per­in­tel­li­gent messiah

zhukeepa1 Apr 2018 6:20 UTC
28 points
26 comments5 min readLW link

Con­se­quen­tial­ism & corrigibility

Steven Byrnes14 Dec 2021 13:23 UTC
66 points
27 comments7 min readLW link

The limits of corrigibility

Stuart_Armstrong10 Apr 2018 10:49 UTC
27 points
9 comments4 min readLW link

Ad­dress­ing three prob­lems with coun­ter­fac­tual cor­rigi­bil­ity: bad bets, defend­ing against back­stops, and over­con­fi­dence.

RyanCarey21 Oct 2018 12:03 UTC
23 points
1 comment6 min readLW link

[In­tro to brain-like-AGI safety] 14. Con­trol­led AGI

Steven Byrnes11 May 2022 13:17 UTC
41 points
25 comments19 min readLW link

Do what we mean vs. do what we say

Rohin Shah30 Aug 2018 22:03 UTC
34 points
14 comments1 min readLW link

Towards a mechanis­tic un­der­stand­ing of corrigibility

evhub22 Aug 2019 23:20 UTC
47 points
26 comments6 min readLW link

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

Seth Herd7 Jul 2023 6:54 UTC
53 points
26 comments11 min readLW link

On cor­rigi­bil­ity and its basin

Donald Hobson20 Jun 2022 16:33 UTC
16 points
3 comments2 min readLW link

Another view of quan­tiliz­ers: avoid­ing Good­hart’s Law

jessicata9 Jan 2016 4:02 UTC
26 points
2 comments2 min readLW link

[Question] What is wrong with this ap­proach to cor­rigi­bil­ity?

Rafael Cosman12 Jul 2022 22:55 UTC
7 points
8 comments1 min readLW link

A first look at the hard prob­lem of corrigibility

jessicata15 Oct 2015 20:16 UTC
12 points
5 comments4 min readLW link

5. Open Cor­rigi­bil­ity Questions

Max Harms10 Jun 2024 14:09 UTC
21 points
0 comments7 min readLW link

In­finite Pos­si­bil­ity Space and the Shut­down Problem

magfrump18 Oct 2022 5:37 UTC
6 points
0 comments2 min readLW link
(www.magfrump.net)

An Im­pos­si­bil­ity Proof Rele­vant to the Shut­down Prob­lem and Corrigibility

Audere2 May 2023 6:52 UTC
65 points
13 comments9 min readLW link

Peo­ple care about each other even though they have im­perfect mo­ti­va­tional poin­t­ers?

TurnTrout8 Nov 2022 18:15 UTC
33 points
25 comments7 min readLW link

Con­se­quen­tial­ists: One-Way Pat­tern Traps

David Udell16 Jan 2023 20:48 UTC
54 points
3 comments14 min readLW link

Cor­rigi­bil­ity = Tool-ness?

28 Jun 2024 1:19 UTC
76 points
6 comments9 min readLW link

[Question] Dumb and ill-posed ques­tion: Is con­cep­tual re­search like this MIRI pa­per on the shut­down prob­lem/​Cor­rigi­bil­ity “real”

joraine24 Nov 2022 5:08 UTC
25 points
11 comments1 min readLW link

Con­trary to List of Lethal­ity’s point 22, al­ign­ment’s door num­ber 2

False Name14 Dec 2022 22:01 UTC
−2 points
5 comments22 min readLW link

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth Herd18 Apr 2023 16:29 UTC
83 points
18 comments20 min readLW link

Take 14: Cor­rigi­bil­ity isn’t that great.

Charlie Steiner25 Dec 2022 13:04 UTC
15 points
3 comments3 min readLW link

Cor­rigi­bil­ity, Much more de­tail than any­one wants to Read

Logan Zoellner7 May 2023 1:02 UTC
26 points
2 comments7 min readLW link

Desider­ata for an AI

Nathan Helm-Burger19 Jul 2023 16:18 UTC
8 points
0 comments4 min readLW link

Us­ing pre­dic­tors in cor­rigible systems

porby19 Jul 2023 22:29 UTC
19 points
6 comments27 min readLW link

He­donic Loops and Tam­ing RL

beren19 Jul 2023 15:12 UTC
20 points
14 comments9 min readLW link

Three men­tal images from think­ing about AGI de­bate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC
55 points
35 comments4 min readLW link

Cor­rigi­bil­ity Via Thought-Pro­cess Deference

Thane Ruthenis24 Nov 2022 17:06 UTC
17 points
5 comments9 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
126 points
6 comments35 min readLW link

[Question] Train­ing for cor­ri­ga­bil­ity: ob­vi­ous prob­lems?

Ben Amitay24 Feb 2023 14:02 UTC
4 points
6 comments1 min readLW link

Jan Kul­veit’s Cor­rigi­bil­ity Thoughts Distilled

brook20 Aug 2023 17:52 UTC
20 points
1 comment5 min readLW link

Ag­gre­gat­ing Utilities for Cor­rigible AI [Feed­back Draft]

12 May 2023 20:57 UTC
28 points
7 comments22 min readLW link

Game The­ory with­out Argmax [Part 1]

Cleo Nardo11 Nov 2023 15:59 UTC
64 points
17 comments19 min readLW link

[Question] Why does ad­vanced AI want not to be shut down?

RedFishBlueFish28 Mar 2023 4:26 UTC
3 points
19 comments1 min readLW link

Game The­ory with­out Argmax [Part 2]

Cleo Nardo11 Nov 2023 16:02 UTC
31 points
14 comments13 min readLW link

Pre­dic­tive model agents are sort of corrigible

Raymond D5 Jan 2024 14:05 UTC
35 points
6 comments3 min readLW link

Cor­rigi­bil­ity as out­side view

TurnTrout8 May 2020 21:56 UTC
36 points
11 comments4 min readLW link

Can cor­rigi­bil­ity be learned safely?

Wei Dai1 Apr 2018 23:07 UTC
35 points
115 comments4 min readLW link

A Cri­tique of Non-Obstruction

Joe_Collman3 Feb 2021 8:45 UTC
13 points
9 comments4 min readLW link

Thoughts on im­ple­ment­ing cor­rigible ro­bust alignment

Steven Byrnes26 Nov 2019 14:06 UTC
26 points
2 comments6 min readLW link

An Idea For Cor­rigible, Re­cur­sively Im­prov­ing Math Oracles

jimrandomh20 Jul 2015 3:35 UTC
9 points
5 comments2 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC
16 points
0 comments42 min readLW link

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC
5 points
1 comment7 min readLW link

Up­dat­ing Utility Functions

9 May 2022 9:44 UTC
41 points
6 comments8 min readLW link

How RL Agents Be­have When Their Ac­tions Are Mod­ified? [Distil­la­tion post]

PabloAMC20 May 2022 18:47 UTC
22 points
0 comments8 min readLW link

In­fer­nal Cor­rigi­bil­ity, Fiendishly Difficult

David Udell27 May 2022 20:32 UTC
20 points
1 comment13 min readLW link

Machines vs Memes Part 3: Imi­ta­tion and Memes

ceru231 Jun 2022 13:36 UTC
7 points
0 comments7 min readLW link

Simulators

janus2 Sep 2022 12:45 UTC
599 points
161 comments41 min readLW link8 reviews
(generative.ink)

Dath Ilan’s Views on Stop­gap Corrigibility

David Udell22 Sep 2022 16:16 UTC
77 points
19 comments13 min readLW link
(www.glowfic.com)

[Question] Sim­ple ques­tion about cor­rigi­bil­ity and val­ues in AI.

jmh22 Oct 2022 2:59 UTC
6 points
1 comment1 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

5 Dec 2022 20:28 UTC
40 points
19 comments10 min readLW link

CIRL Cor­rigi­bil­ity is Fragile

21 Dec 2022 1:40 UTC
58 points
9 comments12 min readLW link

Ex­per­i­ment Idea: RL Agents Evad­ing Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC
31 points
7 comments17 min readLW link
(docs.google.com)

Bing find­ing ways to by­pass Microsoft’s filters with­out be­ing asked. Is it re­pro­ducible?

Christopher King20 Feb 2023 15:11 UTC
16 points
15 comments1 min readLW link

Break­ing the Op­ti­mizer’s Curse, and Con­se­quences for Ex­is­ten­tial Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC
10 points
1 comment23 min readLW link

Just How Hard a Prob­lem is Align­ment?

Roger Dearnaley25 Feb 2023 9:00 UTC
−1 points
1 comment21 min readLW link

In­ter­pretabil­ity/​Tool-ness/​Align­ment/​Cor­rigi­bil­ity are not Composable

johnswentworth8 Aug 2022 18:05 UTC
129 points
12 comments3 min readLW link

You can still fetch the coffee to­day if you’re dead tomorrow

davidad9 Dec 2022 14:06 UTC
84 points
19 comments5 min readLW link

Solve Cor­rigi­bil­ity Week

Logan Riggs28 Nov 2021 17:00 UTC
39 points
21 comments1 min readLW link

A Ped­a­gog­i­cal Guide to Corrigibility

A.H.17 Jan 2024 11:45 UTC
6 points
3 comments16 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC
32 points
3 comments15 min readLW link

Nash Bar­gain­ing be­tween Subagents doesn’t solve the Shut­down Problem

A.H.25 Jan 2024 10:47 UTC
22 points
1 comment9 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC
35 points
6 comments31 min readLW link

0. CAST: Cor­rigi­bil­ity as Sin­gu­lar Target

Max Harms7 Jun 2024 22:29 UTC
86 points
10 comments8 min readLW link

1. The CAST Strategy

Max Harms7 Jun 2024 22:29 UTC
45 points
19 comments38 min readLW link

Rele­vance of ‘Harm­ful In­tel­li­gence’ Data in Train­ing Datasets (We­bText vs. Pile)

MiguelDev12 Oct 2023 12:08 UTC
12 points
0 comments9 min readLW link

2. Cor­rigi­bil­ity Intuition

Max Harms8 Jun 2024 15:52 UTC
53 points
8 comments33 min readLW link

3a. Towards For­mal Corrigibility

Max Harms9 Jun 2024 16:53 UTC
9 points
0 comments19 min readLW link

3b. For­mal (Faux) Corrigibility

Max Harms9 Jun 2024 17:18 UTC
19 points
10 comments17 min readLW link

4. Ex­ist­ing Writ­ing on Corrigibility

Max Harms10 Jun 2024 14:08 UTC
41 points
10 comments106 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments27 min readLW link

Select Agent Speci­fi­ca­tions as Nat­u­ral Abstractions

lukemarks7 Apr 2023 23:16 UTC
19 points
3 comments5 min readLW link

Agen­tized LLMs will change the al­ign­ment landscape

Seth Herd9 Apr 2023 2:29 UTC
153 points
95 comments3 min readLW link

Pay­ing the cor­rigi­bil­ity tax

Max H19 Apr 2023 1:57 UTC
14 points
1 comment13 min readLW link

Think­ing about max­i­miza­tion and corrigibility

James Payor21 Apr 2023 21:22 UTC
63 points
4 comments5 min readLW link

Archety­pal Trans­fer Learn­ing: a Pro­posed Align­ment Solu­tion that solves the In­ner & Outer Align­ment Prob­lem while adding Cor­rigible Traits to GPT-2-medium

MiguelDev26 Apr 2023 1:37 UTC
14 points
5 comments10 min readLW link

[Question] A Ques­tion about Cor­rigi­bil­ity (2015)

A.H.27 Nov 2023 12:05 UTC
4 points
2 comments1 min readLW link

An­nounce­ment: AI al­ign­ment prize round 4 winners

cousin_it20 Jan 2019 14:46 UTC
74 points
41 comments1 min readLW link

Boe­ing 737 MAX MCAS as an agent cor­rigi­bil­ity failure

shminux16 Mar 2019 1:46 UTC
60 points
3 comments1 min readLW link

«Boundaries/​Mem­branes» and AI safety compilation

Chipmonk3 May 2023 21:41 UTC
53 points
17 comments8 min readLW link

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

16 May 2023 10:53 UTC
22 points
0 comments13 min readLW link

A Cor­rigi­bil­ity Me­taphore—Big Gambles

WCargo10 May 2023 18:13 UTC
16 points
0 comments4 min readLW link

GPT-4 im­plic­itly val­ues iden­tity preser­va­tion: a study of LMCA iden­tity management

Ozyrus17 May 2023 14:13 UTC
21 points
4 comments13 min readLW link

Col­lec­tive Identity

18 May 2023 9:00 UTC
59 points
12 comments8 min readLW link

Creat­ing a self-refer­en­tial sys­tem prompt for GPT-4

Ozyrus17 May 2023 14:13 UTC
3 points
1 comment3 min readLW link

Mr. Meeseeks as an AI ca­pa­bil­ity tripwire

Eric Zhang19 May 2023 11:33 UTC
37 points
17 comments2 min readLW link

New pa­per: Cor­rigi­bil­ity with Utility Preservation

Koen.Holtman6 Aug 2019 19:04 UTC
44 points
11 comments2 min readLW link

In­tro­duc­ing Cor­rigi­bil­ity (an FAI re­search sub­field)

So8res20 Oct 2014 21:09 UTC
52 points
28 comments3 min readLW link

In­fer­ence from a Math­e­mat­i­cal De­scrip­tion of an Ex­ist­ing Align­ment Re­search: a pro­posal for an outer al­ign­ment re­search program

Christopher King2 Jun 2023 21:54 UTC
7 points
4 comments16 min readLW link

Shut­down-Seek­ing AI

Simon Goldstein31 May 2023 22:19 UTC
48 points
31 comments15 min readLW link

Im­prove­ment on MIRI’s Corrigibility

9 Jun 2023 16:10 UTC
54 points
8 comments13 min readLW link

A Mul­tidis­ci­plinary Ap­proach to Align­ment (MATA) and Archety­pal Trans­fer Learn­ing (ATL)

MiguelDev19 Jun 2023 2:32 UTC
4 points
2 comments7 min readLW link

Ex­plor­ing Func­tional De­ci­sion The­ory (FDT) and a mod­ified ver­sion (ModFDT)

MiguelDev5 Jul 2023 14:06 UTC
8 points
11 comments15 min readLW link

[Question] What are some good ex­am­ples of in­cor­rigi­bil­ity?

RyanCarey28 Apr 2019 0:22 UTC
23 points
17 comments1 min readLW link

Cor­rigi­bil­ity thoughts II: the robot operator

Stuart_Armstrong18 Jan 2017 15:52 UTC
3 points
2 comments2 min readLW link

Win­ners of AI Align­ment Awards Re­search Contest

13 Jul 2023 16:14 UTC
114 points
3 comments12 min readLW link
(alignmentawards.com)

Train for in­cor­rigi­bil­ity, then re­verse it (Shut­down Prob­lem Con­test Sub­mis­sion)

Daniel_Eth18 Jul 2023 8:26 UTC
9 points
1 comment1 min readLW link

Only a hack can solve the shut­down problem

dp15 Jul 2023 20:26 UTC
5 points
0 comments8 min readLW link

Cor­rigi­bil­ity thoughts III: ma­nipu­lat­ing ver­sus deceiving

Stuart_Armstrong18 Jan 2017 15:57 UTC
3 points
0 comments1 min readLW link

Ques­tion: MIRI Cor­rig­bil­ity Agenda

algon3313 Mar 2019 19:38 UTC
15 points
11 comments1 min readLW link

Petrov corrigibility

Stuart_Armstrong11 Sep 2018 13:50 UTC
20 points
10 comments1 min readLW link

Cor­rigi­bil­ity doesn’t always have a good ac­tion to take

Stuart_Armstrong28 Aug 2018 20:30 UTC
19 points
0 comments1 min readLW link

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

Justausername24 Aug 2023 3:53 UTC
1 point
0 comments6 min readLW link

In­vuln­er­a­ble In­com­plete Prefer­ences: A For­mal Statement

Sami Petersen30 Aug 2023 21:59 UTC
124 points
32 comments35 min readLW link

Cor­rigi­bil­ity as Con­strained Optimisation

Henrik Åslund11 Apr 2019 20:09 UTC
15 points
3 comments5 min readLW link

In­stru­men­tal Con­ver­gence Bounty

Logan Zoellner14 Sep 2023 14:02 UTC
62 points
24 comments1 min readLW link

How use­ful is Cor­rigi­bil­ity?

martinkunev12 Sep 2023 0:05 UTC
11 points
4 comments5 min readLW link

Three AI Safety Re­lated Ideas

Wei Dai13 Dec 2018 21:32 UTC
68 points
38 comments2 min readLW link

Coun­ter­fac­tual Plan­ning in AGI Systems

Koen.Holtman3 Feb 2021 13:54 UTC
10 points
0 comments5 min readLW link

Creat­ing AGI Safety Interlocks

Koen.Holtman5 Feb 2021 12:01 UTC
7 points
4 comments8 min readLW link

Disen­tan­gling Cor­rigi­bil­ity: 2015-2021

Koen.Holtman16 Feb 2021 18:01 UTC
22 points
20 comments9 min readLW link

Safely con­trol­ling the AGI agent re­ward function

Koen.Holtman17 Feb 2021 14:47 UTC
8 points
0 comments5 min readLW link

In­for­ma­tion bot­tle­neck for coun­ter­fac­tual corrigibility

tailcalled6 Dec 2021 17:11 UTC
8 points
1 comment7 min readLW link
No comments.