RSS

Corrigibility

TagLast edit: 30 Dec 2024 10:39 UTC by Dakara

Corrigibility is an AI system’s capacity to be safely and reliably modified, corrected, or shut down by humans after deployment, even if doing so conflicts with its current objectives.

A ‘corrigible’ agent is one that doesn’t interfere with what we would intuitively see as attempts to ‘correct’ the agent, or ‘correct’ our mistakes in building it; and permits these ‘corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.

Corrigibility is also used in a broader sense, something like a helpful agent. Paul Christiano has defined corrigibility as an agent that will help me:

  • Figure out whether I built the right AI and correct any mistakes I made

  • Remain informed about the AI’s behavior and avoid unpleasant surprises

  • Make better decisions and clarify my preferences

  • Acquire resources and remain in effective control of them

  • Ensure that my AI systems continue to do all of these nice things

  • …and so on

See also:

Let’s See You Write That Cor­rigi­bil­ity Tag

Eliezer Yudkowsky19 Jun 2022 21:11 UTC
123 points
70 comments1 min readLW link

Corrigibility

paulfchristiano27 Nov 2018 21:50 UTC
57 points
8 comments6 min readLW link

2. Cor­rigi­bil­ity Intuition

Max Harms8 Jun 2024 15:52 UTC
65 points
10 comments33 min readLW link

What’s Hard About The Shut­down Problem

johnswentworth20 Oct 2023 21:13 UTC
98 points
33 comments4 min readLW link

Towards shut­down­able agents via stochas­tic choice

8 Jul 2024 10:14 UTC
59 points
11 comments23 min readLW link
(arxiv.org)

“Cor­rigi­bil­ity at some small length” by dath ilan

Christopher King5 Apr 2023 1:47 UTC
32 points
3 comments9 min readLW link
(www.glowfic.com)

A broad basin of at­trac­tion around hu­man val­ues?

Wei Dai12 Apr 2022 5:15 UTC
114 points
18 comments2 min readLW link

0. CAST: Cor­rigi­bil­ity as Sin­gu­lar Target

Max Harms7 Jun 2024 22:29 UTC
145 points
12 comments8 min readLW link

The Shut­down Prob­lem: An AI Eng­ineer­ing Puz­zle for De­ci­sion Theorists

EJT23 Oct 2023 21:00 UTC
79 points
22 comments1 min readLW link
(philpapers.org)

In­finite Pos­si­bil­ity Space and the Shut­down Problem

magfrump18 Oct 2022 5:37 UTC
9 points
0 comments2 min readLW link
(www.magfrump.net)

Re­ward Is Not Enough

Steven Byrnes16 Jun 2021 13:52 UTC
123 points
19 comments10 min readLW link1 review

The Shut­down Prob­lem: In­com­plete Prefer­ences as a Solution

EJT23 Feb 2024 16:01 UTC
52 points
28 comments42 min readLW link

Cor­rigi­bil­ity could make things worse

ThomasCederborg11 Jun 2024 0:55 UTC
9 points
6 comments6 min readLW link

Non-Ob­struc­tion: A Sim­ple Con­cept Mo­ti­vat­ing Corrigibility

TurnTrout21 Nov 2020 19:35 UTC
74 points
20 comments19 min readLW link

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
124 points
29 comments8 min readLW link
(arxiv.org)

Sim­plify­ing Cor­rigi­bil­ity – Subagent Cor­rigi­bil­ity Is Not Anti-Natural

Rubi J. Hudson16 Jul 2024 22:44 UTC
44 points
27 comments5 min readLW link

An Im­pos­si­bil­ity Proof Rele­vant to the Shut­down Prob­lem and Corrigibility

Audere2 May 2023 6:52 UTC
66 points
13 comments9 min readLW link

Ex­tend­ing the Off-Switch Game: Toward a Ro­bust Frame­work for AI Corrigibility

OwenChen25 Sep 2024 20:38 UTC
3 points
0 comments4 min readLW link

AIs Will In­creas­ingly Fake Alignment

Zvi24 Dec 2024 13:00 UTC
89 points
0 comments52 min readLW link
(thezvi.wordpress.com)

Cor­rigi­bil­ity, Much more de­tail than any­one wants to Read

Logan Zoellner7 May 2023 1:02 UTC
26 points
2 comments7 min readLW link

A Cri­tique of Non-Obstruction

Joe Collman3 Feb 2021 8:45 UTC
13 points
9 comments4 min readLW link

Cor­rigi­bil­ity’s De­sir­a­bil­ity is Timing-Sensitive

RobertM26 Dec 2024 22:24 UTC
28 points
4 comments3 min readLW link

Three men­tal images from think­ing about AGI de­bate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC
55 points
35 comments4 min readLW link

AI As­sis­tants Should Have a Direct Line to Their Developers

Jan_Kulveit28 Dec 2024 17:01 UTC
55 points
6 comments2 min readLW link

Solv­ing the whole AGI con­trol prob­lem, ver­sion 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC
63 points
7 comments26 min readLW link

AXRP Epi­sode 8 - As­sis­tance Games with Dy­lan Had­field-Menell

DanielFilan8 Jun 2021 23:20 UTC
22 points
1 comment72 min readLW link

Model-based RL, De­sires, Brains, Wireheading

Steven Byrnes14 Jul 2021 15:11 UTC
22 points
1 comment13 min readLW link

A Cer­tain For­mal­iza­tion of Cor­rigi­bil­ity Is VNM-Incoherent

TurnTrout20 Nov 2021 0:30 UTC
67 points
24 comments8 min readLW link

For­mal­iz­ing Policy-Mod­ifi­ca­tion Corrigibility

TurnTrout3 Dec 2021 1:31 UTC
25 points
6 comments6 min readLW link

Ag­gre­gat­ing Utilities for Cor­rigible AI [Feed­back Draft]

12 May 2023 20:57 UTC
28 points
7 comments22 min readLW link

Con­se­quen­tial­ism & corrigibility

Steven Byrnes14 Dec 2021 13:23 UTC
70 points
29 comments7 min readLW link

Test­ing for Schem­ing with Model Deletion

Guive7 Jan 2025 1:54 UTC
59 points
20 comments21 min readLW link
(guive.substack.com)

[In­tro to brain-like-AGI safety] 14. Con­trol­led AGI

Steven Byrnes11 May 2022 13:17 UTC
45 points
25 comments20 min readLW link

[Question] Why does ad­vanced AI want not to be shut down?

RedFishBlueFish28 Mar 2023 4:26 UTC
2 points
19 comments1 min readLW link

5. Open Cor­rigi­bil­ity Questions

Max Harms10 Jun 2024 14:09 UTC
29 points
0 comments7 min readLW link

On cor­rigi­bil­ity and its basin

Donald Hobson20 Jun 2022 16:33 UTC
16 points
3 comments2 min readLW link

Another view of quan­tiliz­ers: avoid­ing Good­hart’s Law

jessicata9 Jan 2016 4:02 UTC
26 points
2 comments2 min readLW link

[Question] What is wrong with this ap­proach to cor­rigi­bil­ity?

Rafael Cosman12 Jul 2022 22:55 UTC
7 points
8 comments1 min readLW link

A first look at the hard prob­lem of corrigibility

jessicata15 Oct 2015 20:16 UTC
12 points
5 comments4 min readLW link

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth Herd18 Apr 2023 16:29 UTC
86 points
18 comments20 min readLW link

Peo­ple care about each other even though they have im­perfect mo­ti­va­tional poin­t­ers?

TurnTrout8 Nov 2022 18:15 UTC
33 points
25 comments7 min readLW link

Con­se­quen­tial­ists: One-Way Pat­tern Traps

David Udell16 Jan 2023 20:48 UTC
59 points
3 comments14 min readLW link

Cake, or death!

Stuart_Armstrong25 Oct 2012 10:33 UTC
47 points
13 comments4 min readLW link

[Question] Dumb and ill-posed ques­tion: Is con­cep­tual re­search like this MIRI pa­per on the shut­down prob­lem/​Cor­rigi­bil­ity “real”

joraine24 Nov 2022 5:08 UTC
25 points
11 comments1 min readLW link

Con­trary to List of Lethal­ity’s point 22, al­ign­ment’s door num­ber 2

False Name14 Dec 2022 22:01 UTC
−2 points
5 comments22 min readLW link

Take 14: Cor­rigi­bil­ity isn’t that great.

Charlie Steiner25 Dec 2022 13:04 UTC
15 points
3 comments3 min readLW link

Game The­ory with­out Argmax [Part 2]

Cleo Nardo11 Nov 2023 16:02 UTC
31 points
14 comments13 min readLW link

Game The­ory with­out Argmax [Part 1]

Cleo Nardo11 Nov 2023 15:59 UTC
70 points
18 comments19 min readLW link

Do what we mean vs. do what we say

Rohin Shah30 Aug 2018 22:03 UTC
34 points
14 comments1 min readLW link

[Question] Train­ing for cor­ri­ga­bil­ity: ob­vi­ous prob­lems?

Ben Amitay24 Feb 2023 14:02 UTC
4 points
6 comments1 min readLW link

Cor­rigi­bil­ity Via Thought-Pro­cess Deference

Thane Ruthenis24 Nov 2022 17:06 UTC
17 points
5 comments9 min readLW link

Cor­rigi­bil­ity = Tool-ness?

28 Jun 2024 1:19 UTC
78 points
8 comments9 min readLW link

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

Seth Herd7 Jul 2023 6:54 UTC
55 points
30 comments11 min readLW link

Pre­dic­tive model agents are sort of corrigible

Raymond D5 Jan 2024 14:05 UTC
35 points
6 comments3 min readLW link

Cor­rigi­bil­ity as out­side view

TurnTrout8 May 2020 21:56 UTC
36 points
11 comments4 min readLW link

Can cor­rigi­bil­ity be learned safely?

Wei Dai1 Apr 2018 23:07 UTC
35 points
115 comments4 min readLW link

Thoughts on im­ple­ment­ing cor­rigible ro­bust alignment

Steven Byrnes26 Nov 2019 14:06 UTC
26 points
2 comments6 min readLW link

An Idea For Cor­rigible, Re­cur­sively Im­prov­ing Math Oracles

jimrandomh20 Jul 2015 3:35 UTC
9 points
5 comments2 min readLW link

Desider­ata for an AI

Nathan Helm-Burger19 Jul 2023 16:18 UTC
9 points
0 comments4 min readLW link

Us­ing pre­dic­tors in cor­rigible systems

porby19 Jul 2023 22:29 UTC
19 points
6 comments27 min readLW link

He­donic Loops and Tam­ing RL

beren19 Jul 2023 15:12 UTC
20 points
14 comments9 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
126 points
6 comments35 min readLW link

Cor­rigible om­ni­scient AI ca­pa­ble of mak­ing clones

Kaj_Sotala22 Mar 2015 12:19 UTC
5 points
4 comments1 min readLW link
(www.sharelatex.com)

Cor­rigible but mis­al­igned: a su­per­in­tel­li­gent messiah

zhukeepa1 Apr 2018 6:20 UTC
28 points
26 comments5 min readLW link

Jan Kul­veit’s Cor­rigi­bil­ity Thoughts Distilled

brook20 Aug 2023 17:52 UTC
20 points
1 comment5 min readLW link

The limits of corrigibility

Stuart_Armstrong10 Apr 2018 10:49 UTC
27 points
9 comments4 min readLW link

Ad­dress­ing three prob­lems with coun­ter­fac­tual cor­rigi­bil­ity: bad bets, defend­ing against back­stops, and over­con­fi­dence.

RyanCarey21 Oct 2018 12:03 UTC
23 points
1 comment6 min readLW link

Towards a mechanis­tic un­der­stand­ing of corrigibility

evhub22 Aug 2019 23:20 UTC
47 points
26 comments4 min readLW link

Dath Ilan’s Views on Stop­gap Corrigibility

David Udell22 Sep 2022 16:16 UTC
77 points
19 comments13 min readLW link
(www.glowfic.com)

[Question] Sim­ple ques­tion about cor­rigi­bil­ity and val­ues in AI.

jmh22 Oct 2022 2:59 UTC
6 points
1 comment1 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

5 Dec 2022 20:28 UTC
40 points
19 comments10 min readLW link

CIRL Cor­rigi­bil­ity is Fragile

21 Dec 2022 1:40 UTC
58 points
8 comments12 min readLW link

Ex­per­i­ment Idea: RL Agents Evad­ing Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC
31 points
7 comments17 min readLW link
(docs.google.com)

Bing find­ing ways to by­pass Microsoft’s filters with­out be­ing asked. Is it re­pro­ducible?

Christopher King20 Feb 2023 15:11 UTC
27 points
15 comments1 min readLW link

Break­ing the Op­ti­mizer’s Curse, and Con­se­quences for Ex­is­ten­tial Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC
10 points
1 comment23 min readLW link

Just How Hard a Prob­lem is Align­ment?

Roger Dearnaley25 Feb 2023 9:00 UTC
3 points
1 comment21 min readLW link

In­ter­pretabil­ity/​Tool-ness/​Align­ment/​Cor­rigi­bil­ity are not Composable

johnswentworth8 Aug 2022 18:05 UTC
136 points
12 comments3 min readLW link

You can still fetch the coffee to­day if you’re dead tomorrow

davidad9 Dec 2022 14:06 UTC
96 points
19 comments5 min readLW link

Solve Cor­rigi­bil­ity Week

Logan Riggs28 Nov 2021 17:00 UTC
39 points
21 comments1 min readLW link

A Ped­a­gog­i­cal Guide to Corrigibility

A.H.17 Jan 2024 11:45 UTC
6 points
3 comments16 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC
33 points
3 comments15 min readLW link

Nash Bar­gain­ing be­tween Subagents doesn’t solve the Shut­down Problem

A.H.25 Jan 2024 10:47 UTC
22 points
1 comment9 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC
40 points
12 comments31 min readLW link

1. The CAST Strategy

Max Harms7 Jun 2024 22:29 UTC
46 points
19 comments38 min readLW link

Rele­vance of ‘Harm­ful In­tel­li­gence’ Data in Train­ing Datasets (We­bText vs. Pile)

MiguelDev12 Oct 2023 12:08 UTC
12 points
0 comments9 min readLW link

3a. Towards For­mal Corrigibility

Max Harms9 Jun 2024 16:53 UTC
22 points
2 comments19 min readLW link

3b. For­mal (Faux) Corrigibility

Max Harms9 Jun 2024 17:18 UTC
21 points
13 comments17 min readLW link

4. Ex­ist­ing Writ­ing on Corrigibility

Max Harms10 Jun 2024 14:08 UTC
49 points
15 comments106 min readLW link

A Shut­down Prob­lem Proposal

21 Jan 2024 18:12 UTC
125 points
61 comments6 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments27 min readLW link

Why mod­el­ling multi-ob­jec­tive home­osta­sis is es­sen­tial for AI al­ign­ment (and how it helps with AI safety as well)

Roland Pihlakas12 Jan 2025 3:37 UTC
36 points
5 comments10 min readLW link

Select Agent Speci­fi­ca­tions as Nat­u­ral Abstractions

lukemarks7 Apr 2023 23:16 UTC
19 points
3 comments5 min readLW link

Agen­tized LLMs will change the al­ign­ment landscape

Seth Herd9 Apr 2023 2:29 UTC
160 points
102 comments3 min readLW link1 review

Pay­ing the cor­rigi­bil­ity tax

Max H19 Apr 2023 1:57 UTC
14 points
1 comment13 min readLW link

Think­ing about max­i­miza­tion and corrigibility

James Payor21 Apr 2023 21:22 UTC
63 points
4 comments5 min readLW link

Archety­pal Trans­fer Learn­ing: a Pro­posed Align­ment Solu­tion that solves the In­ner & Outer Align­ment Prob­lem while adding Cor­rigible Traits to GPT-2-medium

MiguelDev26 Apr 2023 1:37 UTC
14 points
5 comments10 min readLW link

[Question] A Ques­tion about Cor­rigi­bil­ity (2015)

A.H.27 Nov 2023 12:05 UTC
4 points
2 comments1 min readLW link

An­nounce­ment: AI al­ign­ment prize round 4 winners

cousin_it20 Jan 2019 14:46 UTC
74 points
41 comments1 min readLW link

Boe­ing 737 MAX MCAS as an agent cor­rigi­bil­ity failure

Shmi16 Mar 2019 1:46 UTC
60 points
3 comments1 min readLW link

«Boundaries/​Mem­branes» and AI safety compilation

Chipmonk3 May 2023 21:41 UTC
57 points
17 comments8 min readLW link

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

16 May 2023 10:53 UTC
26 points
0 comments13 min readLW link

A Cor­rigi­bil­ity Me­taphore—Big Gambles

WCargo10 May 2023 18:13 UTC
16 points
0 comments4 min readLW link

GPT-4 im­plic­itly val­ues iden­tity preser­va­tion: a study of LMCA iden­tity management

Ozyrus17 May 2023 14:13 UTC
21 points
4 comments13 min readLW link

Col­lec­tive Identity

18 May 2023 9:00 UTC
59 points
12 comments8 min readLW link

Creat­ing a self-refer­en­tial sys­tem prompt for GPT-4

Ozyrus17 May 2023 14:13 UTC
3 points
1 comment3 min readLW link

Mr. Meeseeks as an AI ca­pa­bil­ity tripwire

Eric Zhang19 May 2023 11:33 UTC
37 points
17 comments2 min readLW link

New pa­per: Cor­rigi­bil­ity with Utility Preservation

Koen.Holtman6 Aug 2019 19:04 UTC
44 points
11 comments2 min readLW link

In­tro­duc­ing Cor­rigi­bil­ity (an FAI re­search sub­field)

So8res20 Oct 2014 21:09 UTC
52 points
28 comments3 min readLW link

In­fer­ence from a Math­e­mat­i­cal De­scrip­tion of an Ex­ist­ing Align­ment Re­search: a pro­posal for an outer al­ign­ment re­search program

Christopher King2 Jun 2023 21:54 UTC
7 points
4 comments16 min readLW link

Shut­down-Seek­ing AI

Simon Goldstein31 May 2023 22:19 UTC
50 points
32 comments15 min readLW link

Im­prove­ment on MIRI’s Corrigibility

9 Jun 2023 16:10 UTC
54 points
8 comments13 min readLW link

A Mul­tidis­ci­plinary Ap­proach to Align­ment (MATA) and Archety­pal Trans­fer Learn­ing (ATL)

MiguelDev19 Jun 2023 2:32 UTC
4 points
2 comments7 min readLW link

Ex­plor­ing Func­tional De­ci­sion The­ory (FDT) and a mod­ified ver­sion (ModFDT)

MiguelDev5 Jul 2023 14:06 UTC
11 points
11 comments15 min readLW link

[Question] What are some good ex­am­ples of in­cor­rigi­bil­ity?

RyanCarey28 Apr 2019 0:22 UTC
23 points
17 comments1 min readLW link

Cor­rigi­bil­ity thoughts II: the robot operator

Stuart_Armstrong18 Jan 2017 15:52 UTC
3 points
2 comments2 min readLW link

Win­ners of AI Align­ment Awards Re­search Contest

13 Jul 2023 16:14 UTC
115 points
4 comments12 min readLW link
(alignmentawards.com)

Train for in­cor­rigi­bil­ity, then re­verse it (Shut­down Prob­lem Con­test Sub­mis­sion)

Daniel_Eth18 Jul 2023 8:26 UTC
9 points
1 comment1 min readLW link

Only a hack can solve the shut­down problem

dp15 Jul 2023 20:26 UTC
5 points
0 comments8 min readLW link

Cor­rigi­bil­ity thoughts III: ma­nipu­lat­ing ver­sus deceiving

Stuart_Armstrong18 Jan 2017 15:57 UTC
3 points
0 comments1 min readLW link

Ques­tion: MIRI Cor­rig­bil­ity Agenda

algon3313 Mar 2019 19:38 UTC
15 points
11 comments1 min readLW link

Petrov corrigibility

Stuart_Armstrong11 Sep 2018 13:50 UTC
20 points
10 comments1 min readLW link

Cor­rigi­bil­ity doesn’t always have a good ac­tion to take

Stuart_Armstrong28 Aug 2018 20:30 UTC
19 points
0 comments1 min readLW link

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

Justausername24 Aug 2023 3:53 UTC
1 point
0 comments6 min readLW link

In­vuln­er­a­ble In­com­plete Prefer­ences: A For­mal Statement

SCP30 Aug 2023 21:59 UTC
134 points
38 comments35 min readLW link

Cor­rigi­bil­ity as Con­strained Optimisation

Henrik Åslund11 Apr 2019 20:09 UTC
15 points
3 comments5 min readLW link

In­stru­men­tal Con­ver­gence Bounty

Logan Zoellner14 Sep 2023 14:02 UTC
62 points
24 comments1 min readLW link

How use­ful is Cor­rigi­bil­ity?

martinkunev12 Sep 2023 0:05 UTC
11 points
4 comments5 min readLW link

Three AI Safety Re­lated Ideas

Wei Dai13 Dec 2018 21:32 UTC
69 points
38 comments2 min readLW link

Coun­ter­fac­tual Plan­ning in AGI Systems

Koen.Holtman3 Feb 2021 13:54 UTC
10 points
0 comments5 min readLW link

Creat­ing AGI Safety Interlocks

Koen.Holtman5 Feb 2021 12:01 UTC
7 points
4 comments8 min readLW link

Disen­tan­gling Cor­rigi­bil­ity: 2015-2021

Koen.Holtman16 Feb 2021 18:01 UTC
22 points
20 comments9 min readLW link

Safely con­trol­ling the AGI agent re­ward function

Koen.Holtman17 Feb 2021 14:47 UTC
8 points
0 comments5 min readLW link

In­for­ma­tion bot­tle­neck for coun­ter­fac­tual corrigibility

tailcalled6 Dec 2021 17:11 UTC
8 points
1 comment7 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC
16 points
0 comments42 min readLW link

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC
5 points
1 comment7 min readLW link

Up­dat­ing Utility Functions

9 May 2022 9:44 UTC
41 points
6 comments8 min readLW link

How RL Agents Be­have When Their Ac­tions Are Mod­ified? [Distil­la­tion post]

PabloAMC20 May 2022 18:47 UTC
22 points
0 comments8 min readLW link

In­fer­nal Cor­rigi­bil­ity, Fiendishly Difficult

David Udell27 May 2022 20:32 UTC
20 points
1 comment13 min readLW link

Machines vs Memes Part 3: Imi­ta­tion and Memes

ceru231 Jun 2022 13:36 UTC
7 points
0 comments7 min readLW link

Simulators

janus2 Sep 2022 12:45 UTC
614 points
168 comments41 min readLW link8 reviews
(generative.ink)
No comments.