RSS

Corrigibility

TagLast edit: 27 Mar 2023 15:48 UTC by Yaakov T

A ‘corrigible’ agent is one that doesn’t interfere with what we would intuitively see as attempts to ‘correct’ the agent, or ‘correct’ our mistakes in building it; and permits these ‘corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.

See also:

Corrigibility

paulfchristiano27 Nov 2018 21:50 UTC
53 points
8 comments6 min readLW link

Let’s See You Write That Cor­rigi­bil­ity Tag

Eliezer Yudkowsky19 Jun 2022 21:11 UTC
110 points
68 comments1 min readLW link

A Gym Grid­world En­vi­ron­ment for the Treach­er­ous Turn

Michaël Trazzi28 Jul 2018 21:27 UTC
73 points
9 comments3 min readLW link
(github.com)

Cor­rigi­bil­ity Via Thought-Pro­cess Deference

Thane Ruthenis24 Nov 2022 17:06 UTC
14 points
5 comments9 min readLW link

Cor­rigi­bil­ity as out­side view

TurnTrout8 May 2020 21:56 UTC
36 points
11 comments4 min readLW link

Can cor­rigi­bil­ity be learned safely?

Wei_Dai1 Apr 2018 23:07 UTC
35 points
115 comments4 min readLW link

Thoughts on im­ple­ment­ing cor­rigible ro­bust alignment

Steven Byrnes26 Nov 2019 14:06 UTC
26 points
2 comments6 min readLW link

An Idea For Cor­rigible, Re­cur­sively Im­prov­ing Math Oracles

jimrandomh20 Jul 2015 3:35 UTC
7 points
0 comments2 min readLW link

Cor­rigible om­ni­scient AI ca­pa­ble of mak­ing clones

Kaj_Sotala22 Mar 2015 12:19 UTC
5 points
0 comments1 min readLW link
(www.sharelatex.com)

Cor­rigible but mis­al­igned: a su­per­in­tel­li­gent messiah

zhukeepa1 Apr 2018 6:20 UTC
28 points
26 comments5 min readLW link

The limits of corrigibility

Stuart_Armstrong10 Apr 2018 10:49 UTC
27 points
9 comments4 min readLW link

Ad­dress­ing three prob­lems with coun­ter­fac­tual cor­rigi­bil­ity: bad bets, defend­ing against back­stops, and over­con­fi­dence.

RyanCarey21 Oct 2018 12:03 UTC
23 points
1 comment6 min readLW link

Towards a mechanis­tic un­der­stand­ing of corrigibility

evhub22 Aug 2019 23:20 UTC
46 points
26 comments6 min readLW link

Three men­tal images from think­ing about AGI de­bate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC
55 points
35 comments4 min readLW link

Cake, or death!

Stuart_Armstrong25 Oct 2012 10:33 UTC
47 points
13 comments4 min readLW link

Do what we mean vs. do what we say

Rohin Shah30 Aug 2018 22:03 UTC
34 points
14 comments1 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
125 points
6 comments35 min readLW link

Non-Ob­struc­tion: A Sim­ple Con­cept Mo­ti­vat­ing Corrigibility

TurnTrout21 Nov 2020 19:35 UTC
68 points
19 comments19 min readLW link

A Cri­tique of Non-Obstruction

Joe_Collman3 Feb 2021 8:45 UTC
13 points
10 comments4 min readLW link

Solv­ing the whole AGI con­trol prob­lem, ver­sion 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC
60 points
7 comments26 min readLW link

AXRP Epi­sode 8 - As­sis­tance Games with Dy­lan Had­field-Menell

DanielFilan8 Jun 2021 23:20 UTC
22 points
1 comment71 min readLW link

Re­ward Is Not Enough

Steven Byrnes16 Jun 2021 13:52 UTC
110 points
19 comments10 min readLW link1 review

Model-based RL, De­sires, Brains, Wireheading

Steven Byrnes14 Jul 2021 15:11 UTC
18 points
1 comment13 min readLW link

Cor­rigi­bil­ity Can Be VNM-Incoherent

TurnTrout20 Nov 2021 0:30 UTC
65 points
24 comments7 min readLW link

For­mal­iz­ing Policy-Mod­ifi­ca­tion Corrigibility

TurnTrout3 Dec 2021 1:31 UTC
23 points
6 comments6 min readLW link

Solve Cor­rigi­bil­ity Week

Logan Riggs28 Nov 2021 17:00 UTC
39 points
21 comments1 min readLW link

Con­se­quen­tial­ism & corrigibility

Steven Byrnes14 Dec 2021 13:23 UTC
60 points
27 comments7 min readLW link

A broad basin of at­trac­tion around hu­man val­ues?

Wei_Dai12 Apr 2022 5:15 UTC
106 points
17 comments2 min readLW link

[In­tro to brain-like-AGI safety] 14. Con­trol­led AGI

Steven Byrnes11 May 2022 13:17 UTC
33 points
25 comments19 min readLW link

On cor­rigi­bil­ity and its basin

Donald Hobson20 Jun 2022 16:33 UTC
16 points
3 comments2 min readLW link

Another view of quan­tiliz­ers: avoid­ing Good­hart’s Law

jessicata9 Jan 2016 4:02 UTC
24 points
1 comment2 min readLW link

[Question] What is wrong with this ap­proach to cor­rigi­bil­ity?

Rafael Cosman12 Jul 2022 22:55 UTC
7 points
8 comments1 min readLW link

A first look at the hard prob­lem of corrigibility

jessicata15 Oct 2015 20:16 UTC
11 points
0 comments4 min readLW link

In­ter­pretabil­ity/​Tool-ness/​Align­ment/​Cor­rigi­bil­ity are not Composable

johnswentworth8 Aug 2022 18:05 UTC
115 points
8 comments3 min readLW link

CHAI, As­sis­tance Games, And Fully-Up­dated Defer­ence [Scott Alexan­der]

berglund4 Oct 2022 17:04 UTC
21 points
1 comment17 min readLW link
(astralcodexten.substack.com)

In­finite Pos­si­bil­ity Space and the Shut­down Problem

magfrump18 Oct 2022 5:37 UTC
6 points
0 comments2 min readLW link
(www.magfrump.net)

Peo­ple care about each other even though they have im­perfect mo­ti­va­tional poin­t­ers?

TurnTrout8 Nov 2022 18:15 UTC
32 points
25 comments7 min readLW link

Con­se­quen­tial­ists: One-Way Pat­tern Traps

David Udell16 Jan 2023 20:48 UTC
47 points
3 comments14 min readLW link

[Question] Dumb and ill-posed ques­tion: Is con­cep­tual re­search like this MIRI pa­per on the shut­down prob­lem/​Cor­rigi­bil­ity “real”

joraine24 Nov 2022 5:08 UTC
26 points
11 comments1 min readLW link

You can still fetch the coffee to­day if you’re dead tomorrow

davidad9 Dec 2022 14:06 UTC
60 points
16 comments5 min readLW link

Con­trary to List of Lethal­ity’s point 22, al­ign­ment’s door num­ber 2

False Name14 Dec 2022 22:01 UTC
0 points
1 comment22 min readLW link

Take 14: Cor­rigi­bil­ity isn’t that great.

Charlie Steiner25 Dec 2022 13:04 UTC
15 points
3 comments3 min readLW link

[Question] Train­ing for cor­ri­ga­bil­ity: ob­vi­ous prob­lems?

Ben Amitay24 Feb 2023 14:02 UTC
4 points
5 comments1 min readLW link

An­nounce­ment: AI al­ign­ment prize round 4 winners

cousin_it20 Jan 2019 14:46 UTC
74 points
41 comments1 min readLW link

Boe­ing 737 MAX MCAS as an agent cor­rigi­bil­ity failure

shminux16 Mar 2019 1:46 UTC
60 points
3 comments1 min readLW link

New pa­per: Cor­rigi­bil­ity with Utility Preservation

Koen.Holtman6 Aug 2019 19:04 UTC
35 points
11 comments2 min readLW link

In­tro­duc­ing Cor­rigi­bil­ity (an FAI re­search sub­field)

So8res20 Oct 2014 21:09 UTC
52 points
28 comments3 min readLW link

[Question] What are some good ex­am­ples of in­cor­rigi­bil­ity?

RyanCarey28 Apr 2019 0:22 UTC
23 points
17 comments1 min readLW link

Cor­rigi­bil­ity thoughts II: the robot operator

Stuart_Armstrong18 Jan 2017 15:52 UTC
3 points
2 comments2 min readLW link

Cor­rigi­bil­ity thoughts III: ma­nipu­lat­ing ver­sus deceiving

Stuart_Armstrong18 Jan 2017 15:57 UTC
3 points
0 comments1 min readLW link

Ques­tion: MIRI Cor­rig­bil­ity Agenda

algon3313 Mar 2019 19:38 UTC
15 points
11 comments1 min readLW link

Petrov corrigibility

Stuart_Armstrong11 Sep 2018 13:50 UTC
20 points
10 comments1 min readLW link

Cor­rigi­bil­ity doesn’t always have a good ac­tion to take

Stuart_Armstrong28 Aug 2018 20:30 UTC
19 points
0 comments1 min readLW link

Cor­rigi­bil­ity as Con­strained Optimisation

Henrik Åslund11 Apr 2019 20:09 UTC
15 points
3 comments5 min readLW link

Three AI Safety Re­lated Ideas

Wei_Dai13 Dec 2018 21:32 UTC
68 points
38 comments2 min readLW link

Coun­ter­fac­tual Plan­ning in AGI Systems

Koen.Holtman3 Feb 2021 13:54 UTC
8 points
0 comments5 min readLW link

Creat­ing AGI Safety Interlocks

Koen.Holtman5 Feb 2021 12:01 UTC
7 points
4 comments8 min readLW link

Disen­tan­gling Cor­rigi­bil­ity: 2015-2021

Koen.Holtman16 Feb 2021 18:01 UTC
18 points
20 comments9 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments26 min readLW link

Safely con­trol­ling the AGI agent re­ward function

Koen.Holtman17 Feb 2021 14:47 UTC
7 points
0 comments5 min readLW link

In­for­ma­tion bot­tle­neck for coun­ter­fac­tual corrigibility

tailcalled6 Dec 2021 17:11 UTC
8 points
1 comment7 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC
16 points
0 comments42 min readLW link

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC
5 points
1 comment7 min readLW link

Up­dat­ing Utility Functions

9 May 2022 9:44 UTC
36 points
6 comments8 min readLW link

How RL Agents Be­have When Their Ac­tions Are Mod­ified? [Distil­la­tion post]

PabloAMC20 May 2022 18:47 UTC
21 points
0 comments8 min readLW link

In­fer­nal Cor­rigi­bil­ity, Fiendishly Difficult

David Udell27 May 2022 20:32 UTC
16 points
1 comment13 min readLW link

Machines vs Memes Part 3: Imi­ta­tion and Memes

ceru231 Jun 2022 13:36 UTC
5 points
0 comments7 min readLW link

Simulators

janus2 Sep 2022 12:45 UTC
575 points
111 comments41 min readLW link
(generative.ink)

Dath Ilan’s Views on Stop­gap Corrigibility

David Udell22 Sep 2022 16:16 UTC
52 points
18 comments13 min readLW link
(www.glowfic.com)

[Question] Sim­ple ques­tion about cor­rigi­bil­ity and val­ues in AI.

jmh22 Oct 2022 2:59 UTC
6 points
1 comment1 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

5 Dec 2022 20:28 UTC
38 points
17 comments10 min readLW link

CIRL Cor­rigi­bil­ity is Fragile

21 Dec 2022 1:40 UTC
26 points
6 comments12 min readLW link

Ex­per­i­ment Idea: RL Agents Evad­ing Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC
31 points
7 comments17 min readLW link
(docs.google.com)

Bing find­ing ways to by­pass Microsoft’s filters with­out be­ing asked. Is it re­pro­ducible?

Christopher King20 Feb 2023 15:11 UTC
12 points
10 comments1 min readLW link

Break­ing the Op­ti­mizer’s Curse, and Con­se­quences for Ex­is­ten­tial Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC
4 points
0 comments23 min readLW link

Just How Hard a Prob­lem is Align­ment?

Roger Dearnaley25 Feb 2023 9:00 UTC
−1 points
1 comment21 min readLW link
No comments.