Corrigibility

TagLast edit: 27 Nov 2023 9:24 UTC by Seth Herd

A ‘corrigible’ agent is one that doesn’t interfere with what we would intuitively see as attempts to ‘correct’ the agent, or ‘correct’ our mistakes in building it; and permits these ‘corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.

If we try to suspend the AI to disk, or shut it down entirely, a corrigible AI will let us do so. This is not something that an AI is automatically incentivized to let us do, since if it is shut down, it will be unable to fulfill what would usually be its goals.
If we try to reprogram the AI, a corrigible AI will not resist this change and will allow this modification to go through. If this is not specifically incentivized, an AI might attempt to fool us into believing the utility function was modified successfully, while actually keeping its original utility function as obscured functionality. By default, this deception could be a preferred outcome according to the AI’s current preferences.

Corrigibility is also used in a broader sense, something like a helpful agent. Paul Christiano has defined corrigibility as an agent that will help me:

Figure out whether I built the right AI and correct any mistakes I made
Remain informed about the AI’s behavior and avoid unpleasant surprises
Make better decisions and clarify my preferences
Acquire resources and remain in effective control of them
Ensure that my AI systems continue to do all of these nice things
…and so on

See also:

AI “Stop Button” Problem—Computerphile

Let’s See You Write That Corrigibility Tag

Eliezer Yudkowsky19 Jun 2022 21:11 UTC

123 points

70 comments1 min readLW link

Corrigibility

paulfchristiano27 Nov 2018 21:50 UTC

57 points

8 comments6 min readLW link

“Corrigibility at some small length” by dath ilan

Christopher King5 Apr 2023 1:47 UTC

32 points

3 comments9 min readLW link

(www.glowfic.com)

Towards shutdownable agents via stochastic choice

EJT, alexr, christosi and LAThomson

8 Jul 2024 10:14 UTC

50 points

5 comments23 min readLW link

(arxiv.org)

What’s Hard About The Shutdown Problem

johnswentworth20 Oct 2023 21:13 UTC

98 points

32 comments4 min readLW link

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

EJT23 Oct 2023 21:00 UTC

78 points

22 comments1 min readLW link

(philpapers.org)

Steering Llama-2 with contrastive activation additions

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

2 Jan 2024 0:47 UTC

123 points

29 comments8 min readLW link

(arxiv.org)

Corrigibility could make things worse

ThomasCederborg11 Jun 2024 0:55 UTC

7 points

5 comments6 min readLW link

Reward Is Not Enough

Steven Byrnes16 Jun 2021 13:52 UTC

120 points

19 comments10 min readLW link 1 review

The Shutdown Problem: Incomplete Preferences as a Solution

EJT23 Feb 2024 16:01 UTC

50 points

22 comments41 min readLW link

A broad basin of attraction around human values?

Wei Dai12 Apr 2022 5:15 UTC

113 points

17 comments2 min readLW link

Non-Obstruction: A Simple Concept Motivating Corrigibility

TurnTrout21 Nov 2020 19:35 UTC

74 points

20 comments19 min readLW link

Can corrigibility be learned safely?

Wei Dai1 Apr 2018 23:07 UTC

35 points

115 comments4 min readLW link

Thoughts on implementing corrigible robust alignment

Steven Byrnes26 Nov 2019 14:06 UTC

26 points

2 comments6 min readLW link

Solving the whole AGI control problem, version 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC

63 points

7 comments26 min readLW link

AXRP Episode 8 - Assistance Games with Dylan Hadfield-Menell

DanielFilan8 Jun 2021 23:20 UTC

22 points

1 comment72 min readLW link

Model-based RL, Desires, Brains, Wireheading

Steven Byrnes14 Jul 2021 15:11 UTC

22 points

1 comment13 min readLW link

A Certain Formalization of Corrigibility Is VNM-Incoherent

TurnTrout20 Nov 2021 0:30 UTC

67 points

24 comments8 min readLW link

Formalizing Policy-Modification Corrigibility

TurnTrout3 Dec 2021 1:31 UTC

25 points

6 comments6 min readLW link

Cake, or death!

Stuart_Armstrong25 Oct 2012 10:33 UTC

47 points

13 comments4 min readLW link

An Idea For Corrigible, Recursively Improving Math Oracles

jimrandomh20 Jul 2015 3:35 UTC

9 points

5 comments2 min readLW link

Consequentialism & corrigibility

Steven Byrnes14 Dec 2021 13:23 UTC

66 points

27 comments7 min readLW link

Corrigible omniscient AI capable of making clones

Kaj_Sotala22 Mar 2015 12:19 UTC

5 points

4 comments1 min readLW link

(www.sharelatex.com)

Corrigible but misaligned: a superintelligent messiah

zhukeepa1 Apr 2018 6:20 UTC

28 points

26 comments5 min readLW link

[Intro to brain-like-AGI safety] 14. Controlled AGI

Steven Byrnes11 May 2022 13:17 UTC

41 points

25 comments20 min readLW link

The limits of corrigibility

Stuart_Armstrong10 Apr 2018 10:49 UTC

27 points

9 comments4 min readLW link

Do what we mean vs. do what we say

Rohin Shah30 Aug 2018 22:03 UTC

34 points

14 comments1 min readLW link

Addressing three problems with counterfactual corrigibility: bad bets, defending against backstops, and overconfidence.

RyanCarey21 Oct 2018 12:03 UTC

23 points

1 comment6 min readLW link

On corrigibility and its basin

Donald Hobson20 Jun 2022 16:33 UTC

16 points

3 comments2 min readLW link

Another view of quantilizers: avoiding Goodhart’s Law

jessicata9 Jan 2016 4:02 UTC

26 points

2 comments2 min readLW link

[Question] What is wrong with this approach to corrigibility?

Rafael Cosman12 Jul 2022 22:55 UTC

7 points

8 comments1 min readLW link

A first look at the hard problem of corrigibility

jessicata15 Oct 2015 20:16 UTC

12 points

5 comments4 min readLW link

Internal independent review for language model agent alignment

Seth Herd7 Jul 2023 6:54 UTC

53 points

26 comments11 min readLW link

Towards a mechanistic understanding of corrigibility

evhub22 Aug 2019 23:20 UTC

47 points

26 comments6 min readLW link

Infinite Possibility Space and the Shutdown Problem

magfrump18 Oct 2022 5:37 UTC

6 points

0 comments2 min readLW link

(www.magfrump.net)

People care about each other even though they have imperfect motivational pointers?

TurnTrout8 Nov 2022 18:15 UTC

33 points

25 comments7 min readLW link

Consequentialists: One-Way Pattern Traps

David Udell16 Jan 2023 20:48 UTC

54 points

3 comments14 min readLW link

5. Open Corrigibility Questions

Max Harms10 Jun 2024 14:09 UTC

21 points

0 comments7 min readLW link

[Question] Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility “real”

joraine24 Nov 2022 5:08 UTC

25 points

11 comments1 min readLW link

Contrary to List of Lethality’s point 22, alignment’s door number 2

False Name14 Dec 2022 22:01 UTC

−2 points

5 comments22 min readLW link

An Impossibility Proof Relevant to the Shutdown Problem and Corrigibility

Audere2 May 2023 6:52 UTC

65 points

13 comments9 min readLW link

Take 14: Corrigibility isn’t that great.

Charlie Steiner25 Dec 2022 13:04 UTC

15 points

3 comments3 min readLW link

Corrigibility = Tool-ness?

johnswentworth and David Lorell

28 Jun 2024 1:19 UTC

78 points

8 comments9 min readLW link

Desiderata for an AI

Nathan Helm-Burger19 Jul 2023 16:18 UTC

8 points

0 comments4 min readLW link

Using predictors in corrigible systems

porby19 Jul 2023 22:29 UTC

19 points

6 comments27 min readLW link

Corrigibility, Much more detail than anyone wants to Read

Logan Zoellner7 May 2023 1:02 UTC

26 points

2 comments7 min readLW link

Corrigibility Via Thought-Process Deference

Thane Ruthenis24 Nov 2022 17:06 UTC

17 points

5 comments9 min readLW link

Hedonic Loops and Taming RL

beren19 Jul 2023 15:12 UTC

20 points

14 comments9 min readLW link

AI Alignment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC

126 points

6 comments35 min readLW link

[Question] Training for corrigability: obvious problems?

Ben Amitay24 Feb 2023 14:02 UTC

4 points

6 comments1 min readLW link

Capabilities and alignment of LLM cognitive architectures

Seth Herd18 Apr 2023 16:29 UTC

83 points

18 comments20 min readLW link

Jan Kulveit’s Corrigibility Thoughts Distilled

brook20 Aug 2023 17:52 UTC

20 points

1 comment5 min readLW link

Three mental images from thinking about AGI debate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC

55 points

35 comments4 min readLW link

Aggregating Utilities for Corrigible AI [Feedback Draft]

Dan H and Simon Goldstein

12 May 2023 20:57 UTC

28 points

7 comments22 min readLW link

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson16 Jul 2024 22:44 UTC

42 points

19 comments5 min readLW link

Game Theory without Argmax [Part 1]

Cleo Nardo11 Nov 2023 15:59 UTC

64 points

17 comments19 min readLW link

Predictive model agents are sort of corrigible

Raymond D5 Jan 2024 14:05 UTC

35 points

6 comments3 min readLW link

Game Theory without Argmax [Part 2]

Cleo Nardo11 Nov 2023 16:02 UTC

31 points

14 comments13 min readLW link

[Question] Why does advanced AI want not to be shut down?

RedFishBlueFish28 Mar 2023 4:26 UTC

3 points

19 comments1 min readLW link

A Critique of Non-Obstruction

Joe_Collman3 Feb 2021 8:45 UTC

13 points

9 comments4 min readLW link

Corrigibility as outside view

TurnTrout8 May 2020 21:56 UTC

36 points

11 comments4 min readLW link

Motivations, Natural Selection, and Curriculum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC

16 points

0 comments42 min readLW link

Question 3: Control proposals for minimizing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC

5 points

1 comment7 min readLW link

Updating Utility Functions

JustinShovelain and Joar Skalse

9 May 2022 9:44 UTC

41 points

6 comments8 min readLW link

How RL Agents Behave When Their Actions Are Modified? [Distillation post]

PabloAMC20 May 2022 18:47 UTC

22 points

0 comments8 min readLW link

Infernal Corrigibility, Fiendishly Difficult

David Udell27 May 2022 20:32 UTC

20 points

1 comment13 min readLW link

Machines vs Memes Part 3: Imitation and Memes

ceru231 Jun 2022 13:36 UTC

7 points

0 comments7 min readLW link

Simulators

janus2 Sep 2022 12:45 UTC

601 points

161 comments41 min readLW link 8 reviews

(generative.ink)

Dath Ilan’s Views on Stopgap Corrigibility

David Udell22 Sep 2022 16:16 UTC

77 points

19 comments13 min readLW link

(www.glowfic.com)

[Question] Simple question about corrigibility and values in AI.

jmh22 Oct 2022 2:59 UTC

6 points

1 comment1 min readLW link

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

5 Dec 2022 20:28 UTC

40 points

19 comments10 min readLW link

CIRL Corrigibility is Fragile

Rachel Freedman and AdamGleave

21 Dec 2022 1:40 UTC

58 points

9 comments12 min readLW link

Experiment Idea: RL Agents Evading Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC

31 points

7 comments17 min readLW link

(docs.google.com)

Bing finding ways to bypass Microsoft’s filters without being asked. Is it reproducible?

Christopher King20 Feb 2023 15:11 UTC

16 points

15 comments1 min readLW link

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC

10 points

1 comment23 min readLW link

Just How Hard a Problem is Alignment?

Roger Dearnaley25 Feb 2023 9:00 UTC

−1 points

1 comment21 min readLW link

Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth8 Aug 2022 18:05 UTC

130 points

12 comments3 min readLW link

You can still fetch the coffee today if you’re dead tomorrow

davidad9 Dec 2022 14:06 UTC

84 points

19 comments5 min readLW link

Solve Corrigibility Week

Logan Riggs28 Nov 2021 17:00 UTC

39 points

21 comments1 min readLW link

A Pedagogical Guide to Corrigibility

A.H.17 Jan 2024 11:45 UTC

6 points

3 comments16 min readLW link

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC

32 points

3 comments15 min readLW link

Nash Bargaining between Subagents doesn’t solve the Shutdown Problem

A.H.25 Jan 2024 10:47 UTC

22 points

1 comment9 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

38 points

6 comments31 min readLW link

0. CAST: Corrigibility as Singular Target

Max Harms7 Jun 2024 22:29 UTC

90 points

10 comments8 min readLW link

1. The CAST Strategy

Max Harms7 Jun 2024 22:29 UTC

49 points

19 comments38 min readLW link

Relevance of ‘Harmful Intelligence’ Data in Training Datasets (WebText vs. Pile)

MiguelDev12 Oct 2023 12:08 UTC

12 points

0 comments9 min readLW link

2. Corrigibility Intuition

Max Harms8 Jun 2024 15:52 UTC

53 points

9 comments33 min readLW link

3a. Towards Formal Corrigibility

Max Harms9 Jun 2024 16:53 UTC

9 points

0 comments19 min readLW link

3b. Formal (Faux) Corrigibility

Max Harms9 Jun 2024 17:18 UTC

19 points

10 comments17 min readLW link

4. Existing Writing on Corrigibility

Max Harms10 Jun 2024 14:08 UTC

41 points

13 comments106 min readLW link

A Shutdown Problem Proposal

johnswentworth and David Lorell

21 Jan 2024 18:12 UTC

125 points

61 comments6 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC

15 points

0 comments27 min readLW link

Select Agent Specifications as Natural Abstractions

lukemarks7 Apr 2023 23:16 UTC

19 points

3 comments5 min readLW link

Agentized LLMs will change the alignment landscape

Seth Herd9 Apr 2023 2:29 UTC

153 points

96 comments3 min readLW link

Paying the corrigibility tax

Max H19 Apr 2023 1:57 UTC

14 points

1 comment13 min readLW link

Thinking about maximization and corrigibility

James Payor21 Apr 2023 21:22 UTC

63 points

4 comments5 min readLW link

Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner & Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium

MiguelDev26 Apr 2023 1:37 UTC

14 points

5 comments10 min readLW link

[Question] A Question about Corrigibility (2015)

A.H.27 Nov 2023 12:05 UTC

4 points

2 comments1 min readLW link

Announcement: AI alignment prize round 4 winners

cousin_it20 Jan 2019 14:46 UTC

74 points

41 comments1 min readLW link

Boeing 737 MAX MCAS as an agent corrigibility failure

shminux16 Mar 2019 1:46 UTC

60 points

3 comments1 min readLW link

«Boundaries/Membranes» and AI safety compilation

Chipmonk3 May 2023 21:41 UTC

53 points

17 comments8 min readLW link

Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios

Simon Lermen, Teun van der Weij and Leon Lang

16 May 2023 10:53 UTC

22 points

0 comments13 min readLW link

A Corrigibility Metaphore—Big Gambles

WCargo10 May 2023 18:13 UTC

16 points

0 comments4 min readLW link

GPT-4 implicitly values identity preservation: a study of LMCA identity management

Ozyrus17 May 2023 14:13 UTC

21 points

4 comments13 min readLW link

Collective Identity

NicholasKees, ukc10014 and Garrett Baker

18 May 2023 9:00 UTC

59 points

12 comments8 min readLW link

Creating a self-referential system prompt for GPT-4

Ozyrus17 May 2023 14:13 UTC

3 points

1 comment3 min readLW link

Mr. Meeseeks as an AI capability tripwire

Eric Zhang19 May 2023 11:33 UTC

37 points

17 comments2 min readLW link

New paper: Corrigibility with Utility Preservation

Koen.Holtman6 Aug 2019 19:04 UTC

44 points

11 comments2 min readLW link

Introducing Corrigibility (an FAI research subfield)

So8res20 Oct 2014 21:09 UTC

52 points

28 comments3 min readLW link

Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program

Christopher King2 Jun 2023 21:54 UTC

7 points

4 comments16 min readLW link

Shutdown-Seeking AI

Simon Goldstein31 May 2023 22:19 UTC

48 points

31 comments15 min readLW link

Improvement on MIRI’s Corrigibility

WCargo and Charbel-Raphaël

9 Jun 2023 16:10 UTC

54 points

8 comments13 min readLW link

A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL)

MiguelDev19 Jun 2023 2:32 UTC

4 points

2 comments7 min readLW link

Exploring Functional Decision Theory (FDT) and a modified version (ModFDT)

MiguelDev5 Jul 2023 14:06 UTC

8 points

11 comments15 min readLW link

[Question] What are some good examples of incorrigibility?

RyanCarey28 Apr 2019 0:22 UTC

23 points

17 comments1 min readLW link

Corrigibility thoughts II: the robot operator

Stuart_Armstrong18 Jan 2017 15:52 UTC

3 points

2 comments2 min readLW link

Winners of AI Alignment Awards Research Contest

Akash and OliviaJ

13 Jul 2023 16:14 UTC

114 points

3 comments12 min readLW link

(alignmentawards.com)

Train for incorrigibility, then reverse it (Shutdown Problem Contest Submission)

Daniel_Eth18 Jul 2023 8:26 UTC

9 points

1 comment1 min readLW link

Only a hack can solve the shutdown problem

dp15 Jul 2023 20:26 UTC

5 points

0 comments8 min readLW link

Corrigibility thoughts III: manipulating versus deceiving

Stuart_Armstrong18 Jan 2017 15:57 UTC

3 points

0 comments1 min readLW link

Question: MIRI Corrigbility Agenda

algon3313 Mar 2019 19:38 UTC

15 points

11 comments1 min readLW link

Petrov corrigibility

Stuart_Armstrong11 Sep 2018 13:50 UTC

20 points

10 comments1 min readLW link

Corrigibility doesn’t always have a good action to take

Stuart_Armstrong28 Aug 2018 20:30 UTC

19 points

0 comments1 min readLW link

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

Justausername24 Aug 2023 3:53 UTC

1 point

0 comments6 min readLW link

Invulnerable Incomplete Preferences: A Formal Statement

Sami Petersen30 Aug 2023 21:59 UTC

124 points

32 comments35 min readLW link

Corrigibility as Constrained Optimisation

Henrik Åslund11 Apr 2019 20:09 UTC

15 points

3 comments5 min readLW link

Instrumental Convergence Bounty

Logan Zoellner14 Sep 2023 14:02 UTC

62 points

24 comments1 min readLW link

How useful is Corrigibility?

martinkunev12 Sep 2023 0:05 UTC

11 points

4 comments5 min readLW link

Three AI Safety Related Ideas

Wei Dai13 Dec 2018 21:32 UTC

68 points

38 comments2 min readLW link

Counterfactual Planning in AGI Systems

Koen.Holtman3 Feb 2021 13:54 UTC

10 points

0 comments5 min readLW link

Creating AGI Safety Interlocks

Koen.Holtman5 Feb 2021 12:01 UTC

7 points

4 comments8 min readLW link

Disentangling Corrigibility: 2015-2021

Koen.Holtman16 Feb 2021 18:01 UTC

22 points

20 comments9 min readLW link

Safely controlling the AGI agent reward function

Koen.Holtman17 Feb 2021 14:47 UTC

8 points

0 comments5 min readLW link

Information bottleneck for counterfactual corrigibility

tailcalled6 Dec 2021 17:11 UTC

8 points

1 comment7 min readLW link

No comments.