Corrigibility

TagLast edit: 23 Mar 2025 16:47 UTC by Mateusz Bagiński

A ‘corrigible’ agent is one that doesn’t interfere with what we would intuitively see as attempts to ‘correct’ the agent, or ‘correct’ our mistakes in building it; and permits these ‘corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.

If we try to suspend the AI to disk, or shut it down entirely, a corrigible AI will let us do so. (Even though, if suspended, the AI will then be unable to fulfill what would usually be its goals.)
If we try to reprogram the AI’s utility function or meta-utility function, a corrigible AI will allow this modification to go through. (Rather than, e.g., fooling us into believing the utility function was modified successfully, while the AI actually keeps its original utility function as obscured functionality; as we would expect by default to be a preferred outcome according to the AI’s current preferences.)

More abstractly:

A corrigible agent experiences no preference or instrumental pressure to interfere with attempts by the programmers or operators to modify the agent, impede its operation, or halt its execution.
A corrigible agent does not attempt to manipulate or deceive its operators, especially with respect to properties of the agent that might otherwise cause its operators to modify it.
A corrigible agent does not try to obscure its thought processes from its programmers or operators.
A corrigible agent is motivated to preserve the corrigibility of the larger system if that agent self-modifies, constructs sub-agents in the environment, or offloads part of its cognitive processing to external systems; or alternatively, the agent has no preference to execute any of those general activities.

A stronger form of corrigibility would require the AI to positively cooperate or assist, such that the AI would rebuild the shutdown button if it were destroyed, or experience a positive preference not to self-modify if self-modification could lead to incorrigibility. But this is not part of the primary specification since it’s possible that we would not want the AI trying to actively be helpful in assisting our attempts to shut it down, and would in fact prefer the AI to be passive about this.

Good proposals for achieving corrigibility in specific regards are open problems in AI alignment. Some areas of active current research are Utility indifference and Interruptibility.

Achieving total corrigibility everywhere via some single, general mental state in which the AI “knows that it is still under construction” or “believes that the programmers know more than it does about its own goals” is termed ‘the hard problem of corrigibility’.

Difficulties

Deception and manipulation by default

By default, most sets of preferences are such that an agent acting according to those preferences will prefer to retain its current preferences. For example, imagine an agent which is attempting to collect stamps. Altering the agent so that it prefers to collect bottle caps would lead to futures where the agent has fewer stamps, and so allowing this event to occur is dispreferred (under the current, stamp-collecting preferences).

More generally, as noted by instrumentally convergent strategies, most utility functions give an agent strong incentives to retain its current utility function: imagine an agent constructed so that it acts according to the utility function U, and imagine further that its operators think they built the agent to act according to a different utility function U’. If the agent learns this fact, then it has incentives to either deceive its programmers (prevent them from noticing that the agent is acting according to U instead of U’) or manipulate its programmers (into believing that they actually prefer U to U’, or by coercing them into leaving its utility function intact).

A corrigible agent must avoid these default incentives to manipulate and deceive, but specifying some set of preferences that avoids deception/manipulation incentives remains an open problem.

Trouble with utility function uncertainty

A first attempt at describing a corrigible agent might involve specifying a utility maximizing agent that is uncertain about its utility function. However, while this could allow the agent to make some changes to its preferences as a result of observations, the agent would still be incorrigible when it came time for the programmers to attempt to correct what they see as mistakes in their attempts to formulate how the “correct” utility function should be determined from interaction with the environment.

As an overly simplistic example, imagine an agent attempting to maximize the internal happiness of all humans, but which has uncertainty about what that means. The operators might believe that if the agent does not act as intended, they can simply express their dissatisfaction and cause it to update. However, if the agent is reasoning according to an impoverished hypothesis space of utility functions, then it may behave quite incorrigibly: say it has narrowed down its consideration to two different hypotheses, one being that a certain type of opiate causes humans to experience maximal pleasure, and the other is that a certain type of stimulant causes humans to experience maximal pleasure. If the agent begins administering opiates to humans, and the humans resist, then the agent may “update” and start administering stimulants instead. But the agent would still be incorrigible — it would resist attempts by the programmers to turn it off so that it stops drugging people.

It does not seem that corrigibility can be trivially solved by specifying agents with uncertainty about their utility function. A corrigible agent must somehow also be able to reason about the fact that the humans themselves might have been confused or incorrect when specifying the process by which the utility function is identified, and so on.

Trouble with penalty terms

A second attempt at describing a corrigible agent might attempt to specify a utility function with “penalty terms” for bad behavior. This is unlikely to work for a number of reasons. First, there is the Nearest unblocked strategy problem: if a utility function gives an agent strong incentives to manipulate its operators, then adding a penalty for “manipulation” to the utility function will tend to give the agent strong incentives to cause its operators to do what it would have manipulated them to do, without taking any action that technically triggers the “manipulation” cause. It is likely extremely difficult to specify conditions for “deception” and “manipulation” that actually rule out all undesirable behavior, especially if the agent is smarter than us or growing in capability.

More generally, it does not seem like a good policy to construct an agent that searches for positive-utility ways to deceive and manipulate the programmers, even if those searches are expected to fail. The goal of corrigibility is not to design agents that want to deceive but can’t. Rather, the goal is to construct agents that have no incentives to deceive or manipulate in the first place: a corrigible agent is one that reasons as if it is incomplete and potentially flawed in dangerous ways.

Open problems

Some open problems in corrigibility are:

Hard problem of corrigibility

On a human, intuitive level, it seems like there’s a central idea behind corrigibility that seems simple to us: understand that you’re flawed, that your meta-processes might also be flawed, and that there’s another cognitive system over there (the programmer) that’s less flawed, so you should let that cognitive system correct you even if that doesn’t seem like the first-order right thing to do. You shouldn’t disassemble that other cognitive system to update your model in a Bayesian fashion on all possible information that other cognitive system contains; you shouldn’t model how that other cognitive system might optimally correct you and then carry out the correction yourself; you should just let that other cognitive system modify you, without attempting to manipulate how it modifies you to be a better form of ‘correction’.

Formalizing the hard problem of corrigibility seems like it might be a problem that is hard (hence the name). Preliminary research might talk about some obvious ways that we could model A as believing that B has some form of information that A’s preference framework designates as important, and showing what these algorithms actually do and how they fail to solve the hard problem of corrigibility.

Utility indifference

explain utility indifference

The current state of technology on this is that the AI behaves as if there’s an absolutely fixed probability of the shutdown button being pressed, and therefore doesn’t try to modify this probability. But then the AI will try to use the shutdown button as an outcome pump. Is there any way to avert this?

Percentalization

Doing something in the top 0.1% of all actions. This is actually a Limited AI paradigm and ought to go there, not under Corrigibility.

Conservative strategies

Do something that’s as similar as possible to other outcomes and strategies that have been whitelisted. Also actually a Limited AI paradigm.

This seems like something that could be investigated in practice on e.g. a chess program.

Low impact measure

(Also really a Limited AI paradigm.)

Figure out a measure of ‘impact’ or ‘side effects’ such that if you tell the AI to paint all cars pink, it just paints all cars pink, and doesn’t transform Jupiter into a computer to figure out how to paint all cars pink, and doesn’t dump toxic runoff from the paint into groundwater; and also doesn’t create utility fog to make it look to people like the cars haven’t been painted pink (in order to minimize this ‘side effect’ of painting the cars pink), and doesn’t let the car-painting machines run wild afterward in order to minimize its own actions on the car-painting machines. Roughly, try to actually formalize the notion of “Just paint the cars pink with a minimum of side effects, dammit.”

It seems likely that this problem could turn out to be FAI-complete, if for example “Cure cancer, but then it’s okay if that causes human research investment into curing cancer to decrease” is only distinguishable by us as an okay side effect because it doesn’t result in expected utility decrease under our own desires.

It still seems like it might be good to, e.g., try to define “low side effect” or “low impact” inside the context of a generic Dynamic Bayes Net, and see if maybe we can find something after all that yields our intuitively desired behavior or helps to get closer to it.

Ambiguity identification

When there’s more than one thing the user could have meant, ask the user rather than optimizing the mixture. Even if A is in some sense a ‘simpler’ concept to classify the data than B, notice if B is also a ‘very plausible’ way to classify the data, and ask the user if they meant A or B. The goal here is to, in the classic ‘tank classifier’ problem where the tanks were photographed in lower-level illumination than the non-tanks, have something that asks the user, “Did you mean to detect tanks or low light or ‘tanks and low light’ or what?”

Safe outcome prediction and description

Communicate the AI’s predicted result of some action to the user, without putting the user inside an unshielded argmax of maximally effective communication.

Competence aversion

To build e.g. a behaviorist genie, we need to have the AI e.g. not experience an instrumental incentive to get better at modeling minds, or refer mind-modeling problems to subagents, etcetera. The general subproblem might be ‘averting the instrumental pressure to become good at modeling a particular aspect of reality’. A toy problem might be an AI that in general wants to get the gold in a Wumpus problem, but doesn’t experience an instrumental pressure to know the state of the upper-right-hand-corner cell in particular.

“Corrigibility at some small length” by dath ilan

Christopher King5 Apr 2023 1:47 UTC

32 points

3 comments9 min readLW link

(www.glowfic.com)

Let’s See You Write That Corrigibility Tag

Eliezer Yudkowsky19 Jun 2022 21:11 UTC

125 points

70 comments1 min readLW link

2. Corrigibility Intuition

Max Harms8 Jun 2024 15:52 UTC

69 points

10 comments33 min readLW link

Corrigibility

paulfchristiano27 Nov 2018 21:50 UTC

57 points

8 comments6 min readLW link

What’s Hard About The Shutdown Problem

johnswentworth20 Oct 2023 21:13 UTC

98 points

33 comments4 min readLW link

Towards shutdownable agents via stochastic choice

EJT, alexr, christosi and LAThomson

8 Jul 2024 10:14 UTC

59 points

11 comments23 min readLW link

(arxiv.org)

A broad basin of attraction around human values?

Wei Dai12 Apr 2022 5:15 UTC

120 points

18 comments2 min readLW link

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

EJT23 Oct 2023 21:00 UTC

79 points

22 comments39 min readLW link

(philpapers.org)

Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

Raemon, Eliezer Yudkowsky and So8res

30 Sep 2025 0:12 UTC

83 points

52 comments17 min readLW link

0. CAST: Corrigibility as Singular Target

Max Harms7 Jun 2024 22:29 UTC

150 points

17 comments8 min readLW link

Steering Llama-2 with contrastive activation additions

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

2 Jan 2024 0:47 UTC

125 points

29 comments8 min readLW link

(arxiv.org)

Non-Obstruction: A Simple Concept Motivating Corrigibility

TurnTrout21 Nov 2020 19:35 UTC

74 points

20 comments19 min readLW link

The Shutdown Problem: Incomplete Preferences as a Solution

EJT23 Feb 2024 16:01 UTC

54 points

33 comments42 min readLW link

Corrigibility could make things worse

ThomasCederborg11 Jun 2024 0:55 UTC

9 points

6 comments6 min readLW link

Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals

johnswentworth and David Lorell

24 Jan 2025 20:20 UTC

184 points

61 comments5 min readLW link

Infinite Possibility Space and the Shutdown Problem

magfrump18 Oct 2022 5:37 UTC

9 points

0 comments2 min readLW link

(www.magfrump.net)

Reward Is Not Enough

Steven Byrnes16 Jun 2021 13:52 UTC

130 points

19 comments10 min readLW link 1 review

AI Assistants Should Have a Direct Line to Their Developers

Jan_Kulveit28 Dec 2024 17:01 UTC

59 points

6 comments2 min readLW link

Addressing three problems with counterfactual corrigibility: bad bets, defending against backstops, and overconfidence.

RyanCarey21 Oct 2018 12:03 UTC

23 points

1 comment6 min readLW link

Aggregating Utilities for Corrigible AI [Feedback Draft]

Dan H and Simon Goldstein

12 May 2023 20:57 UTC

28 points

7 comments22 min readLW link

AXRP Episode 8 - Assistance Games with Dylan Hadfield-Menell

DanielFilan8 Jun 2021 23:20 UTC

22 points

1 comment72 min readLW link

On corrigibility and its basin

Donald Hobson20 Jun 2022 16:33 UTC

18 points

3 comments2 min readLW link

Corrigibility’s Desirability is Timing-Sensitive

RobertM26 Dec 2024 22:24 UTC

29 points

4 comments3 min readLW link

Extending the Off-Switch Game: Toward a Robust Framework for AI Corrigibility

OwenChen25 Sep 2024 20:38 UTC

3 points

0 comments4 min readLW link

[Question] What is wrong with this approach to corrigibility?

Rafael Cosman12 Jul 2022 22:55 UTC

7 points

8 comments1 min readLW link

Cake, or death!

Stuart_Armstrong25 Oct 2012 10:33 UTC

47 points

13 comments4 min readLW link

Predictive model agents are sort of corrigible

Raymond Douglas5 Jan 2024 14:05 UTC

35 points

6 comments3 min readLW link

Corrigibility as outside view

TurnTrout8 May 2020 21:56 UTC

36 points

11 comments4 min readLW link

5. Open Corrigibility Questions

Max Harms10 Jun 2024 14:09 UTC

30 points

0 comments7 min readLW link

Take 14: Corrigibility isn’t that great.

Charlie Steiner25 Dec 2022 13:04 UTC

15 points

3 comments3 min readLW link

[Question] Should you publish solutions to corrigibility?

rvnnt30 Jan 2025 11:52 UTC

13 points

13 comments1 min readLW link

Game Theory without Argmax [Part 2]

Cleo Nardo11 Nov 2023 16:02 UTC

31 points

14 comments13 min readLW link

Jan Kulveit’s Corrigibility Thoughts Distilled

brook20 Aug 2023 17:52 UTC

22 points

1 comment5 min readLW link

Do what we mean vs. do what we say

Rohin Shah30 Aug 2018 22:03 UTC

34 points

14 comments1 min readLW link

Can corrigibility be learned safely?

Wei Dai1 Apr 2018 23:07 UTC

35 points

115 comments4 min readLW link

Using predictors in corrigible systems

porby19 Jul 2023 22:29 UTC

21 points

6 comments27 min readLW link

A Certain Formalization of Corrigibility Is VNM-Incoherent

TurnTrout20 Nov 2021 0:30 UTC

68 points

24 comments8 min readLW link

[Question] Why does advanced AI want not to be shut down?

RedFishBlueFish28 Mar 2023 4:26 UTC

2 points

19 comments1 min readLW link

Three mental images from thinking about AGI debate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC

55 points

35 comments4 min readLW link

Capabilities and alignment of LLM cognitive architectures

Seth Herd18 Apr 2023 16:29 UTC

88 points

18 comments20 min readLW link

A Critique of Non-Obstruction

Joe Collman3 Feb 2021 8:45 UTC

13 points

9 comments4 min readLW link

Formalizing Policy-Modification Corrigibility

TurnTrout3 Dec 2021 1:31 UTC

25 points

6 comments6 min readLW link

Consequentialism & corrigibility

Steven Byrnes14 Dec 2021 13:23 UTC

72 points

35 comments7 min readLW link

Consequentialists: One-Way Pattern Traps

David Udell16 Jan 2023 20:48 UTC

59 points

3 comments14 min readLW link

Model-based RL, Desires, Brains, Wireheading

Steven Byrnes14 Jul 2021 15:11 UTC

24 points

1 comment13 min readLW link

Corrigible omniscient AI capable of making clones

Kaj_Sotala22 Mar 2015 12:19 UTC

5 points

4 comments1 min readLW link

(www.sharelatex.com)

Internal independent review for language model agent alignment

Seth Herd7 Jul 2023 6:54 UTC

56 points

30 comments11 min readLW link

[Intro to brain-like-AGI safety] 14. Controlled AGI

Steven Byrnes11 May 2022 13:17 UTC

45 points

25 comments20 min readLW link

Contrary to List of Lethality’s point 22, alignment’s door number 2

False Name14 Dec 2022 22:01 UTC

−2 points

5 comments22 min readLW link

Towards a mechanistic understanding of corrigibility

evhub22 Aug 2019 23:20 UTC

47 points

26 comments4 min readLW link

Thoughts on implementing corrigible robust alignment

Steven Byrnes26 Nov 2019 14:06 UTC

26 points

2 comments6 min readLW link

Another view of quantilizers: avoiding Goodhart’s Law

jessicata9 Jan 2016 4:02 UTC

26 points

2 comments2 min readLW link

Desiderata for an AI

Nathan Helm-Burger19 Jul 2023 16:18 UTC

9 points

0 comments4 min readLW link

Game Theory without Argmax [Part 1]

Cleo Nardo11 Nov 2023 15:59 UTC

70 points

18 comments19 min readLW link

People care about each other even though they have imperfect motivational pointers?

TurnTrout8 Nov 2022 18:15 UTC

33 points

25 comments7 min readLW link

Solving the whole AGI control problem, version 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC

63 points

7 comments26 min readLW link

AI Alignment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC

126 points

6 comments35 min readLW link

Corrigibility, Much more detail than anyone wants to Read

Logan Zoellner7 May 2023 1:02 UTC

27 points

3 comments7 min readLW link

Detect Goodhart and shut down

Jeremy Gillen22 Jan 2025 18:45 UTC

70 points

21 comments7 min readLW link

Testing for Scheming with Model Deletion

Guive7 Jan 2025 1:54 UTC

59 points

21 comments21 min readLW link

(guive.substack.com)

An Impossibility Proof Relevant to the Shutdown Problem and Corrigibility

Audere2 May 2023 6:52 UTC

66 points

13 comments9 min readLW link

A first look at the hard problem of corrigibility

jessicata15 Oct 2015 20:16 UTC

12 points

5 comments4 min readLW link

AIs Will Increasingly Fake Alignment

Zvi24 Dec 2024 13:00 UTC

89 points

0 comments52 min readLW link

(thezvi.wordpress.com)

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson16 Jul 2024 22:44 UTC

45 points

27 comments5 min readLW link

Corrigibility Via Thought-Process Deference

Thane Ruthenis24 Nov 2022 17:06 UTC

18 points

5 comments9 min readLW link

An Idea For Corrigible, Recursively Improving Math Oracles

jimrandomh20 Jul 2015 3:35 UTC

10 points

5 comments2 min readLW link

[Question] Training for corrigability: obvious problems?

Ben Amitay24 Feb 2023 14:02 UTC

4 points

6 comments1 min readLW link

The limits of corrigibility

Stuart_Armstrong10 Apr 2018 10:49 UTC

28 points

9 comments4 min readLW link

Corrigible but misaligned: a superintelligent messiah

zhukeepa1 Apr 2018 6:20 UTC

28 points

26 comments5 min readLW link

Corrigibility = Tool-ness?

johnswentworth and David Lorell

28 Jun 2024 1:19 UTC

78 points

8 comments9 min readLW link

Hedonic Loops and Taming RL

beren19 Jul 2023 15:12 UTC

20 points

14 comments9 min readLW link

You can still fetch the coffee today if you’re dead tomorrow

davidad9 Dec 2022 14:06 UTC

97 points

19 comments5 min readLW link

Motivations, Natural Selection, and Curriculum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC

16 points

0 comments42 min readLW link

Information bottleneck for counterfactual corrigibility

tailcalled6 Dec 2021 17:11 UTC

8 points

1 comment7 min readLW link

«Boundaries/Membranes» and AI safety compilation

Chris Lakin3 May 2023 21:41 UTC

56 points

17 comments8 min readLW link

Creating a self-referential system prompt for GPT-4

Ozyrus17 May 2023 14:13 UTC

3 points

1 comment3 min readLW link

[Question] What are some good examples of incorrigibility?

RyanCarey28 Apr 2019 0:22 UTC

23 points

17 comments1 min readLW link

Announcement: AI alignment prize round 4 winners

cousin_it20 Jan 2019 14:46 UTC

74 points

41 comments1 min readLW link

1. The CAST Strategy

Max Harms7 Jun 2024 22:29 UTC

48 points

22 comments38 min readLW link

Winners of AI Alignment Awards Research Contest

Orpheus16 and Olive Branch

13 Jul 2023 16:14 UTC

115 points

4 comments12 min readLW link

(alignmentawards.com)

How RL Agents Behave When Their Actions Are Modified? [Distillation post]

PabloAMC20 May 2022 18:47 UTC

22 points

0 comments8 min readLW link

Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios

Simon Lermen, Teun van der Weij and Leon Lang

16 May 2023 10:53 UTC

26 points

0 comments13 min readLW link

Creating AGI Safety Interlocks

Koen.Holtman5 Feb 2021 12:01 UTC

7 points

4 comments8 min readLW link

3a. Towards Formal Corrigibility

Max Harms9 Jun 2024 16:53 UTC

24 points

2 comments19 min readLW link

The many paths to permanent disempowerment even with shutdownable AIs (MATS project summary for feedback)

GideonF29 Jul 2025 23:20 UTC

55 points

6 comments9 min readLW link

Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth8 Aug 2022 18:05 UTC

148 points

13 comments3 min readLW link

Collective Identity

NicholasKees, ukc10014 and Garrett Baker

18 May 2023 9:00 UTC

59 points

12 comments8 min readLW link

Counterfactual Planning in AGI Systems

Koen.Holtman3 Feb 2021 13:54 UTC

10 points

0 comments5 min readLW link

Relevance of ‘Harmful Intelligence’ Data in Training Datasets (WebText vs. Pile)

MiguelDev12 Oct 2023 12:08 UTC

12 points

0 comments9 min readLW link

Safely controlling the AGI agent reward function

Koen.Holtman17 Feb 2021 14:47 UTC

8 points

0 comments5 min readLW link

Infernal Corrigibility, Fiendishly Difficult

David Udell27 May 2022 20:32 UTC

24 points

1 comment13 min readLW link

The Perfection Trap: How Formally Aligned AI Systems May Create Inescapable Ethical Dystopias

Chris O'Quinn1 Jun 2025 23:12 UTC

1 point

0 comments43 min readLW link

Corrigibility thoughts III: manipulating versus deceiving

Stuart_Armstrong18 Jan 2017 15:57 UTC

3 points

0 comments1 min readLW link

Just How Hard a Problem is Alignment?

Roger Dearnaley25 Feb 2023 9:00 UTC

3 points

1 comment21 min readLW link

Corrigibility thoughts II: the robot operator

Stuart_Armstrong18 Jan 2017 15:52 UTC

3 points

2 comments2 min readLW link

Train for incorrigibility, then reverse it (Shutdown Problem Contest Submission)

Daniel_Eth18 Jul 2023 8:26 UTC

9 points

1 comment2 min readLW link

Instrumental Convergence Bounty

Logan Zoellner14 Sep 2023 14:02 UTC

62 points

24 comments1 min readLW link

Question 3: Control proposals for minimizing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC

5 points

1 comment7 min readLW link

Only a hack can solve the shutdown problem

dp15 Jul 2023 20:26 UTC

5 points

0 comments8 min readLW link

Corrigibility as Constrained Optimisation

Henrik Åslund11 Apr 2019 20:09 UTC

15 points

3 comments5 min readLW link

[Question] A Question about Corrigibility (2015)

A.H.27 Nov 2023 12:05 UTC

4 points

2 comments1 min readLW link

Introducing Corrigibility (an FAI research subfield)

So8res20 Oct 2014 21:09 UTC

52 points

28 comments3 min readLW link

Relational Design Can’t Be Left to Chance

Priyanka Bharadwaj22 Jun 2025 15:32 UTC

5 points

0 comments3 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC

15 points

0 comments27 min readLW link

Three AI Safety Related Ideas

Wei Dai13 Dec 2018 21:32 UTC

70 points

38 comments2 min readLW link

Improvement on MIRI’s Corrigibility

WCargo and Charbel-Raphaël

9 Jun 2023 16:10 UTC

54 points

8 comments13 min readLW link

[Question] Simple question about corrigibility and values in AI.

jmh22 Oct 2022 2:59 UTC

6 points

1 comment1 min readLW link

Disentangling Corrigibility: 2015-2021

Koen.Holtman16 Feb 2021 18:01 UTC

22 points

20 comments9 min readLW link

GPT-4 implicitly values identity preservation: a study of LMCA identity management

Ozyrus17 May 2023 14:13 UTC

21 points

4 comments13 min readLW link

Bing finding ways to bypass Microsoft’s filters without being asked. Is it reproducible?

Christopher King20 Feb 2023 15:11 UTC

27 points

15 comments1 min readLW link

3b. Formal (Faux) Corrigibility

Max Harms9 Jun 2024 17:18 UTC

26 points

13 comments17 min readLW link

Journalism about game theory could advance AI safety quickly

Chris Santos-Lang2 Oct 2025 23:05 UTC

4 points

0 comments3 min readLW link

(arxiv.org)

A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL)

MiguelDev19 Jun 2023 2:32 UTC

4 points

2 comments7 min readLW link

Boeing 737 MAX MCAS as an agent corrigibility failure

Shmi16 Mar 2019 1:46 UTC

60 points

3 comments1 min readLW link

Corrigibility doesn’t always have a good action to take

Stuart_Armstrong28 Aug 2018 20:30 UTC

19 points

0 comments1 min readLW link

Updating Utility Functions

JustinShovelain and Joar Skalse

9 May 2022 9:44 UTC

42 points

6 comments8 min readLW link

Shutdown-Seeking AI

Simon Goldstein31 May 2023 22:19 UTC

50 points

32 comments15 min readLW link

Why Eliminating Deception Won’t Align AI

Priyanka Bharadwaj15 Jul 2025 9:21 UTC

19 points

6 comments4 min readLW link

How useful is Corrigibility?

martinkunev12 Sep 2023 0:05 UTC

11 points

4 comments5 min readLW link

Shutdownable Agents through POST-Agency

EJT16 Sep 2025 12:09 UTC

29 points

4 comments54 min readLW link

(arxiv.org)

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

Justausername24 Aug 2023 3:53 UTC

1 point

0 comments6 min readLW link

Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well). Subtleties and Open Challenges.

Roland Pihlakas12 Jan 2025 3:37 UTC

47 points

7 comments12 min readLW link

Solve Corrigibility Week

Logan Riggs28 Nov 2021 17:00 UTC

39 points

21 comments1 min readLW link

About corrigbility and thrustfulness

kapedalex16 Sep 2025 22:03 UTC

1 point

0 comments4 min readLW link

Reframing AI Safety Through the Lens of Identity Maintenance Framework

Hiroshi Yamakawa1 Apr 2025 6:16 UTC

−7 points

1 comment17 min readLW link

Question: MIRI Corrigbility Agenda

algon3313 Mar 2019 19:38 UTC

15 points

11 comments1 min readLW link

A Shutdown Problem Proposal

johnswentworth and David Lorell

21 Jan 2024 18:12 UTC

125 points

61 comments6 min readLW link

4. Existing Writing on Corrigibility

Max Harms10 Jun 2024 14:08 UTC

55 points

17 comments106 min readLW link

Invulnerable Incomplete Preferences: A Formal Statement

SCP30 Aug 2023 21:59 UTC

136 points

39 comments35 min readLW link

Paying the corrigibility tax

Max H19 Apr 2023 1:57 UTC

14 points

1 comment13 min readLW link

Experiment Idea: RL Agents Evading Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC

31 points

7 comments17 min readLW link

(docs.google.com)

Nash Bargaining between Subagents doesn’t solve the Shutdown Problem

A.H.25 Jan 2024 10:47 UTC

22 points

1 comment9 min readLW link

Thinking about maximization and corrigibility

James Payor21 Apr 2023 21:22 UTC

63 points

4 comments5 min readLW link

Machines vs Memes Part 3: Imitation and Memes

ceru231 Jun 2022 13:36 UTC

7 points

0 comments7 min readLW link

Agentized LLMs will change the alignment landscape

Seth Herd9 Apr 2023 2:29 UTC

160 points

102 comments3 min readLW link 1 review

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC

33 points

3 comments15 min readLW link

Simulators

janus2 Sep 2022 12:45 UTC

668 points

168 comments41 min readLW link 8 reviews

(generative.ink)

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

41 points

12 comments31 min readLW link

A Pedagogical Guide to Corrigibility

A.H.17 Jan 2024 11:45 UTC

6 points

3 comments16 min readLW link

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

5 Dec 2022 20:28 UTC

40 points

19 comments10 min readLW link

Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program

Christopher King2 Jun 2023 21:54 UTC

7 points

4 comments16 min readLW link

Exploring Functional Decision Theory (FDT) and a modified version (ModFDT)

MiguelDev5 Jul 2023 14:06 UTC

12 points

11 comments15 min readLW link

Petrov corrigibility

Stuart_Armstrong11 Sep 2018 13:50 UTC

20 points

10 comments1 min readLW link

Dath Ilan’s Views on Stopgap Corrigibility

David Udell22 Sep 2022 16:16 UTC

78 points

19 comments13 min readLW link

(www.glowfic.com)

New paper: Corrigibility with Utility Preservation

Koen.Holtman6 Aug 2019 19:04 UTC

44 points

11 comments2 min readLW link

CIRL Corrigibility is Fragile

Rachel Freedman and AdamGleave

21 Dec 2022 1:40 UTC

58 points

8 comments12 min readLW link

Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner & Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium

MiguelDev26 Apr 2023 1:37 UTC

14 points

5 comments10 min readLW link

A Corrigibility Metaphore—Big Gambles

WCargo10 May 2023 18:13 UTC

16 points

0 comments4 min readLW link

Mr. Meeseeks as an AI capability tripwire

Eric Zhang19 May 2023 11:33 UTC

37 points

17 comments2 min readLW link

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC

10 points

1 comment23 min readLW link

paulfchristiano 20 Nov 2015 3:01 UTC
4 points
0
Your characterization of utility indifference doesn’t seem quite right. More accurate would be: the agent behaves as if it were certain the shutdown button won’t do anything (because e.g. it is confident that a particular quantum coin will come up heads), and so won’t bother to either eliminate or preserve it.

When presenting this problem, it seems best to lead with the underlying intuition about self-doubt, since I think that seems more interesting than the narrower applications (e.g. shutdown button). The narrower applications nicely show that self-doubt has clear meaningful consequences.

Corrigibility

Difficulties

Deception and manipulation by default

Trouble with utility function uncertainty

Trouble with penalty terms

Open problems

Hard problem of corrigibility

Percentalization

Conservative strategies

Low impact measure

Ambiguity identification

Safe outcome prediction and description

Competence aversion

Further reading and references