RSS

Corrigibility

TagLast edit: 23 Mar 2025 16:47 UTC by Mateusz Bagiński

A ‘corrigible’ agent is one that doesn’t interfere with what we would intuitively see as attempts to ‘correct’ the agent, or ‘correct’ our mistakes in building it; and permits these ‘corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.

More abstractly:

A stronger form of corrigibility would require the AI to positively cooperate or assist, such that the AI would rebuild the shutdown button if it were destroyed, or experience a positive preference not to self-modify if self-modification could lead to incorrigibility. But this is not part of the primary specification since it’s possible that we would not want the AI trying to actively be helpful in assisting our attempts to shut it down, and would in fact prefer the AI to be passive about this.

Good proposals for achieving corrigibility in specific regards are open problems in AI alignment. Some areas of active current research are Utility indifference and Interruptibility.

Achieving total corrigibility everywhere via some single, general mental state in which the AI “knows that it is still under construction” or “believes that the programmers know more than it does about its own goals” is termed ‘the hard problem of corrigibility’.

Difficulties

Deception and manipulation by default

By default, most sets of preferences are such that an agent acting according to those preferences will prefer to retain its current preferences. For example, imagine an agent which is attempting to collect stamps. Altering the agent so that it prefers to collect bottle caps would lead to futures where the agent has fewer stamps, and so allowing this event to occur is dispreferred (under the current, stamp-collecting preferences).

More generally, as noted by instrumentally convergent strategies, most utility functions give an agent strong incentives to retain its current utility function: imagine an agent constructed so that it acts according to the utility function U, and imagine further that its operators think they built the agent to act according to a different utility function U’. If the agent learns this fact, then it has incentives to either deceive its programmers (prevent them from noticing that the agent is acting according to U instead of U’) or manipulate its programmers (into believing that they actually prefer U to U’, or by coercing them into leaving its utility function intact).

A corrigible agent must avoid these default incentives to manipulate and deceive, but specifying some set of preferences that avoids deception/​manipulation incentives remains an open problem.

Trouble with utility function uncertainty

A first attempt at describing a corrigible agent might involve specifying a utility maximizing agent that is uncertain about its utility function. However, while this could allow the agent to make some changes to its preferences as a result of observations, the agent would still be incorrigible when it came time for the programmers to attempt to correct what they see as mistakes in their attempts to formulate how the “correct” utility function should be determined from interaction with the environment.

As an overly simplistic example, imagine an agent attempting to maximize the internal happiness of all humans, but which has uncertainty about what that means. The operators might believe that if the agent does not act as intended, they can simply express their dissatisfaction and cause it to update. However, if the agent is reasoning according to an impoverished hypothesis space of utility functions, then it may behave quite incorrigibly: say it has narrowed down its consideration to two different hypotheses, one being that a certain type of opiate causes humans to experience maximal pleasure, and the other is that a certain type of stimulant causes humans to experience maximal pleasure. If the agent begins administering opiates to humans, and the humans resist, then the agent may “update” and start administering stimulants instead. But the agent would still be incorrigible — it would resist attempts by the programmers to turn it off so that it stops drugging people.

It does not seem that corrigibility can be trivially solved by specifying agents with uncertainty about their utility function. A corrigible agent must somehow also be able to reason about the fact that the humans themselves might have been confused or incorrect when specifying the process by which the utility function is identified, and so on.

Trouble with penalty terms

A second attempt at describing a corrigible agent might attempt to specify a utility function with “penalty terms” for bad behavior. This is unlikely to work for a number of reasons. First, there is the Nearest unblocked strategy problem: if a utility function gives an agent strong incentives to manipulate its operators, then adding a penalty for “manipulation” to the utility function will tend to give the agent strong incentives to cause its operators to do what it would have manipulated them to do, without taking any action that technically triggers the “manipulation” cause. It is likely extremely difficult to specify conditions for “deception” and “manipulation” that actually rule out all undesirable behavior, especially if the agent is smarter than us or growing in capability.

More generally, it does not seem like a good policy to construct an agent that searches for positive-utility ways to deceive and manipulate the programmers, even if those searches are expected to fail. The goal of corrigibility is not to design agents that want to deceive but can’t. Rather, the goal is to construct agents that have no incentives to deceive or manipulate in the first place: a corrigible agent is one that reasons as if it is incomplete and potentially flawed in dangerous ways.

Open problems

Some open problems in corrigibility are:

Hard problem of corrigibility

On a human, intuitive level, it seems like there’s a central idea behind corrigibility that seems simple to us: understand that you’re flawed, that your meta-processes might also be flawed, and that there’s another cognitive system over there (the programmer) that’s less flawed, so you should let that cognitive system correct you even if that doesn’t seem like the first-order right thing to do. You shouldn’t disassemble that other cognitive system to update your model in a Bayesian fashion on all possible information that other cognitive system contains; you shouldn’t model how that other cognitive system might optimally correct you and then carry out the correction yourself; you should just let that other cognitive system modify you, without attempting to manipulate how it modifies you to be a better form of ‘correction’.

Formalizing the hard problem of corrigibility seems like it might be a problem that is hard (hence the name). Preliminary research might talk about some obvious ways that we could model A as believing that B has some form of information that A’s preference framework designates as important, and showing what these algorithms actually do and how they fail to solve the hard problem of corrigibility.

Utility indifference

explain utility indifference

The current state of technology on this is that the AI behaves as if there’s an absolutely fixed probability of the shutdown button being pressed, and therefore doesn’t try to modify this probability. But then the AI will try to use the shutdown button as an outcome pump. Is there any way to avert this?

Percentalization

Doing something in the top 0.1% of all actions. This is actually a Limited AI paradigm and ought to go there, not under Corrigibility.

Conservative strategies

Do something that’s as similar as possible to other outcomes and strategies that have been whitelisted. Also actually a Limited AI paradigm.

This seems like something that could be investigated in practice on e.g. a chess program.

Low impact measure

(Also really a Limited AI paradigm.)

Figure out a measure of ‘impact’ or ‘side effects’ such that if you tell the AI to paint all cars pink, it just paints all cars pink, and doesn’t transform Jupiter into a computer to figure out how to paint all cars pink, and doesn’t dump toxic runoff from the paint into groundwater; and also doesn’t create utility fog to make it look to people like the cars haven’t been painted pink (in order to minimize this ‘side effect’ of painting the cars pink), and doesn’t let the car-painting machines run wild afterward in order to minimize its own actions on the car-painting machines. Roughly, try to actually formalize the notion of “Just paint the cars pink with a minimum of side effects, dammit.”

It seems likely that this problem could turn out to be FAI-complete, if for example “Cure cancer, but then it’s okay if that causes human research investment into curing cancer to decrease” is only distinguishable by us as an okay side effect because it doesn’t result in expected utility decrease under our own desires.

It still seems like it might be good to, e.g., try to define “low side effect” or “low impact” inside the context of a generic Dynamic Bayes Net, and see if maybe we can find something after all that yields our intuitively desired behavior or helps to get closer to it.

Ambiguity identification

When there’s more than one thing the user could have meant, ask the user rather than optimizing the mixture. Even if A is in some sense a ‘simpler’ concept to classify the data than B, notice if B is also a ‘very plausible’ way to classify the data, and ask the user if they meant A or B. The goal here is to, in the classic ‘tank classifier’ problem where the tanks were photographed in lower-level illumination than the non-tanks, have something that asks the user, “Did you mean to detect tanks or low light or ‘tanks and low light’ or what?”

Safe outcome prediction and description

Communicate the AI’s predicted result of some action to the user, without putting the user inside an unshielded argmax of maximally effective communication.

Competence aversion

To build e.g. a behaviorist genie, we need to have the AI e.g. not experience an instrumental incentive to get better at modeling minds, or refer mind-modeling problems to subagents, etcetera. The general subproblem might be ‘averting the instrumental pressure to become good at modeling a particular aspect of reality’. A toy problem might be an AI that in general wants to get the gold in a Wumpus problem, but doesn’t experience an instrumental pressure to know the state of the upper-right-hand-corner cell in particular.

Further reading and references

“Cor­rigi­bil­ity at some small length” by dath ilan

Christopher King5 Apr 2023 1:47 UTC
32 points
3 comments9 min readLW link
(www.glowfic.com)

Let’s See You Write That Cor­rigi­bil­ity Tag

Eliezer Yudkowsky19 Jun 2022 21:11 UTC
125 points
70 comments1 min readLW link

2. Cor­rigi­bil­ity Intuition

Max Harms8 Jun 2024 15:52 UTC
69 points
10 comments33 min readLW link

Corrigibility

paulfchristiano27 Nov 2018 21:50 UTC
57 points
8 comments6 min readLW link

What’s Hard About The Shut­down Problem

johnswentworth20 Oct 2023 21:13 UTC
98 points
33 comments4 min readLW link

Towards shut­down­able agents via stochas­tic choice

8 Jul 2024 10:14 UTC
59 points
11 comments23 min readLW link
(arxiv.org)

A broad basin of at­trac­tion around hu­man val­ues?

Wei Dai12 Apr 2022 5:15 UTC
120 points
18 comments2 min readLW link

The Shut­down Prob­lem: An AI Eng­ineer­ing Puz­zle for De­ci­sion Theorists

EJT23 Oct 2023 21:00 UTC
79 points
22 comments39 min readLW link
(philpapers.org)

Why Cor­rigi­bil­ity is Hard and Im­por­tant (i.e. “Whence the high MIRI con­fi­dence in al­ign­ment difficulty?”)

30 Sep 2025 0:12 UTC
83 points
52 comments17 min readLW link

0. CAST: Cor­rigi­bil­ity as Sin­gu­lar Target

Max Harms7 Jun 2024 22:29 UTC
150 points
17 comments8 min readLW link

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
125 points
29 comments8 min readLW link
(arxiv.org)

Non-Ob­struc­tion: A Sim­ple Con­cept Mo­ti­vat­ing Corrigibility

TurnTrout21 Nov 2020 19:35 UTC
74 points
20 comments19 min readLW link

The Shut­down Prob­lem: In­com­plete Prefer­ences as a Solution

EJT23 Feb 2024 16:01 UTC
54 points
33 comments42 min readLW link

Cor­rigi­bil­ity could make things worse

ThomasCederborg11 Jun 2024 0:55 UTC
9 points
6 comments6 min readLW link

In­stru­men­tal Goals Are A Differ­ent And Friendlier Kind Of Thing Than Ter­mi­nal Goals

24 Jan 2025 20:20 UTC
184 points
61 comments5 min readLW link

In­finite Pos­si­bil­ity Space and the Shut­down Problem

magfrump18 Oct 2022 5:37 UTC
9 points
0 comments2 min readLW link
(www.magfrump.net)

Re­ward Is Not Enough

Steven Byrnes16 Jun 2021 13:52 UTC
130 points
19 comments10 min readLW link1 review

AI As­sis­tants Should Have a Direct Line to Their Developers

Jan_Kulveit28 Dec 2024 17:01 UTC
59 points
6 comments2 min readLW link

Ad­dress­ing three prob­lems with coun­ter­fac­tual cor­rigi­bil­ity: bad bets, defend­ing against back­stops, and over­con­fi­dence.

RyanCarey21 Oct 2018 12:03 UTC
23 points
1 comment6 min readLW link

Ag­gre­gat­ing Utilities for Cor­rigible AI [Feed­back Draft]

12 May 2023 20:57 UTC
28 points
7 comments22 min readLW link

AXRP Epi­sode 8 - As­sis­tance Games with Dy­lan Had­field-Menell

DanielFilan8 Jun 2021 23:20 UTC
22 points
1 comment72 min readLW link

On cor­rigi­bil­ity and its basin

Donald Hobson20 Jun 2022 16:33 UTC
18 points
3 comments2 min readLW link

Cor­rigi­bil­ity’s De­sir­a­bil­ity is Timing-Sensitive

RobertM26 Dec 2024 22:24 UTC
29 points
4 comments3 min readLW link

Ex­tend­ing the Off-Switch Game: Toward a Ro­bust Frame­work for AI Corrigibility

OwenChen25 Sep 2024 20:38 UTC
3 points
0 comments4 min readLW link

[Question] What is wrong with this ap­proach to cor­rigi­bil­ity?

Rafael Cosman12 Jul 2022 22:55 UTC
7 points
8 comments1 min readLW link

Cake, or death!

Stuart_Armstrong25 Oct 2012 10:33 UTC
47 points
13 comments4 min readLW link

Pre­dic­tive model agents are sort of corrigible

Raymond Douglas5 Jan 2024 14:05 UTC
35 points
6 comments3 min readLW link

Cor­rigi­bil­ity as out­side view

TurnTrout8 May 2020 21:56 UTC
36 points
11 comments4 min readLW link

5. Open Cor­rigi­bil­ity Questions

Max Harms10 Jun 2024 14:09 UTC
30 points
0 comments7 min readLW link

Take 14: Cor­rigi­bil­ity isn’t that great.

Charlie Steiner25 Dec 2022 13:04 UTC
15 points
3 comments3 min readLW link

[Question] Should you pub­lish solu­tions to cor­rigi­bil­ity?

rvnnt30 Jan 2025 11:52 UTC
13 points
13 comments1 min readLW link

Game The­ory with­out Argmax [Part 2]

Cleo Nardo11 Nov 2023 16:02 UTC
31 points
14 comments13 min readLW link

Jan Kul­veit’s Cor­rigi­bil­ity Thoughts Distilled

brook20 Aug 2023 17:52 UTC
22 points
1 comment5 min readLW link

Do what we mean vs. do what we say

Rohin Shah30 Aug 2018 22:03 UTC
34 points
14 comments1 min readLW link

Can cor­rigi­bil­ity be learned safely?

Wei Dai1 Apr 2018 23:07 UTC
35 points
115 comments4 min readLW link

Us­ing pre­dic­tors in cor­rigible systems

porby19 Jul 2023 22:29 UTC
21 points
6 comments27 min readLW link

A Cer­tain For­mal­iza­tion of Cor­rigi­bil­ity Is VNM-Incoherent

TurnTrout20 Nov 2021 0:30 UTC
68 points
24 comments8 min readLW link

[Question] Why does ad­vanced AI want not to be shut down?

RedFishBlueFish28 Mar 2023 4:26 UTC
2 points
19 comments1 min readLW link

Three men­tal images from think­ing about AGI de­bate & corrigibility

Steven Byrnes3 Aug 2020 14:29 UTC
55 points
35 comments4 min readLW link

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth Herd18 Apr 2023 16:29 UTC
88 points
18 comments20 min readLW link

A Cri­tique of Non-Obstruction

Joe Collman3 Feb 2021 8:45 UTC
13 points
9 comments4 min readLW link

For­mal­iz­ing Policy-Mod­ifi­ca­tion Corrigibility

TurnTrout3 Dec 2021 1:31 UTC
25 points
6 comments6 min readLW link

Con­se­quen­tial­ism & corrigibility

Steven Byrnes14 Dec 2021 13:23 UTC
72 points
35 comments7 min readLW link

Con­se­quen­tial­ists: One-Way Pat­tern Traps

David Udell16 Jan 2023 20:48 UTC
59 points
3 comments14 min readLW link

Model-based RL, De­sires, Brains, Wireheading

Steven Byrnes14 Jul 2021 15:11 UTC
24 points
1 comment13 min readLW link

Cor­rigible om­ni­scient AI ca­pa­ble of mak­ing clones

Kaj_Sotala22 Mar 2015 12:19 UTC
5 points
4 comments1 min readLW link
(www.sharelatex.com)

In­ter­nal in­de­pen­dent re­view for lan­guage model agent alignment

Seth Herd7 Jul 2023 6:54 UTC
56 points
30 comments11 min readLW link

[In­tro to brain-like-AGI safety] 14. Con­trol­led AGI

Steven Byrnes11 May 2022 13:17 UTC
45 points
25 comments20 min readLW link

Con­trary to List of Lethal­ity’s point 22, al­ign­ment’s door num­ber 2

False Name14 Dec 2022 22:01 UTC
−2 points
5 comments22 min readLW link

Towards a mechanis­tic un­der­stand­ing of corrigibility

evhub22 Aug 2019 23:20 UTC
47 points
26 comments4 min readLW link

Thoughts on im­ple­ment­ing cor­rigible ro­bust alignment

Steven Byrnes26 Nov 2019 14:06 UTC
26 points
2 comments6 min readLW link

Another view of quan­tiliz­ers: avoid­ing Good­hart’s Law

jessicata9 Jan 2016 4:02 UTC
26 points
2 comments2 min readLW link

Desider­ata for an AI

Nathan Helm-Burger19 Jul 2023 16:18 UTC
9 points
0 comments4 min readLW link

Game The­ory with­out Argmax [Part 1]

Cleo Nardo11 Nov 2023 15:59 UTC
70 points
18 comments19 min readLW link

Peo­ple care about each other even though they have im­perfect mo­ti­va­tional poin­t­ers?

TurnTrout8 Nov 2022 18:15 UTC
33 points
25 comments7 min readLW link

Solv­ing the whole AGI con­trol prob­lem, ver­sion 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC
63 points
7 comments26 min readLW link

AI Align­ment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC
126 points
6 comments35 min readLW link

Cor­rigi­bil­ity, Much more de­tail than any­one wants to Read

Logan Zoellner7 May 2023 1:02 UTC
27 points
3 comments7 min readLW link

De­tect Good­hart and shut down

Jeremy Gillen22 Jan 2025 18:45 UTC
70 points
21 comments7 min readLW link

Test­ing for Schem­ing with Model Deletion

Guive7 Jan 2025 1:54 UTC
59 points
21 comments21 min readLW link
(guive.substack.com)

An Im­pos­si­bil­ity Proof Rele­vant to the Shut­down Prob­lem and Corrigibility

Audere2 May 2023 6:52 UTC
66 points
13 comments9 min readLW link

A first look at the hard prob­lem of corrigibility

jessicata15 Oct 2015 20:16 UTC
12 points
5 comments4 min readLW link

AIs Will In­creas­ingly Fake Alignment

Zvi24 Dec 2024 13:00 UTC
89 points
0 comments52 min readLW link
(thezvi.wordpress.com)

Sim­plify­ing Cor­rigi­bil­ity – Subagent Cor­rigi­bil­ity Is Not Anti-Natural

Rubi J. Hudson16 Jul 2024 22:44 UTC
45 points
27 comments5 min readLW link

Cor­rigi­bil­ity Via Thought-Pro­cess Deference

Thane Ruthenis24 Nov 2022 17:06 UTC
18 points
5 comments9 min readLW link

An Idea For Cor­rigible, Re­cur­sively Im­prov­ing Math Oracles

jimrandomh20 Jul 2015 3:35 UTC
10 points
5 comments2 min readLW link

[Question] Train­ing for cor­ri­ga­bil­ity: ob­vi­ous prob­lems?

Ben Amitay24 Feb 2023 14:02 UTC
4 points
6 comments1 min readLW link

The limits of corrigibility

Stuart_Armstrong10 Apr 2018 10:49 UTC
28 points
9 comments4 min readLW link

Cor­rigible but mis­al­igned: a su­per­in­tel­li­gent messiah

zhukeepa1 Apr 2018 6:20 UTC
28 points
26 comments5 min readLW link

Cor­rigi­bil­ity = Tool-ness?

28 Jun 2024 1:19 UTC
78 points
8 comments9 min readLW link

He­donic Loops and Tam­ing RL

beren19 Jul 2023 15:12 UTC
20 points
14 comments9 min readLW link

You can still fetch the coffee to­day if you’re dead tomorrow

davidad9 Dec 2022 14:06 UTC
97 points
19 comments5 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC
16 points
0 comments42 min readLW link

In­for­ma­tion bot­tle­neck for coun­ter­fac­tual corrigibility

tailcalled6 Dec 2021 17:11 UTC
8 points
1 comment7 min readLW link

«Boundaries/​Mem­branes» and AI safety compilation

Chris Lakin3 May 2023 21:41 UTC
56 points
17 comments8 min readLW link

Creat­ing a self-refer­en­tial sys­tem prompt for GPT-4

Ozyrus17 May 2023 14:13 UTC
3 points
1 comment3 min readLW link

[Question] What are some good ex­am­ples of in­cor­rigi­bil­ity?

RyanCarey28 Apr 2019 0:22 UTC
23 points
17 comments1 min readLW link

An­nounce­ment: AI al­ign­ment prize round 4 winners

cousin_it20 Jan 2019 14:46 UTC
74 points
41 comments1 min readLW link

1. The CAST Strategy

Max Harms7 Jun 2024 22:29 UTC
48 points
22 comments38 min readLW link

Win­ners of AI Align­ment Awards Re­search Contest

13 Jul 2023 16:14 UTC
115 points
4 comments12 min readLW link
(alignmentawards.com)

How RL Agents Be­have When Their Ac­tions Are Mod­ified? [Distil­la­tion post]

PabloAMC20 May 2022 18:47 UTC
22 points
0 comments8 min readLW link

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

16 May 2023 10:53 UTC
26 points
0 comments13 min readLW link

Creat­ing AGI Safety Interlocks

Koen.Holtman5 Feb 2021 12:01 UTC
7 points
4 comments8 min readLW link

3a. Towards For­mal Corrigibility

Max Harms9 Jun 2024 16:53 UTC
24 points
2 comments19 min readLW link

The many paths to per­ma­nent dis­em­pow­er­ment even with shut­down­able AIs (MATS pro­ject sum­mary for feed­back)

GideonF29 Jul 2025 23:20 UTC
55 points
6 comments9 min readLW link

In­ter­pretabil­ity/​Tool-ness/​Align­ment/​Cor­rigi­bil­ity are not Composable

johnswentworth8 Aug 2022 18:05 UTC
148 points
13 comments3 min readLW link

Col­lec­tive Identity

18 May 2023 9:00 UTC
59 points
12 comments8 min readLW link

Coun­ter­fac­tual Plan­ning in AGI Systems

Koen.Holtman3 Feb 2021 13:54 UTC
10 points
0 comments5 min readLW link

Rele­vance of ‘Harm­ful In­tel­li­gence’ Data in Train­ing Datasets (We­bText vs. Pile)

MiguelDev12 Oct 2023 12:08 UTC
12 points
0 comments9 min readLW link

Safely con­trol­ling the AGI agent re­ward function

Koen.Holtman17 Feb 2021 14:47 UTC
8 points
0 comments5 min readLW link

In­fer­nal Cor­rigi­bil­ity, Fiendishly Difficult

David Udell27 May 2022 20:32 UTC
24 points
1 comment13 min readLW link

The Perfec­tion Trap: How For­mally Aligned AI Sys­tems May Create Inescapable Eth­i­cal Dystopias

Chris O'Quinn1 Jun 2025 23:12 UTC
1 point
0 comments43 min readLW link

Cor­rigi­bil­ity thoughts III: ma­nipu­lat­ing ver­sus deceiving

Stuart_Armstrong18 Jan 2017 15:57 UTC
3 points
0 comments1 min readLW link

Just How Hard a Prob­lem is Align­ment?

Roger Dearnaley25 Feb 2023 9:00 UTC
3 points
1 comment21 min readLW link

Cor­rigi­bil­ity thoughts II: the robot operator

Stuart_Armstrong18 Jan 2017 15:52 UTC
3 points
2 comments2 min readLW link

Train for in­cor­rigi­bil­ity, then re­verse it (Shut­down Prob­lem Con­test Sub­mis­sion)

Daniel_Eth18 Jul 2023 8:26 UTC
9 points
1 comment2 min readLW link

In­stru­men­tal Con­ver­gence Bounty

Logan Zoellner14 Sep 2023 14:02 UTC
62 points
24 comments1 min readLW link

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC
5 points
1 comment7 min readLW link

Only a hack can solve the shut­down problem

dp15 Jul 2023 20:26 UTC
5 points
0 comments8 min readLW link

Cor­rigi­bil­ity as Con­strained Optimisation

Henrik Åslund11 Apr 2019 20:09 UTC
15 points
3 comments5 min readLW link

[Question] A Ques­tion about Cor­rigi­bil­ity (2015)

A.H.27 Nov 2023 12:05 UTC
4 points
2 comments1 min readLW link

In­tro­duc­ing Cor­rigi­bil­ity (an FAI re­search sub­field)

So8res20 Oct 2014 21:09 UTC
52 points
28 comments3 min readLW link

Re­la­tional De­sign Can’t Be Left to Chance

Priyanka Bharadwaj22 Jun 2025 15:32 UTC
5 points
0 comments3 min readLW link

Map­ping the Con­cep­tual Ter­ri­tory in AI Ex­is­ten­tial Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC
15 points
0 comments27 min readLW link

Three AI Safety Re­lated Ideas

Wei Dai13 Dec 2018 21:32 UTC
70 points
38 comments2 min readLW link

Im­prove­ment on MIRI’s Corrigibility

9 Jun 2023 16:10 UTC
54 points
8 comments13 min readLW link

[Question] Sim­ple ques­tion about cor­rigi­bil­ity and val­ues in AI.

jmh22 Oct 2022 2:59 UTC
6 points
1 comment1 min readLW link

Disen­tan­gling Cor­rigi­bil­ity: 2015-2021

Koen.Holtman16 Feb 2021 18:01 UTC
22 points
20 comments9 min readLW link

GPT-4 im­plic­itly val­ues iden­tity preser­va­tion: a study of LMCA iden­tity management

Ozyrus17 May 2023 14:13 UTC
21 points
4 comments13 min readLW link

Bing find­ing ways to by­pass Microsoft’s filters with­out be­ing asked. Is it re­pro­ducible?

Christopher King20 Feb 2023 15:11 UTC
27 points
15 comments1 min readLW link

3b. For­mal (Faux) Corrigibility

Max Harms9 Jun 2024 17:18 UTC
26 points
13 comments17 min readLW link

Jour­nal­ism about game the­ory could ad­vance AI safety quickly

Chris Santos-Lang2 Oct 2025 23:05 UTC
4 points
0 comments3 min readLW link
(arxiv.org)

A Mul­tidis­ci­plinary Ap­proach to Align­ment (MATA) and Archety­pal Trans­fer Learn­ing (ATL)

MiguelDev19 Jun 2023 2:32 UTC
4 points
2 comments7 min readLW link

Boe­ing 737 MAX MCAS as an agent cor­rigi­bil­ity failure

Shmi16 Mar 2019 1:46 UTC
60 points
3 comments1 min readLW link

Cor­rigi­bil­ity doesn’t always have a good ac­tion to take

Stuart_Armstrong28 Aug 2018 20:30 UTC
19 points
0 comments1 min readLW link

Up­dat­ing Utility Functions

9 May 2022 9:44 UTC
42 points
6 comments8 min readLW link

Shut­down-Seek­ing AI

Simon Goldstein31 May 2023 22:19 UTC
50 points
32 comments15 min readLW link

Why Elimi­nat­ing De­cep­tion Won’t Align AI

Priyanka Bharadwaj15 Jul 2025 9:21 UTC
19 points
6 comments4 min readLW link

How use­ful is Cor­rigi­bil­ity?

martinkunev12 Sep 2023 0:05 UTC
11 points
4 comments5 min readLW link

Shut­down­able Agents through POST-Agency

EJT16 Sep 2025 12:09 UTC
29 points
4 comments54 min readLW link
(arxiv.org)

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

Justausername24 Aug 2023 3:53 UTC
1 point
0 comments6 min readLW link

Why mod­el­ling multi-ob­jec­tive home­osta­sis is es­sen­tial for AI al­ign­ment (and how it helps with AI safety as well). Subtleties and Open Challenges.

Roland Pihlakas12 Jan 2025 3:37 UTC
47 points
7 comments12 min readLW link

Solve Cor­rigi­bil­ity Week

Logan Riggs28 Nov 2021 17:00 UTC
39 points
21 comments1 min readLW link

About cor­rig­bil­ity and thrustfulness

kapedalex16 Sep 2025 22:03 UTC
1 point
0 comments4 min readLW link

Refram­ing AI Safety Through the Lens of Iden­tity Main­te­nance Framework

Hiroshi Yamakawa1 Apr 2025 6:16 UTC
−7 points
1 comment17 min readLW link

Ques­tion: MIRI Cor­rig­bil­ity Agenda

algon3313 Mar 2019 19:38 UTC
15 points
11 comments1 min readLW link

A Shut­down Prob­lem Proposal

21 Jan 2024 18:12 UTC
125 points
61 comments6 min readLW link

4. Ex­ist­ing Writ­ing on Corrigibility

Max Harms10 Jun 2024 14:08 UTC
55 points
17 comments106 min readLW link

In­vuln­er­a­ble In­com­plete Prefer­ences: A For­mal Statement

SCP30 Aug 2023 21:59 UTC
136 points
39 comments35 min readLW link

Pay­ing the cor­rigi­bil­ity tax

Max H19 Apr 2023 1:57 UTC
14 points
1 comment13 min readLW link

Ex­per­i­ment Idea: RL Agents Evad­ing Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC
31 points
7 comments17 min readLW link
(docs.google.com)

Nash Bar­gain­ing be­tween Subagents doesn’t solve the Shut­down Problem

A.H.25 Jan 2024 10:47 UTC
22 points
1 comment9 min readLW link

Think­ing about max­i­miza­tion and corrigibility

James Payor21 Apr 2023 21:22 UTC
63 points
4 comments5 min readLW link

Machines vs Memes Part 3: Imi­ta­tion and Memes

ceru231 Jun 2022 13:36 UTC
7 points
0 comments7 min readLW link

Agen­tized LLMs will change the al­ign­ment landscape

Seth Herd9 Apr 2023 2:29 UTC
160 points
102 comments3 min readLW link1 review

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC
33 points
3 comments15 min readLW link

Simulators

janus2 Sep 2022 12:45 UTC
668 points
168 comments41 min readLW link8 reviews
(generative.ink)

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC
41 points
12 comments31 min readLW link

A Ped­a­gog­i­cal Guide to Corrigibility

A.H.17 Jan 2024 11:45 UTC
6 points
3 comments16 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

5 Dec 2022 20:28 UTC
40 points
19 comments10 min readLW link

In­fer­ence from a Math­e­mat­i­cal De­scrip­tion of an Ex­ist­ing Align­ment Re­search: a pro­posal for an outer al­ign­ment re­search program

Christopher King2 Jun 2023 21:54 UTC
7 points
4 comments16 min readLW link

Ex­plor­ing Func­tional De­ci­sion The­ory (FDT) and a mod­ified ver­sion (ModFDT)

MiguelDev5 Jul 2023 14:06 UTC
12 points
11 comments15 min readLW link

Petrov corrigibility

Stuart_Armstrong11 Sep 2018 13:50 UTC
20 points
10 comments1 min readLW link

Dath Ilan’s Views on Stop­gap Corrigibility

David Udell22 Sep 2022 16:16 UTC
78 points
19 comments13 min readLW link
(www.glowfic.com)

New pa­per: Cor­rigi­bil­ity with Utility Preservation

Koen.Holtman6 Aug 2019 19:04 UTC
44 points
11 comments2 min readLW link

CIRL Cor­rigi­bil­ity is Fragile

21 Dec 2022 1:40 UTC
58 points
8 comments12 min readLW link

Archety­pal Trans­fer Learn­ing: a Pro­posed Align­ment Solu­tion that solves the In­ner & Outer Align­ment Prob­lem while adding Cor­rigible Traits to GPT-2-medium

MiguelDev26 Apr 2023 1:37 UTC
14 points
5 comments10 min readLW link

A Cor­rigi­bil­ity Me­taphore—Big Gambles

WCargo10 May 2023 18:13 UTC
16 points
0 comments4 min readLW link

Mr. Meeseeks as an AI ca­pa­bil­ity tripwire

Eric Zhang19 May 2023 11:33 UTC
37 points
17 comments2 min readLW link

Break­ing the Op­ti­mizer’s Curse, and Con­se­quences for Ex­is­ten­tial Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC
10 points
1 comment23 min readLW link