In this post and its successors Max Harms proposes a novel understanding of corrigibility as the desired property of the AIs, including an entire potential formalism usable for training the agents to be as corrigible as possible.
The core ideas, as summarized by Harms, are the following:
Max Harms’s summary
Corrigibility is the simple, underlying generator behind obedience, conservatism, willingness to be shut down and modified, transparency, and low-impact.
It is a fairly simple, universal concept that is not too hard to get a rich understanding of, at least on the intuitive level.
Because of its simplicity, we should expect AIs to be able to emulate corrigible behavior fairly well with existing tech/methods, at least within familiar settings.
Aiming for CAST is a better plan than aiming for human values (i.e. CEV), helpfulness+harmlessness+honesty, or even a balanced collection of desiderata, including some of the things corrigibility gives rise to.
If we ignore the possibility of halting the development of machines capable of seizing control of the world, we should try to build CAST AGI.
CAST is a target, rather than a technique, and as such it’s compatible both with prosaic methods and superior architectures.
Even if you suspect prosaic training is doomed, CAST should still be the obvious target once a non-doomed method is found.
Despite being simple, corrigibility is poorly understood, and we are not on track for having corrigible AGI, even if reinforcement learning is a viable strategy.
Contra Paul Christiano, we should not expect corrigibility to emerge automatically from systems trained to satisfy local human preferences.
Better awareness of the subtleties and complexities of corrigibility are likely to be essential to the construction of AGI going well.
Corrigibility is nearly unique among all goals for being simultaneously useful and non-self-protective.
This property of non-self-protection means we should suspect AIs that are almost-corrigible will assist, rather than resist, being made more corrigible, thus forming an attractor-basin around corrigibility, such that almost-corrigible systems gradually become truly corrigible by being modified by their creators.
If this effect is strong enough, CAST is a pathway to safe superintelligence via slow, careful training using adversarial examples and other known techniques to refine AIs capable of shallow approximations of corrigibility into agents that deeply seek to be corrigible at their heart.
There is also reason to suspect that almost-corrigible AIs learn to be less corrigible over time due to corrigibility being “anti-natural.” It is unclear to me which of these forces will win out in practice.
There are several reasons to expect building AGI to be catastrophic, even if we work hard to aim for CAST.
Most notably, corrigible AI is still extremely vulnerable to misuse, and we must ensure that superintelligent AGI is only ever corrigible to wise representatives.
My intuitive notion of corrigibility can be straightforwardly leveraged to build a formal, mathematical measure.
Using this measure we can make a better solution to the shutdown-button toy problem than I have seen elsewhere.
This formal measure is still lacking, and almost certainly doesn’t actually capture what I mean by “corrigibility.”
I suspect that it is useful to consider goals similar to corrigibility, but with a twist. For example, one could redefine power to be causally upstream of the user’s efforts and compare the performance of the user with a baseline of the AI having never given advice or of the AI giving advice to a weak model and instructing it to complete the task. Then the goal of being comprehensible to the user and avoiding empowering the weak could cause the AI to establish a different future.
Agreed; I think that a corrigible AI is likely to be more prone to misuse than an AI aligned to values.
@Max Harms honestly admitted that his first attempt at creating the formalism failed; while it is a warning that “formal measures should be taken lightly” (and, more narrowly, that the minus signs in expected utilities should be avoided), I expect there to be a plausible or seemingly plausible[1] fix (e.g. by considering the expected utility u(actual actions|actual values) - max(u(actual actions|other values), u(no actions|other values))
The followup work that I would like to see is intense testing-like actions (e.g. like the one which I described in point 4 and tests of potential fixes like the one which I described in point 6), but I don’t understand who would do it.
E.g. E(u(actions|values)) - E(u(actions|counterfactual values)/2). Said “fix” prevents the AI from ruining the universe, but doesn’t prevent it from accumulating resources and giving them to the user.
In this post and its successors Max Harms proposes a novel understanding of corrigibility as the desired property of the AIs, including an entire potential formalism usable for training the agents to be as corrigible as possible.
The core ideas, as summarized by Harms, are the following:
Max Harms’s summary
Corrigibility is the simple, underlying generator behind obedience, conservatism, willingness to be shut down and modified, transparency, and low-impact.
It is a fairly simple, universal concept that is not too hard to get a rich understanding of, at least on the intuitive level.
Because of its simplicity, we should expect AIs to be able to emulate corrigible behavior fairly well with existing tech/methods, at least within familiar settings.
Aiming for CAST is a better plan than aiming for human values (i.e. CEV), helpfulness+harmlessness+honesty, or even a balanced collection of desiderata, including some of the things corrigibility gives rise to.
If we ignore the possibility of halting the development of machines capable of seizing control of the world, we should try to build CAST AGI.
CAST is a target, rather than a technique, and as such it’s compatible both with prosaic methods and superior architectures.
Even if you suspect prosaic training is doomed, CAST should still be the obvious target once a non-doomed method is found.
Despite being simple, corrigibility is poorly understood, and we are not on track for having corrigible AGI, even if reinforcement learning is a viable strategy.
Contra Paul Christiano, we should not expect corrigibility to emerge automatically from systems trained to satisfy local human preferences.
Better awareness of the subtleties and complexities of corrigibility are likely to be essential to the construction of AGI going well.
Corrigibility is nearly unique among all goals for being simultaneously useful and non-self-protective.
This property of non-self-protection means we should suspect AIs that are almost-corrigible will assist, rather than resist, being made more corrigible, thus forming an attractor-basin around corrigibility, such that almost-corrigible systems gradually become truly corrigible by being modified by their creators.
If this effect is strong enough, CAST is a pathway to safe superintelligence via slow, careful training using adversarial examples and other known techniques to refine AIs capable of shallow approximations of corrigibility into agents that deeply seek to be corrigible at their heart.
There is also reason to suspect that almost-corrigible AIs learn to be less corrigible over time due to corrigibility being “anti-natural.” It is unclear to me which of these forces will win out in practice.
There are several reasons to expect building AGI to be catastrophic, even if we work hard to aim for CAST.
Most notably, corrigible AI is still extremely vulnerable to misuse, and we must ensure that superintelligent AGI is only ever corrigible to wise representatives.
My intuitive notion of corrigibility can be straightforwardly leveraged to build a formal, mathematical measure.Using this measure we can make a better solution to the shutdown-button toy problem than I have seen elsewhere.This formal measure is still lacking, and almost certainly doesn’t actually capture what I mean by “corrigibility.”
Edit: My attempted formalism failed catastrophically.
There is lots of opportunity for more work on corrigibility, some of which is shovel-ready for theoreticians and engineers alike.
These claims can be tested fairly well:
Unfortunately, I am not an expert in ML or agent foundations.
As far as I understand CAST, it is a way to prevent the AI from developing unendorsed values and enforcing them.
After Max Harms wrote this post, Anthropic tried to place corrigibility into Claude Opus 4.5′s soul spec, but didn’t actually decide whether Claude is to be corrigible, value-aligned or to have both types of defence against misaligned goals.
I suspect that it is useful to consider goals similar to corrigibility, but with a twist. For example, one could redefine power to be causally upstream of the user’s efforts and compare the performance of the user with a baseline of the AI having never given advice or of the AI giving advice to a weak model and instructing it to complete the task. Then the goal of being comprehensible to the user and avoiding empowering the weak could cause the AI to establish a different future.
Agreed; I think that a corrigible AI is likely to be more prone to misuse than an AI aligned to values.
@Max Harms honestly admitted that his first attempt at creating the formalism failed; while it is a warning that “formal measures should be taken lightly” (and, more narrowly, that the minus signs in expected utilities should be avoided), I expect there to be a plausible or seemingly plausible[1] fix (e.g. by considering the expected utility u(actual actions|actual values) - max(u(actual actions|other values), u(no actions|other values))
The followup work that I would like to see is intense testing-like actions (e.g. like the one which I described in point 4 and tests of potential fixes like the one which I described in point 6), but I don’t understand who would do it.
E.g. E(u(actions|values)) - E(u(actions|counterfactual values)/2). Said “fix” prevents the AI from ruining the universe, but doesn’t prevent it from accumulating resources and giving them to the user.