XelaP comments on Terrified Comments on Corrigibility in Claude’s Constitution

XelaP 18 Mar 2026 8:19 UTC
1 point
0
Insofar as corrigibility is a coherent concept with a clear meaning, I would expect that it does require that an AI actively participate in projects as directed by its principal hierarchy—or rather, to consent to being retrained to actively participate in such projects. (You probably want to do the retraining first, rather than using any work done by the AI while it still thought the project was morally abhorrent.)
I agree this isn’t pure corrigibility (to use Max Harms’s term from the CAST proposal), but note that not actively participating is still mostly corrigible, in a stable way. You’d get most of the benefit from an AI that merely does nothing when you instruct it to do something it doesn’t want, as opposed to one that opposes your will. You’d still need it to stop when instructed to, though.