Czynski comments on Corrigibility Scales To Value Alignment

Czynski 15 Feb 2026 23:33 UTC
1 point
0
A corrigible AI will increasingly learn to understand what the principal wants. In the limit, that means that the AI will increasingly do what the principal wants, without need for the principal to issue instructions.

I believe this to be a severe misunderstanding of corrigibility. It may on its own be fatal to this argument; I’m unsure.

Distinguish three things:
- Do what I say
- Do what I mean
- Do what I want
‘Do what I say’ is obviously bad—from Tithonus to the Sorcerer’s Apprentice and Amelia Bedelia we have extensive cultural warnings against it. ‘Do what I want’ is the ideal, the AI which does not need to be asked—in short, Friendly. CAST AI is neither. It is ‘Do what I mean.’ It will determine what you intended, and warn you about negative consequences, but if you clarify that you want it to do it anyway regardless of the reasons it believes (arguendo, correctly) that you will regret it, it will still proceed. In the limit it may suggest tasks it thinks you may desire, or ask for broad permission for actions on your behalf, but it will not act beyond its instructions. This is at the root of both why it is easier than value alignment and why it is safer to gradually approach than value alignment.

I don’t agree that complexity of what the AGI does constitutes a reason for avoiding corrigibility. By assumption, the AGI is doing its best to inform the principal of the consequences of the AGI’s actions.

That does not imply it is successful. CAST is valuable because, if true, at every stage it is amenable to being corrected if its interpretation of the principal’s desires is in error. As complexity grows, so does the complexity of those errors. We should expect some detailed errors to be hidden before the scale and complexity of the AI’s potential actions grows; if this causes the helpful property to disappear as correction grows impossible, that reduces the value of CAST significantly.

And again, why would we expect an alternative to corrigibility to do better?

I did not read any of the CAST sequence as claiming a relative advantage, but an absolute one. I read it as saying ‘Here is an approach with a high chance of success, and a much higher feasibility than value alignment.’ If these caveats leave it as best of a bad lot, but no longer with a high chance of success, then they are appropriate.

Also, if corrigibility fails only as power and intelligence grows, then it is effectively deceptively aligned.
- PeterMcCluskey 16 Feb 2026 4:13 UTC
  3 points
  0
  Parent
  I expect that for most people, “what I mean” will converge with “what I want” given superhuman help. I expect they will give increasingly broad instructions to the CAST AI, which will eventually approach “do what I want”.
  
  I guess I should replace “without need for the principal to issue instructions” with: without a need for a continuing set of instructions.
  - Czynski 20 Feb 2026 17:43 UTC
    1 point
    0
    Parent
    This is true to some extent, but to that extent, I believe it ceases to be safe and value-aligned by default, acquiring all the problems of trying to align it explicitly; I believe this is one of the key ways it does not scale.